-
Notifications
You must be signed in to change notification settings - Fork 341
Data frame binary operations #2656
Conversation
|
Could you reference an issue or describe the goal? /cc @roji |
| public string Name; | ||
|
|
||
| public virtual object this[long rowIndex] { get { throw new NotImplementedException(); } set { throw new NotImplementedException(); } } | ||
| public virtual object this[long startIndex, int length] { get { throw new NotImplementedException(); } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how current API of System.Data is designed. It uses boxing in many places. I think that this methods should be removed from the base class and be implemented in DataFrameColumn<T>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are overridden in DataFrameColumn at the moment?. Do you mean they should occur only in DataFrameColumn and not in the base class? I'm not completely convinced about that yet. At the moment, the DataFrame class has a DataFrameTable which has an IList called columns i.e. the DataFrameTable does not know the real type of the columns it holds => therefore, the approach at the moment is to add APIs on BaseDataFrameColumn and override them in the derived columns. It is possible that this will change down the line as I add more features, but for the moment I think this looks reasonable. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's change it a bit later.
src/Microsoft.Data/BinaryOperations/DataFrameBinaryOperations.tt
Outdated
Show resolved
Hide resolved
| for (int ii = 0; ii < NumColumns; ii++) | ||
| { | ||
| DataFrameColumn<T> column = _table.Column(ii) as DataFrameColumn<T>; | ||
| if (column != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invert condition and use continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to the style but I do have some questions here for the larger crowd:
If I had a DataFrameColumn<int> and called DataFrameColumn<int>.Add(5.0),
The lines
DataFrameColumn<T> column = _table.Column(ii) as DataFrameColumn<T>; if (column != null)
mean that nothing will happen since the dynamic cast will fail. I'm leaning towards checking the type of the resulting operation and creating a new column with that type?
For ex: DataFrameColumn.Add(5.0) will result in a new DataFrameColumn, similar to what 1+5.0 does in C#.
For cases such as DataFrameColumn<int> - string, we just don't perform the operation instead of throwing?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect it to throw if I try an operation that isn't supported, but why wouldn't that one follow the same rule as number + string with scalars? i.e. convert to string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, my bad. I typed in a hurry. I meant to write "DataFrameColumn - string"(Basically whenever we have incompatible operands). Edited it now.
src/Microsoft.Data/BinaryOperations/DataFrameBinaryOperations.tt
Outdated
Show resolved
Hide resolved
|
|
||
| <# foreach (MethodConfiguration method in methodConfiguration) { #> | ||
| <# if (method.MethodType == MethodType.BinaryScalar || method.MethodType == MethodType.ComparisonScalar) {#> | ||
| public DataFrame <#=method.MethodName#><T>(T value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad indentation here and above. It should be multiple of 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I don't think T4TT engine does a great job with the indentations. I did try that before deciding to stay with this scheme
| <# } else { #> | ||
| public DataFrame <#=method.MethodName#><T>(IList<T> values) | ||
| <# } #> | ||
| where T : struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is T limited to value types only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want this to be a DataFrame of primitive types. Since there's no way at the moment to specify primitive types alone, I put in value types. A DataFrame is used for data wrangling with primitive types, similar to Pandas. Is there a way to add reference types to Pandas/DataFrame that makes sense? For ex, what does it mean to say DataFrameColumn.Max() or DataFrameColumn.Add(5)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to store strings in a dataframe column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. At the moment, my plan is to store strings as bytes so I can still keep the struct constraint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed an issue to supports string columns in the DataFrame. Working on a patch for it at the moment
src/Microsoft.Data/BinaryOperations/DataFrameBinaryOperations.tt
Outdated
Show resolved
Hide resolved
src/Microsoft.Data/Dependencies/netcoreapp2.1/Apache.Arrow.deps.json
Outdated
Show resolved
Hide resolved
corefxlab.sln
Outdated
| EndProject | ||
| Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "System.Text.JsonLab.Serialization.Tests", "tests\System.Text.JsonLab.Serialization.Tests\System.Text.JsonLab.Serialization.Tests.csproj", "{544D4C8C-B5C6-4C3C-9763-E4CB6AF9A90C}" | ||
| EndProject | ||
| Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Microsoft.Data", "src\Microsoft.Data\Microsoft.Data.csproj", "{9542C4FF-BC8F-47E3-916D-D2FF38003823}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should come up with a better package/assembly name than Microsoft.Data. That namespace is going to have many things in it, and this work shouldn't try taking the whole thing.
Maybe Microsoft.DataFrames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there going to be more than one DataFrame? What about System.Data,Frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there will be more than one DataFrame (at least not owned/maintained/worked on by the corefx team at Microsoft).
We could consider putting this in System, but I don't think the type would ever be in the .NET Core shared framework or in the netstandard specification. I imagine this type to always be available through a NuGet package where we can add more APIs without having to rely on a new .NET Core version/TFM. With that in mind, it is unclear to me what should go into a System namespace and what shouldn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame is yet another abstraction over data in general and I think putting it in System is better than putting it in Microsoft. namespace, when other data-related abstractions are in System.Data, etc. It is not windows-specific, and should not be seen as such to encourage wider adoption.
| namespace Microsoft.Data | ||
| { | ||
| public class DataFrameColumn<T> : BaseDataFrameColumn | ||
| where T : struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be PrimitiveDataFrameColumn<T> if T must be a struct. With the current name, it feels like DataFrameColumn<string> should be possible, but isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless, of course, we remove the T : struct restriction, then I think the current name would make sense.
In reply to: 273526844 [](ancestors = 273526844)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the T : struct restriction is required. If there was a restriction that truly only allowed primitive types, I would've used that. The DataFrame can only hold primitive types. No ref types and ideally no user defined structs/value types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a reason of that restriction? string isn't a primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been renamed to PrimitiveColumn, so the T : struct remains. Strings will be supported with a new column type
| public class DataFrameBuffer<T> | ||
| where T : struct | ||
| { | ||
| public Memory<byte> Memory { get; private set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't the backing store for this a Memory<T>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I was going for similarity with Arrow. This way, the Memory can just be wrapped in an Arrow Buffer with no conversion overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be compatible with Arrow? I've seen the dependency commented out, but I'm not sure whether Arrow is going to be supported as the engine behind the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is going to be compatible with Arrow but, if I understand you right, not exactly in the way you mention. The C# Arrow implementation as it exists now requires that the buffer storing the data is immutable. No APIs exist to mutate the contents of an ArrowBuffer. I had a discussion in the Arrow repo a month or so back about removing this restriction, but a lot of stuff has already been built with immutability in mind, so I think it's safe to say that that is a no-go.
By compatible here I mean that a DataFrameBuffer can be wrapped in an ArrowBuffer easily with no overhead. That way the DataFrameBuffer can be mutated by the DataFrame, and anyone who consumes an ArrowBuffer can also get to the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is below:
public Span<T> Span
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => MemoryMarshal.Cast<byte, T>(Memory.Span);
}This isn't a free operation. If someone calls this many times in a hot path, this Cast is going to take cycles, when if it was just in terms of T to begin with, there is no need for a cast.
Similarly, you wouldn't need to hold onto _size and do division for things like Capacity and MaxCapacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue for this
| { | ||
| static public DataFrame MakeTestTableWithTwoColumns(int length) | ||
| { | ||
| BaseDataFrameColumn dataFrameColumn1 = new PrimitiveDataFrameColumn<int>("Int1", Enumerable.Range(0, length).Select(x => x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimitiveDataFrameColumn feels very verbose for such a common thing. It would be good to have some shortcuts to create them, like some static functions that could at least infer the generic type and maybe just be called Column()
| @@ -0,0 +1,555 @@ | |||
| | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty lines
gfoidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's too early for some (perf-) comments, please ignore them.
| Memory = new byte[numberOfValues * _size]; | ||
| Length = 0; | ||
| } | ||
| public void Append(T value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code-path could be optimized. Similar to what was done in List.Add. Or is it too early for such kind of things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: As part of perf work
| { | ||
| get | ||
| { | ||
| if (index > Length) throw new ArgumentOutOfRangeException(nameof(index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the (ugly) ThrowHelper-pattern, to move the throwing code out of the hot-path, thus allowing the indexer to be inlineable.
Other places as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note you only do this when it is definitely hot path .. we are all assuming the JIT will do this one day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: As part of perf work
tests/Microsoft.Data.DataFrame.Tests/Microsoft.DataFrames.Tests.csproj
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,126 @@ | |||
| // Licensed to the .NET Foundation under one or more agreements. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need a little more tests in this PR. This doesn't seem like enough given the functionality in the library with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of agree. With this patch, the only ops allowed are things like floatColumn + float, so the scope is pretty narrow. Would you consider waiting till the next patch(will be up in an hour) where I've added a couple more tests for binary ops? That patch tests different functions and more combinations such as int + bool/double/decimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK for now. Let's be sure to add adequate tests in future PRs.
| { | ||
| get | ||
| { | ||
| var ret = new List<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should initialize this list to the length capacity. That way it doesn't have to grow while you are adding elements to it.
| { | ||
| long nextRowIndex = startIndex + ret.Count + 1; | ||
| arrayIndex++; | ||
| temp = Buffers[arrayIndex][(int)nextRowIndex, length - ret.Count, ret]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is internal, this API usage/pattern is a little unusual:
[(int)nextRowIndex, length - ret.Count, ret];
Specifically, passing in an object ret to get filled up. It feels like it should just be a standard method call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I don't like that I have to do this either. Basically, this is a slice of length "length" that can begin in the middle of some buffer and spill over into the next buffer. I was trying to get by with passing just 1 return list "ret" to both calls to avoid creating more objects. That's why the return type of the indexer is a bool. Ideally, if there was API that made a List from a Span efficiently, or that allowed me to append to a List from a Span, I would've used that. Unless there's a less ugly way to do this? Definitely open to suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just make this a method and not use an indexer. Honestly, I don't think I've ever seen someone use a C# indexer this way.
| { | ||
| long nextRowIndex = startIndex + ret.Count + 1; | ||
| arrayIndex++; | ||
| temp = Buffers[arrayIndex][(int)nextRowIndex, length - ret.Count, ret]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is casting nextRowIndex to int, the right thing here? Same for above where we are casting startIndex to an int. Why take in a long if you are casting it to int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextRowIndex should've been 0. That's a bug. Fixing it.
I'm taking a long because the df can hold long number of elements. Each buffer can hold only intMax number of elements though, so I'm casting once I get to the right buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public IList<T> this[long startIndex, int length]
{
get
{
var ret = new List<T>(length);
long endIndex = startIndex + length;
int arrayIndex = GetArrayContainingRowIndex(ref startIndex);
bool temp = Buffers[arrayIndex][(int)startIndex, length, ret];Here we take a long startIndex and then are casting it (int)startIndex.
I see that GetArrayContainingRowIndex is decreasing the number, which honestly is not intuitive. Maybe make GetArrayContainingRowIndex return an out int newStartIndex (or a better name). Have GetArrayContainingRowIndex ensure that the number is less than int.MaxValue, then the 3 calling places don't need to cast here. If it ever fails, I'm not sure what would even happen with the current code.
src/Microsoft.Data/DataFrameTable.cs
Outdated
|
|
||
| namespace Microsoft.Data | ||
| { | ||
| public class DataFrameTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between a DataFrame and a DataFrameTable?
Maybe it would be good to have high-level class summary comments describing each public class. You can write them like this:
/// <summary>
/// High level summary comment
/// </summary>
public class Foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DataFrameTable is just a convenient store to hold the columns. The DataFrame class uses it to manipulate the columns and implement its algorithms. I changed the access modifier to internal now so users can only create a DataFrame and don't know anything about the Table
| </ItemGroup>--> | ||
|
|
||
| <ItemGroup> | ||
| <PackageReference Include="CsvHelper" Version="12.1.2" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we actually using CsvHelper? If not, can you delete this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Deleted it. I put this in a long time ago to write unit tests
tests/Microsoft.Data.DataFrame.Tests/Microsoft.DataFrames.Tests.csproj
Outdated
Show resolved
Hide resolved
|
|
||
| namespace Microsoft.Data | ||
| { | ||
| public partial class PrimitiveDataFrameColumnContainer<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a public (or a user's) perspective, what is the difference between PrimitiveDataFrameColumn and PrimitiveDataFrameColumnContainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PrimitiveDataFrameColumnContainer is just an internal store(changed the modifier to internal now so users can't create one) to hold the column data. PrimitiveDataFrameColumn defines the APIs that manipulate the container
|
Just a note: I'm hoping to get this approved soon so I can put up the patches I have for StringColumn support and BinaryOperator overloads I'll be filing issues for the various TODOs and other feedback from here shortly so that I can fix them in patches separately |
| for (int ii = 0; ii < ColumnCount; ii++) | ||
| { | ||
| PrimitiveDataFrameColumn<T> column = _table.Column(ii) as PrimitiveDataFrameColumn<T>; | ||
| if (column != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it correct that it should jsut return silently if this is null (if it's not the expected type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is wrong. I can throw an error in an else block for this patch but my next patch already handles it by implementing support for things like (intColumn + float). It's hard to support binary ops on generic columns at the moment.
|
|
||
| public Span<T> Span | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we try to avoid inlining attributes unless we know for sure we need them. ie., we know it will not get inlined (for some bad reason) and we know inlining is important to get the performance we want. In this case I would trust it will get inlined, and when we do perf work later, if it's not, we can do appropriate things (which might include modifying the jit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This will go away soon enough though for #2659
| @@ -0,0 +1 @@ | |||
| No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does templating require the empty file be in the repo for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. I had to rebase on upstream/master and it got generated this time.
| } | ||
| public void RemoveColumn(int columnIndex) | ||
| { | ||
| _columnNameToIndexDictionary.Remove(_columnNames[columnIndex]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you validate for -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope I think. _columnName[columnIndex] will throw the exception I want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bad practice not to check parameters and delegate their validation.
| @@ -0,0 +1,4138 @@ | |||
| | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, blank lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to remove 1 of the blank lines in my next patch, but for the life of me, I am not able to find out where the first blank line is coming from. All I know is that it is from some random space in the .ttinclude file
| internal interface IPrimitiveDataFrameColumnArithmetic<T> | ||
| where T : struct | ||
| { | ||
| void Add(PrimitiveDataFrameColumnContainer<T> left, PrimitiveDataFrameColumnContainer<T> right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider sorting. It's at least a stable ordering..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next patch. This'll cause too much pain when I rebase
| public static IPrimitiveDataFrameColumnArithmetic<T> GetArithmetic<T>() | ||
| where T : struct | ||
| { | ||
| if (typeof(T) == typeof(bool)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a switch statement on the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also set a return value then return return (IPrimitiveDataFrameColumnArithmetic<T>)value which will save all these pasted casts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just took this from some of the Tensor code we have checked in, so I didn't pay too much attention to it. Will consider it for my next patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you specify the case where a constant value is required, i.e.
switch (typeof(T))
{
case typeof(int):
// ...Won't compile (CS0150) (with C# 7.3 and any version SharpLab offers).
Or does pattern matching provide a syntax I'm not aware of?
A way via TypeCode won't be optimized by JIT -- I believe JIT won't optimize (dead code eliminate) on case either, as it does with the if-chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on this, but my understanding of this pattern was that the JIT was smart enough to eliminate chunks of code. When JIT'ing for T == bool, it can see that if (typeof(T) == typeof(bool)) always holds, similarly else if (typeof(T) == typeof(int)) never holds, so it can completely remove non-bool code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt correct.
The JIT does this by dead code elimination, as typeof is a JIT-time-constant, so all other branches will fall away.
Side note: As the IL for such methods is quite big, the JIT won't inline these methods in most cases. Here AggressiveInline can help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
|
|
||
| internal class BoolArithmetic : IPrimitiveDataFrameColumnArithmetic<bool> | ||
| { | ||
| public void Add(PrimitiveDataFrameColumnContainer<bool> left, PrimitiveDataFrameColumnContainer<bool> right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VS has a magic refactoring that will implement an interface for you. When it does this it uses NotImplementedEXception. You should probably use that.
NotImplementedException --> I am not done here
NotSupportedException --> It is not meaningful to call this (usually when something really should be abstract)
PlatformNotSupportedException --> Works on other platforms, not this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie in theory in CoreFX there are zero NotImplementedException, but in practice I see we weren't consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotSupportedException is the right thing here I think. It doesn't make sense to call any arithmetic ops on a bool
src/Microsoft.Data/PrimitiveDataFrameColumnContainer.BinaryOperations.cs
Show resolved
Hide resolved
danmoseley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me but someone who understands the semantics yo'ure shooting for here should signo ff.
|
|
||
| public long NullCount { get; protected set; } | ||
|
|
||
| public string Name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public fields is a bad practice. Must be replaced by a property.
| public string Name; | ||
|
|
||
| public virtual object this[long rowIndex] { get { throw new NotImplementedException(); } set { throw new NotImplementedException(); } } | ||
| public virtual object this[long startIndex, int length] { get { throw new NotImplementedException(); } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's change it a bit later.
| private DataFrameTable(IList<BaseDataFrameColumn> columns) | ||
| { | ||
| columns = columns ?? throw new ArgumentNullException(nameof(columns)); | ||
| _columns = columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you assigning the value to columns if it's a parameter you're checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand?? I'm assigning to _columns? columns without the "_" is the parameter. Technically I'm not 100% convinced with this code yet. I'm still not certain if I should be making a copy of the columns being passed in, so I'll wait for more usage scenarios before I make the choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're assigning columns to columns in line 32 to check for a null value. Just check the value while assigning it to the field:
| _columns = columns; | |
| _columns = columns ?? throw new ArgumentNullException(nameof(columns)); |
| } | ||
| public void RemoveColumn(int columnIndex) | ||
| { | ||
| _columnNameToIndexDictionary.Remove(_columnNames[columnIndex]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bad practice not to check parameters and delegate their validation.
| <Project Sdk="Microsoft.NET.Sdk"> | ||
|
|
||
| <PropertyGroup> | ||
| <TargetFramework>netcoreapp2.2</TargetFramework> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this not be netstandard2.0?
| </ItemGroup> | ||
|
|
||
| <ItemGroup> | ||
| <Compile Update="BaseColumn.BinaryOperations.cs"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this file in your PR. Are you sure all of these are still necessary?
| namespace Microsoft.Data | ||
| { | ||
| /// <summary> | ||
| /// The base column type. All APIs should have atleast a stub here first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atleast is two words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
Let's be sure any feedback you didn't address here gets tracked and addressed in a future PR (there are a lot of comments to wade through).
This is work towards #26854 to add a DataFrame type to .NET Core
This patch implements the basic infrastructure, indexing and binary operations on a dataframe. I thought this was a good spot to get an initial review. More patches will follow on top of this to support computations, aggregations, joins, merges, group by etc
@eerhardt @stephentoub @ericstj