Skip to content
This repository was archived by the owner on Aug 2, 2023. It is now read-only.

Conversation

@pgovind
Copy link

@pgovind pgovind commented Apr 9, 2019

This is work towards #26854 to add a DataFrame type to .NET Core

This patch implements the basic infrastructure, indexing and binary operations on a dataframe. I thought this was a good spot to get an initial review. More patches will follow on top of this to support computations, aggregations, joins, merges, group by etc

@eerhardt @stephentoub @ericstj

@dnfclas
Copy link

dnfclas commented Apr 9, 2019

CLA assistant check
All CLA requirements met.

@YohDeadfall
Copy link
Contributor

Could you reference an issue or describe the goal?

/cc @roji

public string Name;

public virtual object this[long rowIndex] { get { throw new NotImplementedException(); } set { throw new NotImplementedException(); } }
public virtual object this[long startIndex, int length] { get { throw new NotImplementedException(); } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how current API of System.Data is designed. It uses boxing in many places. I think that this methods should be removed from the base class and be implemented in DataFrameColumn<T>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are overridden in DataFrameColumn at the moment?. Do you mean they should occur only in DataFrameColumn and not in the base class? I'm not completely convinced about that yet. At the moment, the DataFrame class has a DataFrameTable which has an IList called columns i.e. the DataFrameTable does not know the real type of the columns it holds => therefore, the approach at the moment is to add APIs on BaseDataFrameColumn and override them in the derived columns. It is possible that this will change down the line as I add more features, but for the moment I think this looks reasonable. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's change it a bit later.

for (int ii = 0; ii < NumColumns; ii++)
{
DataFrameColumn<T> column = _table.Column(ii) as DataFrameColumn<T>;
if (column != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert condition and use continue.

Copy link
Author

@pgovind pgovind Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to the style but I do have some questions here for the larger crowd:

If I had a DataFrameColumn<int> and called DataFrameColumn<int>.Add(5.0),
The lines
DataFrameColumn<T> column = _table.Column(ii) as DataFrameColumn<T>; if (column != null)

mean that nothing will happen since the dynamic cast will fail. I'm leaning towards checking the type of the resulting operation and creating a new column with that type?
For ex: DataFrameColumn.Add(5.0) will result in a new DataFrameColumn, similar to what 1+5.0 does in C#.

For cases such as DataFrameColumn<int> - string, we just don't perform the operation instead of throwing?
Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect it to throw if I try an operation that isn't supported, but why wouldn't that one follow the same rule as number + string with scalars? i.e. convert to string

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, my bad. I typed in a hurry. I meant to write "DataFrameColumn - string"(Basically whenever we have incompatible operands). Edited it now.


<# foreach (MethodConfiguration method in methodConfiguration) { #>
<# if (method.MethodType == MethodType.BinaryScalar || method.MethodType == MethodType.ComparisonScalar) {#>
public DataFrame <#=method.MethodName#><T>(T value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indentation here and above. It should be multiple of 4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I don't think T4TT engine does a great job with the indentations. I did try that before deciding to stay with this scheme

<# } else { #>
public DataFrame <#=method.MethodName#><T>(IList<T> values)
<# } #>
where T : struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is T limited to value types only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want this to be a DataFrame of primitive types. Since there's no way at the moment to specify primitive types alone, I put in value types. A DataFrame is used for data wrangling with primitive types, similar to Pandas. Is there a way to add reference types to Pandas/DataFrame that makes sense? For ex, what does it mean to say DataFrameColumn.Max() or DataFrameColumn.Add(5)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to store strings in a dataframe column.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. At the moment, my plan is to store strings as bytes so I can still keep the struct constraint

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed an issue to supports string columns in the DataFrame. Working on a patch for it at the moment

corefxlab.sln Outdated
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "System.Text.JsonLab.Serialization.Tests", "tests\System.Text.JsonLab.Serialization.Tests\System.Text.JsonLab.Serialization.Tests.csproj", "{544D4C8C-B5C6-4C3C-9763-E4CB6AF9A90C}"
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Microsoft.Data", "src\Microsoft.Data\Microsoft.Data.csproj", "{9542C4FF-BC8F-47E3-916D-D2FF38003823}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with a better package/assembly name than Microsoft.Data. That namespace is going to have many things in it, and this work shouldn't try taking the whole thing.

Maybe Microsoft.DataFrames?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there going to be more than one DataFrame? What about System.Data,Frame?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there will be more than one DataFrame (at least not owned/maintained/worked on by the corefx team at Microsoft).

We could consider putting this in System, but I don't think the type would ever be in the .NET Core shared framework or in the netstandard specification. I imagine this type to always be available through a NuGet package where we can add more APIs without having to rely on a new .NET Core version/TFM. With that in mind, it is unclear to me what should go into a System namespace and what shouldn't.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame is yet another abstraction over data in general and I think putting it in System is better than putting it in Microsoft. namespace, when other data-related abstractions are in System.Data, etc. It is not windows-specific, and should not be seen as such to encourage wider adoption.

namespace Microsoft.Data
{
public class DataFrameColumn<T> : BaseDataFrameColumn
where T : struct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be PrimitiveDataFrameColumn<T> if T must be a struct. With the current name, it feels like DataFrameColumn<string> should be possible, but isn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless, of course, we remove the T : struct restriction, then I think the current name would make sense.


In reply to: 273526844 [](ancestors = 273526844)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the T : struct restriction is required. If there was a restriction that truly only allowed primitive types, I would've used that. The DataFrame can only hold primitive types. No ref types and ideally no user defined structs/value types

Copy link
Contributor

@YohDeadfall YohDeadfall Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a reason of that restriction? string isn't a primitive type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been renamed to PrimitiveColumn, so the T : struct remains. Strings will be supported with a new column type

public class DataFrameBuffer<T>
where T : struct
{
public Memory<byte> Memory { get; private set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the backing store for this a Memory<T>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I was going for similarity with Arrow. This way, the Memory can just be wrapped in an Arrow Buffer with no conversion overhead

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be compatible with Arrow? I've seen the dependency commented out, but I'm not sure whether Arrow is going to be supported as the engine behind the interface.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is going to be compatible with Arrow but, if I understand you right, not exactly in the way you mention. The C# Arrow implementation as it exists now requires that the buffer storing the data is immutable. No APIs exist to mutate the contents of an ArrowBuffer. I had a discussion in the Arrow repo a month or so back about removing this restriction, but a lot of stuff has already been built with immutability in mind, so I think it's safe to say that that is a no-go.

By compatible here I mean that a DataFrameBuffer can be wrapped in an ArrowBuffer easily with no overhead. That way the DataFrameBuffer can be mutated by the DataFrame, and anyone who consumes an ArrowBuffer can also get to the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is below:

public Span<T> Span
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => MemoryMarshal.Cast<byte, T>(Memory.Span);
        }

This isn't a free operation. If someone calls this many times in a hot path, this Cast is going to take cycles, when if it was just in terms of T to begin with, there is no need for a cast.

Similarly, you wouldn't need to hold onto _size and do division for things like Capacity and MaxCapacity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue for this

{
static public DataFrame MakeTestTableWithTwoColumns(int length)
{
BaseDataFrameColumn dataFrameColumn1 = new PrimitiveDataFrameColumn<int>("Int1", Enumerable.Range(0, length).Select(x => x));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveDataFrameColumn feels very verbose for such a common thing. It would be good to have some shortcuts to create them, like some static functions that could at least infer the generic type and maybe just be called Column()

@@ -0,0 +1,555 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty lines

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's too early for some (perf-) comments, please ignore them.

Memory = new byte[numberOfValues * _size];
Length = 0;
}
public void Append(T value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code-path could be optimized. Similar to what was done in List.Add. Or is it too early for such kind of things?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: As part of perf work

{
get
{
if (index > Length) throw new ArgumentOutOfRangeException(nameof(index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the (ugly) ThrowHelper-pattern, to move the throwing code out of the hot-path, thus allowing the indexer to be inlineable.

Other places as well.

Copy link
Member

@danmoseley danmoseley Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note you only do this when it is definitely hot path .. we are all assuming the JIT will do this one day

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: As part of perf work

@@ -0,0 +1,126 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a little more tests in this PR. This doesn't seem like enough given the functionality in the library with this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of agree. With this patch, the only ops allowed are things like floatColumn + float, so the scope is pretty narrow. Would you consider waiting till the next patch(will be up in an hour) where I've added a couple more tests for binary ops? That patch tests different functions and more combinations such as int + bool/double/decimal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for now. Let's be sure to add adequate tests in future PRs.

{
get
{
var ret = new List<T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should initialize this list to the length capacity. That way it doesn't have to grow while you are adding elements to it.

{
long nextRowIndex = startIndex + ret.Count + 1;
arrayIndex++;
temp = Buffers[arrayIndex][(int)nextRowIndex, length - ret.Count, ret];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is internal, this API usage/pattern is a little unusual:

[(int)nextRowIndex, length - ret.Count, ret];

Specifically, passing in an object ret to get filled up. It feels like it should just be a standard method call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't like that I have to do this either. Basically, this is a slice of length "length" that can begin in the middle of some buffer and spill over into the next buffer. I was trying to get by with passing just 1 return list "ret" to both calls to avoid creating more objects. That's why the return type of the indexer is a bool. Ideally, if there was API that made a List from a Span efficiently, or that allowed me to append to a List from a Span, I would've used that. Unless there's a less ugly way to do this? Definitely open to suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just make this a method and not use an indexer. Honestly, I don't think I've ever seen someone use a C# indexer this way.

{
long nextRowIndex = startIndex + ret.Count + 1;
arrayIndex++;
temp = Buffers[arrayIndex][(int)nextRowIndex, length - ret.Count, ret];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is casting nextRowIndex to int, the right thing here? Same for above where we are casting startIndex to an int. Why take in a long if you are casting it to int?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextRowIndex should've been 0. That's a bug. Fixing it.

I'm taking a long because the df can hold long number of elements. Each buffer can hold only intMax number of elements though, so I'm casting once I get to the right buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        public IList<T> this[long startIndex, int length]
        {
            get
            {
                var ret = new List<T>(length);
                long endIndex = startIndex + length;
                int arrayIndex = GetArrayContainingRowIndex(ref startIndex);
                bool temp = Buffers[arrayIndex][(int)startIndex, length, ret];

Here we take a long startIndex and then are casting it (int)startIndex.

I see that GetArrayContainingRowIndex is decreasing the number, which honestly is not intuitive. Maybe make GetArrayContainingRowIndex return an out int newStartIndex (or a better name). Have GetArrayContainingRowIndex ensure that the number is less than int.MaxValue, then the 3 calling places don't need to cast here. If it ever fails, I'm not sure what would even happen with the current code.


namespace Microsoft.Data
{
public class DataFrameTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between a DataFrame and a DataFrameTable?

Maybe it would be good to have high-level class summary comments describing each public class. You can write them like this:

    /// <summary>
    /// High level summary comment
    /// </summary>
   public class Foo

Copy link
Author

@pgovind pgovind Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DataFrameTable is just a convenient store to hold the columns. The DataFrame class uses it to manipulate the columns and implement its algorithms. I changed the access modifier to internal now so users can only create a DataFrame and don't know anything about the Table

</ItemGroup>-->

<ItemGroup>
<PackageReference Include="CsvHelper" Version="12.1.2" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actually using CsvHelper? If not, can you delete this line?

Copy link
Author

@pgovind pgovind Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Deleted it. I put this in a long time ago to write unit tests


namespace Microsoft.Data
{
public partial class PrimitiveDataFrameColumnContainer<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a public (or a user's) perspective, what is the difference between PrimitiveDataFrameColumn and PrimitiveDataFrameColumnContainer?

Copy link
Author

@pgovind pgovind Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PrimitiveDataFrameColumnContainer is just an internal store(changed the modifier to internal now so users can't create one) to hold the column data. PrimitiveDataFrameColumn defines the APIs that manipulate the container

@pgovind
Copy link
Author

pgovind commented Apr 15, 2019

Just a note: I'm hoping to get this approved soon so I can put up the patches I have for StringColumn support and BinaryOperator overloads

I'll be filing issues for the various TODOs and other feedback from here shortly so that I can fix them in patches separately

for (int ii = 0; ii < ColumnCount; ii++)
{
PrimitiveDataFrameColumn<T> column = _table.Column(ii) as PrimitiveDataFrameColumn<T>;
if (column != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it correct that it should jsut return silently if this is null (if it's not the expected type)?

Copy link
Author

@pgovind pgovind Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is wrong. I can throw an error in an else block for this patch but my next patch already handles it by implementing support for things like (intColumn + float). It's hard to support binary ops on generic columns at the moment.


public Span<T> Span
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we try to avoid inlining attributes unless we know for sure we need them. ie., we know it will not get inlined (for some bad reason) and we know inlining is important to get the performance we want. In this case I would trust it will get inlined, and when we do perf work later, if it's not, we can do appropriate things (which might include modifying the jit)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This will go away soon enough though for #2659

@@ -0,0 +1 @@
 No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does templating require the empty file be in the repo for some reason?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. I had to rebase on upstream/master and it got generated this time.

}
public void RemoveColumn(int columnIndex)
{
_columnNameToIndexDictionary.Remove(_columnNames[columnIndex]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you validate for -1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope I think. _columnName[columnIndex] will throw the exception I want

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad practice not to check parameters and delegate their validation.

@@ -0,0 +1,4138 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, blank lines

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to remove 1 of the blank lines in my next patch, but for the life of me, I am not able to find out where the first blank line is coming from. All I know is that it is from some random space in the .ttinclude file

internal interface IPrimitiveDataFrameColumnArithmetic<T>
where T : struct
{
void Add(PrimitiveDataFrameColumnContainer<T> left, PrimitiveDataFrameColumnContainer<T> right);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider sorting. It's at least a stable ordering..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next patch. This'll cause too much pain when I rebase

public static IPrimitiveDataFrameColumnArithmetic<T> GetArithmetic<T>()
where T : struct
{
if (typeof(T) == typeof(bool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a switch statement on the type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also set a return value then return return (IPrimitiveDataFrameColumnArithmetic<T>)value which will save all these pasted casts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just took this from some of the Tensor code we have checked in, so I didn't pay too much attention to it. Will consider it for my next patch

Copy link
Member

@gfoidl gfoidl Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you specify the case where a constant value is required, i.e.

switch (typeof(T))
{
    case typeof(int):
// ...

Won't compile (CS0150) (with C# 7.3 and any version SharpLab offers).
Or does pattern matching provide a syntax I'm not aware of?

A way via TypeCode won't be optimized by JIT -- I believe JIT won't optimize (dead code eliminate) on case either, as it does with the if-chain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on this, but my understanding of this pattern was that the JIT was smart enough to eliminate chunks of code. When JIT'ing for T == bool, it can see that if (typeof(T) == typeof(bool)) always holds, similarly else if (typeof(T) == typeof(int)) never holds, so it can completely remove non-bool code paths.

Copy link
Member

@gfoidl gfoidl Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt correct.

The JIT does this by dead code elimination, as typeof is a JIT-time-constant, so all other branches will fall away.

Side note: As the IL for such methods is quite big, the JIT won't inline these methods in most cases. Here AggressiveInline can help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.


internal class BoolArithmetic : IPrimitiveDataFrameColumnArithmetic<bool>
{
public void Add(PrimitiveDataFrameColumnContainer<bool> left, PrimitiveDataFrameColumnContainer<bool> right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS has a magic refactoring that will implement an interface for you. When it does this it uses NotImplementedEXception. You should probably use that.

NotImplementedException --> I am not done here
NotSupportedException --> It is not meaningful to call this (usually when something really should be abstract)
PlatformNotSupportedException --> Works on other platforms, not this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie in theory in CoreFX there are zero NotImplementedException, but in practice I see we weren't consistent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotSupportedException is the right thing here I think. It doesn't make sense to call any arithmetic ops on a bool

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me but someone who understands the semantics yo'ure shooting for here should signo ff.


public long NullCount { get; protected set; }

public string Name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public fields is a bad practice. Must be replaced by a property.

public string Name;

public virtual object this[long rowIndex] { get { throw new NotImplementedException(); } set { throw new NotImplementedException(); } }
public virtual object this[long startIndex, int length] { get { throw new NotImplementedException(); } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's change it a bit later.

private DataFrameTable(IList<BaseDataFrameColumn> columns)
{
columns = columns ?? throw new ArgumentNullException(nameof(columns));
_columns = columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you assigning the value to columns if it's a parameter you're checking?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand?? I'm assigning to _columns? columns without the "_" is the parameter. Technically I'm not 100% convinced with this code yet. I'm still not certain if I should be making a copy of the columns being passed in, so I'll wait for more usage scenarios before I make the choice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're assigning columns to columns in line 32 to check for a null value. Just check the value while assigning it to the field:

Suggested change
_columns = columns;
_columns = columns ?? throw new ArgumentNullException(nameof(columns));

}
public void RemoveColumn(int columnIndex)
{
_columnNameToIndexDictionary.Remove(_columnNames[columnIndex]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad practice not to check parameters and delegate their validation.

<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netcoreapp2.2</TargetFramework>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be netstandard2.0?

</ItemGroup>

<ItemGroup>
<Compile Update="BaseColumn.BinaryOperations.cs">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this file in your PR. Are you sure all of these are still necessary?

namespace Microsoft.Data
{
/// <summary>
/// The base column type. All APIs should have atleast a stub here first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atleast is two words.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Let's be sure any feedback you didn't address here gets tracked and addressed in a future PR (there are a lot of comments to wade through).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants