Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame enhancements #6088

Open
Tracked by #6144
GKrivosheev-rms opened this issue Feb 15, 2022 · 7 comments
Open
Tracked by #6144

DataFrame enhancements #6088

GKrivosheev-rms opened this issue Feb 15, 2022 · 7 comments
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs
Milestone

Comments

@GKrivosheev-rms
Copy link

GKrivosheev-rms commented Feb 15, 2022

I see dozens of issues and enhancement suggestions for DataFrame in Microsoft.Data.Analysis namespace untouched for almost a year.
Are there any resources allocated to address those?
Is the project dead?
Are there any plans to fund the work on those features in the future?
Should we base any future development on these?

Specific enhancements desired:

  • Array/VBuffer column types
  • Sort by multiple columns
  • GroupBy by multiple columns
  • Parquet read/Write (currently the ParquetSharp.DataFrame has some limited support)
@GKrivosheev-rms GKrivosheev-rms added the enhancement New feature or request label Feb 15, 2022
@luisquintanilla
Copy link
Contributor

Hi @GKrivosheev-rms

Thanks for raising this issue. We're planning on evaluating the data preparation / data wrangling story in the coming months as outlined in the roadmap. We suspect the DataFrame API has a role to play there but until we have a clearer picture on common uses, asks, and pain points with the existing API, there is no active development on the DataFrame API at this time. That doesn't mean the project is dead or issues and feature requests like these aren't being taken into account. They are going to help frame our investigations and prioritize our efforts. Because the DataFrame API is currently in preview and we don't expect to add new features within the next couple of months, personally I would not take hard dependencies on it at this time for critical systems.

Let us know if you have additional questions or issues.

@GKrivosheev-rms
Copy link
Author

Thanks, Luis!

@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Feb 15, 2022
@michaelgsharp michaelgsharp changed the title Is DataFrame project dead? (No Enhancements to Microsoft.Data.Analysis for almost a year) DataFrame enhancements Feb 15, 2022
@GKrivosheev-rms
Copy link
Author

GKrivosheev-rms commented Feb 15, 2022

Luis,
Just to give you a context, we are considering the DataFrame and related code to build a natural disaster modeling framework for RMS / Moody's Analytics that underpins the trillion dollar Catastrophy (Re)Insurance industry. The columnar data type fits nicely for processing insurance losses while doing large-scale analytics and data processing. It's a very nice paradigm. However, in order for us to use it, it needs support and basic enhancements listed above.

@luisquintanilla
Copy link
Contributor

luisquintanilla commented Feb 15, 2022

Tagging for visibility: @GKrivosheev-rms

Thanks Gleb for providing additional context around your scenario. To clarify, you're looking to use DataFrame for data processing and analytics, not exactly for building predictive analytics / machine learning models? If so, have you taken a look at .NET for Apache Spark?

It has it's own implementation of DataFrames which support:

Not sure if that would help solve your problem, but thought I'd mention it.

Here's an E2E example of .NET for Apache Spark and ML.NET as well as standalone examples from the .NET for Apache Spark repo.

@GKrivosheev-rms
Copy link
Author

Thanks for suggestion, @luisquintanilla . I'll take a look.

Few questions:

  1. Why are there are two very similar implementations of dataframes (Spark and Data Analytics)? Is there something in Analytics dataframes that Spark dataframes can't do? Are there plans to consolidate?
  2. Do Spark dataframes or DF operations require a full Spark engine installed and running for single-machine operations?
  3. For ML and data prep workloads for analytics DataFrames, can I apply ML .NET transforms, readers, writers and learners? If the answer is yes, then how can it work without supporting metadata and vector/VBuffer type columns?
  4. What percent of ML .NET features are supporteed via dataframes? Do you have any samples?

Regards,
Gleb

@luisquintanilla
Copy link
Contributor

@GKrivosheev-rms great questions. I've tried to answer them below.

  1. The DataAnalytics DataFrame can be thought of similar to the Pandas DataFrame whereas the .NET for Spark DataFrame is just the DataFrame implementation Spark uses. As a result, the DataAnalytics DataFrame you can use without any dependencies locally on your PC while Spark DataFrames run on the Spark engine, which you could run on your own PC, but there's some setup and dependencies required. At the moment there are no plans for consolidation, though there is some interop. Here are some examples of that:
  1. You need the full Spark engine installed to run operations on Spark DataFrames. These are the setup instructions. You could also use a cloud service / product like Azure Synapse, HDInsight, DataBricks, AWS EMR, etc. .NET for Spark runs anywhere that Spark runs including a single-machine like your PC. You can also use .NET notebooks locally on your PC if you prefer a more interactive way of working with Spark other than spark-submit jobs on the command line.
  2. The short answer is yes, though the interop between DataAnalytics DataFrames and ML.NET is limited at the moment. DataFrame implements IDataView so you can take a DataFrame and use it like you would an IDataView, but going from IDataView to DataFrame doesn't always work because some types like vector/VBuffer aren't supported. Here are some examples of using a DataFrame for training and inferencing:
  1. I don't have an exact percentage of ML.NET features supported by DataFrames, but from the examples I've included above you can use DataFrames for training and inferencing. So long as the one of the data types you're working with are supported.

Hope this helps. Happy to clarify anything.

@aloneguid
Copy link

To add here, Parquet.Net which is already used in ML.NET has full built-in support for DataFrame read and write.

There is a sample C# interactive notebook demonstrating basic use (it's a one-liner) as well. It just works.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants