Skip to content

Commit 0111cff

Browse files
authored
[BEAM-12906] Add a dataframe extra for installing a pandas version supported by the DataFrame API (#15528)
* Add 'dataframe' extra * Update documentation to reference 'dataframe' extra * Add dataframe to default extras * fixup! Add 'dataframe' extra * Install dataframe extra in installGcpTest task (for integration tests)
1 parent 5e7e66f commit 0111cff

6 files changed

Lines changed: 26 additions & 18 deletions

File tree

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2337,7 +2337,7 @@ class BeamModulePlugin implements Plugin<Project> {
23372337
def distTarBall = "${pythonRootDir}/build/apache-beam.tar.gz"
23382338
project.exec {
23392339
executable 'sh'
2340-
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 ${distTarBall}[gcp,test,aws,azure]"
2340+
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 ${distTarBall}[gcp,test,aws,azure,dataframe]"
23412341
}
23422342
}
23432343
}

examples/notebooks/tour-of-beam/dataframes.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@
6565
"[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n",
6666
"\n",
6767
"First, we need to install Apache Beam with the `interactive` extra for the Interactive runner.",
68-
"We also need `pandas` for this notebook, but the Interactive runner already depends on it."
68+
"We also need to install a version of `pandas` supported by the DataFrame API, which we can get with the `dataframe` extra in Beam 2.34.0 and newer."
6969
],
7070
"metadata": {
7171
"id": "hDuXLLSZnI1D"
@@ -75,7 +75,7 @@
7575
"cell_type": "code",
7676
"execution_count": null,
7777
"source": [
78-
"%pip install --quiet apache-beam[interactive]"
78+
"%pip install --quiet apache-beam[interactive,dataframe]"
7979
],
8080
"outputs": [],
8181
"metadata": {

sdks/python/apache_beam/examples/dataframe/README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,10 @@ API](https://beam.apache.org/documentation/dsls/dataframes/overview/).
2626

2727
You must have `apache-beam>=2.30.0` installed in order to run these pipelines,
2828
because the `apache_beam.examples.dataframe` module was added in that release.
29-
Additionally using the DataFrame API requires `pandas>=1.0.0` to be installed
30-
in your local Python session. The _same_ version should be installed on workers
31-
when executing DataFrame API pipelines on distributed runners. Reference
32-
[`base_image_requirements.txt`](../../../container/base_image_requirements.txt)
33-
for the Beam release you are using to see what version of pandas will be used
34-
by default on distributed workers.
29+
Using the DataFrame API also requires a compatible pandas version to be
30+
installed, see the
31+
[documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/#pre-requisites)
32+
for details.
3533

3634
## Wordcount Pipeline
3735

sdks/python/setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ def get_version():
165165
REQUIRED_TEST_PACKAGES = [
166166
'freezegun>=0.3.12',
167167
'mock>=1.0.1,<3.0.0',
168-
'pandas>=1.0,<1.4.0',
168+
'pandas<2.0.0',
169169
'parameterized>=0.7.1,<0.8.0',
170170
'pyhamcrest>=1.9,!=1.10.0,<2.0.0',
171171
'pyyaml>=3.12,<6.0.0',
@@ -305,7 +305,8 @@ def run(self):
305305
'interactive': INTERACTIVE_BEAM,
306306
'interactive_test': INTERACTIVE_BEAM_TEST,
307307
'aws': AWS_REQUIREMENTS,
308-
'azure': AZURE_REQUIREMENTS
308+
'azure': AZURE_REQUIREMENTS,
309+
'dataframe': ['pandas>=1.0,<1.4']
309310
},
310311
zip_safe=False,
311312
# PyPI package information.

sdks/python/tox.ini

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ select = E3
3030
# allow apps that support color to use it.
3131
passenv=TERM
3232
# Set [] options for pip installation of apache-beam tarball.
33-
extras = test
33+
extras = test,dataframe
3434
# Don't warn that these commands aren't installed.
3535
whitelist_externals =
3636
false
@@ -88,7 +88,7 @@ commands =
8888
{toxinidir}/scripts/run_pytest.sh {envname} "{posargs}"
8989

9090
[testenv:py{36,37,38}-cloud]
91-
extras = test,gcp,interactive,aws,azure
91+
extras = test,gcp,interactive,dataframe,aws,azure
9292
commands =
9393
{toxinidir}/scripts/run_pytest.sh {envname} "{posargs}"
9494

@@ -98,7 +98,7 @@ deps =
9898
codecov
9999
pytest-cov==2.9.0
100100
passenv = GIT_* BUILD_* ghprb* CHANGE_ID BRANCH_NAME JENKINS_* CODECOV_*
101-
extras = test,gcp,interactive,aws
101+
extras = test,gcp,interactive,dataframe,aws
102102
commands =
103103
-rm .coverage
104104
{toxinidir}/scripts/run_pytest.sh {envname} "{posargs}" "--cov-report=xml --cov=. --cov-append"
@@ -138,7 +138,7 @@ commands =
138138
python setup.py mypy
139139

140140
[testenv:py38-docs]
141-
extras = test,gcp,docs,interactive
141+
extras = test,gcp,docs,interactive,dataframe
142142
deps =
143143
Sphinx==1.8.5
144144
sphinx_rtd_theme==0.4.3
@@ -197,7 +197,7 @@ commands =
197197
# pulls in the latest docutils. Uncomment this line once botocore does not
198198
# conflict with Sphinx:
199199
# extras = docs,test,gcp,aws,interactive,interactive_test
200-
extras = test,gcp,aws,interactive,interactive_test
200+
extras = test,gcp,aws,dataframe,interactive,interactive_test
201201
passenv = WORKSPACE
202202
commands =
203203
time {toxinidir}/scripts/run_dependency_check.sh

website/www/site/content/en/documentation/dsls/dataframes/overview.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,18 @@ The Beam DataFrame API is intended to provide access to a familiar programming i
3030

3131
If you’re new to pandas DataFrames, you can get started by reading [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html), which shows you how to import and work with the `pandas` package. pandas is an open-source Python library for data manipulation and analysis. It provides data structures that simplify working with relational or labeled data. One of these data structures is the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which contains two-dimensional tabular data and provides labeled rows and columns for the data.
3232

33-
## Using DataFrames
33+
## Pre-requisites
34+
35+
To use Beam DataFrames, you need to install Beam python version 2.26.0 or higher (for complete setup instructions, see the [Apache Beam Python SDK Quickstart](https://beam.apache.org/get-started/quickstart-py/)) and a supported `pandas` version. In Beam 2.34.0 and newer the easiest way to do this is with the "dataframe" extra:
36+
37+
```
38+
pip install apache_beam[dataframe]
39+
```
3440

35-
To use Beam DataFrames, you need to install Apache Beam version 2.26.0 or higher (for complete setup instructions, see the [Apache Beam Python SDK Quickstart](https://beam.apache.org/get-started/quickstart-py/)) and pandas version 1.0 or higher. You can use DataFrames as shown in the following example, which reads New York City taxi data from a CSV file, performs a grouped aggregation, and writes the output back to CSV:
41+
Note that the _same_ `pandas` version should be installed on workers when executing DataFrame API pipelines on distributed runners. Reference [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) for the Beam release you are using to see what version of `pandas` will be used by default on workers.
42+
43+
## Using DataFrames
44+
You can use DataFrames as shown in the following example, which reads New York City taxi data from a CSV file, performs a grouped aggregation, and writes the output back to CSV:
3645

3746
{{< highlight py >}}
3847
from apache_beam.dataframe.io import read_csv

0 commit comments

Comments
 (0)