[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.#20089
[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.#20089ueshin wants to merge 5 commits intoapache:masterfrom
Conversation
|
Btw, should we add |
|
Test build #85424 has finished for PR 20089 at commit
|
|
Yea, I think we could. I added the support and tested it before - SPARK-19019. I think it's okay to add it they are just metadata AFAIK. |
|
@HyukjinKwon Thanks! I'll add it soon. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
LGTM
Not a big deal but I know one more place we might also update - https://github.com/apache/spark/blob/master/python/README.md#python-requirements
|
@HyukjinKwon I'll update it as well. |
|
Test build #85425 has finished for PR 20089 at commit
|
|
Test build #85426 has finished for PR 20089 at commit
|
python/README.md
Outdated
| ## Python Requirements | ||
|
|
||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). | ||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy, pandas, and pyarrow). |
There was a problem hiding this comment.
This sounds like mandatory, but I think pyarrow is still an optional choice. Right?
There was a problem hiding this comment.
Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some more details here too.
There was a problem hiding this comment.
I added some more details. WDYT?
| 'ml': ['numpy>=1.7'], | ||
| 'mllib': ['numpy>=1.7'], | ||
| 'sql': ['pandas>=0.19.2'] | ||
| 'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0'] |
There was a problem hiding this comment.
If no pyarrow is installed, will setup force users to install it?
There was a problem hiding this comment.
Nope, extras_require does not do anything in normal cases but they can be installed together with a dev option via pip IIRC.
|
Test build #85432 has finished for PR 20089 at commit
|
python/README.md
Outdated
| ## Python Requirements | ||
|
|
||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages have their own requirements (including numpy and pandas). | ||
| At its core PySpark depends on Py4J (currently version 0.10.6), but additional sub-packages might have their own requirements declared as "Extras" (including numpy, pandas, and pyarrow). You can install the requirements by specifying their extra names. |
There was a problem hiding this comment.
Ah, I see. How about simply like ... :
At its core PySpark depends on Py4J (currently version 0.10.6), but some additional sub-packages have their own
extra requirements for some features (including numpy, pandas, and pyarrow).
for now? I just noticed we are a bit unclear on this (e.g., actually I have been under impression that NumPy is required for ML/MLlib so far) but I think this roughly describes it correctly and is good enough.
Will maybe try to make a PR to fully describe the dependencies and related features later. This PR targets PyArrow anyway.
There was a problem hiding this comment.
Not a big deal anyway. I am actually fine as is too if you prefer @ueshin.
There was a problem hiding this comment.
Let's use the simple one you suggested and leave the detailed description for the future prs.
|
Still LGTM |
|
Test build #85434 has finished for PR 20089 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency.
How was this patch tested?
Existing tests.