[ZEPPELIN-1639] Add tests with external python dependencies to CI build#1632
[ZEPPELIN-1639] Add tests with external python dependencies to CI build#1632agoodm wants to merge 10 commits intoapache:masterfrom
Conversation
|
ping @bzz @felixcheung @Leemoonsoo |
.travis.yml
Outdated
|
|
||
| # Test python/pyspark with python2 | ||
| - jdk: "oraclejdk7" | ||
| env: PYTHON="2.7" SCALA_VER="2.10" SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -pl spark,python -am -DskipTests -DskipRat" TEST_FLAG="verify -DskipRat" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-display,spark-dependencies,spark,python -Dtest=org.apache.zeppelin.spark.PySpark*Test,org.apache.zeppelin.python.* -Dpyspark.test.exclude='' -DfailIfNoTests=false" |
There was a problem hiding this comment.
I think maybe we could run only a small subset of this list
-pl zeppelin-interpreter,zeppelin-display,spark-dependencies,spark,python
There was a problem hiding this comment.
or perhaps we could have one of python in the test mix of the spark one above?
There was a problem hiding this comment.
Though I listed all of the projects which the spark and python depend on, the only actual tests that are being run in this build profile are Pyspark as well as all of the Python interpreter tests.
Essentially the purpose of this profile is to run all python specific tests only, and do so for both python 2 and 3. Since we have two pure spark testing profiles, we would need to double that if we bundle these python tests as well if we want to cover both python 2 and 3, which would lead to waste since that would mean running the non-python tests multiple times for no benefit. Am I misunderstanding and did you mean something else?
There was a problem hiding this comment.
I was thinking the existing spark profile for spark 2.0.0 (or the latest) could also run the Python 3 tests too?
Basically I think we are finding that additional test profile in the matrix could increase test run time and decrease overall stability, so if we could avoid adding more profile the better
There was a problem hiding this comment.
Ah, I see. I think that's a bit of a tough call. On one hand, I think it makes sense because we originally designated that profile for "testing all modules", so it would be logical to include these additional tests too, and it saves us a little bit of time in terms of reducing the number of builds by one. On the other hand, it also means that we can't guarantee an "apples to apples comparison" with the python 2 test so to speak in case a test failure happens prior to the python tests in the build. I think given the the general flakiness of the recent builds though, your suggestion seems like the right way to go. Do you have any thoughts on this, @bzz or @Leemoonsoo ?
| conda update -q conda | ||
| conda info -a | ||
| conda config --add channels conda-forge | ||
| conda install -q matplotlib pandasql |
There was a problem hiding this comment.
where do they go in the install - is this something we could leverage Travis's container base run caching to avoid redownloading it everytime?
There was a problem hiding this comment.
In this case, conda installs packages to $HOME/miniconda/lib/pythonX.Y/site-packages. Note that these dependencies are binary packages, so the total download plus installation time is very short. I think this would be worth looking into for the R packages though as those take several minutes to download, compile, and install.
There was a problem hiding this comment.
even so, if there are opportunities to optimize or to avoid download, we should try to cache it? I think we are finding any download from travis could be a source of fragility, as it could be related network issue in travis, or the target host being temporarily unavailable etc
There was a problem hiding this comment.
Fair enough. It might be slightly trickier to do this here than with native python projects since we would need to cache two different installation environments, one for each python version. I'll see what I can do.
.travis.yml
Outdated
|
|
||
| # Test python/pyspark with python3 | ||
| - jdk: "oraclejdk7" | ||
| env: PYTHON="3.5" SCALA_VER="2.10" SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -pl spark,python -am -DskipTests -DskipRat" TEST_FLAG="verify -DskipRat" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-display,spark-dependencies,spark,python -Dtest=org.apache.zeppelin.spark.PySpark*Test,org.apache.zeppelin.python.* -Dpyspark.test.exclude='' -DfailIfNoTests=false" |
There was a problem hiding this comment.
below it looks like we are always downloading the latest conda, so if Python 3.6 is release this PYTHON="3.5" would not match, is that going to be an issue?
There was a problem hiding this comment.
No, the PYTHON variable is simply used to determine whether or not to use Miniconda2 or 3. I merely labeled it as 3.5 for the purpose of being specific, since that is the actual version of python being used in this case. I should probably just set this to 2 or 3 so it won't confuse anyone.
|
Thank you for the review @felixcheung , see my responses. Although this is beyond the scope of the original intent of the PR, I have added the R dependencies to the cache, which could potentially save several minutes of time for most of the build profile. |
|
@felixcheung so I have tested caching the miniconda directory and found that it actually slows down the build by 1-2 minutes on average. From what I can tell the other big python projects are not fully caching their conda installations either, at the very least not the dependencies themselves. I think the only compelling reason to cache the miniconda installation at this point is to prevent the build from failing due to download errors. In my own experience this has been very rare until recently, and the errors are always happening after conda gets updated. This is making me think that the problems are most likely some bug in the latest version of conda, so I think I will test this some more with an older version of conda and see how stable the build is. |
|
That's great. Thanks for looking into it. |
8f88213 to
ad71f5a
Compare
|
@agoodm thank you, looks great! There has been some changes in |
2ed2305 to
a308e7e
Compare
|
@bzz done. |
|
@agoodm as CI is failing - could you please rebase on latest master, it has some CI improvements and then open\close the PR to trigger a new CI here again? |
a308e7e to
9ce982a
Compare
|
I think the CI failures were due to me somehow inadvertently removing the -Ppyspark from the build profile. Let's see how it goes now. |
|
So the pyspark cluster test has two failures. One is a result of matplotlib always printing a warning message when it is first loaded, and the other is how python 3 handles unicode. For these reason I think we should either move the python 3 tests to another profile which doesn't perform this test, or have it in a separate profile as I had already done once before. What do you think? |
|
|
||
| # Test python/pyspark with python 2 | ||
| - jdk: "oraclejdk7" | ||
| env: PYTHON="2" SCALA_VER="2.10" SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -pl spark,python -am -DskipTests -DskipRat" TEST_FLAG="verify -DskipRat" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-display,spark-dependencies,spark,python -Dtest=org.apache.zeppelin.spark.PySpark*Test,org.apache.zeppelin.python.* -Dpyspark.test.exclude='' -DfailIfNoTests=false" |
There was a problem hiding this comment.
Did not know that -Dtest= overrides -DskipTests
|
Thank you for CI failure assessment @agoodm ! How do you think, how crucial are those tests? May be just documenting on how to run them manually could be an option? Generally, we tried to avoid exploding number of CI profiles to keep the CI build time minimum, but if more profiles is not going cost substantial build time then it definitely could be an option. In that case any of the suggested solutions sounds good to me. |
|
To be honest, I don't see it being that big of a deal. Both profiles take about 5 minutes each which I think is fairly short, since up until recently we were spending 4 minutes just installing the R dependencies for all of the other profiles. Additionally, this would make it much easier to test everything for both python 2 and 3, which can be quite a pain to do manually with conda / virtualenv (eg, fiddling with the PATH before running tests for each case). Also, matplotlib and pandas are most likely used for a large percentage of python / pyspark use cases in Zeppelin, so I think having that integration tested would be really useful. |
|
Thanks for spending more time investigating this.
I'd agree with your assessment.
|
.travis.yml
Outdated
| env: SCALA_VER="2.11" PROFILE="-Prat" BUILD_FLAG="clean" TEST_FLAG="org.apache.rat:apache-rat-plugin:check" TEST_PROJECTS="" | ||
|
|
||
| # Test all modules with spark 2.0.0 and scala 2.11 | ||
| # Test all modules with spark 2.0.0 and scala 2.11 and python 3 |
.travis.yml
Outdated
|
|
||
| # Test python/pyspark with python 3 | ||
| - jdk: "oraclejdk7" | ||
| env: PYTHON="3" SCALA_VER="2.10" SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -pl spark,python -am -DskipTests -DskipRat" TEST_FLAG="verify -DskipRat" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-display,spark-dependencies,spark,python -Dtest=org.apache.zeppelin.spark.PySpark*Test,org.apache.zeppelin.python.* -Dpyspark.test.exclude='' -DfailIfNoTests=false" |
There was a problem hiding this comment.
was there a reason Python tests are building with Spark 1.6.1 instead of 2.0.0? Perhaps it is better to stay with the latest?
There was a problem hiding this comment.
There is no particular reason, since any spark version should work. I simply just copied / pasted another profile when I began writing these ones.
There was a problem hiding this comment.
I see. Perhaps one for stay at 1.6.1 and another (like Python3) we could change to 2.0.0? what do you think?
|
great! a few minor comments (like like you addressed one already!) So to recap, we are able to cache R but not miniconda, correct? Any other comment - merging when CI complete. |
|
Correct, the main reason being that caching it actually increases the build time. I still have one slight nitpick myself, but I'll leave that up to others if that is worth pursuing right now (the newly added docker and conda tests have increased the build time for the python profiles by a good bit, so perhaps they should be excluded somehow since testing them again here has no benefit) |
|
I see, I think generally if we could isolate test sets so that they don't overlap (ie. python tests not running docker tests), we would likely have more stable CI runs. I think that's really our main concerns - CI stability is kinda high priority since it affects everyone. But if CI passes for these last 2 runs then I don't mind either way :) |
|
Oh for the love of... looks like I will need to update the cache again. Let me run this one more time. |
|
looks like transient network error but python 3 test failed. could you rebase and kick this off again? |
833dbf5 to
363019e
Compare
|
@felixcheung so I looked at the travis log more carefully, and I am pretty sure the main thing I was missing was an additional "-Pscala-2.11" to the maven build profile. Hopefully it will finally be green this time. |
|
The final remaining travis error is unrelated, it seems to be occurring with other recent PRs (https://travis-ci.org/apache/zeppelin/builds/179341529). Perhaps we should focus on making the logs less verbose, since this profile already produces a log file that's bigger 3 MB on master. In any case, is this good to finally merge? |
|
I think it is mostly the test output that pollutes the CI logs... So, if I read TravisCI logs right - with 2 new profiles Looks great to me, merging to master if there is no further discussion! |
What is this PR for?
Take 2 of #1618 because I had some earlier problems with rebasing. Since, then I have added some new features, namely:
install_external_dependencies.shwhich conditionally installs the R or python dependencies based on the specified build profile in.travis.yml. This saves several minutes of time for a few of the build profiles since the R dependencies are compiled from source and therefore take quite a bit of time to install.matplotlibandpandas) are now relegated to two separate build profiles. This is done primarily to efficiently test both Python 2 and 3.What type of PR is it?
Improvement and Bugfix
What is the Jira issue?
ZEPPELIN-1639
How should this be tested?
CI tests should be green!
Questions: