Skip to content

ZEPPELIN-160 Working with provided Spark, Hadoop.#244

Closed
Leemoonsoo wants to merge 32 commits intoapache:masterfrom
Leemoonsoo:spark_provided
Closed

ZEPPELIN-160 Working with provided Spark, Hadoop.#244
Leemoonsoo wants to merge 32 commits intoapache:masterfrom
Leemoonsoo:spark_provided

Conversation

@Leemoonsoo
Copy link
Copy Markdown
Member

Zeppelin currently embeds all spark dependencies under interpreter/spark and loading them on runtime.

Which is useful because of user can try Zeppelin + Spark with local mode without installation and configuration of spark.

However, when user has existing spark and hadoop installation, it'll be really helpful to just pointing them instead of build zeppelin with specific version of spark and hadoop combination.

This PR implements ability to use external spark and hadoop installation, by doing

  • spark-dependencies module packages spark/hadoop dependencies under interpreter/spark/dep, to support local mode (current behavior)
  • When SPARK_HOME and HADOOP_HOME is defined, bin/interpreter.sh exclude interpreter/spark/dep from classpath and include system installed spark and hadoop into the classpath.

This patch makes Zeppelin binary independent from spark version. Once Zeppelin is been built, SPARK_HOME can point any version of spark.

@Leemoonsoo
Copy link
Copy Markdown
Member Author

Here's summary of changes made by this patch.

Add spark-dependencies submodule

spark-dependencies maven submodule is created. It is responsible for copy all the spark/hadoop dependencies under interpreter/spark/dep.

Spark/Hadoop dependencies in spark maven submodule is set to provided, while they're loaded on runtime from either interpreter/spark/dep or SPARK_HOME,HADOOP_HOME.

bin/interpreter.sh

bin/interpreter.sh checks if SPARK_HOME and HADOOP_HOME is defined.
If they're not defined, it adds interpreter/spark/dep into classpath.
If they're defined, it does add directories from SPARK_HOME and HADOOP_HOME into classpath.

It also searches for spark-*.conf file under SPARK_HOME/conf and automatically add them into ZEPPELIN_JAVA_OPTS.

remove use of travis-install.sh from .travis

While travis-install.sh reduces logs, it brings some problem.
When build is hanging for some reason, before travis-install.sh gets error and printing them, travis terminates the build container. It makes very hard to debug.

This is ready to review.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you elaborate on why we would need to do this? Curious - isn't SQLContext always has the sql() method?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because of method signature is not exactly the same.
version 1.3 and later has def sql(sqlText: String): DataFrame while version 1.2 and prior has def sql(sqlText: String): SchemaRDD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! May be it is worth adding a comment to the source code itself, for the history?

@felixcheung
Copy link
Copy Markdown
Member

looks good. this is a very important change, thanks for making it

@bzz
Copy link
Copy Markdown
Member

bzz commented Aug 25, 2015

Thank you for a really cool feature!

Quick question:

This patch makes Zeppelin binary independent from spark version. Once Zeppelin is been built, SPARK_HOME can point any version of spark.

This is true only for Spark, not for Hadoop though, am I right?

It's a bit unclear are both, SPARK_HOME and HADOOP_HOME are required to use this mode or only SPARK_HOME should be enough.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

if [[ -z "${PYTHONPATH}" ]]; then

@Leemoonsoo
Copy link
Copy Markdown
Member Author

It's a bit unclear are both, SPARK_HOME and HADOOP_HOME are required to use this mode or only SPARK_HOME should be enough.

It depends on spark distribution.
If system provided spark is "pre-build with user-provided Hadoop", then both SPARK_HOME and HADOOP_HOME is required. Otherwise, only SPARK_HOME would be enough.

@Leemoonsoo Leemoonsoo force-pushed the spark_provided branch 2 times, most recently from 7c1745f to 57b3f96 Compare August 31, 2015 19:12
@Leemoonsoo
Copy link
Copy Markdown
Member Author

I have rebased to resolve merge conflict.
Merging if there're no more discussions.

@bzz
Copy link
Copy Markdown
Member

bzz commented Sep 1, 2015

👍

@asfgit asfgit closed this in 5de01c6 Sep 1, 2015
Leemoonsoo added a commit to Leemoonsoo/zeppelin that referenced this pull request Sep 17, 2015
Zeppelin currently embeds all spark dependencies under interpreter/spark and loading them on runtime.

Which is useful because of user can try Zeppelin + Spark with local mode without installation and configuration of spark.

However, when user has existing spark and hadoop installation, it'll be really helpful to just pointing them instead of build zeppelin with specific version of spark and hadoop combination.

This PR implements ability to use external spark and hadoop installation, by doing

* spark-dependencies module packages spark/hadoop dependencies under interpreter/spark/dep, to support local mode (current behavior)
* When SPARK_HOME and HADOOP_HOME is defined, bin/interpreter.sh exclude interpreter/spark/dep from classpath and include system installed spark and hadoop into the classpath.

This patch makes Zeppelin binary independent from spark version. Once Zeppelin is been built, SPARK_HOME can point any version of spark.

Author: Lee moon soo <[email protected]>

Closes apache#244 from Leemoonsoo/spark_provided and squashes the following commits:

654c378 [Lee moon soo] use consistant, simpler expressions
57b3f96 [Lee moon soo] Add comment
eb4ec09 [Lee moon soo] fix reading spark-*.conf file
bacfd93 [Lee moon soo] Update readme
3a88c77 [Lee moon soo] Test use explicitly %spark
5a17d9c [Lee moon soo] Call sqlContext.sql using reflection
615c395 [Lee moon soo] get correct method
0c28561 [Lee moon soo] call listenerBus() using reflection
62b8c45 [Lee moon soo] Print all logs
5edb6fd [Lee moon soo] Use reflection to call addListener
af7a925 [Lee moon soo] add pyspark flag
5f8a734 [Lee moon soo] test -> package
a0150cf [Lee moon soo] not use travis-install for mvn test
cd4519c [Lee moon soo] try sys.stdout.write instead of print
6304180 [Lee moon soo] enable 1.2.x test
797c0e2 [Lee moon soo] enable 1.3.x test
8de7add [Lee moon soo] trying to find why travis is not closing the test
cf0a61e [Lee moon soo] rm -rf only interpreter directory instead of mvn clean
2606c04 [Lee moon soo] bringing travis-install.sh back
df8f0ba [Lee moon soo] test more efficiently
9d6b40f [Lee moon soo] Update .travis
2ca3d95 [Lee moon soo] set SPARK_HOME
2a61ecd [Lee moon soo] Clear interpreter directory on mvn clean
f1e8789 [Lee moon soo] update travis config
9e812e7 [Lee moon soo] Use reflection not to use import org.apache.spark.scheduler.Stage
c3d96c1 [Lee moon soo] Handle ZEPPELIN_CLASSPATH proper way
0f9598b [Lee moon soo] py4j version as a property
1b7f951 [Lee moon soo] Add dependency for compile and test
b1d62a5 [Lee moon soo] Add scala-library in test scope
c49be62 [Lee moon soo] Add hadoop jar and spark jar from HADOOP_HOME, SPARK_HOME when they are defined
2052aa3 [Lee moon soo] Load interpreter/spark/dep only when SPARK_HOME is undefined
54fdf0d [Lee moon soo] Separate spark-dependency into submodule

(cherry picked from commit 5de01c6)
Signed-off-by: Lee moon soo <[email protected]>
lelou6666 pushed a commit to lelou6666/incubator-zeppelin that referenced this pull request Mar 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants