COMPUTER SCIENCE DEPARTMENT, UNIVERSITY OF BONN
Lab Distributed Big Data Analytics
Worksheet-3: ML on Spark (Spark ML and BigDL)
Dr. Hajira Jabeen, Gezim Sejdiu, Denis Lukovnikov, Prof. Dr. Jens Lehmann
April 25, 2019
In this lab we are going to perform basic Spark ML and BigDL operations (described on “Spark Fundamentals II (ML on Spark)”).
IN CLASS
- Setup
- Download Spark 2.2, unpack to
/opt/spark(or anywhere) - Set
SPARK_HOMEvar to/opt/spark(or where it was unpacked to) - Download BigDL 0.7, unpack anywhere
- Set
BIGDL_HOMEvar to unpacked BigDL directory do pip install bigdl==0.7somewhere- download https://gist.github.com/lukovnikov/461d1165ea04317d2be6b66995ffa73c
- start jupyter using the script (must be marked as executable)
- Download Spark 2.2, unpack to
- Implement PySpark-BigDL dummy linreg notebook.
- Implement PySpark-BigDL mnist notebook.
- Implement PySpark-BigDL mnist cnn notebook.
AT HOME
- Reading:
- Read “Pattern Recognition and Machine Learning” by Bishop
- Read “Deep Learning” by Courville et al. (or check some blog posts/tutorials)
- Check out the MLlib programming guide
- Read the BigDL whitepaper
- Check out the BigDL programming guide
- Check out the tutorials (https://github.com/intel-analytics/BigDL-Tutorials/ ← Python)
- Complete the notebooks
- Convert the mnist_cnn notebook to use MLlib’s Pipeline API