Your first simple PySpark Script – Create and Run
In this post, we will see how you can create your first PySpark script and then run it in
batch mode.
Many people I have seen use notebooks like Jupyter, Zeppelin however you may want to
create pyspark script and run it as per schedule.
This is especially helpful if you want to run ETL like process using PySpark which runs on
fixed schedule.
How to write PySpark Script
Let's create a simple PySpark script which will read data from some path and will write
first 10 records into HDFS. The script will also show you how to create other dummy
function along with main function. We will see how you can call other function inside
main function and print some information as well.
Save the file as "run_sample_pyspark.py"
How to run PySpark Script
You can run the pyspark script using spark-submit. spark-submit is used to run or submit
pyspark applications in the cluster. You may also want to create a dedicated LOG file for
this script execution. Use below command to run the pyspark script we created above on
the cluster.
spark-submit filename
The above statement will run the PySpark script in the background by calling spark-
submit. It also creates a log file in which you can see all the print statement output and
other spark log info. We have set logging level to ERROR in the above script. You can
change it to INFO, DEBUG,WARNING as well.
You can also pass parameters in the spark-submit command and also set spark level
configuration as command-line arguments. Below is one sample example of how to
execute PySpark script.
As part of this post, I wanted to show you how easily you can create your first pyspark
script and run in the cluster.
Summary
We saw how easy it is to create a pyspark script. We also saw how you can create
multiple functions in the same script and call one from another. You can create a single
method "main" and put all logic in it, though I will not encourage you to do so.
For execution of pyspark script, you have to pass the script to spark-submit which will
take care of execution of logic in pyspark. Also you can create dedicated log file for each
run for easy reference and debugging at later time.