0% found this document useful (0 votes)
4 views2 pages

Simple PySpark Script - Create and Run

This document explains how to create and run a simple PySpark script in batch mode, which can be useful for scheduled ETL processes. It details the steps to write a script that reads data and writes records to HDFS, as well as how to execute it using the spark-submit command. Additionally, it highlights the importance of logging and passing parameters during execution for better debugging and management.

Uploaded by

roshanrise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

Simple PySpark Script - Create and Run

This document explains how to create and run a simple PySpark script in batch mode, which can be useful for scheduled ETL processes. It details the steps to write a script that reads data and writes records to HDFS, as well as how to execute it using the spark-submit command. Additionally, it highlights the importance of logging and passing parameters during execution for better debugging and management.

Uploaded by

roshanrise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Your first simple PySpark Script – Create and Run

In this post, we will see how you can create your first PySpark script and then run it in
batch mode.

Many people I have seen use notebooks like Jupyter, Zeppelin however you may want to
create pyspark script and run it as per schedule.

This is especially helpful if you want to run ETL like process using PySpark which runs on
fixed schedule.

How to write PySpark Script


Let's create a simple PySpark script which will read data from some path and will write
first 10 records into HDFS. The script will also show you how to create other dummy
function along with main function. We will see how you can call other function inside
main function and print some information as well.

Save the file as "run_sample_pyspark.py"

How to run PySpark Script

You can run the pyspark script using spark-submit. spark-submit is used to run or submit
pyspark applications in the cluster. You may also want to create a dedicated LOG file for
this script execution. Use below command to run the pyspark script we created above on
the cluster.

spark-submit filename
The above statement will run the PySpark script in the background by calling spark-
submit. It also creates a log file in which you can see all the print statement output and
other spark log info. We have set logging level to ERROR in the above script. You can
change it to INFO, DEBUG,WARNING as well.
You can also pass parameters in the spark-submit command and also set spark level
configuration as command-line arguments. Below is one sample example of how to
execute PySpark script.

As part of this post, I wanted to show you how easily you can create your first pyspark
script and run in the cluster.

Summary
We saw how easy it is to create a pyspark script. We also saw how you can create
multiple functions in the same script and call one from another. You can create a single
method "main" and put all logic in it, though I will not encourage you to do so.

For execution of pyspark script, you have to pass the script to spark-submit which will
take care of execution of logic in pyspark. Also you can create dedicated log file for each
run for easy reference and debugging at later time.

You might also like