The NVIDIA RAPIDS Accelerator for Apache Spark software plug-in pioneered a zero code change user experience (UX) for GPU-accelerated data processing. It accelerates existing Apache Spark SQL and DataFrame-based applications on NVIDIA GPUs by over 9x without requiring a change to your queries or source code.
This led to the new Spark RAPIDS ML Python library, which can speed up applications by over 100x, and also invoked MLlib, Apache Spark’s scalable machine learning library.
Until recently, Spark RAPIDS ML’s MLlib acceleration still needed a small change to get Python to use the accelerated implementation. Specifically, you needed to replace pyspark.ml
with spark_rapids_ml
in the Python import statements for the ML classes you wished to accelerate. For example, to use accelerated KMeans instead of the baseline KMeans, you had to replace import pyspark.ml.classication.KMeans
with import spark_rapids_ml.classication.KMeans
throughout your code. On the plus side, no further code changes were needed to use the accelerated version of KMeans.
In this blog post, we describe new functionality in Spark RAPIDS ML since the 25.02 release that allows you to skip even the import statement changes noted above, for a truly zero code change acceleration experience end-to-end in both Spark SQL and DataFrame, and MLlib code.
Zero code change MLlib acceleration
Consider the following simple PySpark application code:
from pyspark.ml.clustering import KMeans
from pyspark.ml.functions import array_to_vector
df = spark.read.parquet("/data/embedding_vectors")
df = df.select(array_to_vector(df.features).alias("features"))
kmeans_estim = ( KMeans()
.setK(100)
.setFeaturesCol("features")
.setMaxIter(30) )
kmeans_model = kmeans_estim.fit(df)
transformed = kmeans_model.transform(df)
transformed.write.parquet("/data/embedding_vectors_clusters")
This code reads a file of vector embeddings, previously computed using a deep learning language model and stored in parquet format using array
type. It then uses the KMeans algorithm in Spark MLlib to cluster the vectors.
Combining the new zero code change functionality of Spark RAPIDS ML with the RAPIDS Accelerator for Apache Spark software plug-in, you can accelerate this fully compatible PySpark code without any changes: Including parquet decompression and decoding when reading in the file in read.parquet()
, the KMeans clustering numerical computations in fit()
and transform()
, and the encoding and compression when saving the vectors with clusters to another parquet file in write.parquet()
.
We next describe how you can trigger accelerated execution using new variants of the familiar ways to launch Spark applications: Command line interfaces (CLIs), Jupyter notebooks locally, and Jupyter notebooks in cloud provider-hosted Spark services.
Command line interfaces
Suppose the example application code above was in a file called app.py
. Conventionally, you’d use the well-known Spark CLI spark-submit to launch app.py on different types of clusters (local/test, standalone, yarn, kubernetes, etc.):
spark-submit <options> app.py
To accelerate the MLlib parts, after installing the Spark RAPIDS ML library via pip install spark-rapids-ml, you can simply replace the spark-submit
command with a newly included accelerated CLI counterpart (while including the configs and classpath settings, as before, for SQL and DataFrame acceleration):
spark-rapids-submit <options> app.py
If you prefer to run code similar to app.py
interactively in a PySpark shell using the CLI pyspark
, you can accelerate this too, with zero code changes, by using the newly included counterpart CLI pyspark-rapids
to launch an accelerated PySpark shell instead.
Jupyter notebooks: on-premise Spark clusters
Spark applications are also commonly run interactively in Jupyter notebooks running kernels attached to Spark clusters.
As explained in the Spark RAPIDS ML documentation, to get started on a workstation with an NVIDIA GPU, you can launch Jupyter with accelerated Spark in local mode using the pyspark-rapids
command:
PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0' \
pyspark-rapids --master local[*] <options>
And then connect to the Jupyter notebook server at the logged URL. You could then run code similar to app.py
interactively in one or more notebook cells.
You can add the RAPIDS Accelerated Spark plugin jar and spark.plugins
config for end-to-end acceleration.
With suitable settings for --master
, you can then use the same command to enable zero code change notebook acceleration in other Spark cluster deployments (like standalone, yarn).
Jupyter notebooks: Cloud provider hosted Spark
For zero code change UX in cloud-hosted Spark Jupyter notebooks, the Spark RAPIDS ML repo shows you how to configure example initialization and bootstrap scripts when launching GPU Spark clusters, to enable both SQL/Dataframe acceleration and MLlib acceleration. Examples are provided for Databricks, GCP Dataproc, and AWS EMR.
The init scripts inject simple modifications into the respective hosted Spark environments that result in Jupyter notebooks being launched with zero code change acceleration enabled.
How it works
The zero code change acceleration of Spark MLlib enabled by the above CLIs and Jupyter notebook deployments is powered under the hood by importing or running the new spark_rapids_ml.install
module in the Spark RAPIDS ML Python library.
This new module is based heavily on similar functionality in the RAPIDS cudf.pandas Python package released at last year’s GTC, which brought a zero code change GPU-accelerated UX to users of the popular Pandas Data Analysis library.
Importing or running the new spark_rapids_ml.install
module overrides Python’s module import mechanisms to transparently redirect imports of pyspark.ml
estimators in application code to accelerated spark_rapids_ml
counterparts, when available. One tricky aspect is to avoid doing this when the imports are from within PySpark or Spark RAPIDS ML code itself, as in these cases it’s crucial to import the actual pyspark.ml
estimators.
Next steps
You can try out the new zero code change accelerated Spark MLlib functionality, augmenting the original RAPIDS Accelerator for Apache Spark, by installing the spark-rapids-ml Python package and referencing documentation in the Spark RAPIDS ML GitHub repo for zero code change CLIs and notebooks and by running a zero code change test script, also in the repo.