100% found this document useful (1 vote)

805 views123 pages

Spark 4.0

Uploaded by

shubhamkapadnis770

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

805 views123 pages

Spark 4.0

Uploaded by

shubhamkapadnis770

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 123

What’s Next for the

Upcoming Apache
Spark 4.0 Release?
Wenchen Fan cloud-fan
Xiao Li gatorsmile

Data + AI Summit 2024

©2024 Databricks Inc. — All rights reserved 1

Apache Spark 4.0
GA Major Features

Arrow optimized pandas 2 Structured [WIP] Variant Python Data SQL UDF/UDTF
Spark ANSI Mode
Python UDF API parity Logging Data Types Source APIs [WIP]
Connect
Python SQL

UDF-level
applyInArrow Polymorphic PySpark UDF [WIP] Collation [WIP] Stored Execute View
DF.toArrow Dependency
Python UDTF Unified Profiling Support Procedures Immediate Evolution
Control [WIP]
Streaming More

Arbitrary
State Data Streaming Python New Streaming Error Class XML Spark K8S
Stateful Java 21
Source Reader Data Sources Doc Enhancements Connectors operator
Processing V2
©2024 Databricks Inc. — All rights reserved 2
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Unified Profiler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Spark Connect

©2024 Databricks Inc. — All rights reserved

How to embed Spark in applications?
Up until Spark 3.4: Hard to support today’s developer
experience requirements
Applications
Spark’s Monolith Driver
SQL only
Modern data application
Application Logic

Analyzer
IDEs / Notebooks

Close to REPL Optimizer

Scheduler

Programming Languages / SDKs Distributed Execution Engine

No JVM InterOp

©2024 Databricks Inc. — All rights reserved

Connect to Spark from Any App
Thin client, with full power of Apache Spark

Applications
Spark’s Driver

Spark Connect Client API

Modern data application
Application Gateway

Analyzer
IDEs / Notebooks
Optimizer

Scheduler

Programming Languages / SDKs Distributed Execution Engine

©2024 Databricks Inc. — All rights reserved

Spark Connect GA in Apache Spark 4.0

Interactively develop & New Connectors and Build interactive Data

debug from your IDE SDKs in any language! Applications

Scala3

Databricks
Connect

pip install pyspark>=3.4.0 Check out Databricks Connect, Get started with our github
use & contribute the Go client example!
in your favorite IDE!
©2024 Databricks Inc. — All rights reserved
7
The lightweight Spark Connect Package
• pip install pyspark-connect
• Pure Python library, no JVM.

• Pure Spark Connect client, not entire PySpark

• Only 1.5MB (PySpark 355 MB)

• Preferred if your application has fully migrated to the Spark Connect API.

©2024 Databricks Inc. — All rights reserved

ANSI MODE
ON by default in 4.0

©2024 Databricks Inc. — All rights reserved

Migration to
ANSI ON

Action: Turn On ANSI

Mode to fix your data
corruptions!

©2024 Databricks Inc. — All rights reserved

Without ANSI mode

Data Corruption!

©2024 Databricks Inc. — All rights reserved

With ANSI mode (3.5)

Error callsite is captured

©2024 Databricks Inc. — All rights reserved

With ANSI mode (4.0)

Error callsite is highlighted

©2024 Databricks Inc. — All rights reserved

With ANSI mode (4.0)

DataFrame queries with error callsite

Culprit operation Line number

©2024 Databricks Inc. — All rights reserved

DataFrame queries with error callsite
• PySpark support is on the way.
• Spark Connect support is on the way.
• Native notebook integration is on the way (so that you can see
highlight).

©2024 Databricks Inc. — All rights reserved

Variant Data Type for Semi-
Structured Data

©2024 Databricks Inc. — All rights reserved

Motivation
Data Engineer’s Dilemma: Only pick 2 out of 3…

What if you could ingest Flexible

JSON, maintain flexibility,
boost performance, and
use an open standard? Fast Open

©2024 Databricks Inc. — All rights reserved

Variant is flexible
INSERT INTO variant_tbl (event_data) SQL

VALUES
(
PARSE_JSON(
'{"level": "warning",
"message": "invalid request",
"user_agent": "Mozilla/5.0 ..."}'
)
);

SELECT
*
FROM
variant_tbl
WHERE
event_data:user_agent ilike '%Mozilla%';
©2024 Databricks Inc. — All rights reserved
Performance

©2024 Databricks Inc. — All rights reserved

String Collation Support

©2024 Databricks Inc. — All rights reserved

ANSI SQL COLLATE
Sorting and comparing strings according to locale
● Associate columns, fields, array elements with a collation of
choice
○ Case insensitive
○ Accent insensitive
○ Locale aware
● Supported by many string functions such as
○ lower()/upper()
○ substr()
○ locate()
○ like
● GROUP BY, ORDER BY, comparisons, …
● Supported by Delta and Photon
©2024 Databricks Inc. — All rights reserved
A look at the default collation
A<Z<a<z<Ā

> SELECT name FROM names ORDER BY name;

name
Anthony
Bertha
anthony
bertha
Ānthōnī SQL

Is this really what we want here?

> SELECT name

FROM names
ORDER BY name COLLATE unicode;
name
Ānthōnī
anthony
Anthony
bertha
Bertha SQL

Root collation with decent sort order for most locales

> SELECT name

FROM names
WHERE startswith(name COLLATE unicode_ci, 'a')
ORDER BY name COLLATE unicode_ci;

name
anthony
Anthony
SQL

Case insensitive is not accent insensitive: We lost Ānthōnī

> SELECT name

FROM names
WHERE startswith(name COLLATE unicode_ci_ai, 'a')
ORDER BY name COLLATE unicode_ci_ai;

name
Ānthōnī
anthony
Anthony
SQL

100s of supported predefined collations across many locales

©2024 Databricks Inc. — All rights reserved

Stateful Stream Processing

©2024 Databricks Inc. — All rights reserved

Streaming State data source
• Allows you to inspect the internal states of streaming applications,
for debugging, profiling, testing, troubleshooting, etc.
• Allows you to manipulate the internal states for quick workaround to
recover urgent issues.
• All with your familiar data source APIs.

©2024 Databricks Inc. — All rights reserved

State Reader API: state-metadata

High-level API

©2024 Databricks Inc. — All rights reserved

State Reader API: statestore

Granular API

©2024 Databricks Inc. — All rights reserved

Arbitrary Stateful Processing V2

©2024 Databricks Inc. — All rights reserved

(flat)MapGroupsWithState: current V1 version
val ds = spark.readStream.json(path)
• Supports a single user .as[CreditCardTransaction]

defined state object per ds.groupByKey(_.cardId)

grouping key .flatMapGroupsWithState[
CreditCardTransactionState,
CreditCardTransaction
](
• State object can be OutputMode.Append(),

updated while evaluating GroupStateTimeout.NoTimeout())

(
the current group, and (_, txns, groupState) => {
updated value will be // read state, compute new average, and save to state
...
available in next trigger. }
Scala
)

©2024 Databricks Inc. — All rights reserved

Limitations of (flat)MapGroupsWithState
Existing API

Lack of Data Modelling Flexibility Lack of Composite Types

Prevents users from splitting state (for Values stored in GroupState are single
a grouping key) into multiple logical types and cannot support data structures
instances, which can be read/updated like List, Map etc efficiently. Current
independently. approach requires users to read/update
the entire data structure.

Lack of State Eviction Support Lack of State Schema Evolution

No support for eventual state cleanup Does not support changes to state schema
using TTL. once the streaming query has started.

©2024 Databricks Inc. — All rights reserved

transformWithState: the V2 version

Layered, Flexible,
Extensible State API

©2024 Databricks Inc. — All rights reserved

pandas 2 support/API parity

©2024 Databricks Inc. — All rights reserved

Pandas API on Spark (Koalas)

pandas Pandas API on Spark

import pandas as pd import pyspark.pandas as ps

df = pd.read_csv("my_data.csv") df = ps.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1'] df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x df['x2'] = df.x * df.x

Python Python

©2024 Databricks Inc. — All rights reserved

Pandas 2.x Support
API change parity with Pandas 2.2.2

Backwards incompatible
API changes
• Pandas 2.0.0
• Pandas 2.1.0
SPARK-44101
Spark migration guide

©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved 37
New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Unified Profiler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Streaming and Batching
Python Data Sources

©2024 Databricks Inc. — All rights reserved

Why Python Data Source?
• People like writing Python!
• pip install is so convenient.
• Simplified API without complicated performance features in Data
Source V2.

Spark Python

©2024 Databricks Inc. — All rights reserved

Python Data Source APIs

• SPIP: Python Data Source API (SPARK-44076)

• Available in Spark 4.0 preview version and Databricks Runtime 15.2
• Support both batch and streaming, read and write

©2024 Databricks Inc. — All rights reserved

Python Data Source Overview
Easy Three Steps to create and use your custom data sources

Step 1: Create a Data Source Step 2: Register the Data Step 3: Read from or write to
Source the data source
class MySource(DataSource): Register the data source in the spark.read
… current Spark session using the .format("my-source")
Python data source class: .load(...)

spark
.dataSource df.write
.register(MySource) .format("my-source")
.mode("append")
.save(...)

©2024 Databricks Inc. — All rights reserved ©2024 Databricks Inc. — All rights reserved
DataFrame.toArrow
GroupedData.applyInArrow

©2024 Databricks Inc. — All rights reserved

DataFrame.toArrow

• An simply API to convert PySpark DataFrame to PyArrow Table.

• Make it easier to integrate with Arrow ecosystem.
• Note, all the data is loaded into the driver’s memory. It may cause out-
of-memory errors for large data

©2024 Databricks Inc. — All rights reserved

GroupedData.applyInArrow
from pyspark.sql.functions import ceil
>>> import pyarrow
• Utilizes Apache Arrow to map >>> import pyarrow.compute as pc
>>> df = spark.createDataFrame(
functions over DataFrame groups ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
• Returns the result as a DataFrame >>> def normalize(table):
... v = table.column("v")
... norm = pc.divide(pc.subtract(v, pc.mean(v)),
• Supports functions taking pc.stddev(v, ddof=1))
... return table.set_column(1, "v", norm)
pyarrow.Table or tuple of grouping >>> df.groupby("id").applyInArrow(
... normalize, schema="id long, v double").show()
keys and pyarrow.Table +---+-------------------+
| id| v|
• Note: Function requires a full shuffle +---+-------------------+
| 1|-0.7071067811865475|
and may cause out-of-memory errors | 1| 0.7071067811865475|
| 2|-0.8320502943378437|
for large data groups | 2|-0.2773500981126146|
| 2| 1.1094003924504583|
+---+-------------------+

©2024 Databricks Inc. — All rights reserved

XML Connectors

©2024 Databricks Inc. — All rights reserved

Reading XML files out of the box

spark.read.xml("/path/to/my/file.xml").show()

+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python

©2024 Databricks Inc. — All rights reserved

More Than
Just a Simple
Port

©2024 Databricks Inc. — All rights reserved

Databricks JDBC Dialect

©2024 Databricks Inc. — All rights reserved

Reading Databricks SQL out of the box
spark.read.jdbc(
"jdbc:databricks://…",
"my_table",
properties
).show()

+-------+-----+
| name | age |
+-------+-----+
| Alice | 23 |
| Bob | 32 | Python

©2024 Databricks Inc. — All rights reserved

Delta Lake 4.0

©2024 Databricks Inc. — All rights reserved

The biggest Delta release yet
Delta 3.x Delta 4.0

Deletion Liquid Optimized

Delta Kernel
Writes Multi-cluster
Vectors clustering VARIANT UniForm GA Liquid GA writes
Lightning fast Easy migration
semi-structured Write once, read as Cross cloud, cross
from partitioned
Incremental Log data all formats engines
Table features Row IDs tables
checkpoints compactions

MERGE Auto-
Table cloning CDF
improvements compaction
Delta Identity Type
Connect Collations columns widening

Spark Connect Flexible sort and Pain-free primary Data types expand
support comparison and foreign keys with your data

©2024 Databricks Inc. — All rights reserved

UniForm GA Delta Lake
With UniForm
Towards full
lakehouse format Metadata Metadata

Interoperability
Data Parquet
Liquid clustering Usage Walkthrough
Create a new Delta table with liquid clustering
CREATE [EXTERNAL] TABLE tbl (id INT, name STRING) CLUSTER BY(id)
Change Liquid Clustering keys on existing clustered table:
ALTER TABLE tbl CLUSTER BY (name);
Clustering data in a Delta table with liquid clustering:
OPTIMIZE tbl;
What you don’t need to worry about:
● Optimal ﬁle sizes
● Whether a column can be used as a clustering key
● Order of clustering keys
Public doc: https://docs.databricks.com/delta/clustering.html
©2024 Databricks Inc. — All rights reserved
Liquid Clustering GA
Easy to use

Up to 7x faster writes

Up to 12x faster reads

Highly flexible

©2024 Databricks Inc. — All rights reserved

New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Uniﬁed Proﬁler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Python UDTF

©2024 Databricks Inc. — All rights reserved

Apache Spark 3.5

Python User Defined Table Functions

This is a new kind of function that returns
an entire table as output instead of a
single scalar result value
○ Once registered, they can appear in
the FROM clause of a SQL query
○ Or use the DataFrame API to call them

from pyspark.sql.functions import udtf

@udtf(returnType="num: int, squared: int")

class SquareNumbers:
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num) Python

©2024 Databricks Inc. — All rights reserved

Apache Spark 3.5

Python User Defined Table Functions

SELECT * FROM SquareNumbers(

SquareNumbers(lit(1), lit(3)).show()
num => 1, squared => 3);
+-----+--------+
+-----+--------+
| num | squared|
| num | squared|
+-----+--------+
+-----+--------+
| 1 | 1 |
| 1 | 1 |
| 2 | 4 |
| 2 | 4 |
| 3 | 9 |
| 3 | 9 |
+-----+--------+
+-----+--------+ SQL Python

SQL Lang PySpark DataFrame APIs

©2024 Databricks Inc. — All rights reserved

Apache Spark 4.0

Python UDTF Input Table Partitioning

Split input rows among class instances: eval
Polymorphic Analysis runs once each row, then terminate runs last
Compute the output schema for each call class CountAndMax:
def __init__(self):
depending on arguments, using analyze self._count = 0
self._max = 0
class ReadFromConfigFile: def eval(self, row: Row):
@staticmethod self._count += 1
def analyze(filename: AnalyzeArgument): self._max = max(self._max, row[0])
with open(os.path.join( def terminate(self):
SparkFiles.getRootDirectory(), yield self._count, self._max
filename.value), "r") as f:
# Compute the UDTF output schema
WITH t AS (SELECT id FROM RANGE(0, 100))
# based on the contents of the file.
SELECT * FROM CountAndMax(
return AnalyzeResult(
from_file(f.read())) TABLE(t) PARTITION BY id / 10 ORDER BY id);
...
+-------+-----+
ReadFromConfigFile(lit("config.txt")).show() | count | max |
+-------+-----+
+------------+-------------+ | 10 | 0 |
| start_date | other_field | | 10 | 1 |
+------------+-------------+ ...
| 2024-04-02 | 1 | Python
+------------+-------------+ Python
©2024 Databricks Inc. — All rights reserved
Apache Spark 4.0

Python UDTF Custom Initialization

Create a subclass of AnalyzeResult and
Variable Keyword Arguments consume it in each subsequent __init__
The analyze and eval methods may class SplitWords:
@dataclass
accept *args or **kwargs class MyAnalyzeResult(AnalyzeResult):
numWords: int
class VarArgs: numArticles: int
@staticmethod
def analyze(**kwargs: AnalyzeArgument): @staticmethod
return AnalyzeResult(StructType( def analyze(text: str):
[StructField(key, arg.dataType) words = text.split(" ")
for key, arg in sorted( return MyAnalyzeResult(
kwargs.items())])) schema=StructType()
def eval(self, **kwargs): .add("word", StringType())
yield tuple(value for _, value .add("total", IntegerType()),
in sorted(kwargs.items())) withSinglePartition=true,
numWords=len(words)
SELECT * FROM VarArgs(a => 10, b => 'x'); numArticles=len((
word for word in words
+----+-----+ if word in ("a", "an", "the")))
| a | b |
+----+-----+ def __init__(self, r: MyAnalyzeResult):
| 10 | "x" | ... Python
+----+-----+
©2024 Databricks Inc. — All rights reserved
Python
Arrow Optimized Python UDF

©2024 Databricks Inc. — All rights reserved

Enhancing Python UDFs with Apache Arrow

● Introduction to Arrow and Its Role in UDF Optimization:

○ Utilizes Apache Arrow

○ Supported since Spark 3.5 and ON by default since Spark 4.0

● Key Beneﬁts
○ Enhances data serialization and deserialization speed

○ Provides standardized type coercion

©2024 Databricks Inc. — All rights reserved

Enabling Arrow Optimization
Global Activation in a UDF
Local Activation in a UDF
• Activates Arrow optimization for all
• Activates Arrow optimization for Python UDFs in the Spark session
a specific UDF, improving
performance
spark.conf.set("spark.sql.execution.pythonUDF.arro
w.enabled", True)

# An Arrow Python UDF # An Arrow Python UDF

@udf(returnType='int', useArrow=True) @udf(returnType='int')
def arrow_slen(s): def arrow_slen(s):
return len(s) return len(s)
Python Python

©2024 Databricks Inc. — All rights reserved

@udf(returnType='int', useArrow=True)
def arrow_slen(s):
return len(s)
Python

Performance
©2024 Databricks Inc. — All rights reserved
sdf.select(
udf(lambda v: v + 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v - 1, DoubleType(), useArrow=True)("v"),
udf(lambda v: v * v, DoubleType(), useArrow=True)("v")
)
Python

Performance
©2024 Databricks Inc. — All rights reserved
Pickled Python UDF
>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+-----------------------------------------------------------------------+
| date_in_string |
+-----------------------------------------------------------------------+
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet..|
+-----------------------------------------------------------------------+ Python

Arrow-optimized Python UDF

>>> df.select(udf(lambda x: x, 'string')('value').alias('date_in_string')).show()
+--------------+
|date_in_string|
+--------------+
| 1970-01-01|
Comparing Pickled and Arrow-optimized
| 1970-01-02| Python UDFs on type coercion [Link]
+--------------+ Python

©2024 Databricks Inc. — All rights reserved

SQL UDF / UDTF

©2024 Databricks Inc. — All rights reserved

Easily extend SQL function library

● SQL User Defined Scalar Functions

○ Persisted SQL Expressions

● SQL User Defined Table Functions

○ Persisted Parameterized Views

● Support named parameter invocation and defaulting

● Table functions with lateral correlation

©2024 Databricks Inc. — All rights reserved

SQL User Defined Scalar Functions

● Encapsulate (complex) expressions, including subqueries

● May contain subqueries

● Return a scalar value

● Can be used in most places where builtin functions go

©2024 Databricks Inc. — All rights reserved

SQL User Defined Scalar Functions
Persists complex expression patterns
> CREATE FUNCTION roll_dice(
num_dice INT DEFAULT 1 COMMENT 'number of dice to roll (Default: 1)’,
num_sides INT DEFAULT 6 COMMENT 'number of sides per die (Default: 6)'
) COMMENT 'Roll a number of n-sided dice’
RETURN aggregate(
sequence(1, roll_dice.num_dice, 1),
0,
(acc, x) -> (rand() * roll_dice.num_sides) :: INT,
acc -> acc + roll_dice.num_dice
);

> SELECT roll_dice();

3
-- Roll 3 6-sided dice
> SELECT roll_dice(3);
15
-- Roll 3 10-sided dice
> SELECT roll_dice(3, 10) SQL
21
©2024 Databricks Inc. — All rights reserved
SQL User Defined Table Functions

● Encapsulate (complex) correlated subqueries aka a parameterized view

● Can be used in the FROM clause

©2024 Databricks Inc. — All rights reserved

SQL User Deﬁned Table Functions
Persist complex parameterized queries

CREATE FUNCTION weekdays(start DATE,end DATE)

RETURNS TABLE(day_of_week STRING, day DATE)
RETURN SELECT
to_char(day, 'E’),
day
FROM
(
SELECT sequence(weekdays.start, weekdays.end)
) AS t(days),
LATERAL(explode(days)) AS dates(day)
WHERE
extract(DAYOFWEEK_ISO FROM day) BETWEEN 1 AND 5;
SQL

©2024 Databricks Inc. — All rights reserved

SQL User Defined Table Functions
Persist complex parameterized queries

> SELECT > -- Return weekdays for date ranges originating from a
day_of_week, LATERAL correlation
day > SELECT
FROM weekdays.*
weekdays(DATE '2024-01-01’, FROM
DATE '2024-01-14’); VALUES
(DATE '2020-01-01’),
Mon 2022-01-01 (DATE '2021-01-01') AS starts(start),
… LATERAL weekdays(start, start + INTERVAL '7' DAYS);
Fri 2022-01-05
Mon 2022-01-08 Wed 2020-01-01
SQL Thu 2020-01-02
Fri 2020-01-03 SQL
…

©2024 Databricks Inc. — All rights reserved

Named parameter invocation
Self documenting and safer SQL UDF invocation

> DESCRIBE FUNCTION roll_dice;

Function: default.roll_dice
Type: SCALAR
Input: num_dice INT
num_sides INT
Returns: INT

> -- Roll 1 10-sided dice - skip dice count

> SELECT roll_dice(num_sides => 10)
7

> -- Roll 3 10-sided dice - reversed order

> SELECT roll_dice(num_sides => 10, num_dice => 3)
21
SQL

©2024 Databricks Inc. — All rights reserved

Stored Procedure

©2024 Databricks Inc. — All rights reserved

External Stored Procedures

©2024 Databricks Inc. — All rights reserved

External Iceberg Stored Procedure

©2024 Databricks Inc. — All rights reserved

SQL Scripting
It’s SQL, but with control flow!

• Support for control flow, iterators & error handling

Natively in SQL
• Control flow → IF/ELSE, CASE
• Looping → WHILE, REPEAT, ITERATE
• Resultset iterator → FOR
• Exception handling → CONTINUE/EXIT
• Parameterized queries → EXECUTE IMMEDIATE

• Following the SQL/PSM standard

©2024 Databricks Inc. — All rights reserved

SQL Scripting

BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET VAR c = c - 1;
END WHILE;
END
SQL

©2024 Databricks Inc. — All rights reserved

SQL Scripting
WHILE i < array_size(tableArray) DO
SET
-- parameters tableType = (
DECLARE oldColName = 'ColoUr'; SELECT
DECLARE newColName = 'color’; table_type
FROM
BEGIN INFORMATION_SCHEMA.tables
DECLARE tableArray Array < STRING >; WHERE
DECLARE tableType STRING; table_name = tableArray [i]
DECLARE i INT = 0; );
DECLARE alterQuery STRING; IF tableType != 'VIEW' COLLATE UNICODE_CI THEN
SET SET
tableArray = ( alterQuery = 'ALTER TABLE ' || tableArray [i] ||
SELECT ' RENAME COLUMN ' || oldColName || ' TO ' ||
array_agg(table_name) newColName;
FROM EXECUTE IMMEDIATE alterQuery;
INFORMATION_SCHEMA.columns END IF;
WHERE SET i = i + 1;
column_name END WHILE;
COLLATE UNICODE_CI = oldColName END;
SQL
);

©2024 Databricks Inc. — All rights reserved

PySpark UDF Unified Profiling

©2024 Databricks Inc. — All rights reserved

Overview of Uniﬁed Proﬁling
• Key Components: Performance and memory profiling

• Benefits: Tracks function calls, execution time, memory usage

• Replacement for Legacy Profiling

• Drawbacks of Legacy Profiling

• Advantages of New Unified Profiling

• Session-based, works with Spark Connect, runtime toggling

Overview of Unified Profiling
• How to Enable:
• Performance Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
• Memory Profiler: spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")

• API Features: "show", "dump", and "clear" commands

• Show results:
• Performance: spark.profile.show(type="perf")
• Performance: spark.profile.show(type="memory")
• Dump results: spark.profile.dump("/your_path/...")
• Clear results: spark.profile.clear()

PySpark
Performance
Profiler
from pyspark.sql.functions import pandas_udf

df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1

added = df.select(add1("id"))

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
added.show() Python

PySpark
Performance
Proﬁler

PySpark
Memory
Profiler
from pyspark.sql.functions import pandas_udf

df = spark.range(10)
@pandas_udf("long")
def add1(x):
return x + 1

added = df.select(add1("id"))

spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")
added.show() Python

PySpark
Memory
Profiler

New Functionalities
Spark Connect, ANSI Mode, Arbitrary Stateful Processing V2,
Collation Support, Variant Data Types, pandas 2.x Support

Extensions
Python Data Source APIs, XML/Databricks Connectors
and DSV2 Extension, Delta 4.0

Agenda Custom Functions and Procedures

SQL UDFs, SQL Scripting, Python UDTF, Arrow optimized
Python UDF, PySpark UDF Uniﬁed Proﬁler

Usability
Structured Logging Framework, Error Class Framework,
Behavior Change Process
Structured Logging Framework

What are we going to build to improve this?

• Transition to Structured Logging in Apache Spark

• Introducing Spark System Log Directories

Analyzing Spark logs is challenging due to their
unstructured nature

Structured Spark Logging
Starting from Spark 4.0, the default log format is JSON lines, making it
easier to parse and analyze.

{
"ts": "2023-03-12T12:02:46.661-0700",
"level": "ERROR",
"msg": "Fail to know the executor 289 is alive or not",
"context": {
"executor_id": "289"
},
"exception": {
"class": "org.apache.spark.SparkException",
"msg": "Exception thrown in awaitResult",
"stackTrace": "..."
},
"source": "BlockManagerMasterEndpoint"
} Json

Use Spark to Analyze Spark Logs

System Log Directories

logs = spark.read.json("/var/spark/logs.json")

# To get all the errors on host 100.116.29.4

errors_host_logs = logs.filter(
(col("context.host") == "100.116.29.4") & (col("level") == "ERROR"))

# To get all the exceptions from Spark

spark_exceptions_logs = logs.filter(
col("exception.class").startswith("org.apache.spark")) Python

System Log Directories

logs = spark.read.json("/var/spark/logs.json")

# To get all the executor loss logs

executor_lost_logs = logs.filter(
col("msg").contains("Lost executor"))
# To get all the distributed logs about executor 289
executor_289_logs = logs.filter(
col("context.executor_id") == 289) Python

Error Conditions and Messages

Error Conditions
800+ top-frequency error conditions from the server.

Error Conditions in PySpark
All 200+ error conditions issued by PySpark client

Quality PySpark Errors

• Clear and speciﬁc error

classes
• Improved documentation
• Consistency and
standardization
• Enhanced debugging and
maintenance

Spark 3.5

Spark 4.0

Behavior Changes

Overview of Apache Spark Versioning Policy
• Semantic Versioning Structure: [MAJOR].[FEATURE].[MAINTENANCE]
• MAJOR: Long-term API stability

• FEATURE: New features and improvements

• MAINTENANCE: Frequent, urgent patches

• API Compatibility Commitments: Maintain compatibility across

feature releases, reducing the need for users to refactor code.

Categories of Impactful Behavior Changes

• Query Results Impact: Changes affecting the accuracy and outcome of

data queries.
• Schema and Configuration Changes: Adjustments to the database or
application schema and Spark configuration settings.
• API Modifications: Alterations to the public and developer APIs across
multiple programming languages.
• Error Handling Adjustments: Modifications in how errors are classified
and handled within the system.
• Deployment and Management Revisions: Changes in the methods and
tools used for deploying and managing Spark environments.
©2024 Databricks Inc. — All rights reserved
Best Practices for API Changes
• General Approach:
• Avoid API changes whenever possible.

• Prefer deprecating features over direct modiﬁcations to ensure smoother

transitions.

• Implement legacy ﬂags to allow users to opt into previous behaviors

temporarily, easing the transition to new versions.

• Communicate changes clearly through deprecation warnings and

documentation updates.

Best Practices for API Changes
• User-facing documentation
• Migration Guide Updates: Regularly update the migration guide with detailed
information on changes affecting user operations.
• Legacy Configs: Include information on legacy configurations that might help users
transition smoothly between versions.

• Error Messages:
• Clarity and Actionability: Ensure that all error messages are clear and direct, informing
the user precisely what went wrong.
• Workarounds: Wherever possible, provide actionable advice within the error message,
including configuration changes that can revert to previous behaviors or other
immediate solutions.

Best Practices for API Changes
• PR Descriptions
• Detail: Provide comprehensive explanations of the changes, highlighting the
modiﬁcations and their implications.
• Transparency: Explain clearly how the new behavior differs from the old, and the
reasons for these changes.

Documentation

PySpark Doc – Dark Mode

PySpark Doc
• SPARK-44728
• More examples

• Environment Setup

• Quickstart

• Type System

Versionless Spark Programming Guide

[Proposal] Versionless Spark Programming
Guide
• Motivation: Allow for real-time
updates and rapid content iteration
without release schedule
constraints.
• Small changes pose less SEO
deranking risk than major updates.
• Transition 9 existing programming
guides to Spark website repo.

Python: The Number One Choice

>330
millions
Number of PyPI
Downloads per Year

Countries
and Regions
PyPI downloads of
PySpark
in the last 12 months
PySpark’s View of the World
DataFrame
SQL pandas APIs
APIs

PySpark
Catalyst Optimizer

Adaptive Execution

Spark Ecosystem (connectors, 3rd party libraries)

Key Focus of PySpark

Functionality Parity Ease of Use

• Complete feature • No JVM Knowledge
availability Required
• Python native APIs • Pythonic APIs

Performance Parity Ecosystem Integration

• Matched Performance with Scala • Integrated with Python
• Optimized query compiler and ecosystem
engine • Spark ecosystem growth

Key Enhancements in PySpark 3.5 and 4.0
Functionality Parity Ease of Use
• 180 new built-in functions in Spark 3.5+ • Spark Connect
• Python native data source APIs • Error Framework
• Stateful streaming processing V2 • Enriched Documentation
• Python UDTFs • Unified UDF profiling

Performance Ecosystem Integration

• Spark Connect • Compatible with pandas 2 (pandas API)
• Arrow-optimized Python UDF • Delta Lake 4.0
• Variant type for semi-structured processing • [ML] Distributed training with TorchDistributor
• Advanced optimizer and adaptive execution • Arrow integration: toArrow API

Python UDF Pandas 2 pandas API on NumPy UDF-level applyInArrow
Memory/CPU Compatibility Spark Coverage Inputs Dependency Control
Profiler [WIP]

180+ Spark Connect mapInArrow Deepspeed

New Python Distributor

PySpark
Functions

Visualization/ Richer Pythonic [WIP] DF / UDF

Plotting Connectors Error Handling Debuggability

Arrow-optimized DF.toArrow() Python Arbitrary Type Python UDTF PySpark Testing

Python UDFs Stateful Annotations API
Processing

PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Big Book of Data Warehousing and Bi v9 122723 Final 0
No ratings yet
Big Book of Data Warehousing and Bi v9 122723 Final 0
88 pages
Oreilly Technical Guide Understanding Etl
100% (1)
Oreilly Technical Guide Understanding Etl
107 pages
004 Azure Databricks Course Slide Deck V3
0% (1)
004 Azure Databricks Course Slide Deck V3
219 pages
Modern Data Pipelines Overview
No ratings yet
Modern Data Pipelines Overview
91 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
Databricks 101
No ratings yet
Databricks 101
16 pages
Data Engineering With Databricks Da
100% (4)
Data Engineering With Databricks Da
232 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Spark Executor Tuning Guide
No ratings yet
Spark Executor Tuning Guide
26 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Azure Synapse Course Presentation
100% (2)
Azure Synapse Course Presentation
261 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
100% (1)
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
168 pages
Building Data Pipelines with Delta Live Tables
No ratings yet
Building Data Pipelines with Delta Live Tables
52 pages
Databricks Interview Questions With Detailed Solution
No ratings yet
Databricks Interview Questions With Detailed Solution
171 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
193 pages
Well Architected Lakehouse Workshop
100% (1)
Well Architected Lakehouse Workshop
49 pages
Introduction To Databricks
No ratings yet
Introduction To Databricks
149 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Azure Databricks Comprehensive Guide
No ratings yet
Azure Databricks Comprehensive Guide
27 pages
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
Data Fabric Architecture Fundamentals
100% (1)
Data Fabric Architecture Fundamentals
36 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Data Hub Guide For Architects
100% (1)
Data Hub Guide For Architects
83 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Azure Databricks & Spark Course
No ratings yet
Azure Databricks & Spark Course
9 pages
Data Analysis With Databricks
75% (4)
Data Analysis With Databricks
80 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Spark Databricks Summary
80% (5)
Spark Databricks Summary
100 pages
Microsoft Modern Data Estate
No ratings yet
Microsoft Modern Data Estate
48 pages
Introduction To Snowflake Warehouses
No ratings yet
Introduction To Snowflake Warehouses
40 pages
Delta Lake: Optimizing Lakehouse Architecture
100% (2)
Delta Lake: Optimizing Lakehouse Architecture
64 pages
Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Mastering Apache Spark PDF
No ratings yet
Mastering Apache Spark PDF
663 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Big Book of Data Warehousing and Bi v11 010925 Final
No ratings yet
Big Book of Data Warehousing and Bi v11 010925 Final
110 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Azure Synapse Analytics Course Guide
No ratings yet
Azure Synapse Analytics Course Guide
196 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Leetcode 75 Questions (NeetCode On Yt) - Google Sheets
No ratings yet
Leetcode 75 Questions (NeetCode On Yt) - Google Sheets
1 page
Gpu 1
No ratings yet
Gpu 1
2 pages
We Can Insert or Delete Nodes We Can't Insert or Delete A Node
No ratings yet
We Can Insert or Delete Nodes We Can't Insert or Delete A Node
5 pages
Learningthepandaslibrary PDF
100% (1)
Learningthepandaslibrary PDF
233 pages
Balanced Paranthesis Checker
No ratings yet
Balanced Paranthesis Checker
56 pages
Comprehensive SQL Commands Guide
No ratings yet
Comprehensive SQL Commands Guide
6 pages
Javapackages Notes
No ratings yet
Javapackages Notes
40 pages
Quarkus Guide for Developers
No ratings yet
Quarkus Guide for Developers
421 pages
Unit 5 Notes Data Analytics Kit 601
No ratings yet
Unit 5 Notes Data Analytics Kit 601
44 pages
Python Operators and Control Flow Quiz
No ratings yet
Python Operators and Control Flow Quiz
10 pages
Amdahl's Law (Autosaved)
No ratings yet
Amdahl's Law (Autosaved)
12 pages
An Audio Example Using DSP/BIOS: Application Report
No ratings yet
An Audio Example Using DSP/BIOS: Application Report
24 pages
Creo3 m110 Toolkit GS
No ratings yet
Creo3 m110 Toolkit GS
30 pages
Understanding The Digital World What You Need To Know About Computers The Internet Privacy and Security Brian W Kernighan PDF Download
No ratings yet
Understanding The Digital World What You Need To Know About Computers The Internet Privacy and Security Brian W Kernighan PDF Download
163 pages
A Calculator in Pharo Smalltalk Using Spec
100% (1)
A Calculator in Pharo Smalltalk Using Spec
15 pages
Tabular Editor DevOps with Git & Azure
No ratings yet
Tabular Editor DevOps with Git & Azure
23 pages
Software Developer: Professional Summary
No ratings yet
Software Developer: Professional Summary
2 pages
Resume Template - Vamsi
No ratings yet
Resume Template - Vamsi
1 page
More On JPA: Rest of Murach Chapter 13, Pizza2
No ratings yet
More On JPA: Rest of Murach Chapter 13, Pizza2
25 pages
Solved 2023 Question Paper ICSE Class 10 Computer Applications
No ratings yet
Solved 2023 Question Paper ICSE Class 10 Computer Applications
24 pages
Crash Reporting and WebSocket Errors
No ratings yet
Crash Reporting and WebSocket Errors
7 pages
Data Structure Concepts for BBA(CA) Students
No ratings yet
Data Structure Concepts for BBA(CA) Students
38 pages
C Coding Interview Questions
No ratings yet
C Coding Interview Questions
6 pages
Constraint Satisfaction Problems: Section 1 - 3
No ratings yet
Constraint Satisfaction Problems: Section 1 - 3
44 pages
Assignment 1 Comp314
No ratings yet
Assignment 1 Comp314
6 pages
Course Name: Data Structures & Algorithms: BITS Pilani
No ratings yet
Course Name: Data Structures & Algorithms: BITS Pilani
27 pages
Algorithm Design for CS Students
No ratings yet
Algorithm Design for CS Students
1 page
Emotion Recognition
No ratings yet
Emotion Recognition
31 pages
What Is SDLC Model and Its Phases
No ratings yet
What Is SDLC Model and Its Phases
9 pages
Playwright
No ratings yet
Playwright
2 pages