This is the code repository for Data Engineering with Databricks Cookbook, published by Packt.
Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.
This book covers the following exciting features:
- Perform data loading, ingestion, and processing with Apache Spark
- Discover data transformation techniques and custom user-defined functions (UDFs) in Apache Spark
- Manage and optimize Delta tables with Apache Spark and Delta Lake APIs
- Use Spark Structured Streaming for real-time data processing
- Optimize Apache Spark application and Delta table query performance
- Implement DataOps and DevOps practices on Databricks
- Orchestrate data pipelines with Delta Live Tables and Databricks Workflows
- Implement data governance policies with Unity Catalog
If you feel this book is for you, get your copy today!
All of the code is organized into folders. For example, Chapter01.
The code will look like the following:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("read-csv-data")
.master(«spark://spark-master:7077»)
.config(«spark.executor.memory", "512m")
.getOrCreate())
spark.sparkContext.setLogLevel("ERROR")
Following is what you need for this book: This book is for data engineers, data scientists, and data practitioners who want to learn how to build efficient and scalable data pipelines using Apache Spark, Delta Lake, and Databricks. To get the most out of this book, you should have basic knowledge of data architecture, SQL, and Python programming.
With the following software and hardware list you can run all code files present in the book (Chapter 1-11).
| Chapter | Software required | OS required |
|---|---|---|
| 1-11 | Docker Engine version 18.02.0+ | Windows, Mac OS X, and Linux (any) |
| 1-11 | Docker Compose version 1.25.5+ | Windows, Mac OS X, and Linux (any) |
| 1-11 | Docker Desktop | Windows, Mac OS X, and Linux (any) |
| 1-11 | Git | Windows, Mac OS X, and Linux (any) |
Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.

