Skip to content

PacktPublishing/Data-Engineering-with-Databricks-Cookbook

Repository files navigation

Data Engineering with Databricks Cookbook

no-image

This is the code repository for Data Engineering with Databricks Cookbook, published by Packt.

Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

What is this book about?

This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.

This book covers the following exciting features:

  • Perform data loading, ingestion, and processing with Apache Spark
  • Discover data transformation techniques and custom user-defined functions (UDFs) in Apache Spark
  • Manage and optimize Delta tables with Apache Spark and Delta Lake APIs
  • Use Spark Structured Streaming for real-time data processing
  • Optimize Apache Spark application and Delta table query performance
  • Implement DataOps and DevOps practices on Databricks
  • Orchestrate data pipelines with Delta Live Tables and Databricks Workflows
  • Implement data governance policies with Unity Catalog

If you feel this book is for you, get your copy today!

https://www.packtpub.com/

Instructions and Navigations

All of the code is organized into folders. For example, Chapter01.

The code will look like the following:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
 .appName("read-csv-data")
 .master(«spark://spark-master:7077»)
 .config(«spark.executor.memory", "512m")
 .getOrCreate())

spark.sparkContext.setLogLevel("ERROR")

Following is what you need for this book: This book is for data engineers, data scientists, and data practitioners who want to learn how to build efficient and scalable data pipelines using Apache Spark, Delta Lake, and Databricks. To get the most out of this book, you should have basic knowledge of data architecture, SQL, and Python programming.

With the following software and hardware list you can run all code files present in the book (Chapter 1-11).

Software and Hardware List

Chapter Software required OS required
1-11 Docker Engine version 18.02.0+ Windows, Mac OS X, and Linux (any)
1-11 Docker Compose version 1.25.5+ Windows, Mac OS X, and Linux (any)
1-11 Docker Desktop Windows, Mac OS X, and Linux (any)
1-11 Git Windows, Mac OS X, and Linux (any)

Related products

Get to Know the Author

Pulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.

About

Data Engineering with Databricks Cookbook, published by Packt

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages