LLM-driven Data Engineering
Day 1
EcZachly Inc
Is data engineering doomed to
AI?
And what you can do about it!
EcZachly Inc
Here’s why data engineering is doomed
(according to shitty LinkedIn influencers)
EcZachly Inc
- LLMs are getting good at writing SQL
- LLMs increase productivity, reduce debug time, and reduce documentation
writing time
- LLMs can even write Spark code! Y’all saw the English SDK for Databricks
right? EcZachly Inc
- LLMs will guarantee amazing data quality!
LLMs and SQL
EcZachly Inc
I’m sure you’ve seen the
SQL is an important part of the data engineering equation
- LLMs are good but not great at SQL (we will see this in the lab today)
EcZachly Inc
I’m going to show you a chart soon
EcZachly Inc
- Red means, LLMs are going to have a significant impact, soon
- Yellow means, LLMs aren’t quite there yet but could disrupt in the future
- Green means, LLMs are far from accomplishing this
EcZachly Inc
EcZachly Inc
EcZachly Inc
Answering business questions (red)
EcZachly Inc
LangChain hooks into databases
directly and can easily make sense
of data warehouses!
This disruption is imminent! EcZachly Inc
Fixing Broken Pipelines (red)
EcZachly Inc
LLMs + Agents will be able to
significantly reduce the oncall burden.
- LLMs that process stack traces /
quality failures can recommend
possible remedies to unblock
(maybe even via Slack)
The amount of data engineering hours
EcZachly Inc
that are recovered from this is
something to be celebrated!
Tricky failures will still need manual
troubleshooting though
Writing analytical queries (red)
EcZachly Inc
- This will be disrupted 100% but complexity
matters here a lot.
- SQL practitioners probably won’t feel much
difference here
- Self-serve analytics will catch fire here though
This is something to be celebrated!
EcZachly Inc
Data engineers won’t have to hand hold
stakeholders through their analytical questions
nearly as much!
Writing pipeline code in:
SQL, Spark, or Flink (orange)
EcZachly Inc
LLMs can oftentimes give us a
good boilerplate to start with, but
debugging it is often not worth it!
We are still pretty far from
EcZachly Inc
prompts generating us optimized
and correct data pipelines!
Why ChatGPT sucks for SQL!
EcZachly Inc
LLMs are like having a good junior engineer
who can write SQL for you but you have to
check their work
The chat.openai.com interface is
NON-DETERMINISTIC!
EcZachly Inc
The same prompt will output different things
depending on when you input it!
REMEMBER FROM CLASS, IF THINGS ARE NOT
IDEMPOTENT, THEY ARE NOT TO BE TRUST
Debugging
EcZachly Inc
EcZachly Inc
Writing data documentation (orange)
EcZachly Inc
If you’ve done the upfront work of all the
conceptual and physical data modeling,
ChatGPT can give you beautiful
summaries and example queries. (We’ll
see this in the lab today)
ChatGPT will miss most business EcZachly Inc
context unless directly given it and if
you’re directly giving ChatGPT the
context, you might as well write the
documentation yourself at that point
Sprint Planning (orange)
EcZachly Inc
LLMs should be able to give us a
good idea about sprint planning.
The soft skills needed to prioritize EcZachly Inc
and push back are sadly not
present in ChatGPT
Dashboarding (orange)
EcZachly Inc
Presenting data is actually pretty
easy. Usually simple GROUP BY
queries does the job
Knowing you customer and what
EcZachly Inc
data to display and when is the
hard part that ChatGPT will
continue to suck at
Physical Data Modeling (orange)
EcZachly Inc
Optimizing logical/conceptual
data models to a specific
architecture is something
ChatGPT will excel at.
Having all the best practices in EcZachly Inc
place and understand the interplay
between data table, pipeline, etc is
hard for ChatGPT to understand
Writing tests for you pipelines (orange)
EcZachly Inc
Generating fake data will be a big
pain that LLMs can give us
already.
EcZachly Inc
Knowing if we have adequate test
coverage is something ChatGPT
will continue to be bad at
Automated Data Quality Checks (orange)
EcZachly Inc
Suggesting automated checks is
something LLMs can do already
Having the business context to EcZachly Inc
know what checks are valuable is
something LLMs will have a hard
time with
Building data processing frameworks
(green)
EcZachly Inc
Once you hit a sufficient level of
complexity, LLMs fall apart.
Imagine asking ChatGPT,
“Give me the code for the next EcZachly Inc
generation of Spark”
Creating data best practices (green)
EcZachly Inc
Data best practices are most
often learned from experience.
Experience that needs to
permeate through the
organization. EcZachly Inc
ChatGPT will catch up 2-3 years
late!
Conceptual / Logical Data Modeling (green)
EcZachly Inc
Data Modeling is a very soft skills
focused task.
Understanding what data is
needed, where it is, and how to get EcZachly Inc
it is a human-oriented task that
ChatGPT will continue to struggle
with
What’s the lab going to be about today?
EcZachly Inc
- When you should and shouldn’t use LLMs
- What types of tasks are LLMs better suited for
- How can you get the most out of each prompt?
EcZachly Inc
When should you use LLMs?
EcZachly Inc
- Debugging specific problems (similar to StackOverflow or Google)
- Getting your starter DAG boilerplate
- Boilerplate documentation
- Solving a very specific part of a query
EcZachly Inc
Generating Documentation
EcZachly Inc
- LLMs are actually decent at generating text and documentation
- Business context is often lacking though!
- Make sure to specify to use Markdown in the prompt
- The more context you give, the better the documentation
EcZachly Inc
How to use the OpenAI API
EcZachly Inc
- Install the openai Python library
- Get an OpenAI API key (REMEMBER ITS PAY PER USE)
EcZachly Inc
How the prompts work in Python API
EcZachly Inc
- You have multiple roles you can play
- System
- Assistant
- User
EcZachly Inc
The System Role
EcZachly Inc
Think of this as the “contextual” clues that guide the prompt the right way
EcZachly Inc
The Assistant Role
EcZachly Inc
This is for when you want to have a multiple prompt/response conversation
with ChatGPT
EcZachly Inc
The User Role
EcZachly Inc
This is the prompt that is used for the specific task at hand
EcZachly Inc
API Configuration
EcZachly Inc
Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the
output.
Max Tokens - you can set this as high as 4000. One token is about 4 characters
in English
EcZachly Inc
Today’s Lab
EcZachly Inc
- We’ll be interacting with GPT-4 in today’s lab
- We’ll generate an Airflow DAG
- A SQL query
- And documentation for our data sets
EcZachly Inc