0% found this document useful (0 votes)

46 views31 pages

Slides

The document discusses how LLMs are impacting and may impact various data engineering tasks such as answering business questions, fixing broken pipelines, and writing documentation. It notes that LLMs are better suited for some tasks like generating boilerplate but are not as good at tasks requiring business context or soft skills. The lab will demonstrate when to use LLMs, how to get the most from prompts, and interacting with the OpenAI API.

Uploaded by

Prashant Kurve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views31 pages

Slides

Uploaded by

Prashant Kurve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

LLM-driven Data Engineering

Day 1
EcZachly Inc
Is data engineering doomed to
AI?
And what you can do about it!
EcZachly Inc
Here’s why data engineering is doomed
(according to shitty LinkedIn inﬂuencers)
EcZachly Inc

- LLMs are getting good at writing SQL

- LLMs increase productivity, reduce debug time, and reduce documentation
writing time
- LLMs can even write Spark code! Y’all saw the English SDK for Databricks
right? EcZachly Inc
- LLMs will guarantee amazing data quality!
LLMs and SQL
EcZachly Inc

I’m sure you’ve seen the

SQL is an important part of the data engineering equation

- LLMs are good but not great at SQL (we will see this in the lab today)
EcZachly Inc
I’m going to show you a chart soon
EcZachly Inc

- Red means, LLMs are going to have a signiﬁcant impact, soon

- Yellow means, LLMs aren’t quite there yet but could disrupt in the future
- Green means, LLMs are far from accomplishing this

EcZachly Inc
EcZachly Inc

EcZachly Inc
Answering business questions (red)
EcZachly Inc

LangChain hooks into databases

directly and can easily make sense
of data warehouses!

This disruption is imminent! EcZachly Inc

Fixing Broken Pipelines (red)
EcZachly Inc

LLMs + Agents will be able to

signiﬁcantly reduce the oncall burden.

- LLMs that process stack traces /

quality failures can recommend
possible remedies to unblock
(maybe even via Slack)

The amount of data engineering hours

EcZachly Inc
that are recovered from this is
something to be celebrated!

Tricky failures will still need manual

troubleshooting though
Writing analytical queries (red)
EcZachly Inc

- This will be disrupted 100% but complexity

matters here a lot.
- SQL practitioners probably won’t feel much
difference here
- Self-serve analytics will catch ﬁre here though
This is something to be celebrated!
EcZachly Inc
Data engineers won’t have to hand hold
stakeholders through their analytical questions
nearly as much!
Writing pipeline code in:
SQL, Spark, or Flink (orange)
EcZachly Inc

LLMs can oftentimes give us a

good boilerplate to start with, but
debugging it is often not worth it!

We are still pretty far from

EcZachly Inc
prompts generating us optimized
and correct data pipelines!
Why ChatGPT sucks for SQL!
EcZachly Inc

LLMs are like having a good junior engineer

who can write SQL for you but you have to
check their work

The chat.openai.com interface is

NON-DETERMINISTIC!
EcZachly Inc
The same prompt will output different things
depending on when you input it!

REMEMBER FROM CLASS, IF THINGS ARE NOT

IDEMPOTENT, THEY ARE NOT TO BE TRUST
Debugging
EcZachly Inc

EcZachly Inc
Writing data documentation (orange)
EcZachly Inc

If you’ve done the upfront work of all the

conceptual and physical data modeling,
ChatGPT can give you beautiful
summaries and example queries. (We’ll
see this in the lab today)
ChatGPT will miss most business EcZachly Inc
context unless directly given it and if
you’re directly giving ChatGPT the
context, you might as well write the
documentation yourself at that point
Sprint Planning (orange)
EcZachly Inc

LLMs should be able to give us a

good idea about sprint planning.

The soft skills needed to prioritize EcZachly Inc

and push back are sadly not
present in ChatGPT
Dashboarding (orange)
EcZachly Inc

Presenting data is actually pretty

easy. Usually simple GROUP BY
queries does the job

Knowing you customer and what

EcZachly Inc
data to display and when is the
hard part that ChatGPT will
continue to suck at
Physical Data Modeling (orange)
EcZachly Inc

Optimizing logical/conceptual
data models to a speciﬁc
architecture is something
ChatGPT will excel at.
Having all the best practices in EcZachly Inc
place and understand the interplay
between data table, pipeline, etc is
hard for ChatGPT to understand
Writing tests for you pipelines (orange)
EcZachly Inc

Generating fake data will be a big

pain that LLMs can give us
already.

EcZachly Inc
Knowing if we have adequate test
coverage is something ChatGPT
will continue to be bad at
Automated Data Quality Checks (orange)
EcZachly Inc

Suggesting automated checks is

something LLMs can do already

Having the business context to EcZachly Inc

know what checks are valuable is
something LLMs will have a hard
time with
Building data processing frameworks
(green)
EcZachly Inc

Once you hit a suﬃcient level of

complexity, LLMs fall apart.

Imagine asking ChatGPT,

“Give me the code for the next EcZachly Inc
generation of Spark”
Creating data best practices (green)
EcZachly Inc

Data best practices are most

often learned from experience.
Experience that needs to
permeate through the
organization. EcZachly Inc

ChatGPT will catch up 2-3 years

late!
Conceptual / Logical Data Modeling (green)
EcZachly Inc

Data Modeling is a very soft skills

focused task.

Understanding what data is

needed, where it is, and how to get EcZachly Inc
it is a human-oriented task that
ChatGPT will continue to struggle
with
What’s the lab going to be about today?
EcZachly Inc

- When you should and shouldn’t use LLMs

- What types of tasks are LLMs better suited for
- How can you get the most out of each prompt?

EcZachly Inc
When should you use LLMs?
EcZachly Inc

- Debugging speciﬁc problems (similar to StackOverﬂow or Google)

- Getting your starter DAG boilerplate
- Boilerplate documentation
- Solving a very speciﬁc part of a query
EcZachly Inc
Generating Documentation
EcZachly Inc

- LLMs are actually decent at generating text and documentation

- Business context is often lacking though!
- Make sure to specify to use Markdown in the prompt
- The more context you give, the better the documentation
EcZachly Inc
How to use the OpenAI API
EcZachly Inc

- Install the openai Python library

- Get an OpenAI API key (REMEMBER ITS PAY PER USE)

EcZachly Inc
How the prompts work in Python API
EcZachly Inc

- You have multiple roles you can play

- System
- Assistant
- User

EcZachly Inc
The System Role
EcZachly Inc

Think of this as the “contextual” clues that guide the prompt the right way

EcZachly Inc
The Assistant Role
EcZachly Inc

This is for when you want to have a multiple prompt/response conversation

with ChatGPT

EcZachly Inc
The User Role
EcZachly Inc

This is the prompt that is used for the speciﬁc task at hand

EcZachly Inc
API Conﬁguration
EcZachly Inc

Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the

output.
Max Tokens - you can set this as high as 4000. One token is about 4 characters
in English
EcZachly Inc
Today’s Lab
EcZachly Inc

- We’ll be interacting with GPT-4 in today’s lab

- We’ll generate an Airﬂow DAG
- A SQL query
- And documentation for our data sets

EcZachly Inc

Generative AI for Data Democratization
No ratings yet
Generative AI for Data Democratization
46 pages
Build a Chatbot with LangChain
No ratings yet
Build a Chatbot with LangChain
11 pages
LLM Deployment Strategies and Insights
No ratings yet
LLM Deployment Strategies and Insights
5 pages
PDF Div Class 2qs3tf Truncatedtext Module Wrapper Fg1km9p Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5building Llms For Production Louis Francois Bouchard P Div Compress
No ratings yet
PDF Div Class 2qs3tf Truncatedtext Module Wrapper Fg1km9p Classtruncatedtext Module Lineclamped 85ulhh Style Max Lines5building Llms For Production Louis Francois Bouchard P Div Compress
120 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
Little Guide To Building Large Language Models in 2024
No ratings yet
Little Guide To Building Large Language Models in 2024
65 pages
LLM Mastery Pathways
No ratings yet
LLM Mastery Pathways
8 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Ai SQL Accuracy 2023 08 17
No ratings yet
Ai SQL Accuracy 2023 08 17
12 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Genai
No ratings yet
Genai
26 pages
Generative AI Index
No ratings yet
Generative AI Index
9 pages
Ship A I To Production
No ratings yet
Ship A I To Production
13 pages
Doubt Clearance
No ratings yet
Doubt Clearance
5 pages
Ai Roadmap
No ratings yet
Ai Roadmap
15 pages
Gen AI & ML
No ratings yet
Gen AI & ML
41 pages
Day 5
No ratings yet
Day 5
48 pages
AI Fastest Path
No ratings yet
AI Fastest Path
20 pages
LLM For Data Management
No ratings yet
LLM For Data Management
98 pages
Data Science Generative-AI Curriculum
No ratings yet
Data Science Generative-AI Curriculum
14 pages
Guoliang Li Vision-LLM-Enhanced Data Management
No ratings yet
Guoliang Li Vision-LLM-Enhanced Data Management
7 pages
Genai & ML 2025
No ratings yet
Genai & ML 2025
42 pages
Deedy Resume
No ratings yet
Deedy Resume
2 pages
AI Startup in 2025
No ratings yet
AI Startup in 2025
9 pages
What We Learned From A Year of Building With LLMs (Part I) - O'Reilly
No ratings yet
What We Learned From A Year of Building With LLMs (Part I) - O'Reilly
22 pages
Roadmap To LLM
No ratings yet
Roadmap To LLM
12 pages
Training For AI Engineer Interns
No ratings yet
Training For AI Engineer Interns
4 pages
Your Roadmap To Becoming A World Class AI Generalist
No ratings yet
Your Roadmap To Becoming A World Class AI Generalist
10 pages
Resume Prep and Clarification
No ratings yet
Resume Prep and Clarification
10 pages
Generative AI & LLMs for Developers
No ratings yet
Generative AI & LLMs for Developers
9 pages
Generative AI Course Overview
No ratings yet
Generative AI Course Overview
6 pages
Career in AI Checklist
No ratings yet
Career in AI Checklist
2 pages
AI For TPMs EdgeUp Curriculum
No ratings yet
AI For TPMs EdgeUp Curriculum
12 pages
ChatGPT for Data Analytics Training
100% (1)
ChatGPT for Data Analytics Training
124 pages
Data Science Career Insights and Tips
No ratings yet
Data Science Career Insights and Tips
23 pages
GenAI LLM Foundations and Building Blocks
No ratings yet
GenAI LLM Foundations and Building Blocks
6 pages
PHP & LLM Integration with Laravel
No ratings yet
PHP & LLM Integration with Laravel
49 pages
AIML Job Roadmap Improved
No ratings yet
AIML Job Roadmap Improved
2 pages
LLM's For Code Generation
No ratings yet
LLM's For Code Generation
31 pages
Aiml Guide
No ratings yet
Aiml Guide
4 pages
LLMOps Toolkit - Prashant Sahu
No ratings yet
LLMOps Toolkit - Prashant Sahu
12 pages
Pragmatic Engineer - How GenAI Is Reshaping Tech Hiring
No ratings yet
Pragmatic Engineer - How GenAI Is Reshaping Tech Hiring
11 pages
?all Job Roadmap
No ratings yet
?all Job Roadmap
36 pages
Backend Engineering EdgeUp Curriculum
No ratings yet
Backend Engineering EdgeUp Curriculum
9 pages
Machine Learning Careers in Bangladesh
No ratings yet
Machine Learning Careers in Bangladesh
32 pages
AI Strategies for Business Growth
No ratings yet
AI Strategies for Business Growth
54 pages
Ai Engineer Roadmap-1
No ratings yet
Ai Engineer Roadmap-1
3 pages
Day 2 Short Course AI For Data Analytics
No ratings yet
Day 2 Short Course AI For Data Analytics
112 pages
Breaking Into Machine Learning Engineering: A Primer On MLE Skills and Interviews For Beginners
No ratings yet
Breaking Into Machine Learning Engineering: A Primer On MLE Skills and Interviews For Beginners
39 pages
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
No ratings yet
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
8 pages
53 Streamlit
No ratings yet
53 Streamlit
6 pages
Survey Report MLOPS v16 FINAL
No ratings yet
Survey Report MLOPS v16 FINAL
20 pages
LLMs in Python Free Course by Inder P Singh
No ratings yet
LLMs in Python Free Course by Inder P Singh
28 pages
ML AI Roadmap Guide To Epert
No ratings yet
ML AI Roadmap Guide To Epert
6 pages
Mohammed Faheem Resume
No ratings yet
Mohammed Faheem Resume
2 pages
2024 11 15 AI Updates
No ratings yet
2024 11 15 AI Updates
20 pages
Scale Zeitgeist AI Readiness Report 2023
No ratings yet
Scale Zeitgeist AI Readiness Report 2023
24 pages
Genarative AI - Dev Doc-1
No ratings yet
Genarative AI - Dev Doc-1
48 pages
Chat GPT Bible - Lawyers and Legal Professionals Special Edition (Lucas Foster)
No ratings yet
Chat GPT Bible - Lawyers and Legal Professionals Special Edition (Lucas Foster)
114 pages
Rapid LLM App Development Guide
No ratings yet
Rapid LLM App Development Guide
8 pages
ChatGPT for Data Analysis Guide
100% (1)
ChatGPT for Data Analysis Guide
11 pages
2023 Gartner® Market Guide For IT Service Management Platforms
No ratings yet
2023 Gartner® Market Guide For IT Service Management Platforms
31 pages
Gym For Robots
No ratings yet
Gym For Robots
2 pages
Generative AI Startup Landscape in India - Revised Version
100% (3)
Generative AI Startup Landscape in India - Revised Version
44 pages
Agentic AI Framework Comparison
No ratings yet
Agentic AI Framework Comparison
1 page
Artificial Intelligence
No ratings yet
Artificial Intelligence
3 pages
Chapter
No ratings yet
Chapter
27 pages
Wjarr 2025 0279
No ratings yet
Wjarr 2025 0279
11 pages
Microsoft's AI Strategy Shift
No ratings yet
Microsoft's AI Strategy Shift
11 pages
Mastering Agentic RAG - Multi-Tool Orchestration For AI Agents
No ratings yet
Mastering Agentic RAG - Multi-Tool Orchestration For AI Agents
25 pages
Data Structure Algorithm &: System Design
No ratings yet
Data Structure Algorithm &: System Design
46 pages
DeepMind Slows Down Research Releases To Keep Competitive Edge in AI Race
No ratings yet
DeepMind Slows Down Research Releases To Keep Competitive Edge in AI Race
16 pages
GPT Blaster: AI-Powered SEO Content Tool
No ratings yet
GPT Blaster: AI-Powered SEO Content Tool
13 pages
GenAI Platform For Customer Presentations
No ratings yet
GenAI Platform For Customer Presentations
16 pages
Custom GPT Email Generator Workbook
100% (1)
Custom GPT Email Generator Workbook
6 pages
Removing RLHF Protections in GPT-4 Via Fine-Tuning
No ratings yet
Removing RLHF Protections in GPT-4 Via Fine-Tuning
7 pages
Chat GPT Should Be Regulated by The Government Especially For Schools. PLAN
No ratings yet
Chat GPT Should Be Regulated by The Government Especially For Schools. PLAN
2 pages
Bridging the AI Skills Gap in Finance
No ratings yet
Bridging the AI Skills Gap in Finance
8 pages
GENERATIVE AI - Final Web
100% (2)
GENERATIVE AI - Final Web
80 pages
AIfM Unit3 Notes
No ratings yet
AIfM Unit3 Notes
36 pages
France Concludes Global Ai Action Summit Artprice by Artmarket Unveils 2025 2029 Strategic Plan
No ratings yet
France Concludes Global Ai Action Summit Artprice by Artmarket Unveils 2025 2029 Strategic Plan
15 pages
Job Ready ML Skills
No ratings yet
Job Ready ML Skills
9 pages
Generative AI Prompts Productivity, Imagination, and Innovation in The Enterprise
No ratings yet
Generative AI Prompts Productivity, Imagination, and Innovation in The Enterprise
11 pages
Generative AI with Copilot in Bing
No ratings yet
Generative AI with Copilot in Bing
19 pages
Perry Belcher-AI Summit Salesletter
100% (1)
Perry Belcher-AI Summit Salesletter
27 pages

Slides

Uploaded by

Slides

Uploaded by

LLM-driven Data Engineering

- LLMs are getting good at writing SQL

I’m sure you’ve seen the

SQL is an important part of the data engineering equation

- Red means, LLMs are going to have a signiﬁcant impact, soon

LangChain hooks into databases

This disruption is imminent! EcZachly Inc

LLMs + Agents will be able to

- LLMs that process stack traces /

The amount of data engineering hours

Tricky failures will still need manual

- This will be disrupted 100% but complexity

LLMs can oftentimes give us a

We are still pretty far from

LLMs are like having a good junior engineer

The chat.openai.com interface is

REMEMBER FROM CLASS, IF THINGS ARE NOT

If you’ve done the upfront work of all the

LLMs should be able to give us a

The soft skills needed to prioritize EcZachly Inc

Presenting data is actually pretty

Knowing you customer and what

Generating fake data will be a big

Suggesting automated checks is

Having the business context to EcZachly Inc

Once you hit a suﬃcient level of

Imagine asking ChatGPT,

Data best practices are most

ChatGPT will catch up 2-3 years

Data Modeling is a very soft skills

Understanding what data is

- When you should and shouldn’t use LLMs

- Debugging speciﬁc problems (similar to StackOverﬂow or Google)

- LLMs are actually decent at generating text and documentation

- Install the openai Python library

- You have multiple roles you can play

This is for when you want to have a multiple prompt/response conversation

Temperature - Scales from 0 to 2. Closer to 0 the more deterministic the

- We’ll be interacting with GPT-4 in today’s lab

You might also like