Data Pipelines Explained
00:00
let's talk about data pipelines what they are when and how they're used so i want to start
with a simple idea most of us are fortunate enough to turn on the tap whenever we like
and fresh clean water comes out however have you have you thought about how that water
actually gets to you well water starts out in our lakes our oceans
00:30
and even our rivers but most of us probably wouldn't drink straight from the lake right we
have to treat and transform this water into something that's safe for us to use and we do
this using treatment facilities and we get the water from where it is to where it needs to go
using water pipelines
01:00
right now once that water has gotten from the source to their treatment plants it's then
cleansed and and made sure it's safe to use and then it's sent out using even more pipelines
to where we need it and we use it in a couple different places we need it for drinking water
we need it for cleaning
01:30
and we also need it for agriculture right so we use even more pipelines to get this water to
where it's needed okay now as you can see water pipelines take water from where it is to
where it's needed now we can start to think about data in organizations in a very similar
way so data and organization starts out in
02:00
data lakes it's in different databases that are a part of different sas applications some
applications are on-prem and then we also have streaming data which is kind of like our
river here now this can be data that is coming in in real time and so an example of that could
be sensor data from uh factories where data's being collected every second and being sent
1/4
02:30
and being sent back up to our repositories so just like our water sources this data is dirty
it's contaminated and it must be cleaned and transformed before it's useful in helping us
make business decisions now when we talk so how do we do this work we do it using not
water pipelines but data pipelines okay
03:00
so when we talk about data pipelines we have a few different processes that we can use to
help us handle the task of transforming and cleaning this data we can use processes like etl
we can use data replication we can also use something called data virtualization
03:30
right okay so one of the most common processes is etl which stands for extract transform
and load and that does exactly what it sounds like it extracts data from where it is it
transforms it by cleaning up mismatching data by taking care of missing values getting
rid of duplicated data putting in making sure the right columns are there and then loading
it into
04:00
a landing repository for ready-to-use business data an example of one of these repositories
could be an enterprise data warehouse right okay so most of the time we use something
called batch processing which means that on a given schedule we load data into our etl tool
and then load it to where it needs to be
04:30
but we could also have stream ingestion which would support the streaming data that i
mentioned earlier so it's continuously taking data in transforming it and then continuously
loading it to where it needs to be okay now another tool that we might see is data replication
so what this involves is a continuously replicating and copying data into another repository
before being loaded or used by our use
2/4
05:00
case so we could have a repository here in the middle that copies data from our source into
this into this repository so why would we do that right well one of the reasons could be that
the application or use case where we need this data needs to have a really high performant
back end to it and it's possible that our source data can't support something like that another
reason could be for backup and
05:30
disaster recovery reasons so in the situation where our source data goes offline for some
reason we still have this backup to keep running our business processes against okay so the
last one i want to touch on is data virtualization so all of the methods that i've described so
far require you to copy data from where it is and move it into another repository but what
if we want to test out a new
06:00
data use case and don't want to go through a large data transformation project well in that
case we can use a technology called data virtualization to simply virtualize access to our
data sources and only query them in real time when we need them without copying them
over and once we're happy with the outcome of our our test use case we can go back and
build out these formal data pipelines so data virtualization technology allows
06:30
us to access all these disparate data sources without having to go through building out
permanent data pipelines so once we're satisfied with the results of our data virtualization
project we can build a formal data pipeline that can support the massive amounts of data
that we need to that we need in a
07:00
production use case now unfortunately we haven't figured out a way how to virtualize water
but we can definitely do it with data in our in our organizations okay so after we've used all
these different processes to get data ready for uh analysis or different applications we can
start using it so what are the different ways in which we can use this data well we might
need it for our business intelligence platforms that
3/4
07:30
are needed for different types of reporting well we might also need it for machine learning
use cases right so machine learning requires tons and tons of high quality data so we need
to use these data pipeline tools to feed our machine learning algorithms and so this clean
data can be fed into our machine learning models to help us start making better and smarter
decisions in our business
08:00
okay so as we can see data pipelines take data from data producers and give them to data
consumers thank you if you have questions please drop us a line below and if you want to
see more videos like this in the future please like and subscribe
4/4