Data Engineering Unit 1 Notes
Data Engineering Unit 1 Notes
1. Introduction
Data-driven decisions rely entirely on factual information. Since the decisions are based
on data collected from real activities, the chances of personal bias or emotional
influence are minimized. This makes the decision more reliable and consistent.
The approach involves the use of analytical tools, statistical techniques, dashboards,
visualizations, and predictive models. This systematic use of technology ensures that
decisions are made using a step-by-step logical process.
Using techniques like machine learning and statistical forecasting, businesses can
predict future outcomes such as customer demand, sales trends, risk levels, and market
shifts. This allows proactive planning.
(iv) Continuous and Ongoing Process
Data-driven decisions are not one-time actions. As new data is generated every day, the
decision-making cycle keeps repeating—collect data, analyze it, and refine decisions
continuously.
The first step is gathering data from reliable sources. These include customer purchase
history, sales data, website clicks, social media engagement, transaction records,
surveys, sensor data, CCTV systems, mobile apps, and operational reports. Collecting
a wide variety of data provides a holistic view of the business environment.
Collected data is stored in systems like relational databases, cloud storage platforms,
data warehouses, or data lakes. Good storage ensures fast data retrieval, security,
backup, and scalability. Cloud platforms like AWS, Azure, and GCP make storage
flexible and cost-effective.
Raw data often contains errors, duplicates, missing values, and inconsistencies. These
issues must be corrected before analysis. Data cleaning improves accuracy by
removing incorrect entries, fixing incomplete fields, and formatting data properly. Clean
data ensures that decisions based on it are trustworthy.
(iv) Data Analysis and Modelling
This is the core stage where tools like Python, R, Excel, Power BI, Tableau, and
statistical methods are used to identify patterns, correlations, and trends. Machine
learning models can classify data, cluster similar users, or predict future scenarios.
Visualization tools convert complex data into charts and graphs for easy interpretation.
Insights are the meaningful conclusions drawn from analysis. For example, a business
may discover that “Younger customers purchase more during weekends” or “A particular
product sells more during festivals.” These insights reveal the hidden story behind the
numbers.
(iii) Healthcare Sector: Doctors and researchers use patient data, diagnostic records,
and medical images to predict diseases, recommend personalized treatments, track
epidemics, and improve hospital resource allocation.
(iv) Manufacturing and Supply Chain
Since decisions are based on facts and real numbers, the chances of making mistakes
are extremely low compared to decisions based on intuition.
Data insights help businesses identify inefficiencies, bottlenecks, wastage, and delays.
This allows them to streamline operations and improve output.
Predictive models allow companies to anticipate future trends, demands, sales levels,
risks, and market changes, helping them plan in advance.
Poor, inaccurate, or incomplete data can mislead decision-makers and cause major
losses.
(ii) High Cost of Tools and Technology
Advanced analytics tools, cloud services, skilled analysts, and storage systems can be
expensive for small businesses.
Large volumes of sensitive data require strong cybersecurity measures, otherwise the
data may be stolen or misused.
Unstructured data like videos, images, or social media posts require complex analytical
techniques and powerful computing systems.
Sometimes business decisions also require intuition and experience. Excess reliance on
data may cause organizations to ignore human judgment.
Introduction
A Data Pipeline Infrastructure refers to the complete, end-to-end system used for
collecting, transferring, processing, storing, and delivering data to support data-driven
decisions. It ensures that data flows smoothly from source systems to analytics
tools, enabling organizations to convert raw data into meaningful insights. In today's
digital world, where businesses rely heavily on dashboards, forecasting models, AI, ML,
and BI tools, a strong pipeline becomes the backbone of the entire decision-making
ecosystem.
1. Data pipeline infrastructure is the technical foundation that automates the flow
of data from multiple origins (databases, IoT sensors, apps, websites, CRM, ERP
etc.) to destinations such as data warehouses, data lakes, analytics tools, and
ML models.
2. The purpose is to collect, clean, integrate, transform, and deliver large
volumes of data in a reliable, repeatable, and scalable manner.
3. It reduces human effort, eliminates data errors, and helps organizations make
decisions based on accurate, real-time, consistent, and trusted data.
4. It creates a structured pathway, ensuring data is not scattered, redundant,
inconsistent, or incomplete.
5. Without pipelines, data would remain locked in silos, making insights slow,
unreliable, and difficult.
1. Data is generated in systems like websites, mobile apps, smart devices, or
internal enterprise systems.
2. Ingestion tools capture data continuously or in scheduled intervals.
3. Data is stored as raw form in a scalable storage solution like a data lake
4. ETL/ELT tools clean, filter, and transform the data, ensuring high quality.
5. Transformed datasets are loaded into warehouses for analytics.
6. Governance rules maintain security, privacy, and consistency.
7. The final datasets are sent to dashboards, ML pipelines, or decision-making
tools.
8. Managers, analysts, and executives use these insights to make strategic
decisions.
1. E-commerce:
Customer behavior analytics, product recommendations, dynamic pricing.
2. Banking:
Fraud detection, risk scoring, loan approval optimization.
3. Healthcare:
Patient monitoring, diagnostics, medical record consolidation.
4. Manufacturing:
Predictive maintenance, supply chain optimization, IoT analytics.
5. Government/Public Sector:
Smart city monitoring, crime prediction, citizen services analytics.
6. Marketing:
Campaign optimization, customer segmentation, lead scoring.
7. Education:
Student performance tracking, admission analytics, curriculum insights.
7. Advantages
8. Limitations
Data Engineers are responsible for creating the entire data architecture that supports
continuous data flow across the organization.
They decide how data will enter the system, how it will move across pipelines, how it
will be stored, and how it will reach the analytics layer. These architectures must be
scalable so that they can handle massive amounts of structured, semi-structured, and
unstructured data.
They ensure:
This pipeline is the “heart” of a data-driven organization because it ensures that fresh,
reliable, and consistent data is always available to analysts.
They handle:
● Missing values
● Incorrect data types
● Duplicate records
● Outlier detection
● Schema mismatches
This ensures that all downstream analytics — dashboards, prediction models, and
business decisions — are based on clean and trusted data.
Data Engineers architect and maintain centralized storage systems such as:
They choose appropriate file formats (Parquet, ORC, Avro) and optimize storage for
both performance and cost.
Data engineers are responsible for protecting sensitive data and ensuring compliance
with laws like GDPR, HIPAA, etc. Their role includes:
This ensures that data is secure, traceable, and compliant with organizational and legal
standards.
6. Enabling Real-Time Data Processing
They use techniques such as indexing, partitioning, caching, and query optimization to
achieve peak performance.
Data Engineers act as a bridge between raw operational systems and analytical
teams. They work closely with:
This collaboration ensures that business decisions are based on accurate, timely, and
well-structured data.
Automation reduces human error and makes the data ecosystem reliable and
self-maintaining.
This makes Data Engineers essential for the success of AI/ML projects.
● Pipeline workflows
● Data lineage maps
● Schema definitions
● Data quality rules
● Storage structures
In a data-driven organization, the Data Engineer is the backbone of the entire data
ecosystem. They build the pipelines, architectures, governance systems, storage
layers, and processing engines that transform raw data into intelligent insights. Their
work enables accurate analytics, reliable dashboards, efficient ML models, and strategic
decision-making.
Without Data Engineers, organizations cannot become data-driven.
* Introduction to Elements of Data in Data Engineering
Data Engineering plays a crucial role in today’s data-driven world. It focuses on
designing systems to collect, clean, transform, and store data efficiently. To do this
effectively, engineers must understand the core elements of data that form the building
blocks of all data systems.
Data elements define how information is represented, structured, and processed. These
elements ensure that the data pipeline can handle diverse data formats — from
structured database tables to unstructured multimedia files.
Conclusion
Understanding the elements of data is fundamental for every data engineer. These
concepts not only help build robust data systems but also ensure that organizations can
turn raw data into actionable insights with confidence.
Ultimately, mastering these elements empowers engineers to bridge the gap between
raw data and intelligent decision-making — the true goal of data engineering.
** THE FIVE V’s OF BIG DATA: VOLUME, VELOCITY, VARIETY, VERACITY &
VALUE
1. Introduction
The Five V’s of Big Data represent the essential characteristics that define modern
large-scale datasets. As organizations collect information from diverse platforms like
mobile apps, IoT devices, social media, financial systems, and sensors, the structure,
speed, reliability, and usefulness of this data become critical. The Five Vs help
categorize and understand data properties so that suitable technologies, storage
systems, and analytics techniques can be applied.
1. Volume: the size and amounts of big data that companies manage and analyze
2. Value: the most important “V” from the perspective of the business, the value of
big data usually comes from insight discovery and pattern recognition that lead to
more effective operations, stronger customer relationships and other clear and
quantifiable business benefits
3. Variety: the diversity and range of different data types, including unstructured
data, semi-structured data and raw data
4. Velocity: the speed at which companies receive, store and manage data – e.g.,
the specific number of social media posts or search queries received within a
day, hour or other unit of time
5. Veracity: the “truth” or accuracy of data and information assets, which often
determines executive-level confidence
Volume refers to the scale, magnitude, and sheer amount of data that organizations
generate and need to process. Today’s data ecosystems deal not just with gigabytes or
terabytes but move into petabytes, exabytes, and even zettabytes. The rapid growth
of digital platforms, cloud applications, smart devices, and automation systems means
that every action, click, transaction, interaction, or sensor reading contributes to the
growing volume of data.
The increasing volume of data comes from several sources such as business
transactions, social networking platforms, streaming videos, photos, satellite imagery,
GPS signals, IoT sensors, medical records, and machine-generated logs. In traditional
systems, handling such large data volumes was impossible because of limited storage
and slow processing capabilities. However, modern distributed storage frameworks like
HDFS, cloud-based object storage systems like AWS S3, Azure Blob, and Google
Cloud Storage, and scalable platforms like Snowflake or BigQuery allow
organizations to store huge datasets at low cost.
2. VELOCITY
Velocity refers to the speed at which data is created, collected, transmitted, and
processed. In earlier periods, data was collected manually or in batches (e.g.,
end-of-day reports). Today, data is produced continuously and at extremely high
speeds, creating a need for systems that can process information in real-time or near
real-time.
Examples of high-velocity data include live social media activity, streaming videos,
clickstream logs from websites, financial market trades, location signals from
smartphones, sensor readings from manufacturing machines, and real-time health
monitoring devices. In many applications, waiting minutes or hours for data analysis is
not acceptable. A banking fraud detection system, for instance, must detect suspicious
activity within milliseconds. Similarly, ride-sharing applications need real-time GPS
tracking to match drivers with customers instantly.
Because of these requirements, new systems such as Apache Kafka, Flink, Spark
Streaming, and Storm have been developed to process continuous streams of data at
high speed. Velocity is not only about how fast data arrives but also about how fast it
must be stored, analyzed, and used to make decisions. Real-time dashboards,
automated alerts, predictive analytics, and IoT-based systems all rely heavily on
velocity. Managing velocity ensures that organizations can act based on fresh,
up-to-date, and accurate information.
Using real-time alerting, Walmart sales analysts noted that a particular, rather popular,
Halloween novelty cookie was not selling in two stores. A quick investigation showed
that, due to a stocking oversight, those cookies hadn’t been put on the shelves. By
receiving automated alerts, Walmart was quickly able to rectify the situation and save its
sales.
3. VARIETY
Variety refers to the diversity of data formats, structures, types, and sources that
organizations handle in modern environments. Historically, companies mostly dealt with
structured data stored in relational databases, spreadsheets, and tables. Today,
however, data comes in countless forms—text, images, videos, sensor readings,
emails, logs, documents, voice recordings, and more. This enormous diversity creates
complexity in storage, processing, and analysis.
Structured Data
This type of data is organized into well-defined rows and columns. It is easy to search,
analyze, and store in relational databases such as SQL. Examples include financial
transactions, customer records, attendance sheets, and sales reports. Structured data
has clear formatting rules and follows strict schemas.
Semi-Structured Data
Semi-structured data does not follow strict tabular structure but contains tags or
markers that make it partially organized. Common examples are JSON files, XML
documents, email headers, and log files. Though not organized into tables, it still carries
metadata that helps with analysis.
Unstructured Data
Unstructured data has no fixed schema or organization. It is difficult for traditional
systems to process and requires AI-based tools such as natural language processing,
image recognition, and audio processing. Examples include images, videos, scanned
documents, chat messages, social media posts, recorded calls, and sensor signals.
Walmart tracks each one of its 145 million American consumers individually, resulting in
accrued data per hour that’s equivalent to 167 times the books in America’s Library of
Congress. Most of that is unstructured data that comes from its videos, tweets,
Facebook posts, call-center conversations, closed-circuit TV footage, mobile phone
calls and texts, and website clicks.
Variety also covers the multiple origins of data. Modern data comes from traditional
systems like ERP and CRM, but also from new sources such as IoT devices, social
platforms, cloud applications, and web analytics.
4. VERACITY
Veracity refers to the quality, accuracy, reliability, and trustworthiness of data. In the
Big Data environment, data is collected from numerous systems and automated
processes, and it often contains errors, inconsistencies, missing values, noise,
duplication, or outdated information. Poor-quality data can lead to flawed insights and
wrong decisions, which may harm a business severely.
Ensuring veracity means organizations must maintain strict data governance, validation
checks, cleansing procedures, and quality monitoring mechanisms. For example,
sensor data may generate corrupted readings due to hardware failure. Social media
data may contain fake accounts, spam, or manipulated content. Customer databases
may include duplicate entries or incorrect contact details. Logs may contain irrelevant
entries caused by system errors.
To improve veracity, organizations use data profiling, cleansing tools, filtering
mechanisms, and statistical checks. They implement authentication protocols, access
control, metadata management, and auditing. Veracity ensures that data science teams
and decision-makers work with clean, consistent, and trustworthy datasets, reducing
the risk involved in analytics.
According to Jaya Kolhatkar, vice president of global data for Walmart labs, Walmart’s
priority is making sure its data is correct and of high quality. Clean data helps with
privacy issues, ensuring sensitive details are encrypted while customer contact
information is segregated.
5. VALUE
Value is considered the most important V because it refers to the usefulness and
benefits that organizations gain from data. Collecting huge amounts of data has no
purpose unless it results in actionable insights. Value focuses on how data can improve
business performance, reduce costs, increase efficiency, enhance customer
experiences, support automation, and generate revenue.
Value is derived through techniques such as data mining, machine learning, predictive
analytics, visualization, and strategic reporting. For example, retail companies use data
analytics to personalize offers, forecast sales, optimize inventory, and understand
customer behavior. Healthcare providers use patient data to predict disease risks and
enhance diagnosis accuracy. Financial institutions use data to detect fraud, assess
creditworthiness, and recommend investment opportunities.
Walmart uses its big data to make its pharmacies more efficient, help it improve store
checkout, personalize its shopping experience, manage its supply chain, and optimize
product assortment among other ends.
* Activities to Improve Veracity and Value in Data Engineering
In Data Engineering, veracity and value are two key dimensions of data quality. Veracity
represents the truthfulness and reliability of data, while value measures its usefulness to
the business. Improving both ensures accurate analytics and effective decision-making.
Example:
A banking institution maintains customer transaction data. By cleaning and verifying the
data (veracity) and enriching it with customer segmentation (value), the bank can
improve fraud detection and marketing performance simultaneously.
Improving data veracity and value is an ongoing process involving validation,
enrichment, governance, and alignment with business objectives. Data engineers play a
central role in ensuring that the data driving analytics is both trustworthy and
meaningful, empowering organizations to make smarter decisions.