0% found this document useful (0 votes)

4 views22 pages

Data Engineering Unit 1 Notes

Data-Driven Decision Making (DDD) is a modern approach that relies on verified data and statistical evidence to guide organizational decisions, enhancing efficiency and competitiveness. The DDD process involves systematic data collection, storage, analysis, and continuous improvement, allowing businesses to make informed predictions and strategies. A robust Data Pipeline Infrastructure is essential for DDD, automating data flow and ensuring high-quality insights for timely decision-making across various industries.

Uploaded by

samarth2029

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views22 pages

Data Engineering Unit 1 Notes

Uploaded by

samarth2029

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

* DATA-DRIVEN DECISIONS

1. Introduction

Data-Driven Decision Making (DDD) is the modern approach in which organizations

base their decisions on verified data, statistical evidence, patterns, and factual
insights instead of assumptions, guesswork, or intuition. In today’s digital age,
businesses generate enormous amounts of data from customer interactions, online
activities, social media platforms, sales transactions, marketing campaigns, sensors,
machines, and business operations. Data-driven decision making helps convert this raw
data into meaningful knowledge that supports business planning, forecasting, execution,
and continuous improvement. It enables companies to take smarter, faster, and more
accurate decisions, thereby improving efficiency and competitiveness.

2. Meaning of Data-Driven Decision Making

Data-driven decision making refers to the systematic process of collecting, storing,

cleaning, analyzing, and interpreting data in order to gain insights that guide
strategic, tactical, and operational decisions. Instead of relying on experience or
subjective judgment, this method ensures that each decision is supported by evidence.
It also helps organizations recognize hidden patterns, understand customer
preferences, identify business opportunities, spot risks, and make predictions about
future trends.

3. Characteristics of Data-Driven Decisions

(i) Fact-Based and Objective

Data-driven decisions rely entirely on factual information. Since the decisions are based
on data collected from real activities, the chances of personal bias or emotional
influence are minimized. This makes the decision more reliable and consistent.

(ii) Analytical and Systematic

The approach involves the use of analytical tools, statistical techniques, dashboards,
visualizations, and predictive models. This systematic use of technology ensures that
decisions are made using a step-by-step logical process.

(iii) Predictive in Nature

Using techniques like machine learning and statistical forecasting, businesses can
predict future outcomes such as customer demand, sales trends, risk levels, and market
shifts. This allows proactive planning.
(iv) Continuous and Ongoing Process

Data-driven decisions are not one-time actions. As new data is generated every day, the
decision-making cycle keeps repeating—collect data, analyze it, and refine decisions
continuously.

(v) Scalable for Any Organization

Whether it is a small business, a startup, a multinational company, or government

agency, DDD principles can be applied at any scale.

4. Detailed Explanation of the DDD Process

(i) Data Collection

The first step is gathering data from reliable sources. These include customer purchase
history, sales data, website clicks, social media engagement, transaction records,
surveys, sensor data, CCTV systems, mobile apps, and operational reports. Collecting
a wide variety of data provides a holistic view of the business environment.

(ii) Data Storage

Collected data is stored in systems like relational databases, cloud storage platforms,
data warehouses, or data lakes. Good storage ensures fast data retrieval, security,
backup, and scalability. Cloud platforms like AWS, Azure, and GCP make storage
flexible and cost-effective.

(iii) Data Cleaning and Preparation

Raw data often contains errors, duplicates, missing values, and inconsistencies. These
issues must be corrected before analysis. Data cleaning improves accuracy by
removing incorrect entries, fixing incomplete fields, and formatting data properly. Clean
data ensures that decisions based on it are trustworthy.
(iv) Data Analysis and Modelling

This is the core stage where tools like Python, R, Excel, Power BI, Tableau, and
statistical methods are used to identify patterns, correlations, and trends. Machine
learning models can classify data, cluster similar users, or predict future scenarios.
Visualization tools convert complex data into charts and graphs for easy interpretation.

(v) Insight Generation

Insights are the meaningful conclusions drawn from analysis. For example, a business
may discover that “Younger customers purchase more during weekends” or “A particular
product sells more during festivals.” These insights reveal the hidden story behind the
numbers.

(vi) Decision Making and Implementation

Based on insights, the organization formulates strategies—improving customer service,

increasing stock levels, adjusting pricing, redesigning marketing campaigns, or
launching new products. Real data helps decision makers choose the most effective
strategy with confidence.

(vii) Monitoring and Feedback Loop

After implementing decisions, the company observes performance metrics to see

whether the decision worked. New data is again fed into the system, making the cycle
continuous. This constant monitoring promotes improvement and avoids repeated
mistakes.

5. Applications of Data-Driven Decision Making

(i) Marketing and Sales

Businesses analyze customer behavior, purchasing patterns, and preferences to design

targeted advertisements, personalized recommendations, discount strategies, loyalty
programs, and product bundles. Platforms like Amazon and Netflix are leaders in using
data-driven marketing.

(ii) Finance and Banking

Banks use data analytics to detect fraudulent transactions, assess creditworthiness,

identify risky customers, predict loan defaults, and manage investment portfolios.
Real-time monitoring reduces financial losses.

(iii) Healthcare Sector: Doctors and researchers use patient data, diagnostic records,
and medical images to predict diseases, recommend personalized treatments, track
epidemics, and improve hospital resource allocation.
(iv) Manufacturing and Supply Chain

Factories use sensor-generated data to monitor machines, predict equipment failure,

improve product quality, reduce defects, and optimize inventory levels. Supply chain
decisions like routing, warehousing, and demand forecasting also depend heavily on
data.

(v) Government and Public Services

Governments analyze public data to design effective policies, improve transportation

systems, monitor crime rates, plan urban development, and manage public utilities.

6. Advantages of Data-Driven Decisions

(i) Higher Accuracy and Reliability

Since decisions are based on facts and real numbers, the chances of making mistakes
are extremely low compared to decisions based on intuition.

(ii) Removes Human Bias

Data-driven systems reduce the influence of emotions, assumptions, favoritism, or

subjective judgment. This improves fairness in decision-making.

(iii) Improves Efficiency and Productivity

Data insights help businesses identify inefficiencies, bottlenecks, wastage, and delays.
This allows them to streamline operations and improve output.

(iv) Enhances Customer Experience

Personalized recommendations, targeted services, and customized offers increase

customer satisfaction and loyalty. Businesses understand customers better through
data.

(v) Better Forecasting and Future Planning

Predictive models allow companies to anticipate future trends, demands, sales levels,
risks, and market changes, helping them plan in advance.

7. Disadvantages and Limitations

(i) Requires High-Quality Data

Poor, inaccurate, or incomplete data can mislead decision-makers and cause major
losses.
(ii) High Cost of Tools and Technology

Advanced analytics tools, cloud services, skilled analysts, and storage systems can be
expensive for small businesses.

(iii) Security and Privacy Concerns

Large volumes of sensitive data require strong cybersecurity measures, otherwise the
data may be stolen or misused.

(iv) Complexity of Big Data

Unstructured data like videos, images, or social media posts require complex analytical
techniques and powerful computing systems.

(v) Over-dependency on Data

Sometimes business decisions also require intuition and experience. Excess reliance on
data may cause organizations to ignore human judgment.

Data-Driven Decision Making has become the backbone of modern business

operations. With rapidly increasing data generation, companies cannot depend on
traditional intuition-based processes. Instead, data provides a scientific, accurate, and
reliable foundation for planning and decision-making. Through continuous analysis,
monitoring, and feedback loops, organizations can remain competitive, improve
efficiency, enhance customer satisfaction, reduce risks, and predict future trends with
confidence. Hence, DDD is a crucial skill and practice for all industries in the era of
digital transformation.
* DATA PIPELINE INFRASTRUCTURE FOR DATA-DRIVEN DECISIONS

Introduction

A Data Pipeline Infrastructure refers to the complete, end-to-end system used for
collecting, transferring, processing, storing, and delivering data to support data-driven
decisions. It ensures that data flows smoothly from source systems to analytics
tools, enabling organizations to convert raw data into meaningful insights. In today's
digital world, where businesses rely heavily on dashboards, forecasting models, AI, ML,
and BI tools, a strong pipeline becomes the backbone of the entire decision-making
ecosystem.

1. Meaning & Purpose of Data Pipeline Infrastructure

1. Data pipeline infrastructure is the technical foundation that automates the flow
of data from multiple origins (databases, IoT sensors, apps, websites, CRM, ERP
etc.) to destinations such as data warehouses, data lakes, analytics tools, and
ML models.
2. The purpose is to collect, clean, integrate, transform, and deliver large
volumes of data in a reliable, repeatable, and scalable manner.
3. It reduces human effort, eliminates data errors, and helps organizations make
decisions based on accurate, real-time, consistent, and trusted data.
4. It creates a structured pathway, ensuring data is not scattered, redundant,
inconsistent, or incomplete.
5. Without pipelines, data would remain locked in silos, making insights slow,
unreliable, and difficult.

2. Key Components of a Data Pipeline Infrastructure (Detailed)

(i) Data Sources

● These are the origins of raw data.

● Can include structured sources (SQL databases), semi-structured sources (CSV,
JSON, XML) and unstructured sources (images, logs, videos).
● Examples: Social media platforms, IoT sensors, company databases, mobile
applications, transaction systems, CRM records.

(ii) Data Ingestion Layer

● First technical layer where data enters the pipeline.

● Can be batch ingestion (nightly or hourly loads) or real-time ingestion
(streaming via Kafka, Flink, Spark).
● Ensures safe, reliable intake of data even in high-velocity environments.
● Tools: Apache Kafka, Apache Sqoop, AWS Kinesis.

(iii) Data Storage Layer

● Stores raw and semi-processed data.

● Functions include durability, scalability, security, backup, indexing, and
versioning.
● Two types:
1. Data Lake: Stores massive raw data (Hadoop HDFS, AWS S3).
2. Data Warehouse: Stores structured and processed data (Snowflake,
Redshift, BigQuery).

(iv) Data Processing Layer

● The “transformation engine” where raw data becomes usable.

● Covers cleaning, filtering, normalization, aggregation, deduplication, and
enrichment.
● Includes ETL (Extract-Transform-Load) and ELT workflows.
● Tools: Apache Spark, Airflow, AWS Glue.
● Ensures data quality, accuracy, and correctness before it moves to analytics.

(v) Data Integration Layer

● Combines multiple data streams into unified datasets.

● Removes inconsistencies between systems.
● Ensures a “single source of truth” for decision-making.

(vi) Data Governance & Quality Layer

● Ensures data follows rules, standards, and definitions.

● Includes validation checks, data profiling, lineage tracking, access control, and
metadata management.
● Tools: Collibra, Apache Atlas.
● Improves trust and reliability of insights.

(vii) Delivery Layer

● Final stage where processed data is delivered to BI dashboards, ML models,

reporting tools, and decision-makers.
● Supports visualization and predictive analytics.
● Tools: Power BI, Tableau, Quicksight, Jupyter, TensorFlow.

3. Flowchart of Data Pipeline Infrastructure

4. Working of Data Pipeline Infrastructure (Step-by-Step)

1. Data is generated in systems like websites, mobile apps, smart devices, or
internal enterprise systems.
2. Ingestion tools capture data continuously or in scheduled intervals.
3. Data is stored as raw form in a scalable storage solution like a data lake
4. ETL/ELT tools clean, filter, and transform the data, ensuring high quality.
5. Transformed datasets are loaded into warehouses for analytics.
6. Governance rules maintain security, privacy, and consistency.
7. The final datasets are sent to dashboards, ML pipelines, or decision-making
tools.
8. Managers, analysts, and executives use these insights to make strategic
decisions.

5. Importance of Data Pipeline Infrastructure for Data-Driven Decisions

1. Ensures Real-Time Insights

Real-time pipelines help organizations respond instantly to customer behavior,
fraud, failures, or market movement.
2. Improves Decision Accuracy
Clean, processed, and validated data reduces uncertainty and prevents wrong
conclusions.
3. Eliminates Data Silos
Integrates data from all departments (HR, finance, sales, operations), creating
unified insights.
4. Speeds Up Business Processes
Automated pipelines reduce manual effort and speed up analytics cycles.
5. Enhances Scalability
Capable of handling terabytes to petabytes of data without degrading
performance.
6. Supports AI, ML, and Big Data Analytics
AI systems depend on consistent, high-quality data which pipelines ensure.
7. Improves Compliance and Data Security
Governance framework protects data integrity, privacy, and compliance with
laws.
8. Reduces Operational Cost
Optimized pipelines reduce redundant infrastructure and manual data cleaning.

6. Applications of Data Pipeline Infrastructure

1. E-commerce:
Customer behavior analytics, product recommendations, dynamic pricing.
2. Banking:
Fraud detection, risk scoring, loan approval optimization.
3. Healthcare:
Patient monitoring, diagnostics, medical record consolidation.
4. Manufacturing:
Predictive maintenance, supply chain optimization, IoT analytics.
5. Government/Public Sector:
Smart city monitoring, crime prediction, citizen services analytics.
6. Marketing:
Campaign optimization, customer segmentation, lead scoring.
7. Education:
Student performance tracking, admission analytics, curriculum insights.

7. Advantages

1. Ensures reliable, consistent, and high-quality data for decision-making.

2. Enables automation of repetitive data workflows.
3. Supports real-time and batch data processing.
4. Scales easily with growing business needs.
5. Reduces human errors and processing time.
6. Facilitates advanced analytics and AI implementations.

8. Limitations

1. High initial setup cost and complexity.

2. Requires skilled data engineers and architects.
3. Continuous monitoring and maintenance are essential.
4. Integration of legacy systems may be difficult.
5. Poor governance can lead to data quality issues.
A Data Pipeline Infrastructure acts as the central nervous system of any modern,
data-driven organization. It transforms raw, unstructured, and scattered data into
valuable, actionable, and real-time insights, enabling leaders to make accurate and
timely decisions. With advancements in cloud computing, distributed architectures, and
real-time streaming technologies, data pipelines have become more scalable,
automated, and essential than ever. No meaningful analytics, AI, machine learning,
forecasting, or business intelligence can occur without a strong pipeline ecosystem.

**Role of the Data Engineer in Data-Driven Organizations

Data-driven organizations rely heavily on high-quality data to make strategic,

operational, and predictive decisions. In such companies, Data Engineers play a
foundational and mission-critical role. They work behind the scenes to design, build,
manage, and optimize the entire infrastructure that allows data to be collected, cleaned,
stored, processed, and made available to analysts, data scientists, business users, and
decision-makers. Without data engineering, data analytics cannot exist.
Below are the major roles and responsibilities of a Data Engineer, explained in
deep detail.

1. Designing and Building Scalable Data Architecture

Data Engineers are responsible for creating the entire data architecture that supports
continuous data flow across the organization.
They decide how data will enter the system, how it will move across pipelines, how it
will be stored, and how it will reach the analytics layer. These architectures must be
scalable so that they can handle massive amounts of structured, semi-structured, and
unstructured data.

Diagram: High-Level Data Architecture

Data Sources → Ingestion Layer → Storage Layer → Processing Layer → Analytics
Layer
(Apps, IoT, (ETL/ELT) (Data Lake, (Batch/Stream) (BI Tools,
CRM, DBs) Warehouses) Dashboards)

2. Data Ingestion and Integration (ETL/ELT Pipelines)

A core responsibility of data engineers is to build ETL (Extract–Transform–Load) or

ELT pipelines that automatically collect data from multiple sources such as databases,
sensors, web logs, cloud applications, APIs, etc.

They ensure:

● Real-time or batch ingestion

● Error-free data movement
● Data compatibility across heterogeneous sources
● Integration of structured and unstructured formats

This pipeline is the “heart” of a data-driven organization because it ensures that fresh,
reliable, and consistent data is always available to analysts.

3. Ensuring Data Quality and Data Cleaning

Raw data is often incomplete, inconsistent, duplicated, or inaccurate. Data Engineers

apply various data validation, transformation, standardization, and cleansing rules
to make the data usable.

They handle:

● Missing values
● Incorrect data types
● Duplicate records
● Outlier detection
● Schema mismatches

This ensures that all downstream analytics — dashboards, prediction models, and
business decisions — are based on clean and trusted data.

4. Building and Managing Data Lakes & Data Warehouses

Data Engineers architect and maintain centralized storage systems such as:

● Data Lakes (store raw, massive, diverse data)

● Data Warehouses (store structured, analytics-ready data)

They choose appropriate file formats (Parquet, ORC, Avro) and optimize storage for
both performance and cost.

5. Implementing Data Governance, Security, and Compliance

Data engineers are responsible for protecting sensitive data and ensuring compliance
with laws like GDPR, HIPAA, etc. Their role includes:

● Access control and authentication

● Encryption of data at rest and in transit
● Data lineage and audit trails
● Metadata cataloging
● Defining data retention and archival policies

This ensures that data is secure, traceable, and compliant with organizational and legal
standards.
6. Enabling Real-Time Data Processing

Modern organizations rely on real-time dashboards, streaming analytics, fraud detection

systems, and live monitoring.
Data Engineers use tools like Kafka, Spark Streaming, and Flink to build low-latency
real-time data pipelines.

Key responsibilities include:

● Handling high-velocity data

● Managing event-driven pipelines
● Ensuring near-zero delay processing

This helps businesses react immediately to critical events.

7. Optimizing Data Infrastructure for Performance

Data Engineers constantly monitor system performance to ensure:

● High data throughput

● Low processing latency
● Efficient query performance
● Cost-effective resource usage
● Minimal downtime

They use techniques such as indexing, partitioning, caching, and query optimization to
achieve peak performance.

8. Collaboration with Data Scientists and Analysts

Data Engineers act as a bridge between raw operational systems and analytical
teams. They work closely with:

● Data Scientists (to provide training data, feature pipelines, model-ready

datasets)
● Business Analysts & BI Teams (to support dashboards, KPI reports, and
visualizations)

This collaboration ensures that business decisions are based on accurate, timely, and
well-structured data.

9. Automation of Data Workflows

To improve efficiency, Data Engineers automate:

● Scheduled data pipelines

● Data refresh cycles
● Monitoring alert
● System restarts
● Quality checks

Automation reduces human error and makes the data ecosystem reliable and
self-maintaining.

10. Supporting Machine Learning and Advanced Analytics

In a data-driven organization, Data Engineers prepare the entire ML infrastructure by:

● Creating feature stores

● Building pipelines for model training and deployment
● Ensuring availability of historical and real-time data
● Optimizing storage for training algorithms

This makes Data Engineers essential for the success of AI/ML projects.

Diagram: ML Pipeline Supported by Data Engineering

Data → Cleaning/Prep → Feature Engineering → Model Training → Deployment →
Monitoring
↑ ↑ ↑
(Data Engineers manage most of these stages)

11. Managing Big Data Technologies

Data Engineers handle large-scale distributed systems like:

● Hadoop ecosystem (HDFS, Hive, HBase, MapReduce)

● Spark cluster computing
● Cloud services (AWS, GCP, Azure)
● NoSQL databases (MongoDB, Cassandra)

These technologies allow handling terabytes to petabytes of data efficiently.

12. Monitoring, Troubleshooting & Maintaining Reliability

A data pipeline must never fail in a data-driven company.

Data Engineers continuously monitor:
● Pipeline failures
● Data flow bottlenecks
● Storage issues
● API failures
● Broken jobs

They fix issues quickly to keep the data ecosystem stable.

13. Documentation & Metadata Management

Data Engineers maintain detailed documentation for:

● Pipeline workflows
● Data lineage maps
● Schema definitions
● Data quality rules
● Storage structures

This ensures transparency, reproducibility, and smooth onboarding of new team

members.

Diagram: End-to-End Role of a Data Engineer

In a data-driven organization, the Data Engineer is the backbone of the entire data
ecosystem. They build the pipelines, architectures, governance systems, storage
layers, and processing engines that transform raw data into intelligent insights. Their
work enables accurate analytics, reliable dashboards, efficient ML models, and strategic
decision-making.
Without Data Engineers, organizations cannot become data-driven.
* Introduction to Elements of Data in Data Engineering
Data Engineering plays a crucial role in today’s data-driven world. It focuses on
designing systems to collect, clean, transform, and store data efficiently. To do this
effectively, engineers must understand the core elements of data that form the building
blocks of all data systems.

Data elements define how information is represented, structured, and processed. These
elements ensure that the data pipeline can handle diverse data formats — from
structured database tables to unstructured multimedia files.

1. Understanding Data Elements

A data element is the smallest unit of data that holds meaning. For example, in a
customer database, fields like ‘Name’, ‘Email’, and ‘Purchase History’ are individual
data elements. Together, they define a complete data entity.

In large-scale systems, understanding these elements helps engineers design efficient

schemas, data models, and transformations that enhance performance and accuracy.

2. Types of Data Elements

● Structured Data: Highly organized data stored in relational databases with fixed
schemas.
● Semi-Structured Data: Data with flexible schema, such as JSON or XML formats.
● Unstructured Data: Free-form data like videos, images, and text that requires
advanced processing tools.

3. Metadata and Data Attributes

Metadata is data about data. It defines context — explaining where data came from, its
format, structure, and last updated time. Metadata ensures that the dataset remains
interpretable and reliable across systems.

4. Data Quality and Integrity

High-quality data is the foundation of good analytics. Engineers must ensure accuracy,
completeness, and consistency through validation, cleaning, and deduplication. Tools
like Apache Airflow, DBT, and Great Expectations help automate quality checks.

5. Relationships Between Data Elements

Data elements rarely exist in isolation. They form relationships that define how entities
connect in a system. For example, a “Customer” relates to an “Order”, and that order
links to a “Product”. Modeling these relationships helps build optimized databases and
efficient ETL flows.

6. Importance in Data Engineering

● Defines structure for data pipelines.
● Improves transformation and validation processes.
● Supports governance, monitoring, and compliance.
● Provides clean data for ML and BI tools.

Conclusion
Understanding the elements of data is fundamental for every data engineer. These
concepts not only help build robust data systems but also ensure that organizations can
turn raw data into actionable insights with confidence.
Ultimately, mastering these elements empowers engineers to bridge the gap between
raw data and intelligent decision-making — the true goal of data engineering.

** THE FIVE V’s OF BIG DATA: VOLUME, VELOCITY, VARIETY, VERACITY &
VALUE

1. Introduction

The Five V’s of Big Data represent the essential characteristics that define modern
large-scale datasets. As organizations collect information from diverse platforms like
mobile apps, IoT devices, social media, financial systems, and sensors, the structure,
speed, reliability, and usefulness of this data become critical. The Five Vs help
categorize and understand data properties so that suitable technologies, storage
systems, and analytics techniques can be applied.

The five Vs are:

1. Volume: the size and amounts of big data that companies manage and analyze
2. Value: the most important “V” from the perspective of the business, the value of
big data usually comes from insight discovery and pattern recognition that lead to
more effective operations, stronger customer relationships and other clear and
quantifiable business benefits
3. Variety: the diversity and range of different data types, including unstructured
data, semi-structured data and raw data
4. Velocity: the speed at which companies receive, store and manage data – e.g.,
the specific number of social media posts or search queries received within a
day, hour or other unit of time
5. Veracity: the “truth” or accuracy of data and information assets, which often
determines executive-level confidence

Each V addresses a different dimension of Big Data management and analytics.

1. VOLUME

Volume refers to the scale, magnitude, and sheer amount of data that organizations
generate and need to process. Today’s data ecosystems deal not just with gigabytes or
terabytes but move into petabytes, exabytes, and even zettabytes. The rapid growth
of digital platforms, cloud applications, smart devices, and automation systems means
that every action, click, transaction, interaction, or sensor reading contributes to the
growing volume of data.

The increasing volume of data comes from several sources such as business
transactions, social networking platforms, streaming videos, photos, satellite imagery,
GPS signals, IoT sensors, medical records, and machine-generated logs. In traditional
systems, handling such large data volumes was impossible because of limited storage
and slow processing capabilities. However, modern distributed storage frameworks like
HDFS, cloud-based object storage systems like AWS S3, Azure Blob, and Google
Cloud Storage, and scalable platforms like Snowflake or BigQuery allow
organizations to store huge datasets at low cost.

Large volume also demands advanced processing frameworks such as Hadoop,

Spark, and Presto, which can split workloads across multiple machines and process
them in parallel. Volume highlights the challenge of managing extremely large datasets
while ensuring that storage remains cost-effective, scalable, and easily accessible. The
volume of data is the reason Big Data technologies were created—because traditional
systems are not designed to handle such massive datasets in reasonable time.

Walmart operates approximately 10,500 stores in 24 countries, handling more than 1

million customer transactions every hour. The result? Walmart imports more than 2.5
petabytes of data per hour, storing it internally on what happens to be the world’s
biggest private cloud.

2. VELOCITY

Velocity refers to the speed at which data is created, collected, transmitted, and
processed. In earlier periods, data was collected manually or in batches (e.g.,
end-of-day reports). Today, data is produced continuously and at extremely high
speeds, creating a need for systems that can process information in real-time or near
real-time.

Examples of high-velocity data include live social media activity, streaming videos,
clickstream logs from websites, financial market trades, location signals from
smartphones, sensor readings from manufacturing machines, and real-time health
monitoring devices. In many applications, waiting minutes or hours for data analysis is
not acceptable. A banking fraud detection system, for instance, must detect suspicious
activity within milliseconds. Similarly, ride-sharing applications need real-time GPS
tracking to match drivers with customers instantly.

Because of these requirements, new systems such as Apache Kafka, Flink, Spark
Streaming, and Storm have been developed to process continuous streams of data at
high speed. Velocity is not only about how fast data arrives but also about how fast it
must be stored, analyzed, and used to make decisions. Real-time dashboards,
automated alerts, predictive analytics, and IoT-based systems all rely heavily on
velocity. Managing velocity ensures that organizations can act based on fresh,
up-to-date, and accurate information.

Using real-time alerting, Walmart sales analysts noted that a particular, rather popular,
Halloween novelty cookie was not selling in two stores. A quick investigation showed
that, due to a stocking oversight, those cookies hadn’t been put on the shelves. By
receiving automated alerts, Walmart was quickly able to rectify the situation and save its
sales.

3. VARIETY

Variety refers to the diversity of data formats, structures, types, and sources that
organizations handle in modern environments. Historically, companies mostly dealt with
structured data stored in relational databases, spreadsheets, and tables. Today,
however, data comes in countless forms—text, images, videos, sensor readings,
emails, logs, documents, voice recordings, and more. This enormous diversity creates
complexity in storage, processing, and analysis.

A. Data Types under Variety

Structured Data

This type of data is organized into well-defined rows and columns. It is easy to search,
analyze, and store in relational databases such as SQL. Examples include financial
transactions, customer records, attendance sheets, and sales reports. Structured data
has clear formatting rules and follows strict schemas.

Semi-Structured Data

Semi-structured data does not follow strict tabular structure but contains tags or
markers that make it partially organized. Common examples are JSON files, XML
documents, email headers, and log files. Though not organized into tables, it still carries
metadata that helps with analysis.

Unstructured Data
Unstructured data has no fixed schema or organization. It is difficult for traditional
systems to process and requires AI-based tools such as natural language processing,
image recognition, and audio processing. Examples include images, videos, scanned
documents, chat messages, social media posts, recorded calls, and sensor signals.

Walmart tracks each one of its 145 million American consumers individually, resulting in
accrued data per hour that’s equivalent to 167 times the books in America’s Library of
Congress. Most of that is unstructured data that comes from its videos, tweets,
Facebook posts, call-center conversations, closed-circuit TV footage, mobile phone
calls and texts, and website clicks.

B. Data Sources under Variety

Variety also covers the multiple origins of data. Modern data comes from traditional
systems like ERP and CRM, but also from new sources such as IoT devices, social
platforms, cloud applications, and web analytics.

Examples of data sources include:

● Traditional databases containing business records.

● Machine-generated data from sensors, cameras, industrial machines, and smart
meters.
● Social media data from platforms like Instagram, Twitter, and YouTube.
● Web-based data such as cookies, browsing patterns, and clickstreams.
● Multimodal data such as medical images, satellite pictures, or voice commands
used in AI assistants.

Variety is important because analyzing multiple types of data gives organizations a

complete 360-degree view of performance, customer behavior, operations, and market
trends. However, variety also introduces challenges in storage format, metadata
management, data integration, and analysis.

4. VERACITY

Veracity refers to the quality, accuracy, reliability, and trustworthiness of data. In the
Big Data environment, data is collected from numerous systems and automated
processes, and it often contains errors, inconsistencies, missing values, noise,
duplication, or outdated information. Poor-quality data can lead to flawed insights and
wrong decisions, which may harm a business severely.

Ensuring veracity means organizations must maintain strict data governance, validation
checks, cleansing procedures, and quality monitoring mechanisms. For example,
sensor data may generate corrupted readings due to hardware failure. Social media
data may contain fake accounts, spam, or manipulated content. Customer databases
may include duplicate entries or incorrect contact details. Logs may contain irrelevant
entries caused by system errors.
To improve veracity, organizations use data profiling, cleansing tools, filtering
mechanisms, and statistical checks. They implement authentication protocols, access
control, metadata management, and auditing. Veracity ensures that data science teams
and decision-makers work with clean, consistent, and trustworthy datasets, reducing
the risk involved in analytics.

According to Jaya Kolhatkar, vice president of global data for Walmart labs, Walmart’s
priority is making sure its data is correct and of high quality. Clean data helps with
privacy issues, ensuring sensitive details are encrypted while customer contact
information is segregated.

5. VALUE

Value is considered the most important V because it refers to the usefulness and
benefits that organizations gain from data. Collecting huge amounts of data has no
purpose unless it results in actionable insights. Value focuses on how data can improve
business performance, reduce costs, increase efficiency, enhance customer
experiences, support automation, and generate revenue.

Value is derived through techniques such as data mining, machine learning, predictive
analytics, visualization, and strategic reporting. For example, retail companies use data
analytics to personalize offers, forecast sales, optimize inventory, and understand
customer behavior. Healthcare providers use patient data to predict disease risks and
enhance diagnosis accuracy. Financial institutions use data to detect fraud, assess
creditworthiness, and recommend investment opportunities.

Value also includes intangible benefits such as improved decision-making, innovation

opportunities, better market understanding, risk reduction, and enhanced
competitiveness. The real objective of Big Data systems is to convert raw data into
valuable insights. Without value, the other four Vs lose significance.

Walmart uses its big data to make its pharmacies more efficient, help it improve store
checkout, personalize its shopping experience, manage its supply chain, and optimize
product assortment among other ends.
* Activities to Improve Veracity and Value in Data Engineering
In Data Engineering, veracity and value are two key dimensions of data quality. Veracity
represents the truthfulness and reliability of data, while value measures its usefulness to
the business. Improving both ensures accurate analytics and effective decision-making.

Improving Veracity (Data Reliability)

● 1. Data Validation: Apply strict validation rules during data ingestion to eliminate
errors and inconsistencies.
● 2. Data Cleaning: Remove duplicates, correct outliers, and handle missing values
through automated ETL workflows.
● 3. Source Verification: Regularly audit and verify data sources to ensure they are
accurate and credible.
● 4. Consistency Checks: Cross-verify multiple datasets to maintain uniformity
across systems.
● 5. Data Quality Monitoring: Use monitoring tools like Great Expectations and
Deequ to continuously validate data quality metrics.

Improving Value (Data Usefulness)

● 1. Data Enrichment: Combine internal data with external or contextual data to
enhance business insights.
● 2. Integration: Merge multiple data sources to create a single, comprehensive
view for analytics.
● 3. Feature Engineering: Generate new features that add predictive power for
machine learning models.
● 4. Align with Business Goals: Ensure data initiatives directly support KPIs and
business outcomes.
● 5. Metadata Documentation: Maintain detailed metadata to improve
discoverability and usability of data assets.

Balancing Veracity and Value

A successful data engineering framework maintains a balance between accuracy
(veracity) and business relevance (value). Without veracity, insights are misleading;
without value, even accurate data remains unused.

Example:
A banking institution maintains customer transaction data. By cleaning and verifying the
data (veracity) and enriching it with customer segmentation (value), the bank can
improve fraud detection and marketing performance simultaneously.
Improving data veracity and value is an ongoing process involving validation,
enrichment, governance, and alignment with business objectives. Data engineers play a
central role in ensuring that the data driving analytics is both trustworthy and
meaningful, empowering organizations to make smarter decisions.

Get Hall Ticket SN
No ratings yet
Get Hall Ticket SN
2 pages
Data Engineering Unit 2 Notes
No ratings yet
Data Engineering Unit 2 Notes
29 pages
TE-04F OYM (P1) 02-11-2025 Sol
No ratings yet
TE-04F OYM (P1) 02-11-2025 Sol
25 pages
Ans-XII Chemistry - CK and P Block-Practice Test - 30!07!2024
No ratings yet
Ans-XII Chemistry - CK and P Block-Practice Test - 30!07!2024
11 pages
Contraceptive Methods Highlighted
No ratings yet
Contraceptive Methods Highlighted
7 pages
Wa0000.
No ratings yet
Wa0000.
1 page
Final Academic Progress Tracker
No ratings yet
Final Academic Progress Tracker
4 pages
Improved Academic Progress Tracker
No ratings yet
Improved Academic Progress Tracker
4 pages
Raoults Law Deviations
No ratings yet
Raoults Law Deviations
2 pages
Effective Preaching: Bible Passage Focus
No ratings yet
Effective Preaching: Bible Passage Focus
11 pages
Psyche Rle NCP
No ratings yet
Psyche Rle NCP
2 pages
Rizal's Life: Chapters 19-22 Summary
No ratings yet
Rizal's Life: Chapters 19-22 Summary
1 page
Banking Technology May 2019 PDF
No ratings yet
Banking Technology May 2019 PDF
25 pages
Silver Lined Bones
0% (1)
Silver Lined Bones
1,653 pages
Criminal Law 2 Week 3
No ratings yet
Criminal Law 2 Week 3
10 pages
Lufthansa German Airlines Vs CA
No ratings yet
Lufthansa German Airlines Vs CA
7 pages
Gymnasticshq Gymnastics Levels
No ratings yet
Gymnasticshq Gymnastics Levels
4 pages
Empress Wu Zetian in Fiction and in History
No ratings yet
Empress Wu Zetian in Fiction and in History
123 pages
Poem Collection
No ratings yet
Poem Collection
398 pages
2022 College English Test CET-4 Exam Paper
No ratings yet
2022 College English Test CET-4 Exam Paper
10 pages
Presentation ISO IEC 17021-1
100% (1)
Presentation ISO IEC 17021-1
23 pages
PoB English Full Version
No ratings yet
PoB English Full Version
212 pages
Idirect Hub PDF
No ratings yet
Idirect Hub PDF
6 pages
Pre Reading Noughts and Crosses
No ratings yet
Pre Reading Noughts and Crosses
3 pages
Ezekiel: Prophecies in Captivity
No ratings yet
Ezekiel: Prophecies in Captivity
50 pages
SHREK Scena 2
No ratings yet
SHREK Scena 2
2 pages
[4]
No ratings yet
[4]
232 pages
Frequency and Percentage of Participants (N 300) : Table 1
No ratings yet
Frequency and Percentage of Participants (N 300) : Table 1
2 pages
Lab - Configuring Switch Security Features
No ratings yet
Lab - Configuring Switch Security Features
3 pages
Contract of Lease
No ratings yet
Contract of Lease
2 pages
Police Training and Crime Vocabulary Guide
No ratings yet
Police Training and Crime Vocabulary Guide
14 pages
ISO Insurance Plan Guide for Students
No ratings yet
ISO Insurance Plan Guide for Students
1 page
Two Faces of America: Bulosan's Journey
75% (8)
Two Faces of America: Bulosan's Journey
2 pages
Andhra University School of Distance Education Assignment Question Paper 2019-2020 MBA First Year (Hospital Administration)
No ratings yet
Andhra University School of Distance Education Assignment Question Paper 2019-2020 MBA First Year (Hospital Administration)
12 pages
Business in Islam
No ratings yet
Business in Islam
106 pages
Shared Hosting Plans - Fast and Secure Web Service From Namecheap
No ratings yet
Shared Hosting Plans - Fast and Secure Web Service From Namecheap
11 pages
Corporate Finance - Exercises Session 1
No ratings yet
Corporate Finance - Exercises Session 1
18 pages
Usangu Inv No.202
No ratings yet
Usangu Inv No.202
1 page
62
No ratings yet
62
4 pages

Data Engineering Unit 1 Notes

Uploaded by

Data Engineering Unit 1 Notes

Uploaded by

* DATA-DRIVEN DECISIONS

Data-Driven Decision Making (DDD) is the modern approach in which organizations

2. Meaning of Data-Driven Decision Making

Data-driven decision making refers to the systematic process of collecting, storing,

3. Characteristics of Data-Driven Decisions

(i) Fact-Based and Objective

(ii) Analytical and Systematic

(iii) Predictive in Nature

(v) Scalable for Any Organization

Whether it is a small business, a startup, a multinational company, or government

4. Detailed Explanation of the DDD Process

(i) Data Collection

(ii) Data Storage

(iii) Data Cleaning and Preparation

(v) Insight Generation

(vi) Decision Making and Implementation

Based on insights, the organization formulates strategies—improving customer service,

(vii) Monitoring and Feedback Loop

After implementing decisions, the company observes performance metrics to see

5. Applications of Data-Driven Decision Making

(i) Marketing and Sales

Businesses analyze customer behavior, purchasing patterns, and preferences to design

(ii) Finance and Banking

Banks use data analytics to detect fraudulent transactions, assess creditworthiness,

Factories use sensor-generated data to monitor machines, predict equipment failure,

(v) Government and Public Services

Governments analyze public data to design effective policies, improve transportation

6. Advantages of Data-Driven Decisions

(i) Higher Accuracy and Reliability

(ii) Removes Human Bias

Data-driven systems reduce the influence of emotions, assumptions, favoritism, or

(iii) Improves Efficiency and Productivity

(iv) Enhances Customer Experience

Personalized recommendations, targeted services, and customized offers increase

(v) Better Forecasting and Future Planning

7. Disadvantages and Limitations

(i) Requires High-Quality Data

(iii) Security and Privacy Concerns

(iv) Complexity of Big Data

(v) Over-dependency on Data

Data-Driven Decision Making has become the backbone of modern business

1. Meaning & Purpose of Data Pipeline Infrastructure

2. Key Components of a Data Pipeline Infrastructure (Detailed)

(i) Data Sources

●​ These are the origins of raw data.

(ii) Data Ingestion Layer

●​ First technical layer where data enters the pipeline.

(iii) Data Storage Layer

●​ Stores raw and semi-processed data.

(iv) Data Processing Layer

●​ The “transformation engine” where raw data becomes usable.

(v) Data Integration Layer

●​ Combines multiple data streams into unified datasets.

(vi) Data Governance & Quality Layer

●​ Ensures data follows rules, standards, and definitions.

(vii) Delivery Layer

●​ Final stage where processed data is delivered to BI dashboards, ML models,

3. Flowchart of Data Pipeline Infrastructure

5. Importance of Data Pipeline Infrastructure for Data-Driven Decisions

1.​ Ensures Real-Time Insights​

6. Applications of Data Pipeline Infrastructure

1.​ Ensures reliable, consistent, and high-quality data for decision-making.

1.​ High initial setup cost and complexity.

**Role of the Data Engineer in Data-Driven Organizations

Data-driven organizations rely heavily on high-quality data to make strategic,

1. Designing and Building Scalable Data Architecture

Diagram: High-Level Data Architecture

2. Data Ingestion and Integration (ETL/ELT Pipelines)

A core responsibility of data engineers is to build ETL (Extract–Transform–Load) or

●​ Real-time or batch ingestion

3. Ensuring Data Quality and Data Cleaning

Raw data is often incomplete, inconsistent, duplicated, or inaccurate. Data Engineers

4. Building and Managing Data Lakes & Data Warehouses

●​ Data Lakes (store raw, massive, diverse data)

5. Implementing Data Governance, Security, and Compliance

●​ Access control and authentication

● These are the origins of raw data.

● First technical layer where data enters the pipeline.

● Stores raw and semi-processed data.

● The “transformation engine” where raw data becomes usable.

● Combines multiple data streams into unified datasets.

● Ensures data follows rules, standards, and definitions.

● Final stage where processed data is delivered to BI dashboards, ML models,

1. Ensures Real-Time Insights

1. Ensures reliable, consistent, and high-quality data for decision-making.

1. High initial setup cost and complexity.

● Real-time or batch ingestion

● Data Lakes (store raw, massive, diverse data)

● Access control and authentication

● Handling high-velocity data

● High data throughput

● Data Scientists (to provide training data, feature pipelines, model-ready

● Scheduled data pipelines

● Creating feature stores

● Hadoop ecosystem (HDFS, Hive, HBase, MapReduce)

A data pipeline must never fail in a data-driven company.

● Traditional databases containing business records.