Data Engineering Report Final
Data Engineering Report Final
OCTOBER 2024
R.V.R. & J.C. College of Engineering (Autonomous)
(NAAC A+ Grade) (Approved by A.I.C.T.E)
(Affiliated to Acharya Nagarjuna University)
Chandramoulipuram : Chowdavaram
Guntur – 522 019
R.V.R. & J.C. COLLEGE OF ENGINEERING
DEPARTMENT OF
COMPUTER SCIENCE AND BUSINESS SYSTEMS
CERTIFICATE
This is to certify that this internship report “Data Engineering Virtual Internship”
is the Bonafide work of “Chetana Guntupalli (Y21CB015)” who has carried out the work
under my supervision and submitted in partial fulfilment for the award of Summer
Internship (CB-451) during the year 2024 - 2025.
I would like to express our sincere gratitude to these dignitaries, who are with us in the
journey of my summer internship “AWS Data Engineering Virtual Internship”.
First and foremost, we extend our heartfelt thanks to Dr. Kolla Srinivas, principal of
R.V.R. & J.C. College of Engineering, Guntur, for providing me with such overwhelming
environment to undertake this internship.
I would also like to express our sincere thanks to my friends and family for their moral
support though out our journey.
Data engineering is the process of designing and building systems that let people collect
and analyse raw data from multiple sources and formats. These systems empower people to
find practical applications of the data, which businesses can use to thrive. Data engineers play
a crucial role in designing, operating, and supporting the increasingly complex environments
that power modern data analytics. Historically, data engineers have carefully crafted data
warehouse schemas, with table structures and indexes designed to process queries quickly to
ensure adequate performance. With the rise of data lakes, data engineers have more data to
manage and deliver to downstream data consumers for analytics. Data that is stored in data
lakes may be unstructured and unformatted – it needs attention from data engineers before the
business can derive value from it. Fortunately, once a data set has been fully cleaned and
formatted through data engineering, it’s easier and faster to read and understand. Since
businesses are creating data constantly, it’s important to find software that will automate some
of these processes. The skill set of a data engineer encompasses the “undercurrents” of data
engineering: security, data management, DataOps, data architecture, and software engineering.
This skill set requires an understanding of how to evaluate data tools and how they fit together
across the data engineering lifecycle. It’s also critical to know how data is produced in source
systems and how analysts and data scientists will consume and create value after processing
and curating data. Finally, a data engineer juggles a lot of complex moving parts and must
constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and
interoperability. Nowadays, the data-tooling landscape is dramatically less complicated to
manage and deploy. Modern data tools considerably abstract and simplify workflows. As a
result, data engineers are now focused on balancing the simplest and most cost-effective, best-
of-breed services that deliver value to the business. The data engineer is also expected to create
agile data architectures that evolve as new trends emerge.
Table of Contents
Chapter 2
2.1 Cloud Service Models 2
2.2 TCO considerations 5
2.3 Selecting a region 6
2.4 AWS availability zones 6
2.5 AWS data centers 7
2.6 AWS different categories of services 7
2.7 AWS IAM users and groups 11
2.8 AWS VPCs and subnets 13
2.9 AWS cloud Internet gateway 13
2.10 Content delivery network 14
2.11 Amazon CloudFront 15
2.12 AWS computing services 15
2.13 Amazon EC2 instances 16
2.14 EC2 instance life cycle 16
2.15 AWS Lambda 18
2.16 AWS EBS and types 19
2.17 Amazon EFS architecture 20
2.18 Amazon RDS 21
2.19 Difference between relational and non-relational databases 22
2.20 Pillars of AWS Well-Architected Framework 23
2.21 Load balancing 25
2.22 EC2 auto scaling 26
Chapter 3
3.1 Benefits of Data Driven Organization 28
3.2 5v’s of data 30
3.3 Data pipeline structure 30
3.4 Types of scaling 33
3.5 Batch data ingestion 36
3.6 Streaming data ingestion 37
3.7 Data Warehouse 38
3.8 Data Mart 38
3.9 OLAP cubes 39
3.10 Big Data analysis 40
3.11 Processing data ML 41
3.12 Data Analysis 43
3.13 Data Visualization 44
3.14 Fully automated Data pipeline 46
Chapter 1: Introduction
AWS Academy Cloud Foundations is intended for students who seek an overall
understanding of cloud computing concepts, independent of specific technical roles. It provides
a detailed overview of cloud concepts, AWS core services, security, architecture, pricing, and
support.
The important topics discussed are that you should be able to Define the AWS Cloud,
Explain the AWS pricing philosophy, Identify the global infrastructure components of AWS,
Describe security and compliance measures of the AWS Cloud including AWS Identity and
Access Management (IAM), Create an AWS Virtual Private Cloud (Amazon VPC),
Demonstrate when to use Amazon Elastic Compute Cloud (EC2), AWS Lambda and AWS
Elastic Beanstalk, Differentiate between Amazon S3, Amazon EBS, Amazon EFS and Amazon
S3 Glacier, Demonstrate when to use AWS Database services including Amazon Relational
Database Service (RDS), Amazon DynamoDB, Amazon Redshift, and Amazon Aurora,
Explain AWS Cloud architectural principles as well as Explore key concepts related to Elastic
Load Balancing (ELB), Amazon CloudWatch, and Auto Scaling.
Data Engineering with AWS combines data engineering principles with the utilization
of Amazon Web Services (AWS) tools and services, focusing on designing, building, and
maintaining data pipelines and infrastructure for large-scale data processing and analysis.
Career opportunities in this field are abundant and promising as organizations increasingly rely
on data-driven decision-making. With the dominance of AWS in the cloud services market,
having expertise in data engineering with AWS opens up diverse career options and provides
versatility across industries.
Professionals skilled in data engineering with AWS are in high demand, and their
proficiency in data modelling, ETL processes, data warehousing, big data frameworks, and
cloud technologies makes them valuable assets for organizations.
1
Chapter 2: AWS Academy Cloud Foundations
There are three main cloud service models. Each model represents a different part of
the cloud computing stack and gives you a different level of control over your IT resources:
i. Infrastructure as a service (IaaS): Services in this category are the basic building
blocks for cloud IT and typically provide you with access to networking features,
computers (virtual or on dedicated hardware), and data storage space.
ii. Platform as a service (PaaS): Services in this category reduce the need for you to
manage the underlying infrastructure (usually hardware and operating systems) and
enable you to focus on the deployment and management of your applications.
iii. Software as a service (SaaS): Services in this category provide you with a completed
product that the service provider runs and manages. In most cases, software as a service
refers to end-user applications. With a SaaS offering, you do not have to think about
how the service is maintained or how the underlying infrastructure is managed.
2
Advantages of cloud computing
i. Trade capital expense for variable expense: Capital expenses (capex) are funds
that a company uses to acquire, upgrade, and maintain physical assets such as
property, industrial buildings, or equipment.
ii. Benefit from massive economies of scale: By using cloud computing, you can
achieve a lower variable cost than you can get on your own.
iii. Stop guessing capacity: Eliminate guessing about your infrastructure capacity
needs. When you make a capacity decision before you deploy an application, you
often either have expensive idle resources or deal with limited capacity.
iv. Increase speed and agility: In a cloud computing environment, new IT resources
are only a click away, which means that you reduce the time it takes to make those
resources available to your developers from weeks to just minutes.
vi. Go global in minutes: You can deploy your application in multiple AWS Regions
around the world with just a few clicks. As a result, you can provide a lower latency
and better experience for your customers simply and at minimal cost.
Amazon Web Services (AWS) is a secure cloud platform that offers a broad set of global
cloud-based products. Because these products are delivered over the internet, you have on-
demand access to the compute, storage, network, database, and other IT resources that you
might need for your projects—and the tools to manage them.
AWS offers flexibility. Your AWS environment can be reconfigured and updated on
demand, scaled up or down automatically to meet usage patterns and optimize spending, or
3
shut down temporarily or permanently. AWS services are designed to work together to support
virtually any type of application or workload. Think of these services like building blocks,
which you can assemble quickly to build sophisticated, scalable solutions, and then adjust them
as your needs change.
There are three fundamental drivers of cost with AWS: compute, storage, and outbound
data transfer. These characteristics vary somewhat, depending on the AWS product and pricing
model you choose. In most cases, there is no charge for inbound data transfer or for data transfer
between other AWS services within the same AWS Region. There are some exceptions to be
sure to verify data transfer rates before you begin to use the AWS service. Outbound data
transfer is aggregated across services and then charged at the outbound data transfer rate. This
charge appears on the monthly statement as AWS Data Transfer Out.
Additional services
ii. AWS Identity and Access Management (IAM): It controls your users’ access
to AWS services and resources. Consolidated Billing is a billing feature in AWS
Organizations to consolidate payment for multiple AWS accounts or multiple
Amazon Internet Services Private Limited (AISPL) accounts.
iii. AWS Elastic Beanstalk: It is an even easier way for you to quickly deploy and
manage applications in the AWS Cloud.
4
What is TCO?
The AWS Pricing Calculator enables you to name your estimate and create and name
groups of services.
5
2.3 AWS Global Infrastructure overview
The AWS Cloud infrastructure is built around regions. AWS has 22 Regions worldwide.
An AWS Region is a physical geographical location with one or more Availability Zones.
Availability Zones in turn consist of one or more data centres.
6
Fig 2.5 AWS data centres
i. First, it is elastic and scalable. This means resources can dynamically adjust to
increases or decreases in capacity requirements. It can also rapidly adjust to
accommodate growth.
ii. Second, this infrastructure is fault tolerant, which means it has built-in component
redundancy which enables it to continue operations despite a failed component.
7
AWS storage services
i. Amazon Simple Storage Service (Amazon S3): It is an object storage service that
offers scalability, data availability, security, and performance. Use it to store and protect
any amount of data for websites, mobile apps, backup and restore, archive, enterprise
applications, Internet of Things (IoT) devices, and big data analytics.
ii. Amazon Elastic Block Store (Amazon EBS): It is high-performance block storage
that is designed for use with Amazon EC2 for both throughput and transaction intensive
workloads. It is used for a broad range of workloads, such as relational and non-
relational databases, enterprise applications, containerized applications, big data
analytics engines, file systems, and media workflows.
iii. Amazon Elastic File System (Amazon EFS): It provides a scalable, fully managed
elastic Network File System (NFS) file system for use with AWS Cloud services and
on-premises resources. It is built to scale on demand to petabytes, growing and
shrinking automatically as you add and remove files. It reduces the need to provision
and manage capacity to accommodate growth.
It is a secure, durable, and extremely low-cost Amazon S3 cloud storage class for data
archiving and long-term backup. It is designed to deliver 11 9s of durability, and to provide
comprehensive security and compliance capabilities to meet stringent regulatory requirements.
8
applications and services on familiar servers such as Apache and Microsoft Internet
Information Services (IIS).
iii. AWS Lambda: It enables you to run code without provisioning or managing servers.
You pay only for the compute time that you consume. There is no charge when your
code is not running. Amazon Elastic Kubernetes Service (Amazon EKS) makes it easy
to deploy, manage, and scale containerized applications that use Kubernetes on AWS.
iv. AWS Fargate: It is a compute engine for Amazon ECS that allows you to run containers
without having to manage servers or clusters.
i. Amazon Virtual Private Cloud (Amazon VPC): It enables you to provision logically
isolated sections of the AWS Cloud.
iii. Amazon CloudFront: It is a fast content delivery network (CDN) service that securely
delivers data, videos, applications, and application programming interfaces (APIs) to
customers globally, with low latency and high transfer speeds.
iv. AWS Transit Gateway: It is a service that enables customers to connect their Amazon
Virtual Private Clouds (VPCs) and their on-premises networks to a single gateway.
v. Amazon Route 53: It is a scalable cloud Domain Name System (DNS) web service
designed to give you a reliable way to route end users to internet applications. It
translates names (like www.example.com) into the numeric IP addresses (like
192.0.2.1) that computers use to connect to each other.
9
vi. AWS Direct Connect: It provides a way to establish a dedicated private network
connection from your data centre or office to AWS, which can reduce network costs
and increase bandwidth throughput.
vii. AWS VPN: It provides a secure private tunnel from your network or device to the AWS
global network.
Security and compliance are a shared responsibility between AWS and the customer.
This shared responsibility model is designed to help relieve the customer’s operational burden.
At the same time, to provide the flexibility and customer control that enables the deployment
of customer solutions on AWS, the customer remains responsible for some aspects of the
overall security. The differentiation of who is responsible for what is commonly referred to as
security “of” the cloud versus security “in” the cloud.
AWS operates, manages, and controls the components from the software virtualization
layer down to the physical security of the facilities where AWS services operate. AWS is
responsible for protecting the infrastructure that runs all the services that are offered by AWS
Cloud. This infrastructure is composed of the hardware, software, networking, and facilities
that run the AWS Cloud services.
The customer is responsible for the encryption of data at rest and data in transit. The
customer should also ensure that the network is configured for security and that security
credentials and logins are managed safely. Additionally, the customer is responsible for the
configuration of security groups and the configuration of the operating system that run on
compute instances that they launch (including updates and security patches).
AWS Identity and Access Management (IAM)allows you to control access to compute,
storage, database, and application services in the AWS Cloud. IAM can be used to handle
authentication, and to specify and enforce authorization policies so that you can specify which
users can access which services. With IAM, you can manage which resources can be accessed
by who, and how these resources can be accessed.
10
Fig 2.7 AWS IAM users and groups
Data encryption is an essential tool to use when your objective is to protect digital data.
Data encryption takes data that is legible and encodes it so that it is unreadable to anyone who
does not have access to the secret key that can be used to decode it. Thus, even if an attacker
gains access to your data, they cannot make sense of it. Data at rest refers to data that is
physically stored on disk or on tape.
Data in transit refers to data that is moving across the network. Encryption of data in
transit is accomplished by using Transport Layer Security (TLS) 1.2 with an open standard
AES-256 cipher. TLS was formerly called Secure Sockets Layer (SSL).
A computer network is two or more client machines that are connected together to share
resources. A network can be logically partitioned into subnets. Networking requires a
networking device (such as a router or switch) to connect all the clients together and enable
communication between them.
Each client machine in a network has a unique Internet Protocol (IP) address that
identifies it. An IP address is a numerical label in decimal format. Machines convert that
decimal number to a binary format.
11
IPv4 and IPv6 addresses
A 32-bit IP address is called an IPv4 address. IPv6 addresses, which are 128 bits, are
also available. IPv6 addresses can accommodate more user devices.
The Open Systems Interconnection (OSI) model is a conceptual model that is used to
explain how data travels over a network. It consists of seven layers and shows the common
protocols and addresses that are used to send data at each layer.
Amazon VPC
Enables you to provision a logically isolated section of the AWS Cloud where you can
launch AWS resources in a virtual network that you define:
Gives you control over your virtual networking resources, including Selection of IP
address range, creation of subnets, configuration of route tables and network gateways.
Enables you to customize the network configuration for your VPC, enables you to use
multiple layers of security.
12
Fig 2.8 AWS VPCs and subnets
An internet gateway is a scalable, redundant, and highly available VPC component that
allows communication between instances in your VPC and the internet. An internet gateway
serves two purposes: to provide a target in your VPC route tables for internet-routable traffic,
and to perform network address translation for instances that were assigned public IPv4
addresses.
13
Amazon Route 53
Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS)
web service. It is designed to give developers and businesses are liable and cost-effective way
to route users to internet applications by translating names (like www.example.com) into the
numeric IP addresses (like 192.0.2.1) that computers use to connect to each other. In addition,
Amazon Route 53 is fully compliant with IPv6.
14
Fig 2.11 Amazon CloudFront
Amazon CloudFront delivers content through a worldwide network of data centres that
are called edge locations. When a user requests content that you serve with CloudFront, the
user is routed to the edge location that provides the lowest latency (or time delay) so that
content is delivered with the best possible performance.
2.6 Compute
15
Amazon Elastic Compute Cloud (Amazon EC2)
Elastic refers to the fact that you can easily increase or decrease the number of servers
you run to support an application automatically, and you can also increase or decrease the size
of existing servers.
Compute refers to reason why most users run servers in the first place, which is to host
running applications or process data—actions that require compute resources, including
processing power (CPU) and memory (RAM). Cloud refers to the fact that the EC2 instances
that you run are hosted on it.
Docker
Docker is a software platform that packages software (such as applications) into
containers. Docker is installed on each server that will host containers, and it provides simple
commands that you can use to build, start, or stop containers.
16
Docker is best used as a solution when you want to:
i. Standardize environments
ii. Reduce conflicts between language stacks and versions
iii. Use containers as a service
iv. Run microservices using standardized code deployments
v. Require portability for data processing
What is Kubernetes?
Amazon ECR is a fully managed Docker container registry that makes it easy for
developers to store, manage, and deploy docker container images.
17
It is integrated with Amazon ECS, so you can store, run, and manage container images
for applications that run on Amazon ECS. Specify the Amazon ECR repository in your task
definition, and Amazon ECS will retrieve the appropriate images for your applications.
AWS Lambda: Run code without servers:
AWS Lambda is a serverless compute service.
2.7 Storage
Uses include –Boot volumes and storage for Amazon Elastic Compute Cloud (Amazon
EC2) instances, Data storage with a file system, Database hosts, Enterprise applications.
18
Fig 2.16 AWS EBS and types
Amazon S3 is object-level storage, which means that if you want to change a part of a
file, you must make the change and then re-upload the entire modified file.
19
Amazon S3 is a managed cloud storage solution that is designed to scale seamlessly
and provide 11 9s of durability. You can store virtually as many objects as you want in a bucket,
and you can write, read, and delete objects in your bucket.
Amazon EFS provides simple, scalable, elastic file storage for use with AWS services
and on-premises resources. It offers a simple interface that enables you to create and configure
file systems quickly and easily.
2.8 Databases
20
where resources become unavailable. Say that you launch a web server on an Amazon Elastic
Compute Cloud (Amazon EC2) instance.
Managed services require the user to configure them. For example, you create an
Amazon Simple Storage Service (Amazon S3) bucket and then set permissions for it. However,
managed services typically require less configuration.
Amazon RDS
Managed service that sets up and operates a relational database in the cloud.
Amazon RDS enables you to focus on your application, so you can give applications
the performance, high availability, security, and compatibility that they need. With Amazon
RDS, your primary focus is your data and optimizing your application.
21
Amazon DynamoDB
A relational database (RDB) works with structured data that is organized by tables,
records, and columns. RDBs establish a well-defined relationship between database tables.
RDBs use structured query language (SQL), which is a standard user application that provides
a programming interface for database interaction.
A non-relational database is any database that does not follow the relational model that
is provided by traditional relational database management systems (RDBMS). Non-relational
databases have grown in popularity because they were designed to overcome the limitations of
relational databases for handling the demands of variable structured data.
Amazon Aurora
22
2.9 Cloud Architecture
The Operational Excellence pillar focuses on the ability to run and monitor systems to
deliver business value, and to continually improve supporting processes and procedures.
There are five design principles for operational excellence in the cloud:
i. Perform operations as code.
ii. Make frequent, small, reversible changes.
iii. Refine operations procedures frequently.
iv. Anticipate failure.
v. Learn from all operational failures.
Security pillar
The Security pillar focuses on the ability to protect information, systems, and assets
while delivering business value through risk assessments and mitigation strategies.
23
There are seven design principles that can improve security:
i. Implement a strong identity foundation
ii. Enable traceability
iii. Apply security at all layers
iv. Automate security best practices
v. Protect data in transit and at rest
vi. Keep people away from data
vii. Prepare for security events
Reliability pillar
The Reliability pillar focuses on ensuring a workload performs its intended function
correctly and consistently when it’s expected to.
The Performance Efficiency pillar focuses on the ability to use IT and computing
resources efficiently to meet system requirements, and to maintain that efficiency as demand
changes or technologies evolve.
There are five design principles that can improve performance efficiency:
i. Democratize advanced technologies.
ii. Go global in minutes.
iii. Use serverless architectures.
iv. Experiment more often.
v. Consider mechanical sympathy.
24
Cost Optimization pillar
The Cost Optimization pillar focuses on the ability to avoid unnecessary costs. Key
topics include: understanding and controlling where money is being spent, selecting the most
appropriate and right number of resource types, analysing spend over time, and scaling to
meeting business needs without overspending.
25
A load balancer accepts incoming traffic from clients and routes requests to its
registered targets (such as EC2 instances) in one or more Availability Zones.
There is a key difference in how the load balancer types are configured. With
Application Load Balancers and Network Load Balancers, you register targets in target groups,
and route traffic to the target groups. With Classic Load Balancers, you register instances with
the load balancer.
Amazon CloudWatch
Amazon CloudWatch is a monitoring and observability service that is built for DevOps
engineers, developers, site reliability engineers (SRE), and IT managers. CloudWatch monitors
your AWS resources (and the applications that you run on AWS) in real time.
When you run your applications on AWS, you want to ensure that your architecture can
scale to handle changes in demand. In this section, you will learn how to automatically scale
your EC2 instances with Amazon EC2 Auto Scaling.
Scaling is the ability to increase or decrease the compute capacity of your application.
To understand why scaling is important, consider this example of a workload that has varying
resource requirements.
26
AWS Auto Scaling
AWS Auto Scaling is a separate service that monitors your applications. It automatically
adjusts capacity to maintain steady, predictable performance at the lowest possible cost. The
service provides a simple, powerful user interface that enables you to build scaling plans for
resources, including:
27
Chapter 3: AWS Academy Data Engineering
Data is playing a key role across the whole business cycle evolving to key strategic
decisions not only backend tasks, it is foreseen that organizations that use Data driven analytics
are x19 times more profitable from organizations which are not data-driven, and x23 times
more likely to customer’s acquisition.
There are a variety of paths organizations can take to jump-start their data-driven roadmap,
such as launching Big Data initiatives, broadening data-collection initiatives, hiring a Chief
Data Officer (CDO), and creating new analytics functions. But there’s more to it than this. To
become data-driven, organizations need to:
i. Create a culture of innovation that positions data at the core of your business strategy.
ii. Build data capabilities to help drive that culture.
28
3.2 The elements of data
ii. Velocity
It refers to the speed at which the data is getting accumulated. This is mainly due
to IoTs, mobile data, social media etc.
iii. Variety
• Structured data
• Semi-structured data
• Unstructured data
iv. Veracity
v. Value
Just because we collected lots of Data, it’s of no value unless we garner some
insights out of it. Value refers to how useful the data is in decision making. We need to
extract the value of the Big Data using proper analytics.
29
Fig 3.2 5v’s of data
3.3 Design principles and patterns for data pipelines
30
vi. Insights are the output of the big data pipeline. The business owners use these
insights to make critical decisions.
i. Batch processing
The development of batch processing was critical step in building data infrastructures that
were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was patented
and then subsequently integrated in open-source systems, like Hadoop, CouchDB, and
MongoDB.
As the name implies, batch processing loads “batches” of data into a repository during set
time intervals, which are typically scheduled during off-peak business hours. This way, other
workloads aren’t impacted as batch processing jobs tend to work with large volumes of data,
which can tax the overall system. Batch processing is usually the optimal data pipeline when
there isn’t an immediate need to analyse a specific dataset (e.g. monthly accounting), and it is
more associated with the ETL data integration process, which stands for “extract, transform,
and load.”
Batch processing jobs form a workflow of sequenced commands, where the output of one
command becomes the input of the next command. For example, one command may kick off
data ingestion, the next command may trigger filtering of specific columns, and the subsequent
command may handle aggregation. This series of commands will continue until the data is
completely transformed and written into data repository.
Unlike batching processing, streaming data is leveraged when it is required for data to be
continuously updated. For example, apps or point of sale systems need real-time data to update
inventory and sales history of their products; that way, sellers can inform consumers if a product
is in stock or not. A single action, like a product sale, is considered an “event”, and related
events, such as adding an item to checkout, are typically grouped together as a “topic” or
“stream.” These events are then transported via messaging systems or message brokers, such
as the open-source offering, Apache Kafka.
31
Since data events are processed shortly after occurring, streaming processing systems have
lower latency than batch systems, but aren’t considered as reliable as batch processing systems
as messages can be unintentionally dropped or spend a long time in queue. Message brokers
help to address this concern through acknowledgements, where a consumer confirms
processing of the message to the broker to remove it from the queue.
ii. Data Transformation: During this step, a series of jobs are executed to
process data into the format required by the destination data repository. These
jobs embed automation and governance for repetitive workstreams, like
business reporting, ensuring that data is cleansed and transformed
consistently. For example, a data stream may come in a nested JSON format,
and the data transformation stage will aim to unroll that JSON to extract the key
fields for analysis.
iii. Data Storage: The transformed data is then stored within a data repository,
where it can be exposed to various stakeholders. Within streaming data, this
transformed data are typically known as consumers, subscribers, or recipients.
Scaling refers to changing the number of machines or the size of the machine depending
on the size of the data to be processed. Increasing the number of machines or the size of the
machine is called scaling up, and decreasing them is called scaling down. Scaling down is done
to keep costs low.
32
i. Vertical scaling: Increasing memory and/or disk size of the data processing
machine.
ii. Horizontal scaling: Using multiple processes to process a large data set.
Data ingestion and preparation involves processes in collecting, curating, and preparing
the data for ML. Data ingestion involves collecting batch or streaming data in unstructured or
structured format. Data preparation takes the ingested data and processes to a format that can
be used with ML.
Identifying, collecting, and transforming data is the foundation for ML. There is
widespread consensus among ML practitioners that data preparation accounts for
approximately 80% of the time spent in developing a viable ML model.
33
There are several challenges that public sector organizations face in this phase: First is
the ability to connect to and extract data from different types of data sources. Once the data is
extracted, it needs to be catalogued and organized so that it is available for consumption, and
there needs to be a mechanism in place to ensure that only authorized resources have access to
the data. Mechanisms are also needed to ensure that source data transformed for ML is
reviewed and approved for compliance with federal government guidelines.
Data Ingestion
The AWS Cloud enables public sector customers to overcome the challenge of
connecting to and extracting data from both streaming and batch data, as described in the
following:
Streaming Data: For streaming data, Amazon Kinesis and Amazon MSK enable the
collection, processing, and analysis of data in real time. Amazon Kinesis provides a suite of
capabilities to collect, process, and analyse real-time, streaming data. KDS is a service that
enables ingestion of streaming data. Producers of data push data directly into a stream, which
consists of a group of stored data units called records.
Batch Data: There are a number of mechanisms available for data ingestion in batch
format. With AWS DMS, you can replicate and ingest existing databases while the source
databases remain fully operational. The service supports multiple database sources and targets,
including writing data directly to Amazon S3.
Data Preparation
Once the data is extracted, it needs to be transformed and loaded into a data store for
feeding into an ML model. It also needs to be catalogued and organized so that it is available
for consumption, and also needs to enable data lineage for compliance with federal government
guidelines. AWS Cloud provides three services that provide these mechanisms. They are:
AWS Glue: It is a fully managed ETL (extract, transform and load) service that makes
it simple and cost-effective to categorize, clean, enrich, and migrate data from a source system
to a data store for ML. The AWS Glue Data Catalog provides the location and schema of ETL
jobs as well as metadata tables (where each table specifies a single source data store). A crawler
can be set to automatically take inventory of the data in your data stores.
34
Amazon SageMaker Data Wrangler: It is a service that enables the aggregation and
preparation of data for ML and is directly integrated into Amazon SageMaker Studio.
Amazon EMR: Many organizations use Spark for data processing and other purposes
such as for a data warehouse. These organizations already have a complete end-to-end pipeline
in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term.
In these situations, Amazon EMR, a managed service for Hadoop-ecosystem clusters, can be
used to process data.
In today’s data-driven world, the efficient ingestion of data is crucial for organizations
seeking to extract valuable insights and make informed decisions. Two primary methods
employed for data ingestion are batch ingestion and streaming ingestion. Each approach comes
with its own set of advantages and limitations, and the choice between batch and streaming
ingestion depends on various factors, including the nature of the data, data volume, processing
requirements, and the need for real-time analysis. This essay aims to explore and compare batch
and streaming data ingestion, shedding light on their respective strengths, weaknesses, and use
cases.
Batch data ingestion involves collecting and processing data in predefined, fixed-size
chunks or batches. In this approach, data is ingested periodically, typically at regular intervals
(e.g., hourly, daily, or weekly). The data collected in each batch is processed as a unit, often in
parallel, to extract valuable insights. Batch data ingestion is well-suited for applications that do
not require real-time or immediate analysis, and it is characterized by its ability to handle large
data volumes efficiently.
i. Scalability: Batch data ingestion allows organizations to process large volumes of data
efficiently by breaking it into manageable chunks. It facilitates horizontal scalability,
enabling the addition of more resources to meet growing data demands.
35
iii. Cost-Effectiveness: Since batch processing is not real-time, it requires fewer
computational resources compared to streaming ingestion, making it cost-effective.
i. Latency: Batch ingestion introduces latency as data is collected and processed in fixed
intervals. Consequently, insights are not immediately available, which can be limiting
for time-critical applications.
ii. Stale Insights: The insights generated from batch data ingestion may become outdated
by the time they are available for analysis, potentially affecting decision-making
accuracy.
iii. Unsuitable for Real-Time Applications: Batch ingestion is not suitable for
applications that require real-time analysis and immediate actions based on incoming
data.
36
time insights and actions, as it enables organizations to respond immediately to time-sensitive
events.
ii. Low Latency: The real-time nature of streaming ingestion minimizes latency, ensuring
that insights are always up-to-date and relevant.
iii. Cost: Real-time processing can be more costly than batch processing, as it demands
significant investments in infrastructure and technology.
37
3.7 Storing and organizing data
Data warehouse
A data warehouse (DW) is a central repository storing data in queryable forms. From
a technical standpoint, a data warehouse is a relational database optimized for reading,
aggregating, and querying large volumes of data. Traditionally, DWs only contained structured
data or data that can be arranged in tables. However, modern DWs can also support
unstructured data (such as images, pdf files, and audio formats).
Data marts
Simply speaking, a data mart is a smaller data warehouse (their size is usually less
than 100Gb.). They become necessary when the company and the amount of its data grow and
it becomes too long and ineffective to search for information in an enterprise DW. Instead, data
marts are built to allow different departments (e.g., sales, marketing, C-suite) to access relevant
information quickly and easily.
38
OLAP and OLAP cubes
In real-word, most of the data is unstructured, making it difficult to streamline the Data
Processing tasks. And since there is no end to the Data Generation process, collecting and
storing information has become increasingly difficult. Today, it has become essential to have a
systematic approach to handling Big Data to ensure organizations can effectively harness the
power of data.
In this article, you will learn about Big Data, its types, the steps for Big Data Processing,
and the tools used to handle enormous information.
39
data, you can further use it for Statistical Analysis or building Machine Learning models for
predictions.
40
Fig 3.11 Processing data ML
Analytics is a broad term that encompasses numerous subfields. It refers to all the tools
and activities involved in processing data to develop valuable insights and interpretations. It is
worth noting that Data Analytics is dependent on computer tools and software that help extract
data and analyse them for business decisions to be made accordingly.
Data Analytics is highly adopted in the commercial industry since it helps companies
to better understand their customers and boost their advertising campaigns. This is a highly
dynamic field due to the numerous innovations we keep on hearing every once in a while. Data
Analytics has been converted into a mechanical sector reliant on Computer Algorithms that
process raw data to come up with sensible conclusions at the time of this writing.
41
There are two main Data Analysis techniques:
i. Descriptive Analysis: This type of Data Analysis lets you see the
patterns and trends in a particular set of data. It includes processes such
as calculating frequencies, percentages, and measures of central
tendency, including mean, mode, and median.
ii. Inferential Analysis: It is used when the differences and correlations
between particular data sets need to be examined. The processes
involved include ANOVA, t-Tests, and Chi-Square.
42
Fig 3.12 Data Analysis
Data Visualization
Data Visualization deals with picturing the information to develop trends and
conclusions. In Data Visualization, information is organized into charts, graphs, and other
forms of visual representations. This simplifies otherwise complicated information and makes
it accessible to all the involved stakeholders to make critical business decisions.
ii. Graphs: These are excellent tools for analysing the time series relationship in a
particular set of data. For instance, a company’s annual profits could be analysed
based on each month using a graph.
iii. Fever Charts: A Fever Chart is an indispensable tool for any business since it
shows how data changes over time. For instance, a particular product’s performance
could be analysed based on its yearly profits.
iv. Heatmap Visualization: This tool is based on the psychological fact that the human
brain interprets colours much faster than numbers. It is a graph that uses numerical
data points highlighted in light or warm colours to represent high or low-value
points.
43
v. Infographics: Infographics are effective when analysing complex datasets. They
take large amounts of data and organize it into an easy to interpret format.
A fully automated data pipeline enables your organization to extract data at the source,
transform it, and integrate it with data from other sources before sending it to a data warehouse
or lake for loading into business applications and analytics platforms.
44
The three main reasons to implement a fully automated data pipeline are:
45
from targeted marketing to more efficient operations and increased
performance and productivity.
46
Chapter 4: Learning Outcomes
During this internship, I achieved many new skills which are very important for my
future career.
ii. Core AWS services: I explored and gained hands-on experience with essential AWS
services like Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service
(S3), Identity and Access Management (IAM), and Virtual Private Cloud (VPC).
i. Data warehousing with Amazon Redshift: I learned how to use Amazon Redshift, a
cloud-based data warehouse, to store, manage, and analyse large datasets efficiently.
ii. Big data processing with Amazon EMR: I gained experience in utilizing Amazon
EMR, a managed Hadoop framework, to process and analyse massive datasets
efficiently.
ii. Data visualization: I gained proficiency in creating data visualizations using tools like
Tableau or Matplotlib. You learned about choosing appropriate chart types, effective
visual design principles, and storytelling with data.
47
Chapter 5: Conclusion and Future Career
Conclusion
The AWS Academy Data Analytics virtual internship is an excellent opportunity for
individuals seeking to gain hands-on experience in the field of data analytics and explore the
capabilities of AWS data analytics services. The internship provides a comprehensive
foundation in cloud computing concepts, data analysis techniques, and AWS data analytics
services, preparing participants for real-world data analytics projects and potential certification.
Future Career
With the growing demand for data analytics professionals, the AWS Academy Data
Analytics virtual internship can open doors to various career opportunities in the field.
Here are some potential career paths for individuals who have completed the internship:
i. Data Analyst
ii. Data Scientist
iii. Data Engineer
iv. Business Intelligence Analyst
v. Machine Learning Engineer
48
Chapter 6: References
49