0% found this document useful (0 votes)
109 views56 pages

Data Engineering Report Final

Uploaded by

karthikpadala365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views56 pages

Data Engineering Report Final

Uploaded by

karthikpadala365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DATA ENGINEEERING VIRTUAL INTERNSHIP REPORT

Submitted in partial fulfilment of requirements to CSBS


Summer Internship (CB - 451)
IV/IV B. Tech CSBS (VII Semester)
Submitted by
Chetana Guntupalli (Y21CB015)

OCTOBER 2024
R.V.R. & J.C. College of Engineering (Autonomous)
(NAAC A+ Grade) (Approved by A.I.C.T.E)
(Affiliated to Acharya Nagarjuna University)
Chandramoulipuram : Chowdavaram
Guntur – 522 019
R.V.R. & J.C. COLLEGE OF ENGINEERING
DEPARTMENT OF
COMPUTER SCIENCE AND BUSINESS SYSTEMS

CERTIFICATE

This is to certify that this internship report “Data Engineering Virtual Internship”
is the Bonafide work of “Chetana Guntupalli (Y21CB015)” who has carried out the work
under my supervision and submitted in partial fulfilment for the award of Summer
Internship (CB-451) during the year 2024 - 2025.

Mr. K. Subramanyam Dr. A. Sri Nagesh


Internship Incharge Prof.&HOD,CSBS
ACKNOWLEDGEMENT

I would like to express our sincere gratitude to these dignitaries, who are with us in the
journey of my summer internship “AWS Data Engineering Virtual Internship”.

First and foremost, we extend our heartfelt thanks to Dr. Kolla Srinivas, principal of
R.V.R. & J.C. College of Engineering, Guntur, for providing me with such overwhelming
environment to undertake this internship.

I am utmost grateful to Dr. A. Sri Nagesh, Head of the Department of Computer


Science and Business Systems for paving a path for me and assisting me to meet the
requirements that are needed to complete this internship.
I extend my gratitude to the Incharge Mr. K. Subramanyam for his ecstatic guidance
and feedback throughout the internship. His constant support and constructive criticism have
helped me in completing the internship.

I would also like to express our sincere thanks to my friends and family for their moral
support though out our journey.

Chetana Guntupalli (Y21CB015)


SUMMER INTERNSHIP CERTIFICATE
ABSTRACT

Data engineering is the process of designing and building systems that let people collect
and analyse raw data from multiple sources and formats. These systems empower people to
find practical applications of the data, which businesses can use to thrive. Data engineers play
a crucial role in designing, operating, and supporting the increasingly complex environments
that power modern data analytics. Historically, data engineers have carefully crafted data
warehouse schemas, with table structures and indexes designed to process queries quickly to
ensure adequate performance. With the rise of data lakes, data engineers have more data to
manage and deliver to downstream data consumers for analytics. Data that is stored in data
lakes may be unstructured and unformatted – it needs attention from data engineers before the
business can derive value from it. Fortunately, once a data set has been fully cleaned and
formatted through data engineering, it’s easier and faster to read and understand. Since
businesses are creating data constantly, it’s important to find software that will automate some
of these processes. The skill set of a data engineer encompasses the “undercurrents” of data
engineering: security, data management, DataOps, data architecture, and software engineering.
This skill set requires an understanding of how to evaluate data tools and how they fit together
across the data engineering lifecycle. It’s also critical to know how data is produced in source
systems and how analysts and data scientists will consume and create value after processing
and curating data. Finally, a data engineer juggles a lot of complex moving parts and must
constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and
interoperability. Nowadays, the data-tooling landscape is dramatically less complicated to
manage and deploy. Modern data tools considerably abstract and simplify workflows. As a
result, data engineers are now focused on balancing the simplest and most cost-effective, best-
of-breed services that deliver value to the business. The data engineer is also expected to create
agile data architectures that evolve as new trends emerge.
Table of Contents

Title Pg. No.


1. Introduction 1
2. AWS Academy Cloud Foundations 2
2.1. Cloud concepts overview 2-4
2.2. Cloud economics and billing 4-6
2.3. AWS global infrastructure overview 6-10
2.4. AWS cloud security 10-11
2.5. Networking and content delivery 11-15
2.6. Compute 15-18
2.7. Storage 18-20
2.8. Database 20-23
2.9. Cloud architecture 23-25
2.10. Auto scaling and monitoring 25-28
3. AWS Academy Data Engineering 28
3.1. Data driven organizations 28-28
3.2. The elements of data 29-30
3.3. Design principles and for data pipelines 30-32
3.4. Securing and scaling the data pipeline 32-33
3.5. Ingestion and preparing data 33-35
3.6. Ingestion by batch or by stream 35-37
3.7. Storing and organizing data 38-39
3.8. Processing big data 39-40
3.9. Processing data for ml 40-41
3.10. Analysing and visualizing data 41-44
3.11. Automating the pipeline 44-46
4. Learning Outcomes 47
5. Conclusion and Future Career 48
6. References 49
List of Figures

Fig. No. Figure Name Pg. No.

Chapter 2
2.1 Cloud Service Models 2
2.2 TCO considerations 5
2.3 Selecting a region 6
2.4 AWS availability zones 6
2.5 AWS data centers 7
2.6 AWS different categories of services 7
2.7 AWS IAM users and groups 11
2.8 AWS VPCs and subnets 13
2.9 AWS cloud Internet gateway 13
2.10 Content delivery network 14
2.11 Amazon CloudFront 15
2.12 AWS computing services 15
2.13 Amazon EC2 instances 16
2.14 EC2 instance life cycle 16
2.15 AWS Lambda 18
2.16 AWS EBS and types 19
2.17 Amazon EFS architecture 20
2.18 Amazon RDS 21
2.19 Difference between relational and non-relational databases 22
2.20 Pillars of AWS Well-Architected Framework 23
2.21 Load balancing 25
2.22 EC2 auto scaling 26

Chapter 3
3.1 Benefits of Data Driven Organization 28
3.2 5v’s of data 30
3.3 Data pipeline structure 30
3.4 Types of scaling 33
3.5 Batch data ingestion 36
3.6 Streaming data ingestion 37
3.7 Data Warehouse 38
3.8 Data Mart 38
3.9 OLAP cubes 39
3.10 Big Data analysis 40
3.11 Processing data ML 41
3.12 Data Analysis 43
3.13 Data Visualization 44
3.14 Fully automated Data pipeline 46
Chapter 1: Introduction

AWS Academy Cloud Foundations is intended for students who seek an overall
understanding of cloud computing concepts, independent of specific technical roles. It provides
a detailed overview of cloud concepts, AWS core services, security, architecture, pricing, and
support.

The important topics discussed are that you should be able to Define the AWS Cloud,
Explain the AWS pricing philosophy, Identify the global infrastructure components of AWS,
Describe security and compliance measures of the AWS Cloud including AWS Identity and
Access Management (IAM), Create an AWS Virtual Private Cloud (Amazon VPC),
Demonstrate when to use Amazon Elastic Compute Cloud (EC2), AWS Lambda and AWS
Elastic Beanstalk, Differentiate between Amazon S3, Amazon EBS, Amazon EFS and Amazon
S3 Glacier, Demonstrate when to use AWS Database services including Amazon Relational
Database Service (RDS), Amazon DynamoDB, Amazon Redshift, and Amazon Aurora,
Explain AWS Cloud architectural principles as well as Explore key concepts related to Elastic
Load Balancing (ELB), Amazon CloudWatch, and Auto Scaling.

Data Engineering with AWS combines data engineering principles with the utilization
of Amazon Web Services (AWS) tools and services, focusing on designing, building, and
maintaining data pipelines and infrastructure for large-scale data processing and analysis.
Career opportunities in this field are abundant and promising as organizations increasingly rely
on data-driven decision-making. With the dominance of AWS in the cloud services market,
having expertise in data engineering with AWS opens up diverse career options and provides
versatility across industries.

Professionals skilled in data engineering with AWS are in high demand, and their
proficiency in data modelling, ETL processes, data warehousing, big data frameworks, and
cloud technologies makes them valuable assets for organizations.

The “Data Engineering with AWS” course provides a comprehensive understanding of


data engineering principles and the effective utilization of Amazon Web Services. You will
learn to build scalable and robust data solutions using services like Amazon S3, RDS,
Redshift, and DynamoDB. The course covers data storage, management, and processing,
including popular frameworks like Hadoop, Spark, and Kafka. Through hands-on
exercises and real-world examples, you will gain the skills to design and implement efficient
data pipelines for advanced analytics and handling large volumes of data.

1
Chapter 2: AWS Academy Cloud Foundations

2.1 Cloud concepts overview

Cloud computing is the on-demand delivery of compute power, database, storage,


applications, and other IT resources via the internet with pay-as-you-go pricing. These
resources run on server computers that are located in large data centres in different locations
around the world. When you use a cloud service provider like AWS, that service provider owns
the computers that you are using. These resources can be used together like building blocks to
build solutions that help meet business goals and satisfy technology requirements.

There are three main cloud service models. Each model represents a different part of
the cloud computing stack and gives you a different level of control over your IT resources:

i. Infrastructure as a service (IaaS): Services in this category are the basic building
blocks for cloud IT and typically provide you with access to networking features,
computers (virtual or on dedicated hardware), and data storage space.

ii. Platform as a service (PaaS): Services in this category reduce the need for you to
manage the underlying infrastructure (usually hardware and operating systems) and
enable you to focus on the deployment and management of your applications.

iii. Software as a service (SaaS): Services in this category provide you with a completed
product that the service provider runs and manages. In most cases, software as a service
refers to end-user applications. With a SaaS offering, you do not have to think about
how the service is maintained or how the underlying infrastructure is managed.

Fig 2.1 Cloud Service Models

2
Advantages of cloud computing

i. Trade capital expense for variable expense: Capital expenses (capex) are funds
that a company uses to acquire, upgrade, and maintain physical assets such as
property, industrial buildings, or equipment.

ii. Benefit from massive economies of scale: By using cloud computing, you can
achieve a lower variable cost than you can get on your own.

iii. Stop guessing capacity: Eliminate guessing about your infrastructure capacity
needs. When you make a capacity decision before you deploy an application, you
often either have expensive idle resources or deal with limited capacity.

iv. Increase speed and agility: In a cloud computing environment, new IT resources
are only a click away, which means that you reduce the time it takes to make those
resources available to your developers from weeks to just minutes.

v. Stop spending money on running and maintaining data centres: Focus on


projects that differentiate your business instead of focusing on the infrastructure.
Cloud computing enables you to focus on your own customers instead of the heavy
lifting of racking, stacking, and powering servers.

vi. Go global in minutes: You can deploy your application in multiple AWS Regions
around the world with just a few clicks. As a result, you can provide a lower latency
and better experience for your customers simply and at minimal cost.

Amazon Web Services

Amazon Web Services (AWS) is a secure cloud platform that offers a broad set of global
cloud-based products. Because these products are delivered over the internet, you have on-
demand access to the compute, storage, network, database, and other IT resources that you
might need for your projects—and the tools to manage them.

AWS offers flexibility. Your AWS environment can be reconfigured and updated on
demand, scaled up or down automatically to meet usage patterns and optimize spending, or

3
shut down temporarily or permanently. AWS services are designed to work together to support
virtually any type of application or workload. Think of these services like building blocks,
which you can assemble quickly to build sophisticated, scalable solutions, and then adjust them
as your needs change.

2.2 Cloud Economics and Billing

There are three fundamental drivers of cost with AWS: compute, storage, and outbound
data transfer. These characteristics vary somewhat, depending on the AWS product and pricing
model you choose. In most cases, there is no charge for inbound data transfer or for data transfer
between other AWS services within the same AWS Region. There are some exceptions to be
sure to verify data transfer rates before you begin to use the AWS service. Outbound data
transfer is aggregated across services and then charged at the outbound data transfer rate. This
charge appears on the monthly statement as AWS Data Transfer Out.

Additional services

i. Amazon Virtual Private Cloud (Amazon VPC): It enables you to provision a


logically isolated section of the AWS Cloud where you can launch AWS
resources in a virtual network that you define.

ii. AWS Identity and Access Management (IAM): It controls your users’ access
to AWS services and resources. Consolidated Billing is a billing feature in AWS
Organizations to consolidate payment for multiple AWS accounts or multiple
Amazon Internet Services Private Limited (AISPL) accounts.

iii. AWS Elastic Beanstalk: It is an even easier way for you to quickly deploy and
manage applications in the AWS Cloud.

iv. AWS CloudFormation: It gives developers and systems administrators an easy


way to create a collection of related AWS resources and provision them in an
orderly and predictable fashion.

4
What is TCO?

Fig 2.2 TCO considerations

AWS Pricing Calculator


AWS offers the AWS Pricing Calculator to help you estimate a monthly AWS bill. You
can use this tool to explore AWS services and create an estimate for the cost of your use cases
on AWS.

The AWS Pricing Calculator helps you:


i. Estimate monthly costs of AWS services
ii. Identify opportunities for cost reduction
iii. Model your solutions before building them
iv. Explore price points and calculations behind your estimate
v. Find the available instance types and contract terms that meet your needs

The AWS Pricing Calculator enables you to name your estimate and create and name
groups of services.

5
2.3 AWS Global Infrastructure overview

The AWS Cloud infrastructure is built around regions. AWS has 22 Regions worldwide.
An AWS Region is a physical geographical location with one or more Availability Zones.
Availability Zones in turn consist of one or more data centres.

Fig 2.3 Selecting a region

Fig 2.4 AWS availability zones

6
Fig 2.5 AWS data centres

The AWS Global Infrastructure has several valuable features:

i. First, it is elastic and scalable. This means resources can dynamically adjust to
increases or decreases in capacity requirements. It can also rapidly adjust to
accommodate growth.

ii. Second, this infrastructure is fault tolerant, which means it has built-in component
redundancy which enables it to continue operations despite a failed component.

iii. Finally, it requires minimal to no human intervention, while providing high


availability with minimal down time.

Fig 2.6 AWS different categories of services

7
AWS storage services

i. Amazon Simple Storage Service (Amazon S3): It is an object storage service that
offers scalability, data availability, security, and performance. Use it to store and protect
any amount of data for websites, mobile apps, backup and restore, archive, enterprise
applications, Internet of Things (IoT) devices, and big data analytics.

ii. Amazon Elastic Block Store (Amazon EBS): It is high-performance block storage
that is designed for use with Amazon EC2 for both throughput and transaction intensive
workloads. It is used for a broad range of workloads, such as relational and non-
relational databases, enterprise applications, containerized applications, big data
analytics engines, file systems, and media workflows.

iii. Amazon Elastic File System (Amazon EFS): It provides a scalable, fully managed
elastic Network File System (NFS) file system for use with AWS Cloud services and
on-premises resources. It is built to scale on demand to petabytes, growing and
shrinking automatically as you add and remove files. It reduces the need to provision
and manage capacity to accommodate growth.

It is a secure, durable, and extremely low-cost Amazon S3 cloud storage class for data
archiving and long-term backup. It is designed to deliver 11 9s of durability, and to provide
comprehensive security and compliance capabilities to meet stringent regulatory requirements.

AWS compute services

i. Amazon Elastic Compute Cloud (Amazon EC2): It provides resizable compute


capacity as virtual machines in the cloud. Amazon EC2 Auto Scaling enables you to
automatically add or remove EC2 instances according to conditions that you define.
Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-
performance container orchestration service that supports Docker containers.

ii. Amazon Elastic Container Registry (Amazon ECR): It is a fully-managed Docker


container registry that makes it easy for developers to store, manage, and deploy Docker
container images. AWS Elastic Beanstalk is a service for deploying and scaling web

8
applications and services on familiar servers such as Apache and Microsoft Internet
Information Services (IIS).

iii. AWS Lambda: It enables you to run code without provisioning or managing servers.
You pay only for the compute time that you consume. There is no charge when your
code is not running. Amazon Elastic Kubernetes Service (Amazon EKS) makes it easy
to deploy, manage, and scale containerized applications that use Kubernetes on AWS.

iv. AWS Fargate: It is a compute engine for Amazon ECS that allows you to run containers
without having to manage servers or clusters.

AWS networking and content delivery services

i. Amazon Virtual Private Cloud (Amazon VPC): It enables you to provision logically
isolated sections of the AWS Cloud.

ii. Elastic Load Balancing: It automatically distributes incoming application traffic


across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and
Lambda functions.

iii. Amazon CloudFront: It is a fast content delivery network (CDN) service that securely
delivers data, videos, applications, and application programming interfaces (APIs) to
customers globally, with low latency and high transfer speeds.

iv. AWS Transit Gateway: It is a service that enables customers to connect their Amazon
Virtual Private Clouds (VPCs) and their on-premises networks to a single gateway.

v. Amazon Route 53: It is a scalable cloud Domain Name System (DNS) web service
designed to give you a reliable way to route end users to internet applications. It
translates names (like www.example.com) into the numeric IP addresses (like
192.0.2.1) that computers use to connect to each other.

9
vi. AWS Direct Connect: It provides a way to establish a dedicated private network
connection from your data centre or office to AWS, which can reduce network costs
and increase bandwidth throughput.

vii. AWS VPN: It provides a secure private tunnel from your network or device to the AWS
global network.

2.4 AWS Cloud Security

Security and compliance are a shared responsibility between AWS and the customer.
This shared responsibility model is designed to help relieve the customer’s operational burden.
At the same time, to provide the flexibility and customer control that enables the deployment
of customer solutions on AWS, the customer remains responsible for some aspects of the
overall security. The differentiation of who is responsible for what is commonly referred to as
security “of” the cloud versus security “in” the cloud.

AWS operates, manages, and controls the components from the software virtualization
layer down to the physical security of the facilities where AWS services operate. AWS is
responsible for protecting the infrastructure that runs all the services that are offered by AWS
Cloud. This infrastructure is composed of the hardware, software, networking, and facilities
that run the AWS Cloud services.

The customer is responsible for the encryption of data at rest and data in transit. The
customer should also ensure that the network is configured for security and that security
credentials and logins are managed safely. Additionally, the customer is responsible for the
configuration of security groups and the configuration of the operating system that run on
compute instances that they launch (including updates and security patches).

AWS and the customer share security responsibilities:


i. AWS is responsible for security of the cloud
ii. Customer is responsible for security in the cloud

AWS Identity and Access Management (IAM)allows you to control access to compute,
storage, database, and application services in the AWS Cloud. IAM can be used to handle
authentication, and to specify and enforce authorization policies so that you can specify which
users can access which services. With IAM, you can manage which resources can be accessed
by who, and how these resources can be accessed.

10
Fig 2.7 AWS IAM users and groups

Data encryption is an essential tool to use when your objective is to protect digital data.
Data encryption takes data that is legible and encodes it so that it is unreadable to anyone who
does not have access to the secret key that can be used to decode it. Thus, even if an attacker
gains access to your data, they cannot make sense of it. Data at rest refers to data that is
physically stored on disk or on tape.

Data in transit refers to data that is moving across the network. Encryption of data in
transit is accomplished by using Transport Layer Security (TLS) 1.2 with an open standard
AES-256 cipher. TLS was formerly called Secure Sockets Layer (SSL).

2.5 Networking and Content Delivery

A computer network is two or more client machines that are connected together to share
resources. A network can be logically partitioned into subnets. Networking requires a
networking device (such as a router or switch) to connect all the clients together and enable
communication between them.

Each client machine in a network has a unique Internet Protocol (IP) address that
identifies it. An IP address is a numerical label in decimal format. Machines convert that
decimal number to a binary format.

11
IPv4 and IPv6 addresses

A 32-bit IP address is called an IPv4 address. IPv6 addresses, which are 128 bits, are
also available. IPv6 addresses can accommodate more user devices.

A common method to describe networks is Classless Inter-Domain Routing (CIDR).


The CIDR address is expressed as follows:
i. An IP address (which is the first address of the network)
ii. Next, a slash character (/)
iii. Finally, a number that tells you how many bits of the routing prefix must be fixed
or allocated for the network identifier

The Open Systems Interconnection (OSI) model is a conceptual model that is used to
explain how data travels over a network. It consists of seven layers and shows the common
protocols and addresses that are used to send data at each layer.

Amazon VPC

Enables you to provision a logically isolated section of the AWS Cloud where you can
launch AWS resources in a virtual network that you define:

Gives you control over your virtual networking resources, including Selection of IP
address range, creation of subnets, configuration of route tables and network gateways.

Enables you to customize the network configuration for your VPC, enables you to use
multiple layers of security.

12
Fig 2.8 AWS VPCs and subnets

An internet gateway is a scalable, redundant, and highly available VPC component that
allows communication between instances in your VPC and the internet. An internet gateway
serves two purposes: to provide a target in your VPC route tables for internet-routable traffic,
and to perform network address translation for instances that were assigned public IPv4
addresses.

Fig 2.9 AWS cloud Internet gateway

13
Amazon Route 53

Amazon Route 53 is a highly available and scalable cloud Domain Name System (DNS)
web service. It is designed to give developers and businesses are liable and cost-effective way
to route users to internet applications by translating names (like www.example.com) into the
numeric IP addresses (like 192.0.2.1) that computers use to connect to each other. In addition,
Amazon Route 53 is fully compliant with IPv6.

Amazon Cloud Front

Fig 2.10 Content delivery network

A content delivery network (CDN) is a globally distributed system of caching servers.


A CDN caches copies of commonly requested files (static content, such as Hypertext Markup
Language, or HTML; Cascading Style Sheets, or CSS; JavaScript; and image files) that are
hosted on the application origin server. The CDN delivers a local copy of the requested content
from a cache edge or Point of Presence that provides the fastest delivery to the requester.

14
Fig 2.11 Amazon CloudFront

Amazon CloudFront delivers content through a worldwide network of data centres that
are called edge locations. When a user requests content that you serve with CloudFront, the
user is routed to the edge location that provides the lowest latency (or time delay) so that
content is delivered with the best possible performance.

2.6 Compute

Fig 2.12 AWS computing services

15
Amazon Elastic Compute Cloud (Amazon EC2)

Elastic refers to the fact that you can easily increase or decrease the number of servers
you run to support an application automatically, and you can also increase or decrease the size
of existing servers.

Compute refers to reason why most users run servers in the first place, which is to host
running applications or process data—actions that require compute resources, including
processing power (CPU) and memory (RAM). Cloud refers to the fact that the EC2 instances
that you run are hosted on it.

Fig 2.13 Amazon EC2 instances

Docker
Docker is a software platform that packages software (such as applications) into
containers. Docker is installed on each server that will host containers, and it provides simple
commands that you can use to build, start, or stop containers.

Fig 2.14 EC2 instance life cycle

16
Docker is best used as a solution when you want to:
i. Standardize environments
ii. Reduce conflicts between language stacks and versions
iii. Use containers as a service
iv. Run microservices using standardized code deployments
v. Require portability for data processing

Amazon Elastic Container Service (Amazon ECS)

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-


performance container management service that supports Docker containers. Amazon ECS
enables you to easily run applications on a managed cluster of Amazon EC2 instances.

What is Kubernetes?

Kubernetes is open-source software for container orchestration. Kubernetes can work


with many containerization technologies, including Docker. Because it is a popular open-
source project, a large community of developers and companies build extensions, integrations,
and plugins that keep the software relevant, and new and in-demand features are added
frequently.

Amazon Elastic Kubernetes Service (Amazon EKS)

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service


that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and
maintain your own Kubernetes control plane. It is certified Kubernetes conformant, so existing
applications that run on upstream Kubernetes are compatible with Amazon EKS.

Amazon Elastic Container Registry (Amazon ECR)

Amazon ECR is a fully managed Docker container registry that makes it easy for
developers to store, manage, and deploy docker container images.

17
It is integrated with Amazon ECS, so you can store, run, and manage container images
for applications that run on Amazon ECS. Specify the Amazon ECR repository in your task
definition, and Amazon ECS will retrieve the appropriate images for your applications.
AWS Lambda: Run code without servers:
AWS Lambda is a serverless compute service.

Fig 2.15 AWS Lambda

2.7 Storage

Amazon Elastic Block Store (Amazon EBS)


Amazon EBS provides persistent block storage volumes for use with Amazon EC2
instances. Amazon EBS enables you to create individual storage volumes and attach them to
an Amazon EC2 instance. Amazon EBS offers block-level storage. Volumes are automatically
replicated within its Availability Zone. It can be backed up automatically to Amazon S3 through
snapshots.

Uses include –Boot volumes and storage for Amazon Elastic Compute Cloud (Amazon
EC2) instances, Data storage with a file system, Database hosts, Enterprise applications.

18
Fig 2.16 AWS EBS and types

Amazon Simple Storage Service (Amazon S3)

Amazon S3 is object-level storage, which means that if you want to change a part of a
file, you must make the change and then re-upload the entire modified file.

19
Amazon S3 is a managed cloud storage solution that is designed to scale seamlessly
and provide 11 9s of durability. You can store virtually as many objects as you want in a bucket,
and you can write, read, and delete objects in your bucket.

Amazon Elastic File System (Amazon EFS)

Amazon EFS provides simple, scalable, elastic file storage for use with AWS services
and on-premises resources. It offers a simple interface that enables you to create and configure
file systems quickly and easily.

Amazon EFS is built to dynamically scale on demand without disrupting applications—


it will grow and shrink automatically as you add and remove files. It is designed so that your
applications have the storage they need, when they need it.

Fig 2.17 Amazon EFS architecture

2.8 Databases

Amazon Relational Database Service

Unmanaged services are typically provisioned in discrete portions as specified by the


user. You must manage how the service responds to changes in load, errors, and situations

20
where resources become unavailable. Say that you launch a web server on an Amazon Elastic
Compute Cloud (Amazon EC2) instance.

Managed services require the user to configure them. For example, you create an
Amazon Simple Storage Service (Amazon S3) bucket and then set permissions for it. However,
managed services typically require less configuration.

Amazon RDS

Managed service that sets up and operates a relational database in the cloud.

Fig 2.18 Amazon RDS

Amazon RDS enables you to focus on your application, so you can give applications
the performance, high availability, security, and compatibility that they need. With Amazon
RDS, your primary focus is your data and optimizing your application.

21
Amazon DynamoDB

Fig 2.19 Difference between relational and non-relational databases

A relational database (RDB) works with structured data that is organized by tables,
records, and columns. RDBs establish a well-defined relationship between database tables.
RDBs use structured query language (SQL), which is a standard user application that provides
a programming interface for database interaction.

A non-relational database is any database that does not follow the relational model that
is provided by traditional relational database management systems (RDBMS). Non-relational
databases have grown in popularity because they were designed to overcome the limitations of
relational databases for handling the demands of variable structured data.

Amazon Aurora

Amazon Aurora is a MySQL-and PostgreSQL-compatible relational database that is


built for the cloud. It combines the performance and availability of high-end commercial
databases with the simplicity and cost-effectiveness of open-source databases. Using Amazon
Aurora can reduce your database costs while improving the reliability and availability of the
database.

22
2.9 Cloud Architecture

Fig 2.20 Pillars of AWS Well-Architected Framework

Operational Excellence pillar

The Operational Excellence pillar focuses on the ability to run and monitor systems to
deliver business value, and to continually improve supporting processes and procedures.

There are five design principles for operational excellence in the cloud:
i. Perform operations as code.
ii. Make frequent, small, reversible changes.
iii. Refine operations procedures frequently.
iv. Anticipate failure.
v. Learn from all operational failures.

Security pillar
The Security pillar focuses on the ability to protect information, systems, and assets
while delivering business value through risk assessments and mitigation strategies.

23
There are seven design principles that can improve security:
i. Implement a strong identity foundation
ii. Enable traceability
iii. Apply security at all layers
iv. Automate security best practices
v. Protect data in transit and at rest
vi. Keep people away from data
vii. Prepare for security events

Reliability pillar

The Reliability pillar focuses on ensuring a workload performs its intended function
correctly and consistently when it’s expected to.

There are five design principles that can increase reliability:


i. Automatically recover from failure.
ii. Test recovery procedure.
iii. Scale horizontally to increase aggregate workload availability.
iv. Stop guessing capacity.
v. Manage change in automation.

Performance Efficiency pillar

The Performance Efficiency pillar focuses on the ability to use IT and computing
resources efficiently to meet system requirements, and to maintain that efficiency as demand
changes or technologies evolve.

There are five design principles that can improve performance efficiency:
i. Democratize advanced technologies.
ii. Go global in minutes.
iii. Use serverless architectures.
iv. Experiment more often.
v. Consider mechanical sympathy.

24
Cost Optimization pillar

The Cost Optimization pillar focuses on the ability to avoid unnecessary costs. Key
topics include: understanding and controlling where money is being spent, selecting the most
appropriate and right number of resource types, analysing spend over time, and scaling to
meeting business needs without overspending.

There are five design principles that can optimize costs:


i. Implement Cloud Financial Management.
ii. Adopt a consumption model.
iii. Measure overall efficiency.
iv. Stop spending money on undifferentiated heavy lifting.
v. Analyse and attribute expenditure.

2.10 Auto Scaling and Monitoring

Elastic Load Balancing

Elastic Load Balancing is an AWS service that distributes incoming application or


network traffic across multiple targets—such as Amazon Elastic Compute Cloud (Amazon
EC2) instances, containers, internet protocol (IP) addresses, and Lambda functions in a single
Availability Zone or across multiple Availability Zones. Elastic Load Balancing scales your
load balancer as traffic to your application changes over time. It can automatically scale to most
workloads.

Fig 2.21 Load balancing

25
A load balancer accepts incoming traffic from clients and routes requests to its
registered targets (such as EC2 instances) in one or more Availability Zones.

There is a key difference in how the load balancer types are configured. With
Application Load Balancers and Network Load Balancers, you register targets in target groups,
and route traffic to the target groups. With Classic Load Balancers, you register instances with
the load balancer.

Amazon CloudWatch

Amazon CloudWatch is a monitoring and observability service that is built for DevOps
engineers, developers, site reliability engineers (SRE), and IT managers. CloudWatch monitors
your AWS resources (and the applications that you run on AWS) in real time.

Amazon EC2 Auto Scaling

When you run your applications on AWS, you want to ensure that your architecture can
scale to handle changes in demand. In this section, you will learn how to automatically scale
your EC2 instances with Amazon EC2 Auto Scaling.

Why is scaling important?

Scaling is the ability to increase or decrease the compute capacity of your application.
To understand why scaling is important, consider this example of a workload that has varying
resource requirements.

Fig 2.22 EC2 auto scaling

26
AWS Auto Scaling

AWS Auto Scaling is a separate service that monitors your applications. It automatically
adjusts capacity to maintain steady, predictable performance at the lowest possible cost. The
service provides a simple, powerful user interface that enables you to build scaling plans for
resources, including:

i. Amazon EC2 instances and Spot Fleets.


ii. Amazon Elastic Container Service (Amazon ECS) tasks.
iii. Amazon DynamoDB tables and indexes.
iv. Amazon Aurora Replica.

27
Chapter 3: AWS Academy Data Engineering

3.1 Data driven organizations


Data-Driven organization is an organization that knows how to refine the data, mine it,
drive growth and profitability.

Data is playing a key role across the whole business cycle evolving to key strategic
decisions not only backend tasks, it is foreseen that organizations that use Data driven analytics
are x19 times more profitable from organizations which are not data-driven, and x23 times
more likely to customer’s acquisition.

Benefits of democratizing data across your organization


i. Make better decisions, faster
ii. Respond better to the unexpected
iii. Enhance customer experience and engagement
iv. Uncover new opportunities
v. Improve efficiency

The four Es of data culture


i. Engage in data-driven decision making
ii. Educate everyone
iii. Eliminate data blockers
iv. Enable frontline action

There are a variety of paths organizations can take to jump-start their data-driven roadmap,
such as launching Big Data initiatives, broadening data-collection initiatives, hiring a Chief
Data Officer (CDO), and creating new analytics functions. But there’s more to it than this. To
become data-driven, organizations need to:
i. Create a culture of innovation that positions data at the core of your business strategy.
ii. Build data capabilities to help drive that culture.

Fig 3.1 Benefits of Data Driven Organization

28
3.2 The elements of data

The 5 Vs of Big Data


i. Volume
It refers to the size of Big Data. Data can be considered Big Data or not is based
on the volume. The rapidly increasing volume data is due to cloud-computing traffic,
IoT, mobile traffic etc.

ii. Velocity

It refers to the speed at which the data is getting accumulated. This is mainly due
to IoTs, mobile data, social media etc.

iii. Variety

• Structured data
• Semi-structured data
• Unstructured data

iv. Veracity

It refers to assurance of quality/integrity of data. Since the data is collected from


multiple sources, we need to check the data for accuracy before using it for business
insights.

v. Value

Just because we collected lots of Data, it’s of no value unless we garner some
insights out of it. Value refers to how useful the data is in decision making. We need to
extract the value of the Big Data using proper analytics.

29
Fig 3.2 5v’s of data
3.3 Design principles and patterns for data pipelines

Fig 3.3 Data pipeline structure

The pipeline consists of the following phases:

i. Data is collected (or ingested) by an appropriate tool.


ii. The data is persisted to storage.
iii. The data is processed or analysed. The data processing or analysis solution takes
the data from storage, performs operations, and then stores the data again.
iv. Analysis can be repeated to get further answers from the data.
v. Data can be visualized with business intelligence (BI) tools to provide useful
answers to business users.

30
vi. Insights are the output of the big data pipeline. The business owners use these
insights to make critical decisions.

Types of data pipelines

i. Batch processing

The development of batch processing was critical step in building data infrastructures that
were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was patented
and then subsequently integrated in open-source systems, like Hadoop, CouchDB, and
MongoDB.

As the name implies, batch processing loads “batches” of data into a repository during set
time intervals, which are typically scheduled during off-peak business hours. This way, other
workloads aren’t impacted as batch processing jobs tend to work with large volumes of data,
which can tax the overall system. Batch processing is usually the optimal data pipeline when
there isn’t an immediate need to analyse a specific dataset (e.g. monthly accounting), and it is
more associated with the ETL data integration process, which stands for “extract, transform,
and load.”

Batch processing jobs form a workflow of sequenced commands, where the output of one
command becomes the input of the next command. For example, one command may kick off
data ingestion, the next command may trigger filtering of specific columns, and the subsequent
command may handle aggregation. This series of commands will continue until the data is
completely transformed and written into data repository.

ii. Streaming data

Unlike batching processing, streaming data is leveraged when it is required for data to be
continuously updated. For example, apps or point of sale systems need real-time data to update
inventory and sales history of their products; that way, sellers can inform consumers if a product
is in stock or not. A single action, like a product sale, is considered an “event”, and related
events, such as adding an item to checkout, are typically grouped together as a “topic” or
“stream.” These events are then transported via messaging systems or message brokers, such
as the open-source offering, Apache Kafka.

31
Since data events are processed shortly after occurring, streaming processing systems have
lower latency than batch systems, but aren’t considered as reliable as batch processing systems
as messages can be unintentionally dropped or spend a long time in queue. Message brokers
help to address this concern through acknowledgements, where a consumer confirms
processing of the message to the broker to remove it from the queue.

Three core steps make up the architecture of a data pipeline.


i. Data ingestion: Data is collected from various data sources, which includes
various data structures (i.e. structured and unstructured data). Within streaming
data, these raw data sources are typically known as producers, publishers, or
senders. While businesses can choose to extract data only when they are ready
to process it, it’s a better practice to land the raw data within a cloud data
warehouse provider first. This way, the business can update any historical data
if they need to make adjustments to data processing jobs.

ii. Data Transformation: During this step, a series of jobs are executed to
process data into the format required by the destination data repository. These
jobs embed automation and governance for repetitive workstreams, like
business reporting, ensuring that data is cleansed and transformed
consistently. For example, a data stream may come in a nested JSON format,
and the data transformation stage will aim to unroll that JSON to extract the key
fields for analysis.

iii. Data Storage: The transformed data is then stored within a data repository,
where it can be exposed to various stakeholders. Within streaming data, this
transformed data are typically known as consumers, subscribers, or recipients.

3.4 Securing and scaling the data pipeline

Scaling refers to changing the number of machines or the size of the machine depending
on the size of the data to be processed. Increasing the number of machines or the size of the
machine is called scaling up, and decreasing them is called scaling down. Scaling down is done
to keep costs low.

The main reasons for scaling up are to


i. Increase the speed of data processing.
ii. Handle large input data.

There are two types of scaling

32
i. Vertical scaling: Increasing memory and/or disk size of the data processing
machine.

ii. Horizontal scaling: Using multiple processes to process a large data set.

Fig 3.4 Types of scaling

3.5 Ingestion and preparing data

Data ingestion and preparation involves processes in collecting, curating, and preparing
the data for ML. Data ingestion involves collecting batch or streaming data in unstructured or
structured format. Data preparation takes the ingested data and processes to a format that can
be used with ML.

Identifying, collecting, and transforming data is the foundation for ML. There is
widespread consensus among ML practitioners that data preparation accounts for
approximately 80% of the time spent in developing a viable ML model.

33
There are several challenges that public sector organizations face in this phase: First is
the ability to connect to and extract data from different types of data sources. Once the data is
extracted, it needs to be catalogued and organized so that it is available for consumption, and
there needs to be a mechanism in place to ensure that only authorized resources have access to
the data. Mechanisms are also needed to ensure that source data transformed for ML is
reviewed and approved for compliance with federal government guidelines.

Data Ingestion

The AWS Cloud enables public sector customers to overcome the challenge of
connecting to and extracting data from both streaming and batch data, as described in the
following:
Streaming Data: For streaming data, Amazon Kinesis and Amazon MSK enable the
collection, processing, and analysis of data in real time. Amazon Kinesis provides a suite of
capabilities to collect, process, and analyse real-time, streaming data. KDS is a service that
enables ingestion of streaming data. Producers of data push data directly into a stream, which
consists of a group of stored data units called records.

Batch Data: There are a number of mechanisms available for data ingestion in batch
format. With AWS DMS, you can replicate and ingest existing databases while the source
databases remain fully operational. The service supports multiple database sources and targets,
including writing data directly to Amazon S3.

Data Preparation

Once the data is extracted, it needs to be transformed and loaded into a data store for
feeding into an ML model. It also needs to be catalogued and organized so that it is available
for consumption, and also needs to enable data lineage for compliance with federal government
guidelines. AWS Cloud provides three services that provide these mechanisms. They are:

AWS Glue: It is a fully managed ETL (extract, transform and load) service that makes
it simple and cost-effective to categorize, clean, enrich, and migrate data from a source system
to a data store for ML. The AWS Glue Data Catalog provides the location and schema of ETL
jobs as well as metadata tables (where each table specifies a single source data store). A crawler
can be set to automatically take inventory of the data in your data stores.

34
Amazon SageMaker Data Wrangler: It is a service that enables the aggregation and
preparation of data for ML and is directly integrated into Amazon SageMaker Studio.

Amazon EMR: Many organizations use Spark for data processing and other purposes
such as for a data warehouse. These organizations already have a complete end-to-end pipeline
in Spark and also the skillset and inclination to run a persistent Spark cluster for the long term.
In these situations, Amazon EMR, a managed service for Hadoop-ecosystem clusters, can be
used to process data.

3.6 Ingestion by batch or by stream

In today’s data-driven world, the efficient ingestion of data is crucial for organizations
seeking to extract valuable insights and make informed decisions. Two primary methods
employed for data ingestion are batch ingestion and streaming ingestion. Each approach comes
with its own set of advantages and limitations, and the choice between batch and streaming
ingestion depends on various factors, including the nature of the data, data volume, processing
requirements, and the need for real-time analysis. This essay aims to explore and compare batch
and streaming data ingestion, shedding light on their respective strengths, weaknesses, and use
cases.

Batch Data Ingestion

Batch data ingestion involves collecting and processing data in predefined, fixed-size
chunks or batches. In this approach, data is ingested periodically, typically at regular intervals
(e.g., hourly, daily, or weekly). The data collected in each batch is processed as a unit, often in
parallel, to extract valuable insights. Batch data ingestion is well-suited for applications that do
not require real-time or immediate analysis, and it is characterized by its ability to handle large
data volumes efficiently.

Advantages of Batch Data Ingestion

i. Scalability: Batch data ingestion allows organizations to process large volumes of data
efficiently by breaking it into manageable chunks. It facilitates horizontal scalability,
enabling the addition of more resources to meet growing data demands.

ii. Simplicity: Batch ingestion is relatively straightforward to implement, as it does not


require complex real-time data handling mechanisms.

35
iii. Cost-Effectiveness: Since batch processing is not real-time, it requires fewer
computational resources compared to streaming ingestion, making it cost-effective.

Disadvantages of Batch Data Ingestion

i. Latency: Batch ingestion introduces latency as data is collected and processed in fixed
intervals. Consequently, insights are not immediately available, which can be limiting
for time-critical applications.

ii. Stale Insights: The insights generated from batch data ingestion may become outdated
by the time they are available for analysis, potentially affecting decision-making
accuracy.

iii. Unsuitable for Real-Time Applications: Batch ingestion is not suitable for
applications that require real-time analysis and immediate actions based on incoming
data.

Fig 3.5 Batch data ingestion

Streaming Data Ingestion

Streaming data ingestion involves processing data in real-time as it arrives in a


continuous flow or stream. In this approach, data is ingested and analysed as soon as it is
generated or made available. Streaming data ingestion is ideal for applications that require real-

36
time insights and actions, as it enables organizations to respond immediately to time-sensitive
events.

Advantages of Streaming Data Ingestion

i. Real-Time Insights: Streaming ingestion allows organizations to extract insights and


make decisions as data arrives, enabling rapid responses to time-critical events.

ii. Low Latency: The real-time nature of streaming ingestion minimizes latency, ensuring
that insights are always up-to-date and relevant.

iii. Immediate Actions: Streaming ingestion facilitates immediate actions based on


incoming data, making it suitable for applications that require real-time responsiveness.

Disadvantages of Streaming Data Ingestion

i. Complexity: Implementing streaming ingestion can be more complex than batch


ingestion due to the need for handling data in real-time and ensuring data consistency.

ii. Resource Intensive: Streaming ingestion can be resource-intensive, as it requires highly


responsive and optimized distributed systems to handle continuous data streams.

iii. Cost: Real-time processing can be more costly than batch processing, as it demands
significant investments in infrastructure and technology.

Fig 3.6 Streaming data ingestion

37
3.7 Storing and organizing data

Data warehouse

A data warehouse (DW) is a central repository storing data in queryable forms. From
a technical standpoint, a data warehouse is a relational database optimized for reading,
aggregating, and querying large volumes of data. Traditionally, DWs only contained structured
data or data that can be arranged in tables. However, modern DWs can also support
unstructured data (such as images, pdf files, and audio formats).

Fig 3.7 Data Warehouse

Data marts

Simply speaking, a data mart is a smaller data warehouse (their size is usually less
than 100Gb.). They become necessary when the company and the amount of its data grow and
it becomes too long and ineffective to search for information in an enterprise DW. Instead, data
marts are built to allow different departments (e.g., sales, marketing, C-suite) to access relevant
information quickly and easily.

Fig 3.8 Data Mart

38
OLAP and OLAP cubes

OLAP or Online Analytical Processing refers to the computing approach allowing


users to analyse multidimensional data. It’s contrasted with OLTP or Online Transactional
Processing, a simpler method of interacting with databases, not designed for analysing massive
amounts of data from different perspectives.
Traditional databases resemble spreadsheets, using the two-dimensional structure of
rows and columns. However, in OLAP, datasets are presented in multidimensional structures -
- OLAP cubes. Such structures enable efficient processing and advanced analysis of vast
amounts of varied data. For example, a sales department report would include such dimensions
as product, region, sales representative, sales amount, month, and so on.

Fig 3.9 OLAP cubes

3.8 Processing big data

In real-word, most of the data is unstructured, making it difficult to streamline the Data
Processing tasks. And since there is no end to the Data Generation process, collecting and
storing information has become increasingly difficult. Today, it has become essential to have a
systematic approach to handling Big Data to ensure organizations can effectively harness the
power of data.

In this article, you will learn about Big Data, its types, the steps for Big Data Processing,
and the tools used to handle enormous information.

Big Data Processing is the collection of methodologies or frameworks enabling access


to enormous amounts of information and extracting meaningful insights. Initially, Big Data
Processing involves data acquisition and data cleaning. Once you have gathered the quality

39
data, you can further use it for Statistical Analysis or building Machine Learning models for
predictions.

Fig 3.10 Big Data analysis

5 Stages of Big Data Processing


i. Data Extraction
ii. Data Transformation
iii. Data Loading
iv. Data Visualization/BI Analytics
v. Machine Learning Application

3.9 Processing data for ml

Data Preprocessing is an important step in the machine learning algorithm. Imagine a


situation where you are working on an assignment at your college, and the lecturer does not
provide the raw headings and the idea of the topic. In this case, it will be very difficult for you
to complete that assignment because raw data is not presented well to you. The same is the case
in Machine Learning. Suppose the Data preprocessing step is missing while implementing the
machine learning algorithm. In that case, it will definitely affect your work at the end, when it
will be the final stage of applying the available data set to your algorithm.

While performing data preprocessing, it is important to ensure data accuracy so that it


doesn't affect your machine learning algorithm at the final stage.

40
Fig 3.11 Processing data ML

There are six steps of data preprocessing in machine learning:

i. Import the Libraries


ii. Import the Loaded Data
iii. Check for Missing Values
iv. Arrange the Data
v. Do Scaling
vi. Distribute Data into Training, Evaluation and Validation Sets

3.10 Analysing and visualizing data

Analytics is a broad term that encompasses numerous subfields. It refers to all the tools
and activities involved in processing data to develop valuable insights and interpretations. It is
worth noting that Data Analytics is dependent on computer tools and software that help extract
data and analyse them for business decisions to be made accordingly.

Data Analytics is highly adopted in the commercial industry since it helps companies
to better understand their customers and boost their advertising campaigns. This is a highly
dynamic field due to the numerous innovations we keep on hearing every once in a while. Data
Analytics has been converted into a mechanical sector reliant on Computer Algorithms that
process raw data to come up with sensible conclusions at the time of this writing.

41
There are two main Data Analysis techniques:

i. Quantitative Data Analysis

i. Descriptive Analysis: This type of Data Analysis lets you see the
patterns and trends in a particular set of data. It includes processes such
as calculating frequencies, percentages, and measures of central
tendency, including mean, mode, and median.
ii. Inferential Analysis: It is used when the differences and correlations
between particular data sets need to be examined. The processes
involved include ANOVA, t-Tests, and Chi-Square.

ii. Qualitative Data Analysis

While Quantitative Analysis focuses on numeric data, Qualitative is the


complete opposite, dealing with non-numeric data such as audio, video
recordings, images, texts, and transcripts. In addition, Qualitative Data
generally tells you how your data is changing.
Data Analytics components refer to the different techniques you can use
for processing any set of data. They include:
i. Text Analytics: This is the technique used in autocorrect in phones and
software such as Microsoft Word. It involves analysing large amounts
of text to come up with Algorithms. Applications include Linguistic
Analysis and Pattern Recognition.
ii. Data Mining: One of the most critical Data Mining applications is
determining behavioural patterns in inpatient data during clinical trials.
As the name suggests, Data Mining breaks large chunks of data into
smaller pieces that fit a specific purpose.
iii. Business Intelligence: This is one of the essential processes for any
successful business. It involves transforming data into actionable
strategies for a particular commercial entity. For example, this is the
process behind product placement and pricing in most companies.

42
Fig 3.12 Data Analysis

Data Visualization

Data Visualization deals with picturing the information to develop trends and
conclusions. In Data Visualization, information is organized into charts, graphs, and other
forms of visual representations. This simplifies otherwise complicated information and makes
it accessible to all the involved stakeholders to make critical business decisions.

Some of the popular Visualization techniques:


i. Histograms: This is a Graphical Visualization Tool that organizes a set of data into
a range of frequencies. It bears key similarities to a Bar Graph and organizes
information in a way that makes it easy to interpret.

ii. Graphs: These are excellent tools for analysing the time series relationship in a
particular set of data. For instance, a company’s annual profits could be analysed
based on each month using a graph.

iii. Fever Charts: A Fever Chart is an indispensable tool for any business since it
shows how data changes over time. For instance, a particular product’s performance
could be analysed based on its yearly profits.

iv. Heatmap Visualization: This tool is based on the psychological fact that the human
brain interprets colours much faster than numbers. It is a graph that uses numerical
data points highlighted in light or warm colours to represent high or low-value
points.

43
v. Infographics: Infographics are effective when analysing complex datasets. They
take large amounts of data and organize it into an easy to interpret format.

Fig 3.13 Data Visualisation

3.11 Automating the pipeline

A fully automated data pipeline enables your organization to extract data at the source,
transform it, and integrate it with data from other sources before sending it to a data warehouse
or lake for loading into business applications and analytics platforms.

Data Pipeline Automation streamlines complex change processes such as cloud


migration, eliminates the need for manual data pipeline adjustments, and establishes a secure
platform for data-driven enterprises.

44
The three main reasons to implement a fully automated data pipeline are:

i. Advanced BI and Analytics

According to Gartner, 87% of organizations lack data maturity. Almost all


businesses struggle to derive the full value from their data and to derive important
insights that might help them increase organizational efficiency, performance, and
profitability.

Business intelligence tools and no-code platforms for non-technical


business users and democratized data access across the whole organization make
organizations data-driven.

ii. Better data analytics and business insights

A fully automated data pipeline enables data to flow between systems,


removing the need for manual data coding and formatting and enabling
transformations to occur on-platform, enabling real-time analytics and granular
insight delivery. Integrating data from a variety of sources results in improved
business intelligence and customer insights, including the following:
• Increased visibility into the customer journey and experience as a
whole
• Enhancements to performance, efficiency, and productivity
• Organizational decision-making that is timelier and more effective.

iii. Establishing a data-driven company culture

Decision intelligence is a term that encompasses a variety of fields, including


decision management and decision assistance. It comprises applications in the
realm of complex adaptive systems that bridge the gap between numerous
established and emerging disciplines.

While digitalization accelerates the collection of data, most businesses are


still far from properly leveraging their data and acquiring deeper insights and real-
time visibility through advanced analytics. Data is a critical driver of everything

45
from targeted marketing to more efficient operations and increased
performance and productivity.

Fig 3.14 Fully automated Data pipeline

46
Chapter 4: Learning Outcomes

During this internship, I achieved many new skills which are very important for my
future career.

Foundational knowledge of cloud computing concepts and AWS services

i. Cloud computing fundamentals: I gained a comprehensive understanding of cloud


computing concepts, including cloud models (SaaS, PaaS, IaaS), deployment strategies
(cloud-native, hybrid cloud, multi-cloud), security best practices (IAM, VPC, data
encryption), and cost optimization techniques.

ii. Core AWS services: I explored and gained hands-on experience with essential AWS
services like Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service
(S3), Identity and Access Management (IAM), and Virtual Private Cloud (VPC).

Hands-on experience with AWS Data Engineering services

i. Data warehousing with Amazon Redshift: I learned how to use Amazon Redshift, a
cloud-based data warehouse, to store, manage, and analyse large datasets efficiently.

ii. Big data processing with Amazon EMR: I gained experience in utilizing Amazon
EMR, a managed Hadoop framework, to process and analyse massive datasets
efficiently.

Practical skills in data analysis and visualization

i. Data cleaning and transformation: I developed skills in data cleaning techniques to


handle missing values, data inconsistencies, and data quality issues.

ii. Data visualization: I gained proficiency in creating data visualizations using tools like
Tableau or Matplotlib. You learned about choosing appropriate chart types, effective
visual design principles, and storytelling with data.

47
Chapter 5: Conclusion and Future Career

Conclusion

The AWS Academy Data Analytics virtual internship is an excellent opportunity for
individuals seeking to gain hands-on experience in the field of data analytics and explore the
capabilities of AWS data analytics services. The internship provides a comprehensive
foundation in cloud computing concepts, data analysis techniques, and AWS data analytics
services, preparing participants for real-world data analytics projects and potential certification.

Future Career

With the growing demand for data analytics professionals, the AWS Academy Data
Analytics virtual internship can open doors to various career opportunities in the field.

Here are some potential career paths for individuals who have completed the internship:
i. Data Analyst
ii. Data Scientist
iii. Data Engineer
iv. Business Intelligence Analyst
v. Machine Learning Engineer

48
Chapter 6: References

[1] Company overview << https://www.previewtechs.com/about >>

[2] AWS Academy Data Analytics Virtual Internship<<


https://awsacademy.instructure.com/courses/47485/modules>>

[3] Amazon Elastic Compute Cloud “Documentation” available at <<


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html>>

[4] Amazon Simple Storage Service “Documentation” available at


https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html> >

[5] Amazon Virtual Private Cloud “Documentation” available at


<<https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html>>

[6] Amazon Relational Database Service “Documentation”


<<https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Welcome.html>>

49

You might also like