Technical Framework for Big Data Analytics
OLUJIMI OSINAIKE NEXFORD
@ UNIVERSITY
Introduction
Amazon is a massive online retailer and cloud services provider, basically offering diverse
products and also renting out computing power. Amazon deal with an insane amount of data
from both their e-commerce platform and their cloud services (AWS).
Data Architecture Implementation
Amazon's data architecture is like a well-organized library, but instead of books, it's filled with
data.
Big data analytics frameworks often include layers for data ingestion, processing, and
visualization to handle the volume and velocity of big data (Gandomi & Haider, 2015)
Data Lake (Amazon S3): This is the main storage area, a giant pool where all the raw data is
dumped. It's like the library's storage room, holding everything from customer purchase history
to website clickstream data. S3 (Simple Storage Service) is used because it's super scalable and
cost-effective for storing massive amounts of data.
Specialized Databases: They use different types of databases for different purposes, like having
different sections in the library.
o DynamoDB: This is a NoSQL database, perfect for fast access to specific pieces of
information. Imagine it as the card catalog, allowing quick lookups of customer profiles,
product details, and order information. It's designed for high performance and scalability.
o Redshift: This is a data warehouse, designed for analyzing large datasets. Think of it as
the research section of the library, where analysts can run complex queries to understand
trends, customer behavior, and business performance. It's optimized for analytical
workloads.
o Other Databases: Amazon also uses other databases like relational databases (e.g.,
PostgreSQL, MySQL) for specific applications and data needs.
Data Pipelines: Data pipelines are like the delivery trucks that move data from one place
to another. They use tools like AWS Glue and Apache Kafka to ingest, process, and
transform data before it's stored in the data lake or databases.
A well-structured technical framework must support distributed storage and real-time processing
using platforms like Hadoop and Spark (Hashem et al., 2015).
Support of their Data Value Chain: Data is the engine that drives Amazon's entire business. It's
how they make money and stay ahead of the competition.
Personalized Recommendations: When you see "Customers who bought this item also
bought..." that's data in action. Amazon analyzes your past purchases, browsing history,
and other data to suggest products you might like.
Targeted Advertising: Amazon uses data to show you ads that are relevant to your
interests. This makes the ads more effective and helps Amazon earn more revenue.
Supply Chain Optimization: Amazon uses data to predict demand, manage inventory, and
optimize its logistics network. This helps them ensure they have the right products in
stock and can deliver them to customers quickly.
Fraud Detection: Amazon uses data to identify and prevent fraudulent activities,
protecting both the company and its customers.
Pricing Optimization: Amazon uses data to dynamically adjust prices based on demand,
competitor pricing, and other factors.
Distributed Data Processing Models: To handle the massive volume of data, Amazon uses
distributed processing, which is like having a team of workers instead of one person.
EMR (Elastic MapReduce): This is a managed Hadoop and Spark service. Hadoop and Spark are
open-source frameworks designed for processing large datasets in a distributed manner. EMR
allows Amazon to easily spin up clusters of computers to process data in parallel.
Spark: Spark is a fast, in-memory data processing engine that's often used with EMR. It's great
for iterative algorithms and real-time data processing.
Other AWS Services: Amazon also uses other AWS services like Kinesis (for real-time data
streaming) and Lambda (for serverless computing) to process data.
How it Works: Data is broken down into smaller chunks and processed simultaneously across
multiple computers. The results are then aggregated to provide insights.
Data Challenges Across the Value Chain: Dealing with big data isn't always easy. Amazon faces
several challenges.
Volume: The sheer amount of data is overwhelming. They need to store, process, and
analyze petabytes of data every day.
Velocity: Data is coming in at a rapid pace. They need to process data in real-time or near
real-time to make timely decisions.
Variety: Data comes in many different formats (structured, semi-structured,
unstructured). They need to be able to handle all types of data.
Veracity: Ensuring data quality and accuracy is crucial. They need to clean, validate, and
transform data to ensure its reliability.
Security: Protecting sensitive customer data is paramount. They need to implement robust
security measures to prevent data breaches.
Scalability: As the business grows, the data processing infrastructure needs to scale to
handle the increasing volume of data.
Challenges and Recommendations of Their Data Modeling: Data modeling is like creating the
blueprints for how data is organized.
Challenges:
o Complexity: The relationships between different data points can be complex, making it
difficult to design effective data models.
o Evolving Business Needs: Business requirements change over time, which can require
frequent updates to data models.
o Data Silos: Data may be stored in different systems, making it difficult to integrate and
analyze.
Recommendations:
o Flexible and Scalable Models: Use data models that can easily adapt to changing business
needs and scale to handle increasing data volumes. Consider using a data lake approach
with a schema-on-read strategy, allowing for flexibility.
o Data Governance: Implement strong data governance practices to ensure data quality,
consistency, and security.
o Data Cataloging: Use a data catalog to document and manage data assets, making it
easier for users to find and understand data.
o Continuous Model Refinement: Regularly review and refine data models to ensure they
meet business needs and optimize performance.
o Focus on Data Lineage: Track the origin and transformation of data to improve data
quality and facilitate troubleshooting.
o Embrace Automation: Automate data modeling tasks, such as data discovery, data
profiling, and model generation, to improve efficiency and reduce errors.
Scalability, fault tolerance, and low latency are key technical requirements for an effective big
data analytics infrastructure (Zikopoulos & Eaton, 2011).
Reference List
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics.
International Journal of Information Management, 35(2), 137–144.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The
rise of “big data” on cloud computing: Review and open research issues. Information Systems,
47, 98–115.
Zikopoulos, P. C., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class
Hadoop and streaming data. McGraw-Hill Osborne Media.