BDA Unit 1 Notes-1
BDA Unit 1 Notes-1
UNIT I :
Classification of Digital Data, Structured and Unstructured Data - Introduction to Big Data:
Characteristics – Evolution – Definition - Challenges with Big Data - Other Characteristics of
Data - Why Big Data - Traditional Business Intelligence versus Big Data - Data Warehouse
and Hadoop Environment Big Data Analytics: Classification of Analytics – Challenges - Big
Data Analytics important - Top Analytics Tools.
UNIT I
Big data analytics is a field of study and practice that focuses on extracting valuable insights and
meaningful patterns from large and complex datasets. It involves the use of various
techniques, tools, and technologies to process, analyze, and interpret massive volumes of data
to make data-driven decisions, identify trends, and gaining valuable insights.
The term "big data" refers to the vast amounts of structured, semi-structured, and unstructured
data that organizations and businesses collect from various sources such as social media,
sensors, mobile devices, transaction records, and more. This data is typically characterized by
its volume, velocity, variety, and veracity, which makes it challenging to manage and analyze
using traditional data processing methods.
1. Data collection: Gathering and aggregating data from multiple sources, including
structured databases, log files, social media platforms, and IoT devices.
2. Data storage and management: Storing and organizing large volumes of data using
distributed file systems, NoSQL databases, and data warehouses.
3. Data preprocessing: Cleaning, transforming, and filtering the data to ensure its
quality, consistency, and relevance for analysis.
4. Data analysis: Applying various statistical, machine learning, and data mining
techniques to extract patterns, correlations, and insights from the data. This can
involve tasks such as data exploration, clustering, classification, regression, and
predictive modelling.
5. Data visualization: Presenting the analyzed data in a visual format, such as charts,
graphs, and dashboards, to facilitate better understanding and decision-making.
6. Real-time analytics: Performing analysis on streaming data to gain immediate
insights and enable real-time decision-making.
7. Data security and privacy: Ensure appropriate measures are in place to protect
sensitive data and comply with relevant regulations and privacy policies.
Data:
A massive amount of data which cannot be stored, processed and analyzed using traditional
tools is known as Big Data. Hadoop is a framework that stores and processes big data.
Social networking sites: Facebook, Google, and LinkedIn generate huge amounts of
data on a day-to-day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, and Alibaba generate many logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives huge amounts of data
that are stored and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, and Vodafone study user trends and
accordingly publish their plans; for this, they store the data of its million users.
Share Market: Stock exchange worldwide generates huge amounts of data through
daily transactions.
Big data analytics is the process of extracting meaningful insights from big data, such as hidden
patterns, unknown correlation, market trends and customer preferences, that can help
organizations make informed business decisions.
There are quite a few advantages to incorporating big data analytics into a business or
organization. These include:
Cost reduction: big data can reduce costs in storing all the business data in one
place. Tracking analytics also helps companies find ways to work more efficiently to
cut costs wherever possible.
Product development: Developing and marketing new products, services, or
brands is much easier when based on data collected from customers’ needs and
wants. Big data analytics also helps businesses understand product viability and
keep up with trends.
Strategic business decisions: The ability to constantly analyze data helps businesses
make better and faster decisions, such as cost and supply chain optimization.
Customer experience: Data-driven algorithms help marketing efforts (targeted
ads, for example) increase customer satisfaction by delivering an enhanced
customer experience.
Risk management: Businesses can identify risks by analyzing data patterns and
developing solutions for managing those risks.
I .Data Classification:
Process of classifying data in relevant categories to be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its
importance regarding data security and compliance and meeting different business or
personal objectives. It is also a major requirement, as data must be easily retrievable within a
specific period.
DATA
1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format.
The elements in structured data are addressable for effective analysis.
It contains all the data which can be stored in the SQL database in a tabular format.
Today, most data is developed and processed in the simplest way to manage
information.
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example of Relational Data you have to maintain a record of students for
a university, like a student's name, ID, address, and Email of the student. To store the
record of students, the following relational schema and table were used.
2. Unstructured Data:
It is defined as the data that does not follow a pre-defined standard, or you can say
that any does not follow any organized format.
This kind of data is also not fit for the relational database because you will see a pre-
defined manner or organized way of data in the relational database.
Unstructured data is also very important for the big data domain and to manage and
store Unstructured data there are many platforms to handle it like No-SQL Database.
3. Semi-Structured Data:
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. With some process, you can
store them in a relational database but is very hard for some semi-structured data, but semi-
structured exist to ease space.
The main goal of the organization of data is to arrange the data in such a form that it becomes
fairly available to the users. So, it’s basic features as following.
Homogeneity – The data items in a particular group should be similar to each other.
Clarity – There must be no confusion in positioning any data item in a particular
group.
Stability – The data item set must be stable i.e. any investigation should not affect
the same set of classification.
Elastic – One should be able to change the basis of classification as the purpose of
classification changes.
II. Characteristics of Big Data
Big Data contains much data not being processed by traditional data storage or the processing
unit. Many multinational companies use it to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
o Volume
o Veracity
o Variety
o Value
o Velocity
1. Volume
o The name Big Data itself is related to its enormous size. Big Data is a vast ‘volume’
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
o Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded daily.
Big data technologies can handle large amounts of data.
2. Variety
Big Data can be structured, unstructured, semi-structured and collected from different sources.
Data will only be collected from databases and sheets in the past, But these days the data will
come in array forms, PDFs, Emails, audio, SM posts, photos, videos, etc.
3. Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
4. Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.
5. Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data set speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed of data flows from sources like application logs,
business processes, networks, social media sites, sensors, mobile devices, etc.
100000 + tweets ,
168000000 + emails ,
The history of big data can be traced back to the early days of computing and the
emergence of digital data. Here's a brief overview of the key milestones in the evolution
of big data:
The history of big data demonstrates the increasing importance of managing and
analyzing large and diverse datasets. The field continues to evolve, and organizations
are continually exploring innovative approaches and technologies to leverage the
potential of big data for insights and decision-making.
Big Data :
Massive amount of data which cannot be stored, processed and analyzed using traditional tools is
known as Big Data. It deals with large volume of both structured, semi structured and
unstructured data.
Basis Business Intelligence (BI) Big Data
Data • Typically deals with structured • Involves a wide range of data
Sources data from internal transactional sources, including structured,
systems, such as databases, unstructured, and semi-structured
spreadsheets, and enterprise data from social media, sensors,
resource planning (ERP) log files, multimedia content, web
systems. interactions, and more.
• Big Data often includes data from
external sources beyond an
organization's traditional data
repositories.
Data • Deals with relatively smaller • Handles massive datasets that can
Volume and datasets, typically in the range from terabytes to petabytes
Velocity gigabyte to terabyte range, and or even exabytes.
focuses on historical analysis. • Big Data platforms are designed to
• Data updates and processing are handle high-velocity data streams,
usually done in batches. often requiring real-time or near-
real-time processing to extract
timely insights
Processing • Relies on structured query • Utilizes distributed computing
Methods language (SQL) and uses pre- frameworks like Apache Hadoop
defined, structured queries to and Apache Spark, which enable
retrieve and analyze data. It parallel processing of large
primarily relies on relational datasets across a cluster of
databases and data warehouses. computers.
• Big Data technologies support
both batch processing and real-
time/streaming processing,
allowing for more complex and
advanced analytics.
Scalability: • Typically operates on a fixed • Offers horizontal scalability,
infrastructure with limited allowing organizations to scale
scalability. their infrastructure dynamically by
• It may face challenges when adding or removing computing
dealing with rapidly growing resources based on demand.
data volumes or sudden spikes in • Big Data platforms can handle the
data processing requirements. ever-increasing data volumes and
accommodate diverse data sources.
A data warehouse and a Hadoop environment are two different concepts and technologies
used in managing and processing large amounts of data.
Data Warehouse:
A data warehouse is a central repository that stores structured, historical data from
various sources within an organization.
It is designed to support business intelligence (BI) and reporting activities
Hadoop is an open-source framework that enables distributed storage and processing of large
datasets across clusters of commodity hardware. It provides a scalable and cost-effective
solution for managing Big Data.
Distributed Storage:
• Hadoop uses the Hadoop Distributed File System (HDFS) to store data across
multiple nodes in a cluster.
• Data is split into blocks and distributed across the cluster, ensuring high availability
and fault tolerance.
Distributed Processing:
• Hadoop leverages the MapReduce framework to process data in parallel across the
cluster. MapReduce divides data processing tasks into smaller subtasks and distributes
them to different nodes, allowing for efficient parallel processing.
Scalability:
• Hadoop is designed to scale horizontally by adding more nodes to the cluster. This
enables organizations to store and process large volumes of data without relying on
expensive and specialized hardware.
• Hadoop can handle various types of data, including structured, unstructured, and
semi- structured data. It allows organizations to store and process diverse data
formats, such as text, log files, images, videos, and more.
• Hadoop ecosystem includes several data processing frameworks and tools built on top
of Hadoop, such as Apache Spark, Apache Hive, and Apache Pig.
• These frameworks provide higher-level abstractions and APIs for data manipulation,
querying, and analysis.
Big data analytics is process to extract meaningful insights from big data such as hidden patterns,
unknown correlation, market trends and customer preferences that can help organizations
make informed business decisions.
• Data analytics can be of four types depending on the type and scope of analysis being
conducted on the data set.
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
• Descriptive analytics uses historical data from a single internal source to describe
what happened.
• For example: How many people viewed the website?
• Which products had the most defects?
• Used by most businesses, descriptive analytics forms the crux of everyday reporting,
especially through dashboards.
• Diagnostic analytics is a form that dives deep into historical data to identify
anomalies, find patterns, identify correlations, and determine causal relationships.
• With the help of diagnostic analysis, data analysts can understand why a certain
product did not do well in the market, or why customer satisfaction decreased in a
certain month.
• This is a more advanced form of analytics that is often used to answer the question
‘what will happen next?’ in a business situation.
• As the name suggests, this data analytics type predicts the future outcome of a
situation depending on all the available data.
• This data will include both market trends and older data about your business
performance, By combining these two, this data analytics type can predict how your
business will perform during the next season
Prescriptive analytics: What do I need to do?
1. Data quality :
• Poor data quality can significantly impact the accuracy and reliability of analytical
results. Incomplete, inconsistent, and inaccurate data can lead to biased or misleading
insights.
2. Data Integration:
• Organizations often have data stored in different systems and formats across various
departments or sources.
• Integrating data from disparate sources can be complex and time-consuming. Data
integration challenges include data harmonization, resolving schema and format
differences, and aligning data from different databases or systems..
3. Data Privacy and Security:
• With the increasing focus on data privacy regulations, protecting sensitive and
personal data is a significant challenge.
• There is a shortage of skilled data analysts, data scientists, and data engineers who
possess the necessary expertise in data analytics.
• Organizations may struggle to find and retain professionals with the right skill set to
drive effective data analytics initiatives.
• The data analytics landscape is vast, with numerous tools, platforms, and technologies
available.
• Selecting the right technologies that align with organizational needs and integrating
them seamlessly with existing systems can be challenging.
• As data volumes and complexity grow, organizations must ensure that their analytics
infrastructure can scale accordingly.
• Processing and analyzing large datasets within acceptable timeframes can strain
computational resources and affect performance. Designing scalable architectures and
optimizing query performance are vital.
• The data analytics landscape is vast, with numerous tools, platforms, and technologies
available.
• Selecting the right technologies that align with organizational needs and integrating
them seamlessly with existing systems can be challenging.
8. Cost Management:
There are several top analytical tools available in the market that are widely used for
processing, analyzing, and deriving insights from data. Here are some of the popular
analytical tools.
Python:
• Python is a versatile programming language widely used for data analysis and
scientific computing.
• It offers numerous libraries and frameworks, such as pandas, NumPy, and scikit-learn,
which provide powerful data manipulation, analysis, and machine learning
capabilities.
R:
Apache Spark:
• Apache Spark is an open-source distributed computing system that provides fast and
scalable data processing and analytics capabilities.
• It supports various programming languages and offers libraries for distributed data
processing, machine learning, and graph analytics.
Cassandra
Mongo DB
Tableau:
• Tableau is a powerful and user-friendly data visualization and business intelligence tool.
• It allows users to create interactive dashboards, reports, and visualizations from
various data sources. Tableau supports advanced analytics, data blending, and provides
intuitive drag-and-drop functionality.
• Microsoft Power BI is a business intelligence tool that helps users analyze data and
share insights.
• It provides interactive dashboards, data visualization, and reporting capabilities.
• Power BI integrates with various data sources, offers AI-powered features, and
supports collaboration and data sharing.
SAS:
QlikView/Qlik Sense:
• QlikView and Qlik Sense are data visualization and business intelligence tools that
enable users to explore and analyze data visually.
• They offer drag-and-drop functionality, interactive dashboards, and powerful data
discovery capabilities.
MATLAB:
Improved Decision Making: Big Data Analytics enables organizations to extract valuable
insights from vast and diverse datasets. By analyzing this data, organizations can make data-
driven decisions, identify patterns, trends, and correlations that would be otherwise difficult
to uncover, and gain a competitive advantage.
Enhanced Operational Efficiency: Big Data Analytics helps optimize operations by providing
insights into inefficiencies, bottlenecks, and areas for improvement. By analyzing large
datasets, organizations can identify ways to streamline processes, reduce costs, and improve
overall operational efficiency.
Personalized Customer Experiences: Analyzing large volumes of customer data allows
organizations to understand customer behavior, preferences, and needs on an individual level.
With this knowledge, organizations can personalize customer experiences, tailor marketing
campaigns, offer personalized recommendations, and enhance customer satisfaction.
Fraud Detection and Security: Big Data Analytics plays a critical role in fraud detection
and security. By analyzing large datasets, organizations can identify unusual patterns, detect
anomalies, and proactively mitigate risks. This is particularly important in industries such as
finance, insurance, and cybersecurity.
Product Development and Innovation: Big Data Analytics enables organizations to gain
insights into market trends, customer demands, and emerging patterns. This information is
valuable for product development, innovation, and identifying new business opportunities.
Organizations can leverage big data to develop new products, improve existing ones, and stay
ahead in competitive markets.
Product Development and Innovation: Big Data Analytics enables organizations to gain
insights into market trends, customer demands, and emerging patterns. This information is
valuable for product development, innovation, and identifying new business opportunities.
Organizations can leverage big data to develop new products, improve existing ones, and stay
ahead in competitive markets.
Predictive Analytics: Big Data Analytics allows organizations to leverage historical data to
build predictive models. These models can forecast future trends, anticipate customer
behavior, predict demand, and optimize resource allocation. Predictive analytics helps
organizations make proactive decisions and take preventive measures.
Scalability and Agility: Big Data Analytics platforms and technologies are designed to
handle large volumes of data, providing the scalability required to process and analyze
massive datasets. This enables organizations to adapt quickly to changing business needs and
leverage data to gain insights in real-time.
Overall, Big Data Analytics is important as it enables organizations to harness the vast
amounts of data available to drive better decision-making, improve operational
efficiency, deliver
personalized experiences, detect fraud, fuel innovation, and gain a competitive edge in
today's data-driven world.
IMPORTANT QUESTIONS