0% found this document useful (0 votes)

45 views26 pages

BDA Unit 1 Notes-1

The document outlines a syllabus for a course on Big Data and Analytics, covering topics such as data classification, characteristics of big data, and the evolution of big data technologies. It emphasizes the importance of big data in decision-making, cost reduction, and understanding market trends, while also discussing challenges faced in managing and analyzing large datasets. Key components of big data analytics, including data collection, storage, preprocessing, analysis, and visualization, are also highlighted.

Uploaded by

prashanthsripathi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views26 pages

BDA Unit 1 Notes-1

Uploaded by

prashanthsripathi2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Syllabus :

UNIT I :

INTRODUCTION TO BIG DATA AND ANALYTICS:

Classification of Digital Data, Structured and Unstructured Data - Introduction to Big Data:
Characteristics – Evolution – Definition - Challenges with Big Data - Other Characteristics of
Data - Why Big Data - Traditional Business Intelligence versus Big Data - Data Warehouse
and Hadoop Environment Big Data Analytics: Classification of Analytics – Challenges - Big
Data Analytics important - Top Analytics Tools.
UNIT I

INTRODUCTION TO BIG DATA AND ANALYTICS

1. Classification of Digital Data, Structured and Unstructured Data

2. Introduction to Big Data: Characteristics – Evolution – Definition

3. Challenges with Big Data - Other Characteristics of Data

4. Why Big Data

5. Traditional Business Intelligence versus Big Data

6. Data Warehouse and Hadoop Environment Big Data

7. Analytics: Classification of Analytics– Challenges

8. Big Data Analytics important

9. Top Analytics Tools

INTRODUCTION:

Big data analytics is a field of study and practice that focuses on extracting valuable insights and
meaningful patterns from large and complex datasets. It involves the use of various
techniques, tools, and technologies to process, analyze, and interpret massive volumes of data
to make data-driven decisions, identify trends, and gaining valuable insights.

The term "big data" refers to the vast amounts of structured, semi-structured, and unstructured
data that organizations and businesses collect from various sources such as social media,
sensors, mobile devices, transaction records, and more. This data is typically characterized by
its volume, velocity, variety, and veracity, which makes it challenging to manage and analyze
using traditional data processing methods.

Big data analytics encompasses several key components and techniques:

1. Data collection: Gathering and aggregating data from multiple sources, including
structured databases, log files, social media platforms, and IoT devices.
2. Data storage and management: Storing and organizing large volumes of data using
distributed file systems, NoSQL databases, and data warehouses.
3. Data preprocessing: Cleaning, transforming, and filtering the data to ensure its
quality, consistency, and relevance for analysis.
4. Data analysis: Applying various statistical, machine learning, and data mining
techniques to extract patterns, correlations, and insights from the data. This can
involve tasks such as data exploration, clustering, classification, regression, and
predictive modelling.
5. Data visualization: Presenting the analyzed data in a visual format, such as charts,
graphs, and dashboards, to facilitate better understanding and decision-making.
6. Real-time analytics: Performing analysis on streaming data to gain immediate
insights and enable real-time decision-making.
7. Data security and privacy: Ensure appropriate measures are in place to protect
sensitive data and comply with relevant regulations and privacy policies.

Data:

Facts that can be recorded, which have implicit meaning.

Ex: Text, Videos, Speech, ELE201, RAM,19 etc.

Big Data:

A massive amount of data which cannot be stored, processed and analyzed using traditional
tools is known as Big Data. Hadoop is a framework that stores and processes big data.

Sources of Big Data:

These data come from many sources like

 Social networking sites: Facebook, Google, and LinkedIn generate huge amounts of
data on a day-to-day basis as they have billions of users worldwide.
 E-commerce site: Sites like Amazon, Flipkart, and Alibaba generate many logs from
which users buying trends can be traced.
 Weather Station: All the weather station and satellite gives huge amounts of data
that are stored and manipulated to forecast weather.
 Telecom company: Telecom giants like Airtel, and Vodafone study user trends and
accordingly publish their plans; for this, they store the data of its million users.
 Share Market: Stock exchange worldwide generates huge amounts of data through
daily transactions.

Why is Big Data Important?

The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its way; the more efficiently
it uses it, the more potential it has to grow. The company can take data from any source and
analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring
cost advantages to businesses when large amounts of data are to be stored, and the tools
also help identify more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new data sources, which helps businesses analyse data immediately and
make quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data, you can better understand
current market conditions. For example, by analyzing customers’ purchasing behaviors,
a company can find the products that are sold the most and produce products according
to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can
get feedback about who is saying what about your company. If you want to monitor and
improve the online presence of your business, then big data tools can help in all this
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. No single business
can claim success without first having to establish a solid customer base. However,
even with a customer base, a business cannot afford to disregard the high competition it
faces. If a business is slow to learn what customers seek, it is easy to begin offering
poor-quality products. In the end, a loss of clientele will adversely affect business
success. Using big data allows businesses to observe various customer-related patterns
and trends. Observing customer behaviour is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Big data analytics can help change all business operations. This includes the ability to
match customer expectations, change the company’s product line and, of course ensure
that the marketing campaigns are powerful

Biga data analytics:

Big data analytics is the process of extracting meaningful insights from big data, such as hidden
patterns, unknown correlation, market trends and customer preferences, that can help
organizations make informed business decisions.

There are quite a few advantages to incorporating big data analytics into a business or
organization. These include:
 Cost reduction: big data can reduce costs in storing all the business data in one
place. Tracking analytics also helps companies find ways to work more efficiently to
cut costs wherever possible.
 Product development: Developing and marketing new products, services, or
brands is much easier when based on data collected from customers’ needs and
wants. Big data analytics also helps businesses understand product viability and
keep up with trends.
 Strategic business decisions: The ability to constantly analyze data helps businesses
make better and faster decisions, such as cost and supply chain optimization.
 Customer experience: Data-driven algorithms help marketing efforts (targeted
ads, for example) increase customer satisfaction by delivering an enhanced
customer experience.
 Risk management: Businesses can identify risks by analyzing data patterns and
developing solutions for managing those risks.

I .Data Classification:

Process of classifying data in relevant categories to be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its
importance regarding data security and compliance and meeting different business or
personal objectives. It is also a major requirement, as data must be easily retrievable within a
specific period.

Types of Data Classification:

Data can be broadly classified into 3 types.

DATA

STRUCTURED DATA UNSTRUCTURED DATA SEMI-STRUCTURED DATA

1. Structured Data:
 Structured data is created using a fixed schema and is maintained in tabular format.
The elements in structured data are addressable for effective analysis.
 It contains all the data which can be stored in the SQL database in a tabular format.
 Today, most data is developed and processed in the simplest way to manage
information.
Relational data, Geo-location, credit card numbers, addresses, etc.
 Consider an example of Relational Data you have to maintain a record of students for

a university, like a student's name, ID, address, and Email of the student. To store the
record of students, the following relational schema and table were used.
2. Unstructured Data:

 It is defined as the data that does not follow a pre-defined standard, or you can say
that any does not follow any organized format.
 This kind of data is also not fit for the relational database because you will see a pre-
defined manner or organized way of data in the relational database.
 Unstructured data is also very important for the big data domain and to manage and
store Unstructured data there are many platforms to handle it like No-SQL Database.

Examples – Word, PDF, text, media logs, etc.

3. Semi-Structured Data:

Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. With some process, you can
store them in a relational database but is very hard for some semi-structured data, but semi-
structured exist to ease space.

Example – XML data.

Features of Data Classification:

The main goal of the organization of data is to arrange the data in such a form that it becomes
fairly available to the users. So, it’s basic features as following.

 Homogeneity – The data items in a particular group should be similar to each other.
 Clarity – There must be no confusion in positioning any data item in a particular
group.
 Stability – The data item set must be stable i.e. any investigation should not affect
the same set of classification.
 Elastic – One should be able to change the basis of classification as the purpose of
classification changes.
II. Characteristics of Big Data
Big Data contains much data not being processed by traditional data storage or the processing
unit. Many multinational companies use it to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

1. Volume
o The name Big Data itself is related to its enormous size. Big Data is a vast ‘volume’

of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
o Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded daily.
Big data technologies can handle large amounts of data.

Fig: Expected volume of Big data

2. Variety

Big Data can be structured, unstructured, semi-structured and collected from different sources.
Data will only be collected from databases and sheets in the past, But these days the data will
come in array forms, PDFs, Emails, audio, SM posts, photos, videos, etc.

3. Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

4. Value

Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.
5. Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data set speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed of data flows from sources like application logs,
business processes, networks, social media sites, sensors, mobile devices, etc.

 Every one minute

100000 + tweets ,

69500 + status updates

11000000 +internet messages ,

698445 +google searches ,

168000000 + emails ,

1820 TB data created

III. History of big data / Evolution of big data

The history of big data can be traced back to the early days of computing and the
emergence of digital data. Here's a brief overview of the key milestones in the evolution
of big data:

1. Emergence of Databases: In the 1960s, the concept of databases and data

management systems began to take shape. Developing relational database
management systems (RDBMS) like IBM's System R and Oracle's RDBMS laid the
foundation for structured data storage and retrieval.
2. Data Warehousing: In the 1980s, the concept of data warehousing emerged. Data
warehousing involves consolidating data from multiple sources into a central
repository for analysis and reporting purposes. This approach enabled organizations to
leverage their data assets for decision-making.
3. Internet and Web 1.0: The rise and the advent of the World Wide Web in the 1990s
brought about a massive increase in data generation. The web became a platform for
publishing, sharing, and collecting data, leading to the growth of unstructured and
semi- structured data sources.
4. Digital Transformation and Enterprise Systems: Organizations started generating
vast amounts of structured data with the widespread adoption of enterprise systems
such as Enterprise Resource Planning (ERP) and Customer Relationship Management
(CRM). This marked the beginning of data-driven decision-making within businesses.
5. Hadoop and Distributed Computing: In 2005, Doug Cutting and Mike Cafarella
developed Hadoop, an open-source framework designed to process large datasets in a
distributed computing environment. Hadoop's distributed file system (HDFS) and
MapReduce programming model made it easier to handle big data by breaking it into
smaller parts and processing them in parallel.
6. The Rise of NoSQL: Traditional relational databases struggled to handle big data's
volume, variety, and velocity. Consequently, the development of NoSQL (Not Only
SQL) databases gained prominence in the late 2000s. NoSQL databases offered
flexible schema designs and horizontal scalability, making them suitable for handling
unstructured and semi-structured data.
7. Advanced Analytics and Machine Learning: As big data grew, organizations
sought to extract valuable insights and patterns from the data. Advanced analytics
techniques, such as data mining, machine learning, and predictive modeling, gained
traction for uncovering actionable insights and supporting decision-making.
8. Cloud Computing: The advent of cloud computing in the late 2000s provided
organizations with scalable and cost-effective infrastructure for storing, processing,
and analyzing big data. Cloud platforms, such as Amazon Web Services (AWS) and
Google Cloud Platform (GCP), offered on-demand resources, eliminating the need for
large- scale on-premises infrastructure.
9. Internet of Things (IoT): With the proliferation of connected devices and sensors,
the amount of data generated by IoT devices skyrocketed. The IoT contributed to the
expansion of big data, as real-time streaming data from various sources became
available for analysis.
10. Current Trends: Today, big data analytics continues to evolve, driven by
advancements in technologies such as artificial intelligence, natural language
processing, and deep learning. The focus has shifted towards real-time analytics, data
governance, privacy concerns, and the ethical use of data.

The history of big data demonstrates the increasing importance of managing and
analyzing large and diverse datasets. The field continues to evolve, and organizations
are continually exploring innovative approaches and technologies to leverage the
potential of big data for insights and decision-making.

IV. Challenges of Big Data

The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology
may take place which can also lead to some unpleasant result. Big data challenges include the
storing, analyzing the extremely large and fast-growing data.
1. Incomplete Understanding of Big Data:
 Sometimes organizations are unable to understand what is data? Why is it so
important? because of this integration of required algorithms to analyze the
data becomes really difficult.
 Skilled persons required.
 For example, if employees do not understand the importance of data
storage, they might not keep the backup of sensitive data. They might
not use databases properly for storage. As a result, when this important
data is required, it cannot be retrieved easily
2. Exponential Data Growth:
 One of the biggest challenges is the exponential growth of data as time goes
on.
 Since most of the data is unstructured and comes in the form of the
document, videos, etc. Which is difficult to analyze.
Since data is growing exponentially, it’s becoming difficult to store this data.
3. Security of Data
 Companies spend so many resources for understanding, storing, and
analyzing the data they often forget to prioritize the security part.
 So, they try to push the security part for the later stages because of which data
breach is possible.
 A solution for these companies is hiring cybersecurity professionals to
protect their data.
4. Data Integration
 Since data can be in any form from structured data like phone numbers to
unstructured data like videos. So, integrating these data is difficult, but at the
same time, data integration is crucial for processes like analysis, reporting,
and forecasting.
 So integration has to be error-free. Companies can solve these problems by
purchasing the right technologies like IBM Infosphere, Microsoft SQL, etc.

5. Confusion while Big Data tool selection

 Companies often get confused while selecting the best tool for Big Data
analysis and storage. Is HBase or Cassandra the best technology for data
storage? Is Hadoop MapReduce good enough or will Spark be a better option
for data analytics and storage?
 These questions bother companies and sometimes they are unable to find the
answers. They end up making poor decisions and selecting inappropriate
technology. As a result, money, time, efforts and work hours are wasted.
6. Lack of data professionals
 To run these modern technologies and Big Data tools, companies need skilled
data professionals. These professionals will include data scientists, data
analysts and data engineers who are experienced in working with the tools
and making sense out of huge data sets.
[Link] Business Intelligence versus Big
Data Traditional Business Intelligence (BI) :
Traditional Business Intelligence (BI) refers to the methods, technologies, and processes used to
collect, analyze, and present structured data from various internal sources within an
organization. It involves extracting insights and generating reports to support decision-
making processes.

Big Data :
Massive amount of data which cannot be stored, processed and analyzed using traditional tools is
known as Big Data. It deals with large volume of both structured, semi structured and
unstructured data.
Basis Business Intelligence (BI) Big Data
Data • Typically deals with structured • Involves a wide range of data
Sources data from internal transactional sources, including structured,
systems, such as databases, unstructured, and semi-structured
spreadsheets, and enterprise data from social media, sensors,
resource planning (ERP) log files, multimedia content, web
systems. interactions, and more.
• Big Data often includes data from
external sources beyond an
organization's traditional data
repositories.
Data • Deals with relatively smaller • Handles massive datasets that can
Volume and datasets, typically in the range from terabytes to petabytes
Velocity gigabyte to terabyte range, and or even exabytes.
focuses on historical analysis. • Big Data platforms are designed to
• Data updates and processing are handle high-velocity data streams,
usually done in batches. often requiring real-time or near-
real-time processing to extract
timely insights
Processing • Relies on structured query • Utilizes distributed computing
Methods language (SQL) and uses pre- frameworks like Apache Hadoop
defined, structured queries to and Apache Spark, which enable
retrieve and analyze data. It parallel processing of large
primarily relies on relational datasets across a cluster of
databases and data warehouses. computers.
• Big Data technologies support
both batch processing and real-
time/streaming processing,
allowing for more complex and
advanced analytics.
Scalability: • Typically operates on a fixed • Offers horizontal scalability,
infrastructure with limited allowing organizations to scale
scalability. their infrastructure dynamically by
• It may face challenges when adding or removing computing
dealing with rapidly growing resources based on demand.
data volumes or sudden spikes in • Big Data platforms can handle the
data processing requirements. ever-increasing data volumes and
accommodate diverse data sources.

Analytical • Primarily focuses on descriptive • Enables advanced analytics

Capabilities: analytics, providing historical capabilities, including predictive
insights and reports based on analytics, machine learning, and
predefined metrics and Key data mining.
Performance Indicators (KPIs). • Big Data platforms leverage
It often involves simple complex algorithms and statistical
aggregations, drill-downs, and models to discover patterns,
visualizations correlations, and trends in large
and
diverse datasets
Decision- • Offers insights based on • Supports real-time or near-real-
Making historical data and is suited for time analytics, enabling
Speed: periodic reporting and decision- organizations to make faster and
making processes with longer more informed decisions based on
cycles. up-to-date information.

VI. Data Warehouse and Hadoop Environment

A data warehouse and a Hadoop environment are two different concepts and technologies
used in managing and processing large amounts of data.
Data Warehouse:
 A data warehouse is a central repository that stores structured, historical data from
various sources within an organization.
 It is designed to support business intelligence (BI) and reporting activities

key aspects of a data warehouse:

Structure:
• Data warehouses use a relational database management system (RDBMS) to store
data in a structured format.
• Data is organized into tables, rows, and columns, following a predefined schema
called a dimensional model.
• The dimensional model typically consists of fact tables (containing numeric
measures) and dimension tables (providing context and descriptive attributes).
ETL Processes:
• Extract, Transform, Load (ETL) processes are used to populate the data warehouse.
• Data is extracted from different sources, transformed to adhere to the data warehouse
schema, and then loaded into the warehouse.
• ETL processes often involve data cleansing, integration, and aggregation to ensure
data quality and consistency.
Historical Data:
• Data warehouses primarily store historical data, providing a long-term view of the
organization's operations.
• Data is typically collected at regular intervals, such as daily, weekly, or monthly,
allowing for trend analysis and historical reporting.
Business Intelligence and Reporting:
• Data warehouses serve as a foundation for business intelligence activities.
• They support ad hoc querying, reporting, and analysis by providing a consolidated
view of data across different business areas.
• Data is often pre-aggregated and optimized for query performance to facilitate fast
and interactive reporting.
Hadoop Environment:

Hadoop is an open-source framework that enables distributed storage and processing of large
datasets across clusters of commodity hardware. It provides a scalable and cost-effective
solution for managing Big Data.

Distributed Storage:

• Hadoop uses the Hadoop Distributed File System (HDFS) to store data across
multiple nodes in a cluster.
• Data is split into blocks and distributed across the cluster, ensuring high availability
and fault tolerance.
Distributed Processing:

• Hadoop leverages the MapReduce framework to process data in parallel across the
cluster. MapReduce divides data processing tasks into smaller subtasks and distributes
them to different nodes, allowing for efficient parallel processing.

Scalability:

• Hadoop is designed to scale horizontally by adding more nodes to the cluster. This
enables organizations to store and process large volumes of data without relying on
expensive and specialized hardware.

Flexibility and Variety:

• Hadoop can handle various types of data, including structured, unstructured, and
semi- structured data. It allows organizations to store and process diverse data
formats, such as text, log files, images, videos, and more.

Data Processing Frameworks:

• Hadoop ecosystem includes several data processing frameworks and tools built on top
of Hadoop, such as Apache Spark, Apache Hive, and Apache Pig.
• These frameworks provide higher-level abstractions and APIs for data manipulation,
querying, and analysis.

Batch and Real-time Processing:

• Hadoop supports both batch processing and real-time/streaming processing.

• While its MapReduce framework is suitable for batch processing, tools like Apache
Spark enable real-time and near-real-time analytics on streaming data.

VII. Classification of Data analytics

Big data analytics is process to extract meaningful insights from big data such as hidden patterns,
unknown correlation, market trends and customer preferences that can help organizations
make informed business decisions.
• Data analytics can be of four types depending on the type and scope of analysis being
conducted on the data set.

1. Descriptive analytics

2. Diagnostic analytics

3. Predictive analytics

4. Prescriptive analytics

Descriptive analytics: What happened

• Descriptive analytics uses historical data from a single internal source to describe
what happened.
• For example: How many people viewed the website?
• Which products had the most defects?
• Used by most businesses, descriptive analytics forms the crux of everyday reporting,
especially through dashboards.

Diagnostic analytics: Why did it happen?

• Diagnostic analytics is a form that dives deep into historical data to identify
anomalies, find patterns, identify correlations, and determine causal relationships.
• With the help of diagnostic analysis, data analysts can understand why a certain
product did not do well in the market, or why customer satisfaction decreased in a
certain month.

Predictive analytics: What might happen next?

• This is a more advanced form of analytics that is often used to answer the question
‘what will happen next?’ in a business situation.
• As the name suggests, this data analytics type predicts the future outcome of a
situation depending on all the available data.
• This data will include both market trends and older data about your business
performance, By combining these two, this data analytics type can predict how your
business will perform during the next season
Prescriptive analytics: What do I need to do?

• Prescriptive analytics is the most complex type of analytics.

• It combines internal data, external sources, and machine-learning techniques to
provide the most effective recommendations for business decisions.

VIII. Big data analytics challenges

1. Data quality :

• Poor data quality can significantly impact the accuracy and reliability of analytical
results. Incomplete, inconsistent, and inaccurate data can lead to biased or misleading
insights.

2. Data Integration:

• Organizations often have data stored in different systems and formats across various
departments or sources.
• Integrating data from disparate sources can be complex and time-consuming. Data
integration challenges include data harmonization, resolving schema and format
differences, and aligning data from different databases or systems..
3. Data Privacy and Security:

• With the increasing focus on data privacy regulations, protecting sensitive and
personal data is a significant challenge.

4. Skill Gap and Talent Acquisition:

• There is a shortage of skilled data analysts, data scientists, and data engineers who
possess the necessary expertise in data analytics.
• Organizations may struggle to find and retain professionals with the right skill set to
drive effective data analytics initiatives.

5. Technology Selection and Integration:

• The data analytics landscape is vast, with numerous tools, platforms, and technologies
available.
• Selecting the right technologies that align with organizational needs and integrating
them seamlessly with existing systems can be challenging.

6. Scalability and Performance:

• As data volumes and complexity grow, organizations must ensure that their analytics
infrastructure can scale accordingly.
• Processing and analyzing large datasets within acceptable timeframes can strain
computational resources and affect performance. Designing scalable architectures and
optimizing query performance are vital.

7. Technology Selection and Integration:

8. Cost Management:

• Implementing and maintaining a robust data analytics infrastructure can be costly.

• Investing in hardware, software, talent, and ongoing maintenance and support can
strain budgets.

IX. Top analytical tools

There are several top analytical tools available in the market that are widely used for
processing, analyzing, and deriving insights from data. Here are some of the popular
analytical tools.

Python:

• Python is a versatile programming language widely used for data analysis and
scientific computing.
• It offers numerous libraries and frameworks, such as pandas, NumPy, and scikit-learn,
which provide powerful data manipulation, analysis, and machine learning
capabilities.

• R is a programming language and environment specifically designed for statistical

computing and graphics.
• It offers a wide range of packages and libraries for data manipulation, analysis, and
visualization.
• R is popular among statisticians and data scientists for its extensive statistical
modeling capabilities

Apache Spark:

• Apache Spark is an open-source distributed computing system that provides fast and
scalable data processing and analytics capabilities.
• It supports various programming languages and offers libraries for distributed data
processing, machine learning, and graph analytics.

Apache Hadoop: Apache Hadoop is an open-source framework for distributed processing of

large datasets.
• It provides a scalable and fault-tolerant platform for storing, processing, and
analyzing big data.
• Hadoop includes components such as Hadoop Distributed File System (HDFS) and
MapReduce for distributed data processing.

Cassandra

• APACHE Cassandra is an open-source NoSQL distributed database that is used to

fetch large amounts of data.
• It is capable of delivering thousands of operations every second and can handle
petabytes of resources with almost zero downtime.

Mongo DB

• It is a free, open-source platform and a document-oriented (NoSQL) database that is

used to store a high volume of data.
• It uses collections and documents for storage and its document consists of key-value
pairs which are considered a basic unit of Mongo DB.

It is so popular among developers due to its availability for multi-programming languages

such as Python, Jscript, and Ruby

Tableau:

• Tableau is a powerful and user-friendly data visualization and business intelligence tool.
• It allows users to create interactive dashboards, reports, and visualizations from
various data sources. Tableau supports advanced analytics, data blending, and provides
intuitive drag-and-drop functionality.

Microsoft Power BI:

• Microsoft Power BI is a business intelligence tool that helps users analyze data and
share insights.
• It provides interactive dashboards, data visualization, and reporting capabilities.

• Power BI integrates with various data sources, offers AI-powered features, and
supports collaboration and data sharing.
SAS:

• SAS (Statistical Analysis System) is a comprehensive suite of analytics tools widely

used in industries such as finance, healthcare, and government.
• It provides a wide range of functionalities for data management, advanced analytics,
and reporting.

QlikView/Qlik Sense:

• QlikView and Qlik Sense are data visualization and business intelligence tools that
enable users to explore and analyze data visually.
• They offer drag-and-drop functionality, interactive dashboards, and powerful data
discovery capabilities.

MATLAB:

• MATLAB is a programming and numerical computing environment widely used in

academia and industry.
• It offers comprehensive tools for data analysis, modeling, and simulation, along with
built-in algorithms and functions for various analytical tasks.

X. Big data analytics important

Big Data Analytics is crucial for several reasons:

Improved Decision Making: Big Data Analytics enables organizations to extract valuable
insights from vast and diverse datasets. By analyzing this data, organizations can make data-
driven decisions, identify patterns, trends, and correlations that would be otherwise difficult
to uncover, and gain a competitive advantage.

Enhanced Operational Efficiency: Big Data Analytics helps optimize operations by providing
insights into inefficiencies, bottlenecks, and areas for improvement. By analyzing large
datasets, organizations can identify ways to streamline processes, reduce costs, and improve
overall operational efficiency.
Personalized Customer Experiences: Analyzing large volumes of customer data allows
organizations to understand customer behavior, preferences, and needs on an individual level.
With this knowledge, organizations can personalize customer experiences, tailor marketing
campaigns, offer personalized recommendations, and enhance customer satisfaction.

Fraud Detection and Security: Big Data Analytics plays a critical role in fraud detection
and security. By analyzing large datasets, organizations can identify unusual patterns, detect
anomalies, and proactively mitigate risks. This is particularly important in industries such as
finance, insurance, and cybersecurity.

Product Development and Innovation: Big Data Analytics enables organizations to gain
insights into market trends, customer demands, and emerging patterns. This information is
valuable for product development, innovation, and identifying new business opportunities.
Organizations can leverage big data to develop new products, improve existing ones, and stay
ahead in competitive markets.

Predictive Analytics: Big Data Analytics allows organizations to leverage historical data to
build predictive models. These models can forecast future trends, anticipate customer
behavior, predict demand, and optimize resource allocation. Predictive analytics helps
organizations make proactive decisions and take preventive measures.

Scalability and Agility: Big Data Analytics platforms and technologies are designed to
handle large volumes of data, providing the scalability required to process and analyze
massive datasets. This enables organizations to adapt quickly to changing business needs and
leverage data to gain insights in real-time.

Overall, Big Data Analytics is important as it enables organizations to harness the vast
amounts of data available to drive better decision-making, improve operational
efficiency, deliver
personalized experiences, detect fraud, fuel innovation, and gain a competitive edge in
today's data-driven world.

IMPORTANT QUESTIONS

1. What is Big Data ? Explain Characteristics of Big data?

2. What is big data analytics? Explain five ‘V’s of Big data. Briefly
discuss applications of big data?
3. What are the benefits of Big Data? Discuss challenges under Big Data.?
4. Explain the difference between structure and unstructured data.
5. What is big data analytics and discuss the different types of big
data analytics?
6. Differentiate between traditional BI and Big data?
7. Discuss about data ware housing and Hadoop frame work?

FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
No ratings yet
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
121 pages
Big Data
No ratings yet
Big Data
54 pages
Big Data Analytics Overview and Insights
No ratings yet
Big Data Analytics Overview and Insights
20 pages
Ccs334 Unit I
No ratings yet
Ccs334 Unit I
114 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Big Data Analytics. Notes
No ratings yet
Big Data Analytics. Notes
32 pages
Unit 2
No ratings yet
Unit 2
35 pages
Big Data Analytics: Key Insights & Applications
No ratings yet
Big Data Analytics: Key Insights & Applications
37 pages
Unit 5 Introduction To Big Data
No ratings yet
Unit 5 Introduction To Big Data
8 pages
Business Analytics
No ratings yet
Business Analytics
34 pages
Unit I - Bda
No ratings yet
Unit I - Bda
37 pages
22dsb3303a Lecture Notes
No ratings yet
22dsb3303a Lecture Notes
108 pages
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
No ratings yet
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
130 pages
Present
No ratings yet
Present
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
22 pages
Big Data's Role in Business Decisions
No ratings yet
Big Data's Role in Business Decisions
13 pages
Big Data Analytics Notes 1
No ratings yet
Big Data Analytics Notes 1
5 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
16 pages
Understanding Big Data Fundamentals
No ratings yet
Understanding Big Data Fundamentals
4 pages
Introduction To Big Data Unit - 2
No ratings yet
Introduction To Big Data Unit - 2
75 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Unit 1 - Big Data Analytics - CCS334
No ratings yet
Unit 1 - Big Data Analytics - CCS334
35 pages
Need of Big Data
No ratings yet
Need of Big Data
5 pages
CC Unit 4
No ratings yet
CC Unit 4
22 pages
Unit 1 - Understanding Big Data
No ratings yet
Unit 1 - Understanding Big Data
39 pages
Big Data Unit 01
No ratings yet
Big Data Unit 01
8 pages
Data Analytics Complete Notes
100% (1)
Data Analytics Complete Notes
33 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
UNIT V Big Data
No ratings yet
UNIT V Big Data
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
127 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
47 pages
Ccs334 Big Data Analytics
No ratings yet
Ccs334 Big Data Analytics
69 pages
Unit 1
No ratings yet
Unit 1
22 pages
Ccs334 Big Data Analytics
No ratings yet
Ccs334 Big Data Analytics
49 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Big Data
No ratings yet
Big Data
6 pages
UNIT Two Emerging Technology
No ratings yet
UNIT Two Emerging Technology
43 pages
Unit - I
No ratings yet
Unit - I
80 pages
Module 1 Notes
No ratings yet
Module 1 Notes
32 pages
Big Data
No ratings yet
Big Data
6 pages
Class 12 BD & MMS
No ratings yet
Class 12 BD & MMS
8 pages
Big Data
No ratings yet
Big Data
28 pages
Big Data Analytics Unit-I
No ratings yet
Big Data Analytics Unit-I
38 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
14 pages
Big Data Analytics Overview and Applications
No ratings yet
Big Data Analytics Overview and Applications
43 pages
Bda Notes
No ratings yet
Bda Notes
13 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
38 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Big Data: Unlocking Business Growth
No ratings yet
Big Data: Unlocking Business Growth
22 pages
Bda
No ratings yet
Bda
36 pages
Big Data Applications Across Industries
No ratings yet
Big Data Applications Across Industries
15 pages
BigData Notes
No ratings yet
BigData Notes
88 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
Partiunit5introduction To Big Data Its Type and Advantagedisadvantages
No ratings yet
Partiunit5introduction To Big Data Its Type and Advantagedisadvantages
4 pages
Data Analytics Overview for CSE Students
No ratings yet
Data Analytics Overview for CSE Students
26 pages
Unit 1 Introduction To Data Science
No ratings yet
Unit 1 Introduction To Data Science
63 pages
8x8 Work For Desktop Quick User Guide
No ratings yet
8x8 Work For Desktop Quick User Guide
2 pages
Locks in Database Management Systems
No ratings yet
Locks in Database Management Systems
2 pages
Easa Ai Roadmap 2 0 2023
No ratings yet
Easa Ai Roadmap 2 0 2023
36 pages
Windows 10 Key
67% (3)
Windows 10 Key
2 pages
Toeic Test 2
No ratings yet
Toeic Test 2
35 pages
LSI 9500 系列 PCIE4.0 HBA 2020
No ratings yet
LSI 9500 系列 PCIE4.0 HBA 2020
2 pages
TECS OpenPalette Routine Maintenance Guide
No ratings yet
TECS OpenPalette Routine Maintenance Guide
19 pages
Evolution of Computers and Their Uses
No ratings yet
Evolution of Computers and Their Uses
64 pages
Circuit Diagram 43''
0% (1)
Circuit Diagram 43''
12 pages
Practical Assignments - DPS School (Class - Xii)
No ratings yet
Practical Assignments - DPS School (Class - Xii)
43 pages
Promote Your Brand on Pinterest: 50 Tips
No ratings yet
Promote Your Brand on Pinterest: 50 Tips
2 pages
Writing b1.2
No ratings yet
Writing b1.2
4 pages
Core Application Development
No ratings yet
Core Application Development
871 pages
NVIDIA MTE - Perfect The Art of Your Network Landscape
100% (1)
NVIDIA MTE - Perfect The Art of Your Network Landscape
14 pages
Computer Science - Data Warehouse MCQS With Answer
No ratings yet
Computer Science - Data Warehouse MCQS With Answer
35 pages
Introduction To The Internet and Browsers
No ratings yet
Introduction To The Internet and Browsers
22 pages
Client Manager ActiveX
No ratings yet
Client Manager ActiveX
14 pages
Data Engineer in 3 Months
No ratings yet
Data Engineer in 3 Months
2 pages
Compact Coin Deposit Solution CD2
No ratings yet
Compact Coin Deposit Solution CD2
2 pages
Network Security Assessment Questions
No ratings yet
Network Security Assessment Questions
21 pages
Splunk Indexer Clustering Masterclass 101
100% (1)
Splunk Indexer Clustering Masterclass 101
6 pages
Menu Designing & Filtering
No ratings yet
Menu Designing & Filtering
23 pages
X3 Portable Audio Player Guide
No ratings yet
X3 Portable Audio Player Guide
21 pages
NIOS BCA Term I (2024-2025)
No ratings yet
NIOS BCA Term I (2024-2025)
3 pages
Bluetooth Program
No ratings yet
Bluetooth Program
3 pages
MPLS VPN Configuration Labs
No ratings yet
MPLS VPN Configuration Labs
80 pages
Facebook User Session Data
No ratings yet
Facebook User Session Data
6 pages
Userstoryminibook
No ratings yet
Userstoryminibook
22 pages
Ebert Automotive Security PenTest 2022
No ratings yet
Ebert Automotive Security PenTest 2022
9 pages
Advanced Animation With DirectX 1st Edition Jim Adams Ebook Newly Updated Content
100% (2)
Advanced Animation With DirectX 1st Edition Jim Adams Ebook Newly Updated Content
104 pages

BDA Unit 1 Notes-1

Uploaded by

BDA Unit 1 Notes-1

Uploaded by

Syllabus :

INTRODUCTION TO BIG DATA AND ANALYTICS:

INTRODUCTION TO BIG DATA AND ANALYTICS

1. Classification of Digital Data, Structured and Unstructured Data

2. Introduction to Big Data: Characteristics – Evolution – Definition

3. Challenges with Big Data - Other Characteristics of Data

4. Why Big Data

5. Traditional Business Intelligence versus Big Data

6. Data Warehouse and Hadoop Environment Big Data

7. Analytics: Classification of Analytics– Challenges

8. Big Data Analytics important

9. Top Analytics Tools

Big data analytics encompasses several key components and techniques:

Facts that can be recorded, which have implicit meaning.

Ex: Text, Videos, Speech, ELE201, RAM,19 etc.

Sources of Big Data:

These data come from many sources like

Why is Big Data Important?

Biga data analytics:

Types of Data Classification:

STRUCTURED DATA UNSTRUCTURED DATA SEMI-STRUCTURED DATA

Examples – Word, PDF, text, media logs, etc.

Example – XML data.

Features of Data Classification:

5 V's of Big Data

Fig: Expected volume of Big data

For example, Facebook posts with hashtags.

 Every one minute

69500 + status updates

11000000 +internet messages ,

698445 +google searches ,

1820 TB data created

III. History of big data / Evolution of big data

1. Emergence of Databases: In the 1960s, the concept of databases and data

IV. Challenges of Big Data

5. Confusion while Big Data tool selection

Analytical • Primarily focuses on descriptive • Enables advanced analytics

VI. Data Warehouse and Hadoop Environment

key aspects of a data warehouse:

Flexibility and Variety:

Data Processing Frameworks:

Batch and Real-time Processing:

• Hadoop supports both batch processing and real-time/streaming processing.

VII. Classification of Data analytics

Descriptive analytics: What happened

Diagnostic analytics: Why did it happen?

Predictive analytics: What might happen next?

• Prescriptive analytics is the most complex type of analytics.

VIII. Big data analytics challenges

4. Skill Gap and Talent Acquisition:

5. Technology Selection and Integration:

6. Scalability and Performance:

7. Technology Selection and Integration:

• Implementing and maintaining a robust data analytics infrastructure can be costly.

IX. Top analytical tools

• R is a programming language and environment specifically designed for statistical

Apache Hadoop: Apache Hadoop is an open-source framework for distributed processing of

• APACHE Cassandra is an open-source NoSQL distributed database that is used to

• It is a free, open-source platform and a document-oriented (NoSQL) database that is

It is so popular among developers due to its availability for multi-programming languages

Microsoft Power BI:

• SAS (Statistical Analysis System) is a comprehensive suite of analytics tools widely

• MATLAB is a programming and numerical computing environment widely used in

X. Big data analytics important

Big Data Analytics is crucial for several reasons:

1. What is Big Data ? Explain Characteristics of Big data?

You might also like