0% found this document useful (0 votes)
39 views63 pages

Unit 1

The document provides an overview of the Big Data course taught by Rinki Chauhan. It outlines the course code, name, faculty details and covers topics such as the course outcomes, syllabus, introduction to big data including its types, history, importance, platforms and advantages.

Uploaded by

meetalisingh00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views63 pages

Unit 1

The document provides an overview of the Big Data course taught by Rinki Chauhan. It outlines the course code, name, faculty details and covers topics such as the course outcomes, syllabus, introduction to big data including its types, history, importance, platforms and advantages.

Uploaded by

meetalisingh00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Department of Applied Computational Science &

Engineering

Course Code: KDS 601


Course Name: Big Data
Faculty Name: Rinki Chauhan
Email : [email protected]
Department of Applied Computational Science &
Engineering
Vision
To build strong teaching environment that responds to the needs of industry and challenges of the society

Mission
• M1 : Developing strong mathematical & computing skill set among the students.
• M2 : Extending the role of computer science and engineering in diverse areas like Internet of Things (IoT),
Artificial Intelligence & Machine Learning and Data Analytics.
• M3 : Imbibing the students with a deep understanding of professional ethics and high integrity to serve the
Nation.
• M4 : Providing an environment to the students for their growth both as individuals and as globally competent
Computer Science professional wit encouragement for innovation & start-up culture.

Subject: Big Data


Course Outcome

CO’S TITLE

CO1 Understand the key concerns that are common to all software development processes.

CO2 Select appropriate process models, approaches and techniques to manage a given software development process.

CO3 Able to elicit requirements for a software product and translate these into a documented design.

CO4 Recognize the importance of software reliability and how we can design dependable software, and what measures are
used.

CO5 Understand the principles and techniques underlying the process of inspecting and testing software and making it
free of errors and tolerable.

CO6 Understanding the latest advances and its applications in software engineering and testing.

Subject: Big Data


Syllabus

Unit 1

Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big Data
platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big Data
technology components, Big Data importance and applications, Big Data features – security, compliance,
auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of conventional systems,
intelligent data analysis, nature of data, analytic processes and tools, analysis vs reporting, modern data
analytic tools.

Subject: Big Data


Introduction to Big Data

Introduction to Big Data:

• Definition: Big Data refers to large and complex sets of data that traditional data
processing systems are unable to handle effectively.
• Characteristics: Big Data is typically characterized by its volume, variety, velocity,
and veracity.
• Volume: The sheer size of Big Data, measured in petabytes or exabytes, requires
specialized storage and processing systems.
• Variety: Big Data comes in different forms, including structured, semi-structured, and
unstructured data.
• Velocity: Big Data is generated and processed at high speeds, making it necessary
to handle data in real-time.
• Veracity: The quality and accuracy of Big Data can be uncertain, requiring
verification and validation.

Subject: Big Data


Topics to be Covered

Importance of Big Data

• Business Intelligence: Big Data helps organizations to gain valuable insights into their
operations, customer behavior, and market trends.
• Improved Decision-Making: By analyzing large amounts of data, organizations can
make better and more informed decisions.
• Cost Savings: Big Data can help organizations optimize their operations and reduce
costs by identifying inefficiencies and waste.
• Customer Satisfaction: Big Data can help organizations improve their customer
experiences by providing personalized services and products.
• Innovation: Big Data drives innovation by enabling organizations to identify new
opportunities and create new products and services.
• Example: A retail company can use Big Data to analyze customer purchasing
patterns and make better inventory decisions, leading to increased sales and reduced
waste.

Subject: Big Data


Topics to be Covered

Types of digital data

•Definition of Digital Data: Data that exists in a digital format, stored on a computer,
server, or other digital storage device.

Types of Digital Data

•Structured,
•Unstructured,
•Semi-Structured Data.

Subject: Big Data


Topics to be Covered

❖ Structured Data

Definition: Data that is organized in a specific format and follows a consistent pattern.
Examples: Customer data in a database, financial transactions, stock market data.

❖ Unstructured Data

Definition: Data that does not follow a specific format or pattern and does not fit neatly
into a database or spreadsheet.
Examples: Images, videos, audio files, emails, and text documents.

❖ Semi-Structured Data

Definition: Data that has some structure but does not fit neatly into a traditional
database or spreadsheet.
Examples: XML files, JSON data, log files.

Subject: Big Data


Topics to be Covered

History of Big Data innovation :

Introduction to Big Data Innovation

Definition: Big Data refers to the large and complex data sets that cannot be processed and
analyzed using traditional data processing methods.

Origin: The term "Big Data" was first used in 1997 by John Mashey, a computer scientist and
statistician.

Importance: Big Data has become critical in today's business and social world, as it helps
organizations make informed decisions, improve operational efficiency, and drive innovation.

Subject: Big Data


Topics to be Covered

History of Big Data innovation :

Pre-Big Data Era (1960-2000)

•Early use of data: During this time, data was primarily used for scientific and engineering
research.

•Data storage: Data was stored on mainframe computers and was limited by their storage
capacity.

•Data analysis: Data analysis was manual and time-consuming, and only a few experts were
trained in the process.

Subject: Big Data


Topics to be Covered

History of Big Data innovation :

Emergence of Big Data (2000-2010)

•Development of the Internet: The widespread use of the Internet led to an exponential
increase in the volume of data being generated.

•Increase in data sources: The emergence of social media, e-commerce, and other online
platforms created new sources of data.

•Increase in data storage capacity: The development of new data storage technologies, such
as cloud storage, allowed organizations to store and manage larger amounts of data.

Subject: Big Data


Topics to be Covered

History of Big Data innovation :

Big Data Takes Over (2010-Present)

•Advancements in data processing: The development of big data technologies, such as


Hadoop and Spark, has made it possible to process large amounts of data quickly and
efficiently.

•Business impact: Organizations are now able to extract insights from big data to make
informed business decisions, improve customer engagement, and optimize operations.

•Widespread adoption: Big Data has become a critical aspect of many industries, including
healthcare, finance, retail, and marketing.

Subject: Big Data


Topics to be Covered

History of Big Data innovation :

Examples of Big Data Innovation

•Fraud detection in the financial industry

•Predictive maintenance in the manufacturing industry

•Personalized marketing in the retail industry

•Precision medicine in the healthcare industry

Subject: Big Data


Topics to be Covered

Introduction to Big Data Platform

Definition: A platform that enables organizations to process and manage vast amounts of
structured and unstructured data.

Importance: Helps organizations make informed business decisions by analyzing large


data sets.

Characteristics: Scalable, flexible, and highly adaptable to changing business needs.

Example: Apache Hadoop, a popular open-source Big Data platform.

Subject: Big Data


Topics to be Covered

Introduction to Big Data Platform

Key Components of Big Data Platform

Data Collection: Gathering large amounts of data from various sources such as social
media, databases, and sensors.

Data Storage: Storing the data in a way that allows for efficient and effective retrieval and
analysis.

Data Processing: The ability to process and analyze large data sets in real-time.

Data Visualization: Presenting data in a visual format for easy interpretation and
understanding.

Subject: Big Data


Topics to be Covered

Introduction to Big Data Platform

Advantages of Big Data Platform

Improved Decision-Making: Provides insights into data that can be used to make
informed business decisions.

Increased Efficiency: Allows for real-time processing and analysis of large data sets.

Cost Savings: Reduces costs by automating manual processes and reducing errors.

Competitive Advantage: Helps organizations stay ahead of their competition by


providing valuable insights into data.

Subject: Big Data


Topics to be Covered

Drivers for Big Data

• Increased Data Generation: The rise of the internet and connected devices has led to
an exponential growth in data generation, making big data a critical resource.
• Data-Driven Business Decisions: Companies are turning towards big data analytics
to drive better business decisions, leading to a higher competitive advantage.
• Advanced Analytics: Big data analytics tools and techniques allow for advanced data
processing and provide deep insights into complex data sets.
• Cloud Computing: The availability of cloud computing solutions has made it easier for
companies to store and process large amounts of data.
• IoT and Machine Learning: The Internet of Things (IoT) and machine learning are key
drivers of big data, enabling companies to collect, analyze and extract value from large
volumes of data.

Example: A retailer using big data analytics to analyze customer purchase patterns,
demographics, and preferences, can provide better customer experiences and increase
sales.
Subject: Big Data
Topics to be Covered

Big Data architecture

Figure from: Erraissi, Allae & Belangour, Abdessamad & Tragha, Abderrahim. (2017). A Big Data Hadoop building blocks comparative
study. International Journal of Computer Trends and Technology. 48. 36-40. 10.14445/22312803/IJCTT-V48P109.

Subject: Big Data


Topics to be Covered

Big Data architecture

Definition: The combination of hardware, software, and data management systems used
to store, process, and analyze Big Data.

Components: Hadoop, Spark, NoSQL databases, cloud computing, data warehousing,


etc

Subject: Big Data


Topics to be Covered

Big Data architecture

❖ Hadoop

Definition: An open-source framework for distributed storage and processing of large


data sets.
Example: Yahoo, Facebook, Airbnb, etc. use Hadoop to store and process their Big Data.

❖ Spark

Definition: An open-source data processing framework for real-time data analysis.


Example: eBay, Yahoo, and Airbnb use Spark to process real-time data.

❖ NoSQL Databases

Definition: Databases designed for handling large and complex data sets, often
unstructured.
Example: MongoDB, Cassandra, Couchbase, etc.
Subject: Big Data
Topics to be Covered

Big Data architecture

Cloud Computing

Definition: A technology that allows organizations to store and process Big Data on
remote servers.
Example: Amazon Web Services, Microsoft Azure, Google Cloud Platform, etc.

Data Warehousing

Definition: The process of collecting, storing, and managing large amounts of data for
analysis and reporting.
Example: Oracle, Microsoft SQL Server, Teradata, etc.

Subject: Big Data


Topics to be Covered

Big Data architecture

Conclusion

•Big Data is a complex and rapidly growing field that requires a combination of hardware,
software, and data management systems.

•Understanding the characteristics and architecture of Big Data is crucial for organizations
to effectively store, process, and analyze the vast amount of data generated.

Subject: Big Data


Topics to be Covered

5 Vs of Big Data

Introduction

•Big Data is the massive amount of structured and unstructured data generated by
businesses and individuals.

•The 5 Vs of Big Data are crucial characteristics of Big Data that help organizations
manage and analyze data.

Subject: Big Data


Topics to be Covered

5 Vs of Big Data

Volume

•The first V of Big Data refers to the sheer amount of data that is generated and collected.
Examples: Data from social media, IoT devices, and cloud storage.

Variety

•The second V of Big Data refers to the different types of data that are collected.
Examples: Text, images, audio, and video.

Velocity

•The third V of Big Data refers to the speed at which data is generated and needs to be
processed.
Examples: Stock market data, real-time sensor data.
Subject: Big Data
Topics to be Covered

5 Vs of Big Data

Veracity

•The fourth V of Big Data refers to the quality and accuracy of the data.
Examples: Data from unreliable sources, data that is incomplete or inconsistent.
Slide 6: Value

•The fifth V of Big Data refers to the potential business value that can be derived from the
data.
Examples: Predictive analytics, customer behavior insights, and fraud detection.

The 5 Vs of Big Data provide a framework for understanding the complexities of Big
Data and how it can be effectively managed and analyzed. Organizations can
leverage these characteristics to gain valuable insights and make data-driven
decisions.

Subject: Big Data


Topics to be Covered
Introduction to Big Data Technology Components

Definition: Big Data technology refers to a set of tools and techniques used to process, store, analyze
and visualize large and complex data sets.

❖ Storage Component

Definition: The storage component is where the big data is stored and managed.
Example: Hadoop Distributed File System (HDFS)

❖ Processing Component

Definition: The processing component is responsible for the processing and analysis of big data.
Example: Apache Spark

❖ Data Ingestion Component


Definition: Data ingestion refers to the process of capturing and transferring data into the storage
component.
Example: Apache Flume

Subject: Big Data


Topics to be Covered
Introduction to Big Data Technology Components

❖ Data Visualization Component

Definition: Data visualization is the process of presenting data in a graphical or pictorial form.
Example: Tableau

❖ Management Component

Definition: The management component provides the framework to manage and monitor the big data
technology components.
Example: Apache Ambari

Big Data technology components are essential in handling large and complex data sets. A
combination of these components provides a complete solution for big data processing and
analysis.

Subject: Big Data


Topics to be Covered

Big Data importance and applications

Definition: Big Data refers to the large and complex data sets that are generated in large quantities
and are difficult to process using traditional methods.

❖ Importance of Big Data

Definition: The importance of Big Data lies in the fact that it provides valuable insights and
information that can help organizations to make informed decisions.
Example: Customer behavior analysis, Fraud detection in financial institutions.

❖ Applications of Big Data


Definition: Big Data can be applied in various industries, including healthcare, finance, retail, and
government.
Example: Predictive maintenance in manufacturing, Medical record analysis in healthcare.

Subject: Big Data


Topics to be Covered

Big Data importance and applications

• Big Data in Retail

Definition: In the retail industry, Big Data is used to analyze customer behavior and preferences,
improving product recommendations and pricing strategies.
Example: Personalized shopping recommendations, Inventory management optimization.

• Big Data in Healthcare

Definition: In healthcare, Big Data is used to analyze patient medical records, improve medical
diagnosis and treatment plans.
Example: Medical record analysis, Predictive analysis of disease outbreaks.

Subject: Big Data


Topics to be Covered

Big Data importance and applications

• Big Data in Finance

Definition: In finance, Big Data is used to detect fraudulent activities and manage risk.
Example: Fraud detection in financial institutions, Risk management in insurance companies.

Big Data plays a crucial role in various industries, providing valuable insights and information
for informed decision making. With the increasing demand for data-driven decision making,
Big Data is becoming increasingly important for organizations of all sizes.

Subject: Big Data


Topics to be Covered
Big Data features – security, compliance, auditing and protection

Definition: Big Data features refer to the various aspects of Big Data technology that contribute to its
security, compliance, auditing and protection.
❖ Security Feature
Definition: The security feature of Big Data is designed to protect the data from unauthorized access
and theft.
Example: Encryption, User access controls.
❖ Compliance Feature
Definition: The compliance feature ensures that the data is collected, stored and processed in
accordance with legal and regulatory requirements.
Example: Compliance with data privacy laws, Audit trails.
❖ Auditing Feature
Definition: The auditing feature provides a complete record of the data processing activities, including
who accessed the data and when.

Subject: Big Data


Example: User access logs, Data modification records.
Topics to be Covered

Big Data features – security, compliance, auditing and protection

❖ Protection Feature

Definition: The protection feature helps to protect the data from damage or loss due to hardware
failure, software errors or natural disasters.
Example: Data backups, Disaster recovery plans.

Role of Security, Compliance, Auditing and Protection in Big Data

Definition: The security, compliance, auditing and protection features of Big Data technology play a
crucial role in ensuring the safety, privacy and reliability of the data.
Example: Protecting sensitive information, Maintaining data integrity and reliability.

Big Data features of security, compliance, auditing and protection are essential in ensuring
that the data is collected, stored and processed securely and in accordance with legal and
regulatory requirements. These features play a crucial role in maintaining the reliability and
privacy of the data.

Subject: Big Data


Topics to be Covered

Big Data privacy and ethics

❖ Introduction to Big Data Privacy and Ethics

Definition: Big Data refers to the massive amounts of structured and unstructured data generated by
various sources, including individuals, organizations, and devices.

Importance: The increasing use of Big Data technologies has led to concerns about privacy and
ethical issues associated with the collection, storage, and use of large amounts of personal data.

Point: It is essential to understand the privacy and ethical implications of Big Data to ensure that the
benefits of this technology can be leveraged while also protecting individuals' rights.

Subject: Big Data


Topics to be Covered

Big Data privacy and ethics

❖ Big Data Privacy Concerns

Definition: Privacy concerns in Big Data refer to the unauthorized collection, storage, and use of
personal data by organizations or individuals.

Example: The Cambridge Analytica scandal, where the personal data of millions of Facebook users
was collected and used to influence the 2016 US Presidential Election, is a prime example of Big Data
privacy concerns.

Point: Organizations must be transparent about the data they collect and how it is used, and
individuals must be informed about their rights and the ways in which their data is being used.

Subject: Big Data


Topics to be Covered

Big Data privacy and ethics

❖ Big Data Ethical Concerns

Definition: Ethical concerns in Big Data refer to the moral and ethical implications of the use of large
amounts of personal data, including issues such as bias, discrimination, and fairness.

Example: Algorithms used for decision-making, such as those used for hiring or lending, may
perpetuate existing biases and discriminate against certain groups if not designed and tested to be fair
and unbiased.

Point: It is crucial to consider the ethical implications of Big Data and ensure that its use aligns with
fundamental values such as fairness, non-discrimination, and privacy.

Subject: Big Data


Topics to be Covered

Big Data privacy and ethics

❖ Best Practices for Big Data Privacy and Ethics

Definition: Best practices for Big Data privacy and ethics refer to guidelines and strategies that
organizations can follow to ensure that their use of Big Data aligns with privacy and ethical principles.

Example: Best practices include implementing privacy-by-design principles, conducting privacy


impact assessments, and ensuring that data is collected and used with the informed consent of
individuals.

Point: Organizations must prioritize privacy and ethics in their use of Big Data and take steps to
ensure that their practices align with these principles.

Big Data has the potential to transform various industries and bring significant
benefits, but it is essential to address the privacy and ethical concerns associated
with its use. By following best practices for Big Data privacy and ethics,
organizations can ensure that they are using this technology in a responsible and
ethical manner while also protecting individuals' rights.
Subject: Big Data
Topics to be Covered

Introduction to Big Data Analytics

Definition: The process of analyzing and interpreting vast amounts of data to uncover hidden
patterns, correlations and other insights.

Importance: Helps organizations make data-driven decisions, improve efficiency, and gain a
competitive advantage.

Characteristics:

• Volume: Large amounts of data being generated every day


• Variety: Different types of data, such as structured and unstructured data
• Velocity: Speed at which data is generated and processed
• Veracity: Quality and accuracy of data

Subject: Big Data


Topics to be Covered
❖ Methods of Big Data Analytics
Descriptive Analytics: Summarizing and describing data to understand past events and trends.
Predictive Analytics: Using statistical models and algorithms to predict future outcomes.
Prescriptive Analytics: Analyzing data to suggest and implement actions that can be taken to
achieve a desired outcome.
Real-time Analytics: Analyzing data as it is generated and making decisions on the spot.

❖ Tools and Technologies


•Hadoop: Open-source software framework for distributed storage and processing of big data.
•Spark: An open-source data processing engine for large-scale data processing.
•NoSQL databases: Non-relational databases that are designed to handle large amounts of
unstructured data.
•Machine Learning: A type of artificial intelligence that enables systems to learn from data without
being explicitly programmed.
Subject: Big Data
Topics to be Covered
❖ Applications of Big Data Analytics

Healthcare: Analyzing patient data to improve diagnoses and treatments


Retail: Analyzing customer data to improve customer experience and drive sales.
Finance: Analyzing market trends and risk management.
Manufacturing: Analyzing production data to improve efficiency and reduce waste.

❖ Challenges of Big Data Analytics

Data Privacy: Ensuring that sensitive data is kept confidential and secure.
Data Integration: Combining and cleaning data from multiple sources.
Data Quality: Ensuring that data is accurate and relevant.
Data Processing: Dealing with large amounts of data in a timely and efficient manner.

Subject: Big Data


Topics to be Covered

Big Data Analytics is a rapidly growing field that provides organizations with the ability
to make informed decisions based on data.It requires the use of advanced tools and
technologies, as well as the ability to overcome challenges such as data privacy, data
integration, and data quality. As the amount of data being generated continues to grow,
the importance of Big Data Analytics will only continue to increase.

Subject: Big Data


Topics to be Covered

Challenges of conventional systems

❖ Introduction to Conventional Systems

Definition: Conventional systems refer to traditional methods of data processing and management that
have been in use for decades.

Limitations: Conventional systems face a number of challenges when it comes to managing big data.

Subject: Big Data


Topics to be Covered

Challenges of conventional systems

❖ Limited Storage Capacity

Definition: Conventional systems often have limited storage capacity, which makes it difficult to store
large amounts of data.
Example: A legacy system with only 500GB of storage space may struggle to store petabyte-sized
data sets.

❖ Processing Speed

Definition: Conventional systems are often slow in processing large amounts of data, leading to delays
and inefficiencies.
Example: A traditional system may take several hours to process a terabyte of data, whereas a big
data platform could process the same data in a matter of minutes.

Subject: Big Data


Topics to be Covered
Challenges of conventional systems

❖ Scalability Issues

Definition: Conventional systems are not easily scalable, meaning they cannot easily adapt to
changing data volume or complexity.
Example: A conventional system may have difficulty handling a sudden increase in data volume,
such as a surge in online sales during the holiday season.

❖ Inadequate Data Management

Definition: Conventional systems often have limited data management capabilities, making it
difficult to organize and analyze data in a meaningful way.
Example: A traditional system may not have the ability to filter or categorize data, making it
difficult to make informed business decisions.

Subject: Big Data


Topics to be Covered
Challenges of conventional systems

❖ Lack of Integration

Definition: Conventional systems may not be easily integrated with other systems, making it
difficult to share data and work collaboratively.
Example: A legacy system may not be able to easily exchange data with a modern cloud-based
system, leading to data silos and decreased efficiency.

Conventional systems face a number of challenges in managing big data, including limited
storage capacity, slow processing speed, scalability issues, inadequate data management,
and a lack of integration. These challenges highlight the need for modern big data
platforms to effectively manage and utilize large amounts of data.

Subject: Big Data


Topics to be Covered

Intelligent data analysis

❖ Introduction to Intelligent Data Analysis in Big Data

Definition: The process of using advanced analytical techniques and algorithms to extract
meaningful insights from vast amounts of data.

Importance: With the explosion of big data, traditional data analysis methods are no longer
effective. Intelligent data analysis helps organizations make better decisions and improve their
overall performance.

Example: A retail company uses intelligent data analysis to analyze customer behavior and
buying patterns, leading to personalized marketing strategies and increased sales.

Subject: Big Data


Topics to be Covered

Intelligent data analysis

❖ Characteristics of Intelligent Data Analysis

Advanced analytical techniques: Machine learning, predictive analytics, and deep learning
algorithms are used to analyze vast amounts of data.
Automated insights: Intelligent data analysis software can automate the discovery of
insights, reducing the time and effort required for manual analysis.
Scalability: The ability to handle large amounts of data and provide real-time insights is a key
characteristic of intelligent data analysis.
Integration: Integration with other data sources, such as social media and IoT devices,
provides a more comprehensive view of data.

Subject: Big Data


Topics to be Covered

Intelligent data analysis

❖ Benefits of Intelligent Data Analysis


Improved decision making: By providing accurate and relevant insights, intelligent data analysis
helps organizations make better decisions and improve their overall performance.
Increased efficiency: Automated insights and real-time analysis reduce the time and effort
required for manual analysis, freeing up valuable resources.
Enhanced customer experience: By analyzing customer behavior and preferences, intelligent
data analysis helps organizations provide a personalized experience, leading to increased
customer satisfaction and loyalty.
Better risk management: Intelligent data analysis helps organizations identify potential risks and
make proactive decisions to mitigate them.

Subject: Big Data


Topics to be Covered
Intelligent data analysis

❖ Applications of Intelligent Data Analysis

Customer behavior analysis: Understanding customer behavior and preferences is critical for
success in many industries, including retail, e-commerce, and banking.
Fraud detection: Intelligent data analysis can help organizations detect and prevent fraudulent
activities, leading to increased security and improved customer trust.
Supply chain optimization: By analyzing data from suppliers, manufacturers, and distributors,
organizations can optimize their supply chain and reduce costs.

Intelligent data analysis is a critical component of big data analysis, providing valuable insights
that can improve decision making, enhance customer experience, and optimize operations. As
big data continues to grow, intelligent data analysis will play an increasingly important role in
helping organizations make the most of their data.

Subject: Big Data


Topics to be Covered

Introduction to Nature of Data in Big Data

Definition: The nature of data refers to the characteristics of the data being collected and analyzed.

•Point 1: Variety - Big data encompasses a wide variety of data types such as text, images, videos,
audio, and sensor data.
•Point 2: Volume - Big data refers to the massive amounts of data being generated on a daily basis.
•Point 3: Velocity - The speed at which big data is being generated and processed is rapidly
increasing.
•Point 4: Veracity - The uncertainty and unpredictability of big data is a major challenge in the analysis
process.

Example: A social media platform like Twitter generates a huge volume of text data in real-time. This
data is varied in nature as it includes text, images, videos, and more. It is also unpredictable in terms of
Subject: Big Data
its content, as people post various types of messages at different times of the day.
Topics to be Covered
Introduction to Nature of Data in Big Data

Understanding the Importance of Nature of Data in Big Data

Point 1: Data Management - Understanding the nature of data is essential for effective data
management and storage.
Point 2: Data Analysis - The nature of data determines the type of algorithms and techniques that can be
used to analyze it.
Point 3: Data Quality - The quality of data is closely tied to its nature and affects the accuracy of analysis
results.
Point 4: Data Privacy - The nature of data can also impact privacy concerns and the need for data
protection measures.

Example: In the healthcare industry, patient data is private and sensitive. Understanding the nature of this
data is crucial for ensuring that it is stored, analyzed, and protected in a secure manner.

Subject: Big Data


Topics to be Covered

The nature of data plays a crucial role in the big data ecosystem and affects
the way data is managed, analyzed, and used. It is essential for
organizations to understand the nature of data and to use the right tools and
techniques to analyze it effectively. With the increasing volume and
complexity of big data, understanding the nature of data will become even
more important in the future.

Subject: Big Data


Topics to be Covered
Analytic processes and tools

Definition: Analytic processes and tools are a set of techniques and methodologies used to process
and analyze big data in order to extract meaningful insights and knowledge from it.

Importance: With the explosion of big data, the use of analytic processes and tools has become
essential in order to make sense of the vast amounts of information and turn it into valuable insights.

❖ Data Collection and Preparation

Definition: The first step in the analytical process is the collection and preparation of the data, which
involves gathering and cleaning the data in order to make it usable for analysis.
Example: In a customer analytics project, data from various sources such as social media, customer
feedback, and purchase history are collected and then cleaned and transformed to prepare for
analysis.

Subject: Big Data


Topics to be Covered

❖ Data Exploration

Definition: Data exploration involves exploring the data to understand its characteristics,
relationships, and patterns.
Example: In a customer analytics project, data exploration might include creating visualizations and
graphs to understand the distribution of customer demographics and purchasing behavior.

❖ Modeling

Definition: Modeling involves creating mathematical representations of the data to understand the
relationships and patterns between different variables.
Example: In a customer analytics project, modeling might include creating a regression model to
predict customer lifetime value based on demographic and purchasing data.

Subject: Big Data


Topics to be Covered

❖ Visualization

Definition: Visualization is the process of creating graphs, charts, and other visual representations of
the data to communicate insights and findings to stakeholders.
Example: In a customer analytics project, visualization might include creating a dashboard that shows
customer demographics, purchase behavior, and lifetime value.

❖ Data Mining

Definition: Data mining is the process of discovering hidden patterns and relationships in the data
through the use of statistical techniques and algorithms.
Example: In a customer analytics project, data mining might include discovering customer segments
based on purchasing behavior and creating recommendations for personalized marketing.

Subject: Big Data


Topics to be Covered

❖ Machine Learning

Definition: Machine learning is a branch of artificial intelligence that involves using algorithms to learn
from data and make predictions.
Example: In a customer analytics project, machine learning might include creating a recommendation
engine that predicts which products a customer is likely to purchase based on their past behavior.

The use of analytic processes and tools is crucial in making sense of big data and turning it into
valuable insights. From data collection and preparation, to modeling and visualization, to data
mining and machine learning, the right analytical approach will vary depending on the problem
and the data.

Subject: Big Data


Topics to be Covered

Analysis vs Reporting

•Analysis is the process of breaking down data and information to gain insights and understand patterns.
•Reporting is the process of presenting information and data in a structured and organized format.

Analysis Reporting
Examines data and Presents data and
information information
Provides insights and Provides facts and
understanding figures
Involves summarizing
Involves critical thinking
and presenting
Involves informing and
Involves decision making
communicating

Subject: Big Data


Topics to be Covered

Analysis vs Reporting

Example:

•An analyst may analyze sales data to understand why sales have been declining, and make
recommendations for improvement.
•A report on sales may present the sales figures, but does not necessarily provide insights or
recommendations.

Subject: Big Data


Topics to be Covered

Introduction to Modern Data Analytics Tools in Big Data

Definition: Modern data analytics tools are computer software applications designed to process and
analyze large amounts of data to provide valuable insights and support data-driven decision making.

Importance: With the increase in the volume, velocity, and variety of data, these tools have become
crucial for businesses to make informed decisions and stay ahead of the competition.

Key Features: The key features of modern data analytics tools include real-time data processing,
scalable architecture, advanced data visualization, and integration with multiple data sources.

Example: Some popular examples of modern data analytics tools include Hadoop, Spark, Tableau, and
PowerBI.

Subject: Big Data


Topics to be Covered

❖ Hadoop

Definition: Hadoop is an open-source software framework for storing and processing big data in a
distributed computing environment.

Importance: Hadoop enables organizations to store and process massive amounts of structured and
unstructured data, enabling them to make informed decisions.

Key Features: Hadoop provides a scalable and fault-tolerant platform for data processing, making it
ideal for big data applications. It also supports multiple programming languages and integrates with
other big data tools.

Example: Hadoop is used by organizations in various industries, including finance, healthcare, and
retail, to process and analyze customer data, perform predictive analytics, and more.

Subject: Big Data


Topics to be Covered

❖ Spark

Definition: Apache Spark is a fast, in-memory data processing engine that is designed to process big
data efficiently.

Importance: Spark is ideal for large-scale data processing, enabling organizations to analyze data in
real-time and make informed decisions quickly.

Key Features: Spark offers a flexible and scalable architecture that supports batch processing,
interactive queries, and real-time streaming. It also supports multiple programming languages and
integrates with other big data tools.

Example: Spark is used by organizations in various industries, including healthcare,


telecommunications, and transportation, to perform data analysis, machine learning, and more.

Subject: Big Data


Topics to be Covered

❖ Tableau

Definition: Tableau is a data visualization and business intelligence tool that enables organizations to
analyze and present data in an interactive and visually appealing way.

Importance: Tableau helps organizations make sense of their data by creating interactive dashboards,
reports, and charts, enabling them to make informed decisions.

Key Features: Tableau offers an intuitive drag-and-drop interface, real-time data processing, and a
large library of templates and visualizations. It also integrates with multiple data sources, including
Hadoop and Spark.

Example: Tableau is used by organizations in various industries, including finance, retail, and
healthcare, to create interactive dashboards, reports, and charts for data analysis and presentation.

Subject: Big Data


Topics to be Covered

❖ PowerBI

Definition: PowerBI is a cloud-based business intelligence and data visualization tool offered by
Microsoft.
•Importance: PowerBI enables organizations to analyze and present data in a visually appealing way,
making it easier to understand and make informed decisions.

Key Features: PowerBI offers an intuitive drag-and-drop interface, real-time data processing, and
integration with other Microsoft tools. It also supports multiple data sources, including Hadoop and
Spark.

Example: PowerBI is used by organizations in various industries, including finance, healthcare, and
retail, to create interactive dashboards, reports, and charts for data analysis and presentation.

The use of modern data analytics tools has become increasingly important as the volume,
velocity, and variety of data continues to grow. Tools such as Hadoop, Spark, Table

Subject: Big Data


Thank You

Subject: Big Data

You might also like