Unit-1 Introduction of Big Data

The document provides an overview of Big Data, highlighting its characteristics defined by volume, variety, and velocity, and the challenges associated with managing and analyzing vast amounts of data. It discusses the significance of Big Data across various industries, emphasizing the need for new tools and technologies to derive insights and business value. Additionally, it outlines different data structures, including structured, semi-structured, quasi-structured, and unstructured data, and the role of enterprise data warehouses and analytic sandboxes in managing data effectively.

Uploaded by

Suja Mary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

Unit-1 Introduction of Big Data

Uploaded by

Suja Mary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Data Science and Big data

Analytics
Unit - I
Big Data overview
 Data is created constantly, and at an ever-increasing rate.
 Mobile phones, social media, imaging technologies to
determine a medical diagnosis-all these and more create new
data, and that must be stored somewhere for some purpose.
 Devices and sensors automatically generate diagnostic
information that needs to be stored and processed in real
time.
 Merely keeping up with this huge influx of data is difficult,
but substantially more challenging is analyzing vast amounts
of it, especially when it does not conform to traditional
notions of data structure, to identify meaningful patterns and
extract useful information.
 These challenges of the data deluge present the opportunity
to transform business, government, science, and everyday
life.
Several industries have led the way in developing their
ability to gather and exploit data:
• Credit card companies monitor every purchase their
customers make and can identify fraudulent purchases
with a high degree of accuracy using rules derived by
processing billions of transactions.
• Mobile phone companies analyze subscribers' calling
patterns to determine, for example, whether a caller's
frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might
cause the subscriber to defect, the mobile phone
company can proactively offer the subscriber an
incentive to remain in her contract.
• For companies such as Linked In and Facebook, data
itself is their primary product. The valuations of these
companies are heavily derived from the data they gather
and host, which contains more and more intrinsic value
as the data grows.
Three attributes stand out as defining Big Data
characteristics:
• Huge volume of data: Rather than thousands
or millions of rows, Big Data can be billions of
rows and millions of columns.
• Complexity of data types and structures:
Big Data reflects the variety of new data
sources, formats, and structures, including
digital traces being left on the web and other
digital repositories for subsequent analysis.
• Speed of new data creation and growth: Big
Data can describe high velocity data, with rapid
data ingestion and near real time analysis.
 Big Data is described as having 3 Vs: volume,
variety, and velocity
 Big Data problems require new tools and
technologies to store, manage, and realize the
business benefit. These new tools and
technologies enable creation, manipulation,
and Data Overview management of large
datasets and the storage environments that
house them.
 Definition: Big Data is data whose scale,
distribution, diversity, and/or timeliness
require the use of new technical architectures
and analytics to enable insights that unlock
new sources of business value.
Figure highlights several sources of the Big
Data deluge.
Example
 In 2012 Facebook users posted 700 status updates per second
worldwide, which can be leveraged to deduce latent interests or
political views of users and show relevant ads.
 For instance, an update in which a woman changes her
relationship status from "single" to "engaged" would trigger ads
on bridal dresses, wedding planning, or name-changing services.
 Facebook can also construct social graphs to analyze which
users are connected to each other as an interconnected network.
 In March 2013, Facebook released a new feature called "Graph
Search," enabling users and developers to search social graphs
for people with similar interests, hobbies, and shared locations.
 Genetic sequencing and human genome mapping provide a
detailed understanding of genetic makeup and lineage.
 The health care industry is looking toward these advances to
help predict which illnesses a person is likely to get in his
lifetime and take steps to avoid these maladies or reduce their
impact through the use of personalized medicine and treatment.
Data Structures
Big data can come in multiple forms, including
structured and non-structured data such as
financial data, text files, multimedia files, and
genetic mappings.
Figure shows four types of data structures,
with 80-90% of future data growth coming from
nonstructured data types.
 The RDBMSmay store characteristics of the
support calls as typical structured data, with
attributes such as time stamps, machine type,
problem type, and operating system.
 Unstructured, quasi- or semi-structured data,
such as free-form call log information taken
from an e-mail ticket of the problem, customer
chat history, or transcript of a phone call
describing the technical problem and the
solution or audio file of the phone call
conversation.
 A different technique is required to meet the
challenges to analyze semi-structured data,
quasi-structured and unstructured data.
Structured data: Data containing a defined data type,
format, and structure (that is, transaction data,online
analytical processing [OLAP] data cubes, traditional
RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data: Textual data files with a discernible
pattern that enables parsing (suchas Extensible Markup
Language [XML] data files that are self-describing and
defined by an XML schema).
Quasi-structured data: Textual data with erratic data
formats that can be formatted with effort, tools, and
time (for instance, web clickstream data that may
contain inconsistencies in data values and formats).
 Unstructured data: Data that has no inherent structure,
which may include text documents, PDFs, images, and video
Analyst Perspective on Data Repositories
 Database administrator training is not required to create
spreadsheets: They can be set up to do many things quickly
and independently of information technology (IT) groups.
 An ongoing challenge because spreadsheet programs such as
Microsoft Excel still run on many computers worldwide. With
the proliferation of data islands (or spread marts), the need to
centralize the data is more pressing than ever.
 As data needs grew, so did more scalable data warehousing
solutions. These technologies enabled data to be managed
centrally, providing benefits of security, failover, and a single
repository where users could rely on getting an "official"
source of data for financial reporting or other mission-critical
tasks.
 Enterprise Data Warehouses (EDWs) are critical for reporting
and 81 tasks and solve many of the problems that
proliferating spreadsheets introduce, such as which of
multiple versions of a spreadsheet is correct.
 With the EDW model, data is managed and controlled by IT groups
and database administrators (DBAs), and data analysts must
depend on IT for access and changes to the data schemas. This
imposes longer lead times for analysts to get data; most of the
time is spent waiting for approvals rather than starting
meaningful work
 A solution to this problem is the analytic sandbox, which attempts
to resolve the conflict for analysts and data scientists with EDW
and more formally managed corporate data.
 In this model, the IT group may still manage the analytic
sandboxes, but they will be purposefully designed to enable robust
analytics, while being centrally managed and secured.
 These sandboxes, often referred to as workspaces, are designed to
enable teams to explore many datasets in a controlled fashion and
are not typically used for enterprise level financial reporting and
sales dashboards.
 analytic sandboxes enable high-performance computing using in-
database processing the analytics occur within the database itself.
The idea is that performance of the analysis will be better if the
analytics are run in the database itself, rather than bringing the
data to an analytical tool that resides somewhere else.

Social Media
No ratings yet
Social Media
18 pages
Face Recognization
No ratings yet
Face Recognization
6 pages
Unit-1 Data Analytics Lifecycle
No ratings yet
Unit-1 Data Analytics Lifecycle
57 pages
ID3 pgm
No ratings yet
ID3 pgm
3 pages
Unit-1 State of the Practice in Analytics
No ratings yet
Unit-1 State of the Practice in Analytics
24 pages
Unit-1 Data Structure
No ratings yet
Unit-1 Data Structure
20 pages
Thought for the Day
No ratings yet
Thought for the Day
1 page
Agriculture Paper
No ratings yet
Agriculture Paper
5 pages
Decision Tree - Unit3
No ratings yet
Decision Tree - Unit3
21 pages
Ds Plan of Work 2025-26
No ratings yet
Ds Plan of Work 2025-26
2 pages
Unit-1 Data Mining Metrics
100% (1)
Unit-1 Data Mining Metrics
2 pages
Unit-1 Control Statement
No ratings yet
Unit-1 Control Statement
15 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
22 pages
Unit-III Advanced Machine Learning
No ratings yet
Unit-III Advanced Machine Learning
8 pages
Unit IV Recommender System
No ratings yet
Unit IV Recommender System
5 pages
Sequential Storage
No ratings yet
Sequential Storage
9 pages
Pandas & NumPy Data Analysis Guide
No ratings yet
Pandas & NumPy Data Analysis Guide
11 pages
Regression Model Diagnostics Overview
No ratings yet
Regression Model Diagnostics Overview
8 pages
Programs
No ratings yet
Programs
10 pages
Pavement Design 2013 Project
No ratings yet
Pavement Design 2013 Project
58 pages
Arabeyes Technical Dictionary Which Can Be Downloaded at Their Site.
No ratings yet
Arabeyes Technical Dictionary Which Can Be Downloaded at Their Site.
96 pages
Forcepoint'S Cloud and On Premise Email Security
No ratings yet
Forcepoint'S Cloud and On Premise Email Security
8 pages
தனிவட்டி மற்றும் கூட்டு வட்டி 1st chapter
100% (3)
தனிவட்டி மற்றும் கூட்டு வட்டி 1st chapter
26 pages
102 1500880491102 XXXPP4297X Itrv
No ratings yet
102 1500880491102 XXXPP4297X Itrv
1 page
IdeaPad 3 15ITL6 82H8000DUS
No ratings yet
IdeaPad 3 15ITL6 82H8000DUS
2 pages
Flipcode - Building A Game On Your Own
No ratings yet
Flipcode - Building A Game On Your Own
3 pages
Cookie Netflix
No ratings yet
Cookie Netflix
4 pages
Communication Policy 2016
No ratings yet
Communication Policy 2016
24 pages
MikroTik Firewall Packet Manipulation
No ratings yet
MikroTik Firewall Packet Manipulation
28 pages
LLK-B Coupler Spare Parts Catalogue
No ratings yet
LLK-B Coupler Spare Parts Catalogue
6 pages
Price List Laptop Sony Vaio
No ratings yet
Price List Laptop Sony Vaio
2 pages
Elevator Project Manager Resume
No ratings yet
Elevator Project Manager Resume
4 pages
Essential Python Functions Overview
No ratings yet
Essential Python Functions Overview
6 pages
Effects of Computer Usage On Students Performance
67% (3)
Effects of Computer Usage On Students Performance
18 pages
Wa0006.
No ratings yet
Wa0006.
5 pages
Atos AppEx BusinessModelCanvas Group 5
No ratings yet
Atos AppEx BusinessModelCanvas Group 5
3 pages
Connection Master
No ratings yet
Connection Master
8 pages
SAP Cloud Platform Identity Authentication Service
No ratings yet
SAP Cloud Platform Identity Authentication Service
456 pages
Amoñwmz: Gîmm G J M Ho$ Xm¡Amz MWN 'M - H DGW Yam Amoo?
No ratings yet
Amoñwmz: Gîmm G J M Ho$ Xm¡Amz MWN 'M - H DGW Yam Amoo?
8 pages
Graph Coloring and the Four-Color Theorem
No ratings yet
Graph Coloring and the Four-Color Theorem
5 pages
NX CAD Training Courses Overview
No ratings yet
NX CAD Training Courses Overview
8 pages
SAP BW Interview Questions Guide
100% (1)
SAP BW Interview Questions Guide
13 pages
Architectural Aesthetics & Structure
No ratings yet
Architectural Aesthetics & Structure
10 pages
Fives Booklet
100% (2)
Fives Booklet
21 pages
Dcs 500 Procedure
No ratings yet
Dcs 500 Procedure
7 pages
AWS Cloud Engineer Resume: Vinay Kumar
No ratings yet
AWS Cloud Engineer Resume: Vinay Kumar
1 page
Computer Magazine
100% (1)
Computer Magazine
112 pages
Shindengen F072-15
No ratings yet
Shindengen F072-15
47 pages
Everything Search Utility Guide
No ratings yet
Everything Search Utility Guide
13 pages

Unit-1 Introduction of Big Data

Uploaded by

Unit-1 Introduction of Big Data

Uploaded by

Data Science and Big data

You might also like