0% found this document useful (0 votes)
5 views17 pages

Unit-1 Introduction of Big Data

The document provides an overview of Big Data, highlighting its characteristics defined by volume, variety, and velocity, and the challenges associated with managing and analyzing vast amounts of data. It discusses the significance of Big Data across various industries, emphasizing the need for new tools and technologies to derive insights and business value. Additionally, it outlines different data structures, including structured, semi-structured, quasi-structured, and unstructured data, and the role of enterprise data warehouses and analytic sandboxes in managing data effectively.

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Unit-1 Introduction of Big Data

The document provides an overview of Big Data, highlighting its characteristics defined by volume, variety, and velocity, and the challenges associated with managing and analyzing vast amounts of data. It discusses the significance of Big Data across various industries, emphasizing the need for new tools and technologies to derive insights and business value. Additionally, it outlines different data structures, including structured, semi-structured, quasi-structured, and unstructured data, and the role of enterprise data warehouses and analytic sandboxes in managing data effectively.

Uploaded by

Suja Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Science and Big data

Analytics
Unit - I
Big Data overview
 Data is created constantly, and at an ever-increasing rate.
 Mobile phones, social media, imaging technologies to
determine a medical diagnosis-all these and more create new
data, and that must be stored somewhere for some purpose.
 Devices and sensors automatically generate diagnostic
information that needs to be stored and processed in real
time.
 Merely keeping up with this huge influx of data is difficult,
but substantially more challenging is analyzing vast amounts
of it, especially when it does not conform to traditional
notions of data structure, to identify meaningful patterns and
extract useful information.
 These challenges of the data deluge present the opportunity
to transform business, government, science, and everyday
life.
Several industries have led the way in developing their
ability to gather and exploit data:
• Credit card companies monitor every purchase their
customers make and can identify fraudulent purchases
with a high degree of accuracy using rules derived by
processing billions of transactions.
• Mobile phone companies analyze subscribers' calling
patterns to determine, for example, whether a caller's
frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might
cause the subscriber to defect, the mobile phone
company can proactively offer the subscriber an
incentive to remain in her contract.
• For companies such as Linked In and Facebook, data
itself is their primary product. The valuations of these
companies are heavily derived from the data they gather
and host, which contains more and more intrinsic value
as the data grows.
Three attributes stand out as defining Big Data
characteristics:
• Huge volume of data: Rather than thousands
or millions of rows, Big Data can be billions of
rows and millions of columns.
• Complexity of data types and structures:
Big Data reflects the variety of new data
sources, formats, and structures, including
digital traces being left on the web and other
digital repositories for subsequent analysis.
• Speed of new data creation and growth: Big
Data can describe high velocity data, with rapid
data ingestion and near real time analysis.
 Big Data is described as having 3 Vs: volume,
variety, and velocity
 Big Data problems require new tools and
technologies to store, manage, and realize the
business benefit. These new tools and
technologies enable creation, manipulation,
and Data Overview management of large
datasets and the storage environments that
house them.
 Definition: Big Data is data whose scale,
distribution, diversity, and/or timeliness
require the use of new technical architectures
and analytics to enable insights that unlock
new sources of business value.
Figure highlights several sources of the Big
Data deluge.
Example
 In 2012 Facebook users posted 700 status updates per second
worldwide, which can be leveraged to deduce latent interests or
political views of users and show relevant ads.
 For instance, an update in which a woman changes her
relationship status from "single" to "engaged" would trigger ads
on bridal dresses, wedding planning, or name-changing services.
 Facebook can also construct social graphs to analyze which
users are connected to each other as an interconnected network.
 In March 2013, Facebook released a new feature called "Graph
Search," enabling users and developers to search social graphs
for people with similar interests, hobbies, and shared locations.
 Genetic sequencing and human genome mapping provide a
detailed understanding of genetic makeup and lineage.
 The health care industry is looking toward these advances to
help predict which illnesses a person is likely to get in his
lifetime and take steps to avoid these maladies or reduce their
impact through the use of personalized medicine and treatment.
Data Structures
Big data can come in multiple forms, including
structured and non-structured data such as
financial data, text files, multimedia files, and
genetic mappings.
Figure shows four types of data structures,
with 80-90% of future data growth coming from
nonstructured data types.
 The RDBMSmay store characteristics of the
support calls as typical structured data, with
attributes such as time stamps, machine type,
problem type, and operating system.
 Unstructured, quasi- or semi-structured data,
such as free-form call log information taken
from an e-mail ticket of the problem, customer
chat history, or transcript of a phone call
describing the technical problem and the
solution or audio file of the phone call
conversation.
 A different technique is required to meet the
challenges to analyze semi-structured data,
quasi-structured and unstructured data.
Structured data: Data containing a defined data type,
format, and structure (that is, transaction data,online
analytical processing [OLAP] data cubes, traditional
RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data: Textual data files with a discernible
pattern that enables parsing (suchas Extensible Markup
Language [XML] data files that are self-describing and
defined by an XML schema).
Quasi-structured data: Textual data with erratic data
formats that can be formatted with effort, tools, and
time (for instance, web clickstream data that may
contain inconsistencies in data values and formats).
 Unstructured data: Data that has no inherent structure,
which may include text documents, PDFs, images, and video
Analyst Perspective on Data Repositories
 Database administrator training is not required to create
spreadsheets: They can be set up to do many things quickly
and independently of information technology (IT) groups.
 An ongoing challenge because spreadsheet programs such as
Microsoft Excel still run on many computers worldwide. With
the proliferation of data islands (or spread marts), the need to
centralize the data is more pressing than ever.
 As data needs grew, so did more scalable data warehousing
solutions. These technologies enabled data to be managed
centrally, providing benefits of security, failover, and a single
repository where users could rely on getting an "official"
source of data for financial reporting or other mission-critical
tasks.
 Enterprise Data Warehouses (EDWs) are critical for reporting
and 81 tasks and solve many of the problems that
proliferating spreadsheets introduce, such as which of
multiple versions of a spreadsheet is correct.
 With the EDW model, data is managed and controlled by IT groups
and database administrators (DBAs), and data analysts must
depend on IT for access and changes to the data schemas. This
imposes longer lead times for analysts to get data; most of the
time is spent waiting for approvals rather than starting
meaningful work
 A solution to this problem is the analytic sandbox, which attempts
to resolve the conflict for analysts and data scientists with EDW
and more formally managed corporate data.
 In this model, the IT group may still manage the analytic
sandboxes, but they will be purposefully designed to enable robust
analytics, while being centrally managed and secured.
 These sandboxes, often referred to as workspaces, are designed to
enable teams to explore many datasets in a controlled fashion and
are not typically used for enterprise level financial reporting and
sales dashboards.
 analytic sandboxes enable high-performance computing using in-
database processing the analytics occur within the database itself.
The idea is that performance of the analysis will be better if the
analytics are run in the database itself, rather than bringing the
data to an analytical tool that resides somewhere else.

You might also like