0% found this document useful (0 votes)

77 views25 pages

Big Data

The document discusses big data and how organizations are dealing with increasingly large datasets. It begins by providing examples of how much data companies like Google process daily, which is in the petabytes of data. It then discusses how traditional databases and tools are not equipped to handle such large volumes, velocities, and varieties of data. The rest of the document summarizes what big data is, common characteristics of big data like volume, velocity and variety, and how distributed frameworks like Hadoop provide a scalable solution for storing and processing big data across clusters of servers.

Uploaded by

zeeshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views25 pages

Big Data

Uploaded by

zeeshan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Prepared By:EduTechLearners

How much time did it take?

Excel : Have you ever tried a pivot table on 500 MB file?
SAS/R : Have you ever tried a frequency table on 2 GB file?
Access: Have you ever tried running a query on 10 GB file
SQL: Have you ever tried running a query on 50 GB file

Can you think of ?

Can you think of running a query on 20,980,000 GB file.
What if we get a new data set like this, every day?
What if we need to execute complex queries on this data set
everyday ?
Does anybody really deal with this type of data set?
Is it possible to store and analyze this data?
Yes Google deals with more than 20 PB data everyday

In fact, in a minute
Email users send more than 204 million messages;
Mobile Web receives 217 new users;
Google receives over 2 million search queries;
YouTube users upload 48 hours of new video;
Facebook users share 684,000 bits of content;
Twitter users send more than 100,000 tweets;
Consumers spend $272,000 on Web shopping;
Apple receives around 47,000 application downloads;
Brands receive more than 34,000 Facebook 'likes';
Tumblr blog owners publish 27,000 new posts;
Instagram users share 3,600 new photos;
Flickr users, on the other hand, add 3,125 new photos;
Foursquare users perform 2,000 check-ins;
WordPress users publish close to 350 new blog posts.
And this is one year back.. Damn!!
4

Collection of data sets so large and complex that it

becomes difficult to process using on-hand database
management tools or traditional data processing
applications

Big Data is the data whose scale, diversity, and

complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value
and hidden knowledge from it

Big Data is similar to small data, but bigger in size

An aim to solve new problems or old problems in a better

way

Big Data generates value from the storage and processing

of very large quantities of digital information that cannot be
analyzed with traditional computing techniques.

Volume

Velocity

Variety

Data
quantity

Data
Speed

Data
Types

A typical PC might have had 10 gigabytes of storage in

2000.
Today, Face book ingests 500 terabytes of new data every
day.
Boeing 737 will generate 240 terabytes of flight data during
a single flight across the US.
The smart phones, the data they create and consume;
sensors embedded into everyday objects will soon result in
billions of new, constantly-updated data feeds containing
environmental, location, and other information, including
video.
7

Click streams and ad impressions capture user

behavior at millions of events per second

high-frequency stock trading algorithms reflect market

changes within microseconds

machine to machine processes exchange data

between billions of devices

infrastructure and sensors generate massive log data

in real-time

on-line gaming systems support millions of concurrent

users, each producing multiple inputs per second.
8

Big

Data isn't just numbers, dates, and strings.

Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.

Traditional

database systems were designed to

address smaller volumes of structured data,
fewer updates or a predictable, consistent data
structure.

Big

Data analysis includes different types of

data
9

Handling bigdataParallel computing

Imagine a 1gb text file, all the status updates on Facebook in a day
Now suppose that a simple counting of the number of rows takes
10 minutes.
Select count(*) from fb_status

What do you do if you have 6 months data, a file of size 200GB, if

you still want to find the results in 10 minutes?

Parallel computing?
Put multiple CPUs in a machine (100?)
Write a code that will calculate 200 parallel counts and finally
sums up
But you need a super computer

Handling bigdata - Is there a

better way?
Till 1985, There is no way to connect multiple computers. All
systems were Centralized Systems.
So multi-core system or super computers were the only options
for big data problems

After 1985,We have powerful microprocessors and High Speed

Computer Networks (LANs , WANs), which lead to distributed
systems
Now that we have a distributed system that ensures a
collection of independent computers appears to its users as a
single coherent system, can we use some cheap computers
and process our bigdata quickly?
11

MapReduce Programming Model

Processing data using special map() and reduce() functions
The map() function is called on every item in the input and
emits a series of intermediate key/value pairs(Local
calculation)
All values associated with a given key are grouped together
The reduce() function is called on every unique key, and its
value list, and emits a value that is added to the output(final
organization)

Not just MapReduce

Earlier count=count+1 was sufficient but now, we need to
1. Setup a cluster of machines, then divide the whole data set into
blocks and store them in local machines
2. Assign a master node that takes charge of all meta data, work
scheduling and distribution, and job orchestration
3. Assign worker slots to execute map or reduce functions
4. Load Balance (What if one machine is very slow in the cluster?)
5. Fault Tolerance (What if the intermediate data is partially read,
but the machine fails before all reduce(collation) operations
can complete?)
6. Finally write the map reduce code that solves our problem

Ok. Analysis on bigdata can give us awesome insights.

But, datasets are huge, complex and difficult to process.
I found a solution, distributed computing or MapReduce
But looks like this data storage & parallel processing
is complicated
What is the solution?

Hadoop

Hadoop is a bunch of tools, it has many components. HDFS

and MapReduce are two core components of Hadoop
HDFS: Hadoop Distributed File System
makes our job easy to store the data on commodity hardware
Built to expect hardware failures
Intended for large files & batch inserts
MapReduce
For parallel processing
So Hadoop is a software platform that lets one easily write
and run applications that process bigdata

Why Hadoop is useful

Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
Reliable: It automatically maintains multiple copies of data
and automatically redeploys computing tasks based on
failures.
And Hadoop is free

So what is Hadoop?
Hadoop is not Bigdata
Hadoop is not a database
Hadoop is a platform/framework
Which allows the user to quickly write and test distributed
systems
Which is efficient in automatically distributing the data
and work across machines

Hadoop ecosystem

Big Data ecosystem

28
19

Examining large amount of data

Appropriate information

Identification of hidden patterns, unknown correlations

Competitive advantage

Better business decisions: strategic and operational

Effective marketing, customer satisfaction, increased

revenue

Where processing is hosted?

Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
Distributed Storage (e.g. Amazon S3)
What is the programming model?
Distributed Processing (e.g. MapReduce)
How data is stored & indexed?
High-performance schema-free databases (e.g. MongoDB)
What operations are performed on data?
Analytic / Semantic Processing

Application Of Big Data analytics

Smarter
Healthcare

Homeland
Security

Traffic
Control

Manufacturing

Multichannel
sales

Telecom

Trading
Analytics

Search
Quality

Will be so overwhelmed
Need the right people and solve the right problems
Costs escalate too fast
Isnt necessary to capture 100%
Many sources of big data
is privacy
self-regulation
Legal regulation

Our newest research finds that organizations are using big

data to target customer-centric outcomes, tap into internal
data and build a better information ecosystem.

Big Data is already an important part of the $64 billion

database and data analytics market

It offers commercial opportunities of a comparable

scale to enterprise software in the late 1980s

And the Internet boom of the 1990s, and the social media
explosion of today.

Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
25 pages
Bigdata PPT Slides (E)
No ratings yet
Bigdata PPT Slides (E)
10 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Data Science
No ratings yet
Data Science
31 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Hadoop for Scalable Data Management
No ratings yet
Hadoop for Scalable Data Management
58 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Big Data Analysis Fundamentals
No ratings yet
Big Data Analysis Fundamentals
43 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Big Data Seminar Overview and Insights
No ratings yet
Big Data Seminar Overview and Insights
23 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Understanding Big Data Computing
No ratings yet
Understanding Big Data Computing
25 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Big Data Analytics 18CS72 - Module 1
No ratings yet
Big Data Analytics 18CS72 - Module 1
84 pages
Big Data Challenges and Hadoop Insights
No ratings yet
Big Data Challenges and Hadoop Insights
55 pages
Ccs334 BDA Important Questions
No ratings yet
Ccs334 BDA Important Questions
31 pages
Understanding Big Data Analytics Types
No ratings yet
Understanding Big Data Analytics Types
45 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Seminar Report Overview
100% (2)
Big Data Seminar Report Overview
27 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
31 pages
Big Data: Challenges and Solutions
No ratings yet
Big Data: Challenges and Solutions
10 pages
Understanding Big Data: Key Insights
No ratings yet
Understanding Big Data: Key Insights
31 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Hadoop Notes Unit2
No ratings yet
Hadoop Notes Unit2
24 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Big Data Unit 1 AKTU Notes
100% (1)
Big Data Unit 1 AKTU Notes
87 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Big Data: Characteristics and Impact
No ratings yet
Big Data: Characteristics and Impact
31 pages
Module - 1
No ratings yet
Module - 1
84 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Understanding Big Data: Challenges & Applications
No ratings yet
Understanding Big Data: Challenges & Applications
82 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Chap 1
No ratings yet
Chap 1
41 pages
Understanding Data Science Concepts
No ratings yet
Understanding Data Science Concepts
29 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Big Data in The Future of Workforce - Prof Abdullah
No ratings yet
Big Data in The Future of Workforce - Prof Abdullah
30 pages
PHD CSE Seminar in Course Work
0% (1)
PHD CSE Seminar in Course Work
17 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
0% (1)
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
27 pages
Big Data Insights and Hadoop Overview
No ratings yet
Big Data Insights and Hadoop Overview
29 pages
Islamic Studies Exam Guidelines 2017
No ratings yet
Islamic Studies Exam Guidelines 2017
2 pages
More On Similarity Measures and Research Paper Reading/ Writing Art
No ratings yet
More On Similarity Measures and Research Paper Reading/ Writing Art
6 pages
Islamic Studies Subjective-2017
No ratings yet
Islamic Studies Subjective-2017
7 pages
Examiners Report 2014
No ratings yet
Examiners Report 2014
7 pages
Research Guide for Grad Students
No ratings yet
Research Guide for Grad Students
3 pages
China Embassy Contact Information
No ratings yet
China Embassy Contact Information
1 page
Job Openings at Pakistan Water Management
No ratings yet
Job Openings at Pakistan Water Management
1 page
Pakistan Water & Power Development Authority: Do Not Staple
No ratings yet
Pakistan Water & Power Development Authority: Do Not Staple
1 page
Career Opportunities Career Opportunities Career Opportunities Career Opportunities
No ratings yet
Career Opportunities Career Opportunities Career Opportunities Career Opportunities
1 page
Pakistan Water & Power Development Authority: Do Not Staple
No ratings yet
Pakistan Water & Power Development Authority: Do Not Staple
1 page
T.R-6 Challan for FPSC Exam Fees
No ratings yet
T.R-6 Challan for FPSC Exam Fees
1 page
Sample Paper Sbots: Building Standards in Educational and Professional Testing
No ratings yet
Sample Paper Sbots: Building Standards in Educational and Professional Testing
8 pages
Examiners Report 2014
No ratings yet
Examiners Report 2014
7 pages
Laplacian Filter Convolution Guide
No ratings yet
Laplacian Filter Convolution Guide
1 page
Automated Penetration Testing Based On A Threat Model: Norah Ahmed Almubairik Gary Wills
No ratings yet
Automated Penetration Testing Based On A Threat Model: Norah Ahmed Almubairik Gary Wills
2 pages
TAG D (Mphil) Provisional Result
No ratings yet
TAG D (Mphil) Provisional Result
1 page
Image Denoising for Engineers
No ratings yet
Image Denoising for Engineers
14 pages
Wavelet De-Noising Techniques
No ratings yet
Wavelet De-Noising Techniques
37 pages
Neural Networks and AI Techniques
No ratings yet
Neural Networks and AI Techniques
2 pages
Lecture 4: Morphological Image Processing
No ratings yet
Lecture 4: Morphological Image Processing
35 pages
ChatLog Foundation Course - GSA - 2019-01-07 19 - 03
No ratings yet
ChatLog Foundation Course - GSA - 2019-01-07 19 - 03
1 page
Image Denoising: 1. 2 D Double Density DWT Thresholding Method
No ratings yet
Image Denoising: 1. 2 D Double Density DWT Thresholding Method
6 pages
V-Detector: An Efficient Negative Selection Algorithm With "Probably Adequate" Detector Coverage
No ratings yet
V-Detector: An Efficient Negative Selection Algorithm With "Probably Adequate" Detector Coverage
28 pages
630 Vocabulary Words with Mnemonics
No ratings yet
630 Vocabulary Words with Mnemonics
153 pages
Man-In-The-Middle-Attack Prevention Using HTTPS and SSL: International Journal of Computer Science and Mobile Computing
No ratings yet
Man-In-The-Middle-Attack Prevention Using HTTPS and SSL: International Journal of Computer Science and Mobile Computing
11 pages