1
⦿ Collection of data sets so large and complex that it
becomes difficult to process using on-hand
database management tools or traditional data
processing applications
⦿ “Big Data” is the data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value
and hidden knowledge from it
⦿ ‘Big Data’ is similar to ‘small data’, but bigger in size
⦿ An aim to solve new problems or old problems in a better
way
⦿ Big Data generates value from the storage and processing
of very large quantities of digital information that cannot
be analyzed with traditional computing techniques.
Handling bigdata- 2
Parallel computing
• Imagine a 1gb text file, all the status updates on Facebook in a day
• Now suppose that a simple counting of the number of rows
takes 10 minutes.
• Select count(*) from fb_status
•What do you do if you have 6 months data, a file of size 200GB, if
you still want to find the results in 10 minutes?
• Parallel computing?
• Put multiple CPUs in a machine (100?)
• Write a code that will calculate 200 parallel counts and finally
sums up
• But you need a super computer
MapReduce Programming Model
3
Processing data using special map() and reduce()
functions
The map() function is called on every item in the input
and emits a series of intermediate key/value pairs(Local
calculation)
All values associated with a given key are grouped
together
The reduce() function is called on every unique key, and
its value list, and emits a value that is added to the
output(final organization)
Hadoop 4
•Hadoop is a bunch of tools, it has many components. HDFS
and MapReduce are two core components of Hadoop
• HDFS: Hadoop Distributed File System
• makes our job easy to store the data on commodity
hardwar
• Built to expect hardware failures
• Intended for large files & batch inserts
• MapReduce
• For parallel processing
•So Hadoop is a software platform that lets one easily write
and run applications that process bigdata
So what is Hadoop? 5
• Hadoop is not Bigdata
• Hadoop is not a database
• Hadoop is a platform/framework
• Which allows the user to quickly write and test distributed
systems
• Which is efficient in automatically distributing the data
and work across machines
Hadoop ecosystem 6
Big Data ecosystem 7
28
Application
Smarter
Of Big Data 8
analytics
Multi-
Healthcare channel
sales
Homeland Telecom
Security
Trading
Traffic Analytics
Co ntrol
Search
Manufacturing Quality
9
• Will be so overwhelmed
• Need the right people and solve the right problems
• Costs escalate too fast
• Isn’t necessary to capture 100%
• Many sources of big data
is privacy
• self-regulation
• Legal regulation
10
⦿ Our newest research finds that organizations are using big
data to targ et customer-centric outcomes, tap into
internal data and build a better information ecosystem.
⦿ Big Data is already an important part of the $64 billion
database and data analytics market
⦿ It offers commercial opportunities of a comparable
scale to enterprise software in the late 1980s
⦿ And the Internet boom of the 1990s, and the social media
explosion of today.