0 ratings0% found this document useful (0 votes) 87 views34 pagesBigdata Unit 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
-
UNIT
=
What is Big Data ?
‘Big Data sa collection of data hats huge volume, yet growing exponen with
time, tea data with large sie and complexity that none of aon ata
management tools an sore itor paces fice: Big data alsa ats but with
ge sie
What is Data ?
‘The quantities, characters, or symbols on which operations are performed by 2
‘computer, which may be stored and transmitted i the form of electrical signals
and recorded on magnetic, optical, o mechanical recording media,
What is an Example of Big Data?
1 Fatmrgar son eB Osa nares
Fae say ny Th els smarty gowns a iomavien abode meage
‘arg nrg ora
+ sg satan cn get tay ft 20min i ay
‘Natasa praeon nent op 9 ayeBig Data Analytics
+ wait a ie reise ede ania
etn seston arisen scr ged em ae
‘Saget artnet ncn ts ec ce ln a
ron remanent
+ What dante nic et any aan ens
limes eer tts mate arama sec ee
‘Peta hustler yb topo naan mar
Data Analytics vs Data Analysis
‘+ Data Analytics is a scvanced ‘+ Data analysis consists of
‘broader Gata analysis It defining 2 data investigation,
indludes data analysis as 3 sub clearing, transforming daca‘o
component ses the logical give meaningful outcome,
‘framework basea on which 1+ Foranalyse the cata Tableau
analysis s done, Excel et
‘+ There are many analytics tools
In market mainly thon,
‘Apache Spark ete
Cont
‘ig data analytics applications enable big data analyst, data scientist,
predictive modelers, starstiians and other analytics professionals to analyze
rowing volumes of structure transacton data, plus otner forms of data
For example internet clkstrear data, web server logs, social mela content,
{ext from customer emails and survey responses, mobile phone records, and
‘machine data captured by sensors connected tothe internet of things (oT)
Data Analytics vs data analysisThe importance of big data analytics :
Driven by specialized analytics systems and software, as well as
hgh-powered computing systems, big data analytics offers various
business benefits including:
New revenue opportunities
More effective marketing
Better customer service
Improved operational efficiency
Competitive advantages over rivals
Structuring Big Data
Three different data structures
For the analysis of data, itis important to understand that there are three
‘common types of data structures:
+FD 0+ 66
Structured Data
1+ sutures dais bt mat adheres tos preset ita mode and itnctore
“Nrigonord to ana Structured ta cnfres ta tabular ermat th elstonsip
‘ere Sel astnnes acho! thee have sured rms and conn tan Be
1+ Sucre data depends onthe eustence ofa data mode -arradel of hw data can be
"Ror process ane acest cae ota model ech ale darted can
‘Seeeas sparta length trom ster ele he makes racers
‘ae carey power is posobeto hy egaepte arom vero eto
thecase
1+ Sucre ais constr tne mon rasta frm of da storage, ence the
‘lest ears of danse manogarent some (Dan) were seta tera, proces and
Employee Empl Name Gender Depatnent
28 Foshan ale France
x rte female in com
ws Shuey ale dia
shutettes ale France so
ipsme male Frame ssoUnstructured Data
‘+ Unsrucured datas information tha ether does nat Rave a redefined data mode
(or enot oases ina predeined mane
1+ Unsrucuredinfrmaton styl sexo, but may conan daa such seas,
‘ombers, ons acts el
‘Ths resuts in regutstes and ambiguties that make dca to undersand using
Leadon progam as compared o asta stored in structure antabanes
+ Common names ofunsrutred data indude aus vide les or hosel bases
Semi-structured Data
+ Semistructured data fa form of structured data that does nat conform
‘withthe formal suture of data models associated with elavonal
databases or other forms of data tables, but nonetheless contain tags or
‘other markers o separate semantic elements and enforce hierarchies of
records and fields within the data
1+ Therefore, its also known as sel describing structure, Examples of
semistructured data include SON and xML are forms of semistructured
data,
catia uane¢|nana<
cecrsuanentevenlah 0.
pale /seureagerticlager|se0>‘Big data analvtics technologies and tools:
‘© Unstructured and semi-structured datatypes typically don’ ht well in
‘radtional data warehouses that are based on relational
databases oriented to structured data sets.
‘© Further data warehouses may not be able to handle the processing
‘demands posed by sets of big data that need to be updated frequently oF
‘even continually, as in the case of real-time data on stock trading, the
‘online activities of website vistors or the performance of mabile
applications.
‘© Asa result, many of the organizations that collect, process and analyze
‘big data turn to NoSQL databases, as well as Hadoop and its companion
data analytics tools, including:
Exploring the use of Big Data in Business Context :
‘+ Almost all organisation collects relevant data (either directly or thraugh
agency.
‘+ This data is related to customers feedback, information about supplies
and retails, current market trends et
1+ The continuously increasing cost of collecting this information wal be just
2 waste of resources unless some logical conclusion and business insight
‘an be derived from it. This is where Big data Analytics come into picture.
‘+ This wll nelp organisations to reduce the cycle ime, fll orders quickly,
improve forecast accuracy
‘YARN omer manage chs an one oh yearn scondeneron
acon.
DMapReduc:asfnre meat als developers to wie paras that es
and sone competes. * ™
Spark an pen rc, paral procs framework hat ete sera un angele
sharable spplstanracom caters sens
ve an open sure data warehouse orgueyng nd ansng rg atts ored
nao es
ata: ari pub subscribe messing system designed 0 replce
Pig. an open sore technology hat fers high ee meckansm fr the parle
rgrommingat Moped bs ected on Hadoop esters
AGENDA
1+ We are going to discuss in ifferent areas of big data applications:
© Use of Big Data in Social Networking
Use of Big Daa in Preventing Fraudulent Activities
Use of Big Daa in Dececting Fraudulent Activities in Insurance Sector
Use of Big Dara in Retall Secor
In each area we wl discuss the fellowing aspects
“+ What isthe data invoved ?
“+ How to make optimum use of data?
‘© Wha are the useful insights rom analytics of the data?0°
oeye5
Use of big data in social networking @ 3% ©
8 °
'A. Whats social network data? siclee)
Itrefers to data generated from people socializing in social media
websites such as twitter, facebook etc
+ Ona social meaia website you wil ind eifferent people const
adding and updating comments, status, preferences etc
Following url shows the socal network data generated
per/seconds through various socal media,
ely
www.internetlivestats.com
B. How to make optimum use ofthe social networking Big data?
https://youtu.be/JAO_3EvD3DY
‘Analyzing and mining the larger volume of data in social
networking sites such as comments, status, posts likes ete show
the business trends in general with respect to “wants” and
“preferences’ of a wide audience.
If this data can be systematically segregated on the basis of
ifferent age group, locations, gender etc, then organisation can
design products and services specific to people needs,
This is called social network analytics.
EXAMPLE Cont...
‘© Infact the data generated from social networking
analytics enable an organisation to calculate total
revenue a customer can influence instead of the direct
revenue he himself generate. ex: food blogger's
‘Social networking analytics has even advanced
applications such as predicting online reputation of a
brand ex: tripadvisor, increasing profitability in
business by targeting influential customers.
* exinsta influencer. ~, es
} | Ga
|
fer‘C.What are the useful insights from Big Data in social
networking?
‘Te fllowing are the areas in which decsion making
processes of organisation i influenced by socal networking
data:
Business intelligence : cs a data analysis process to convert
2a data analyss process to convert a raw dataset to
‘meaningful information tha ean add value to decision making.
Social networking data and its appropriate analysis has proven
tobe 2 good aid in providing business intelligence.
‘This can be understood from following examples: rom
“ferent sector in business,
IL Marketing: Today preferences of cansumers have changed due to their
busy schedules, So marketers aim to deliver wat consumers desie by using
interactive communication channels such as email mabile,web et.
‘Example: Walmart has started a socal media analytes company calles kesmix
{and estaolsned a branch Walmart fb, analysis mela cornmunation such
25 blog, twits, transactions daa eto predicts trends and learn about
customers wants
IV. Product Design and Development: By Istering to what consumers want
by understanding where the gap inthe product offering is and soon,
‘organisation can make the right decision inte direction of trer product
design and development.
‘Customer relationship management data : vith the help of social
networking analytics, organisations can identify some customers inthe
customers networks, that make a large no af cals, text messages and
havea large network of friends. Such a customer is said o be highly
Influential as studies have shown that when a user ofa telephone
‘networks leaves his fiend also leaves. Infact some organisztons reward
thei Influencer customers wth discount and offers. And these
customers in turn spreading a paste brand image. Other sector ex
Google pay, Airtel et
L. Link Analysis: Social network analytics canals help in aw enforcement
{and ant-terorism efforts asi is possible to ldentiy trouble groups or
‘people who are directly ar indirectly connected to each other. Such type
‘of analysis called LINK ANALYSIS.
Sentiment analysis refers to a computer
programming technique to analyze human emotions,
attitudes and views across popular social networking,
including facebook, Twitter and blogs.
The techniques requires analytics skills as well as
advanced computing applications.
Business research organisations and marketing
professionals across the global use sentiment analysis
in one form or the other to identify and measure
‘customer behaviour and online trends.| Preventing
Fraudulent
Activites |
Types of financial frauds
A. Credit Card fraud: type of fraud very common and
relates to use of creat car faci
Commonly occurs when a fake ora stolen cards used
in an online transactions inspite of securty checks
about the valid owen ofthe card such as address
verfication or (card verification value) CW no et,
Fraudsters manage to manipulate the loopholes in the
system
What are fraudulent Activities 2?
ud,
“+ Fraud can be committed by both words and behaviour intended to
deceive the other party generally o gain an advantage over the party.
Here financial frauds are discussed,
Frauds that occurs frequentl in financialinstution such as banks and
insurance companies and involve any type of monetary transaction are
called as financial frauds
‘+ sue frauds online retallers such as amazon, ebay, Groupon sufer
huge losses, and this is where Big data anelytes come to use
2. Exchange or return polly fraud : Occurs when people
take advantage of exchange return policies offered by an
online retailers
+ Sample: Customers euring ne produ ater xing EEO
reporting non-delivery and later attempting to slit
online et,
The online retailer can prevent such a fraud by charging 3
Restocking fee on return goods, getting customer
signatures on deliver, tracing customers known t0
‘omits such frauds using thee transaction patterns.
This ts where big data analytes come to use
For example: the retaler can study customers ordering
patterns, frequency of change in shipping address, rush
‘orders, sudden huge orders etc{3 Personal information fraud : This type of fraud occurs when the
‘raudsters obtain login credentials of customers and purchase 3 product using
‘them and changing the existing delvery address they buy't
‘when , the original customer realises this he keeps calng the retailer to
refund the amount ashe or she has not mad the transaction,
‘+ According to Consumer goods regulations once fraud is proved retallshas
torefund the amount tothe customers
What are the useful insights from big data
analytics in Real-time fraud detection
‘+ Live data matching :n this study organisations can compare Ive detals
‘of customers obtained from diferent sources to valdate the authentic.
‘+ Bc Inan online vansacton, big data could compare the incoming
address with the geodata received from the customers smartphone
‘8p9s. Avald mateh between the two confirms the authenticity of the
+ Be Als costy products can have sensors attached to them that
‘transmits thei location information, when such products ae delivered to
customers the streaming data obtained from the sensors provide good
source of information to trace any frau.
How to make optimum use of customer data to
prevent fraud.
G Pay
or
VERIICATION
Image analytics
In order to desl with huge amount of data and
fain meaninglul insights to avoid fraud,
organisations need to derive analytics ols to
aferentiate beween real or genuine and
‘Fraudulent customer entries
Organisations have to upgrade thelr knowledge
about emerging methods of fraud and design
necessary prevention checks.
Example: Secure OTP acting asa second round
lor check after CW, Google pay introducing a
ering secur pin apart rm the requ n
+ Thisis another emerging field that can help
detect frauds.
‘+ Image analysis (also known as “computer visior®
‘or image recognition) ste ablityof computers
torecognize attributes within an image,
Some ofthe examples include facial
Fecognition(smart phone), postion movernent
analysis (Google mapsetc
Analytical systems that deal with big data are
‘designed to integrate and understand images,
videos, text, numbers and al forms of
unstructured data to faciitate image analytics.Use of big data in
detecting fraud in
Insurance sector
MPP (Massively Parallel Processing database)
‘+ This technology is used in powerful fraud management systems in order
to detec frauds. The system analysis each customer transaction on the
‘basis of $00 different criterias or aspects to differentiate between areal
and fraudulent transaction,
‘+ This level of analy scalability needs a MPP system,
‘+ MPP is widely used database management system for storing and
analysing huge volume of data,
‘+ An MPP database has severalindependent pieces of data stored on
‘multiple nesworks of connected computers.
‘+ teeliminates the concept of ene central server having a single CPU and
disk
“+ VA payment services make use of MPP ints fraud management system,
Use of big data in detecting fraud in Insurance sector
‘+ This important to study because most cases of cheating and fraudulent
activities occurs in insurance and retail sector.
What is the data availabe in Insurance sector? In generalthe company
offering insurance is always willing to improve ts abilty to take decisions
hile processing claims and ensuring that the claim sa genuine one.
‘+ The company as policies and procedures to help underwriters (an officer
who evaluates insurance coverage, claim details ete) however underwrites
always da nat have the required dat atthe right time to make necessary
decision, thus delaying the processing time and inereasing chances of
frauds
‘+ Til before big data Insurance companies use to analysis small sample of
data ofthe customer and lesser parameters making iless ful proof,How to make optimum use of big data analytics in Insurance
‘+ Asa soktion to these problems bg data anaycal platforms increase the
valeity of data about customers by integrating einer data wth
Gata obained fram socal med or other sources
«+ Ex: Acustomer might indicate hat his/her ar was destroyed in a ood
butte documentation from the socal media may sow tat te ca was
cual in another cy onthe day te fldos occur, ths mismatch may hint
esence of aus,
‘+ Thusinformation obtained from these platforms will enable the insurance
companies to diagnose customer cm behaviour and other related
‘+ Big data can deret patterns of fraudulent behaviour from large amount
of structured and unstructured data ghven tot, ex bank statement
medical bil, erminal recordset and help in detecting rauds quicker
and insuring better actions
Social Customer relationship management :
Social Customer relationship management is not a
platform or technology, but a process. It makes it
critical for insurances companies to link social media
sites, such as Facebook and Twitter, to their CRM
systems.
‘When social media is integrated within an organisation,
it provides high transparency in various issues related
to customers.
‘What are the useful insights from big data analytics in Insurance
Soca network analysis" mised approach using ttl methods
pater onal and kaa oly oy Hn ofreatenhips wth
Linge amount of ata cece fom ferent sures for x Gta rom
pbc reord suchas crmial cords odes change
Frequreyoreosures ga prectesin which recovers mone om a
ister wh asdf repayments) dearaon of brkrpiy, are
{rou datasources that on be asmated nt te NA model wc es
{Derecvel deter enstence a aud.
+ Uang ois apreach secorprsong eration cbalne from varius data
scurtesinass model the rane cgay can cae dao tah ating,
Indcatsthat cam sFaud) ex fa customer fes case o get surance
taney ca espe ine, sopose wee Sete ay onthe
Clstmners statements inthe cam eps an come ares word he
“lube er removed io ca et then ths igh indat te cr was
Dimon prpore.
Retail industryWhat is big data in retail industry??
‘+ Inthe recent times Omni channel retailing process is 3 new buzz word,
‘this proces isthe one which focuses on consumer experiences by using
all avalable channels (asthe word omni means all direction, including
mobil, internet, television, showrooms, radio, mall, apps, and many
more evolving channels
Hence considering the immense numberof transactions prevallingin the
‘omni channe!retailindusty from all channels, there is alot of scope for
the use of big data technologies in extracting useful information such as
relationship patterns, tendsin the sales of product.
Cont...
‘+ For example: wiat time ofthe year do we soll maximum no of leggings
and from which channel?
Design promotional coupons for customers based on their ordering
Further,Ta meet demand of new customers retailers are adopting,
specialized software applications for example : customers are gven the
Information whether a particular item is in tock in nearby store or nat.(
‘pollo pharmacy).
Thisis where Sig data analytics comes to use.
|
RFID
https://www.youtube.com/watch?v-reQUE7
BOUSY
LL
How to make a optimum use retail data:RFID tech
‘Te biggest evolution in automating the process of beling and tracking
detail goods Is RFID (Radi feequency idenifcaton).
‘+ walmartis tne 1st retaler to implement RFD In its products,
‘+ RED helps better item tracking by aifferentiating tems that are out of|
stock and that are available on shel
‘With this technology the huge volume associated with transactional data
‘of omni channel retailing can be easily handled and measures can be
‘made for enhancing customer experiences.Useful Insights from retail data analytics :
‘Asset management :Retal Organisations can tag heir material handling
‘equipments Such a5 venicles, tools with RFID in order to trace them ary tme
{and from any locations.
Readers fixed a speci locations can observe and record all movements of
the tg assets with great accuracy.
‘This information lessess the time for documentation als.
‘+ Regulatory Compliance : To meet the regulations of agencies such as
FBA (food and orug administration), OSHA (Occupational safety and
health administration ) etc, Manufacturers need to dispatch products
such as medicines regulated drugs special foods having preservatives,
hazardous chemicals ec, with updated labels
‘+ RFID tags canbe used asa labeling sytem for this goods
‘= Also logistics companies lke DTC can also dfferertiate speed delivery
products from normal delvery once using REID tags.
Inventory control: RFID data allows manufacturers to track inventory for raw
materials ,works in progress (WIP) or finish goods (FG),
Readers installs on shelves can update inventory automatically and rise
alarms, incase the requirement for restocking arises,
Further the readers can be programmed to rse an alarm incaseitems are
removed and placed elsewnere,
Even Apollo pharmacy manages inventory of available drugs using this
‘echnology.
Shipping and Receiving : RFD tags can be used to fasten the process of final
shipping of finshed goods.
Service and voluntary authorisations : RFID tags can hold updated
Information about repair and services dane on the product.Once the repair
and service has been completed the information can be fed ino the RFID ag,
‘nthe produc, so thusiffuture repairs are required, the technicians can
access this informacion without accessing ary exteral database, which help
In reducing cals and time expensive enquires into document,cont...
© This is done with the help of a new software programs
or applications, that do the following :
© Breaking up the given tasks into sub-tasks
© Surveying the available resource on hand
© Assigning the sub-task to the nodes or computing,
devices that are interconnected via network.
© Finally collecting outputs from all subtasks
Introducing technologies for handling big data:
‘+ Huge amount of data from different sources need to be
managed properly, to derive productive results.
The astronomical increase in volume, velocity, variety
of data collected from different sources at the same
time are forcing organisations to adopt a data analysis
strategy that can be used for analysing entire data in a
very short time, Above applications are based on the concepts of
distributed and parallel computing
Distributed Computing and parallel computing Peete en Benes ‘Techniqueton
for Big data
‘+ Distributed Computing : In distributed computing, multiple computing
resources are connected in a network and computing taskare distributed
across this resources.
This sharing of task increases the speed as wells efficiency of the
system
f+ Iisalso more suitable to process huge amounts of data in a limited time,Characteristics of Distributed System
1 Heterogeneity refs toh aii for he syste 0 operon arity of eet
andr and stare compen
+ Opennen of ditched system define the dict invled tented improve
Anson or pes varying eof
© Coneureney fers tots sso’ tiy wo han the ace and use of aad
+ Sealab ison te maar character efcvenes ofa bts em, i
‘sso bo’ easly the system as aap 1 eae size
seme opt,
ee Sci
cont...
Parallel Computing: this another way to improve the processing capability
‘ofa computer system by adcing additional computational resources toi.
tn this method complex computations are divided into sub tasks, which can be
handled indivivally by processing unt, running in parallel
In general organisations use a combination of parallel and dstributes
techniques to process big data.Diff...
Parallel computing ss 9p of
computing sect which sever
proestors stancoulyonecte
lle, smaler eal broken
‘down oman overall ler, complex
problem
calleston of uch huge eta hat
cant be proceseé by tational
Issues in big data handling systems :
1+ Latency scan be defined asthe aggregate delay in the system because of
delays inthe completion of individual tasks.
© Such a éelay automaticaly leads to the slow down in system
performance as a whole and thsis often termed as System Delay.
©The number of nodes designed in the dstrbuted computing system
topprocess indluidual tasks determines the level of scalabilty ofthe
big data system,
© Thus implementing distributed and paratel computing methodologies
helps in handing ftency.Conti...
“+ Load Bolancing: The sharing of workload across various systems
throughout the network to manage the oad is known as load bolancing.
Distributed and parallel computing methodologies make use cf load
balancing feature to handle growing amounts of big data more efficiently
and flexibility
“+ Virtualization: ig data vitualzation is a process of creating virtual
structures for big date systems such asthe hardware platform, storage
device and operating system etc to meet the goals and objectives of big,
data analytics
‘© Thisvitualzation helps the organisations to understand and navigate
‘easily the flow of information across these physical systems
© Distributed and parallel computing methodologies make use of
\irual'sation to segregate the processing and analysis ask in 2
Special techniques of Distributed & Parallel
computing :
‘+ The distributed and parallel computing techniques has been around
almost 50 years inially the technology was used in computer science
research to salve complex problems by increasing scalabilty without
‘Investing on massive computing system.
‘+ Over the period af time , concepts of Distributed & Parallel computing
‘technology has evolve into a numberof techniques to process and
‘manage huge amounts of data produced at a high veloc
‘+ Some of these teenniques are shown below
systematic framework to minimize errors
Contd.
+ Cluster or geld computing cis 2 orm of parle computing in which 2
bunch of computers (often called nodes) are connected through a LAN and
used to salve complex operations so that te behave like 2 single machine.
‘Tris will reduce down time and prowde lager storage capacty.
Prima used in Hodoep
‘Massive Parallel Processing: Piaily used in data warehousing
MPP swedely used database management system forstonng and analysing
huge volume of data
1+ An MPP database has severalindependent pieces of data stored on multiple
networks of connected computers.
eliminates the concept of one central server having a single CPU and disk
‘+ MPP platform examples are Greenpium and ParAccel (both popular database
‘management comparies)
Cont...
High performance computing (HPC): HPC environments are the once thats
specially designed for processing Fosting point data at high speed.
Ics used in esearch and business organisations to develop specialized apps
here accurate results 's more valuable and satel
Example : polltion level detection etehttps:
LL
LL
www.youtube.com/watch
TIYEGt=1845
Why Hadoop ?
|
0b!
|
Hadoop - High Availability
Distributed Object Oriented
Platform.
What it is and why it matters
LL
‘over the course of evolution of Big Data handling systems, Distributed
computing environments are used to process high volumes of data,
However the mulkiple nodes in such an environment may not akways
cooperate with each ather (due co Issue suchas latency, data related
problem system delay etc) thus leaving alt of scope for erors
In this context hadoop evolved 35 a platform or framework providing an
improved programming model, which is used to create and run
distributed systems quickly and ficiently with least erors“Hadoop is a framework that allows
+ you to first store Big data ina
What is Hadoop ? distributed environment, so that, you
can process it parallely’.
Hortonworks(a data oftware company based in california that developed When to use Hadoop ?
and supported open sources} for bi dat processing) dentin
“An open source software platform for distributed storage and + Search - Yahoo, Amazon
prea cing acm per oped «Log processing - Facebook, Yahoo
run on an entire cluster instead of one PC + Data Warehouse - Facebook
+ “Disriouced storage": A ata set is spreagacrossmutiplearéves. «Video and Image Analysis - New York Times
fone of ther burns down, the data stil lal stored.
1 "Disributed processing Hagoop can agaresate at sing any CPUs
inthe casterWhen not to use Hadoop ?
Low Latency data access : Quick access to small parts
of data
Multiple data modification : Hadoop is a better fit
only if we are primarily concerned about reading data
and not modifying data.
Lots of smalll files : Hadoop is suitable for scenarios,
where we have few but large files,
Evolution of Hadoop
In 2007 Yahoo started using Hadoop on a 1000 node cluster.
Later in Jan 2008, Yahoo released Hadoop as an open source
project to Apache Software Foundation
In July 2008, Apache tested a 4000 node cluster with Hadoop
successfully
In 2009, Hadoop successfully sorted a petabyte of data in less
than 17 hours to handle billions of searches and indexing millions
lof web pages.
Moving ahead in Dec 2011, Apache Hadoop released version 1.0.
Later in Aug 2013, Version 2.0.6 was available.
Evolution of Hadoop
In 2003, Doug Cutting (a s/w designer who invented open-source
search technologies) launches project “Nutch” to handle billions
of searches and indexing millions of web pages.
Later in Oct 2003 - Google releases papers with GFS (Google File
system).
# In Dec 2004, Google releases papers with MapReduce.
‘© In 2005, Nutch used GFS and MapReduce to perform operations
#2006, Yahoo created Hadoop based on GFS and MapReduce
with Doug Cutting and team
Hadoop Ecosystem :
‘+ Aswe understand Hadoop is open source iw framework (a set of prog
written in Java that allows for massively parallel computing allowing big
data sets to be stored and spread across multiple serves with
reduction in performance)
‘+ Being a framework hadoop is made up ef several medles that are
supported by a large ecosystem af technologies.
1+ Thus hadoop ecosystem is defined as a platform which provides various
services to solve the problem associated with big data,cont...
‘There are 4 major services provided :
© Data processing (tools being mapreduce, Yarn)
© Data storage (tools are HDFS, HBASE)
© Data access (tools are HIVE, PIG, SQOOP etc)
© Data management (tools are OOZIE, FLUME,
ZOOKEEPER etc)
Understanding Hadoop Ecosystem :
| Following are the components that collectively form a
Hadoop ecosystem:
+ Ho uo Dette ye
\www.youtube,com/watch?v-aReuLtY + ttqronee reremangtontoso oe
OYMI 1 Santee anapocsine
a nie cur bse roses of ra sess
faze: tanaee
about: Mochine Lesing sit aes
ackaepe Manag ter
ose os erettngHADOOP ECOSYSTEM Contd.
“HPS Mapreduce,YARN are the core components of Apache Hadoop and they form
‘he si ditnbuted Hadoop framework
“There ae several oer Hadoop components that form an integral past ofthe adoop
ccosgtem withthe intent of enhancing the ower of Apache Hadoop in some way ot
the acer like- providing beter iteration with databases, making Hadoop faster ot
developing novel features and functionalities.
in the nex few sides we will discuss some ofthe eminent Hadoop components used
by enterprises extensively and mentioned in ou lsu
“They ate Mahout Sqoop Oozie Flume, Zookeeper base
What is HDFS?
‘emote epeangt caer
We HOFS sot ty ear
HDFS — Hadoop Distributed File System
DES the abstraction means representing the data over the Blocks of a le
Faher than sing le which spies te soragestbeyser,
Simla vealzaton, you cam See HDFS logically a single unit for staring
Big Data, but actually you are storing yoursdataactoss muluple nodes in 8
HDS ows astra architecture
HDFS - Illustration
INHDFS Architecture
HDFS — Hadoop Distributed File System
What is YARN ?
1+ HADOOP YARN (Yet Aout Resource Nog econ doo says
‘tpn fo lloeing char tence svar aplaont peda ge)
"YARN ithe esare mangement nd job scheling tsholoy in the open source
Hadoop dsnbuced procesing framework
Mace th tes eared tobe eves sper mapedice ropes on diferent
nscale a the operating stem a adopt i responsible for aging a
1 Namenode the master node and
which eat blocks stored in whch data node, where are the replications of
the dau boc ept te
he actus dat stored in Dts Nodes
[YARN - Yet Another Resource Negotiator
forms daa processing alles by allocating resources and shedlng tskes nsec | 4
ane —a ”
i
eeere ae =) bal MapReduce : is a programming
paradigm that enables massive scalability
across hundreds or thousands of servers in a
Hadoop cluster.
HDFSYARN
[aio stands or Yet Another Resource Negonator
J was troduced in Hadoop 2, where the resource negataion art was sl
outro MapReB i
os
Je where HOFS spt up the data storage eros your ester, YARN splits up the
Map Reduce
{YAN wrt lg al nas to un jobs fin as pose
HIVE
Fives a application tha runs over the
Iadoop framework and provides SQL Uke
Interface for procesing or querying data
J HIVE provides 2 SOLcke interface for
working on” data stored on
Hadaopntegrated systems
= Thats females your HDPS-stored data lok
ike 3 SQL databate,
sq is a structured query language used for
processing structured and. sombstractured
ase,
sa.”
User
\
tf
HDFS
What is Hive ?
1 Apsteieies ire atleast nen at embry amaie
eatin reread wean mage of ing SO
so psec as
‘Asal Hecate wth Hara vino mck eyo peer
+ Inshor, rtransforms the queries int efficient MapReduce or Spark js.
IVE is ull upon hadoop and the query for processing the data I ive i hive,
‘this query is then converd into mapreduce program and then processed by
hadoop,Pic
PIG Vs HIVE
‘Th corresponding scping languages cated Pig lati, bas 2 SOL-sinlar syntax,
anit can perform MapRedce obs.
Oozie
programs (enserthan nav, tat)
‘© Oozie is an orchestration system for Hadoop jobs.
© Oozie is an Open Source Java Web-Application
available under Apache.
© Oozie is designed to run multistage Hadoop jobs as a
single job : an Oozie job.Oozie In Operation
‘+ Apache Ooze @schedular system io run and manage Hadoop
distrbuted envionment.
‘+ talows combining mutple complex jobs tobe run na sequential order to
achiove a bigger tase
'+ Ooze detects competion of tasks trough callback ane poling.
‘+ When Oozie starts a task, i provides a unique callback HTTP URL to te
{ask and noes tat URL when iis complete
‘+ Ifthe task fas to invoke the callback URL, Oozie can pl the task for
complain.
Workflow in Oozie
‘Spark
30
nap
Reduce Fork Join Pig 306
3b
ive
query
Features Of Apache Oozie
‘+ Apache Ooze is @ sehedut
istibuted environment
‘+ Corie allows combining multiple complex jobs tobe run ina sequent!
‘ordor to achieve the desied output,
‘+ is strongly integrated with Hadoop slack supporting various jobs ke Pig,
Hive, Sqoop et.
‘+ Furthor.00z0's abo to manage the oxistng Hadoop machinery for problems
such as load balancing, fai-over,
system to tun and manage Hadoop jobs in 2
Types of Apache Oozie Jobs
Foliowing thee ypes afb are common in Oozie—
‘+ Oczie Workflow Jobs - Qazi jobs running on demand Workflow actions
canbe diferent
ks ko Hive tasks, Pig task, Shllacton ot.
‘© Oczle Coordinator Jobs - Oazie jobs ruming periodical.
“+ Oozie Bundle - itis a colecton of coorcinatr jobs managed asa single job,Sqoop
Special features of Sqoop:
_Apache Sqoop undertakes the fllowing tasks to integrate bulk data
‘mavement between Hadoop and structured databases
'3q009 fulfils the growing need to transfer data from the mainframe to
OES.
Iefacttaces feature to vansfer data parallel for effecve performance
and optimal system utlizaton.
Sigoop creates fat data copes from an external source into Hadoop,
Ik acts as load balancer by mitigating extra storage and processing
loads to other devices.
‘© Sqoop componentis used for importing data from external sources
such as relational databases & variously structured data marts into
Felated Hadoop components like HDFS, Base or Hive ete
‘© Sqoop mainly helps in moving data from an enterprise database to
Hadoop cluster to performing the FTL (Extract, transform,
load)process.
‘ Itcanalso be used for exporting data from Hadoop components to
external structured data stores as showin below
FLUMEWhat is Flume ?
Flume sreevantin cases when the data is required ta be brought rom
‘multpie servers immodiately nto Hadoop.
In such cases, Flume component is usad to gather and aggregate large
‘amounts of data,
Ifacitates the streaming of huge volumes oo fest various sourcos
{like web servers such as Titer, Facabook et.) info the Hadoop Distributas
Fe Systom (HOFS),
Why Apache Flume?
‘+ Organizations running multiple web services across multiple servers and
haste wil genarate multudes of log les on a dally bass.
‘Also When the rate of incoming data exceeds the rate at which data can be
‘writen tothe destination, Flume acts 9s a mediator between data producers
‘and the centralized stores and provides a steady ow of date between them.
‘+ These logfiles wl contain information about actives that are require for
both auditing and analytical purposes.
ZOOKEEPERIntroduction
‘+ Apache Zookeeper is an open source software framework designed to
Coordinate mulipie services inthe Hadoop ecosystem
‘+ Organizing ana maintaining a service na touted environments a
complicated tase
‘+ ZoaKeoperalows developers to focus on core application logle without
‘worying about the dstibutd nature ofthe appieaton
1+ ZooKeoperis a disrauted co-erination servic to manage large set of hosts.
|
Hbase
Features of Zookeeper contd.
“+ Zookseperis a coorinator Many other tocls HDFS Base rely on
‘+ Ttean keep track of what node is upidown, which one isthe master node,
nat workers are avalable, and many more things
‘+ Itkeeps track of things that can go wrong onthe cluster include them
ode crashing, a worker crashing, o network rouble, where a pat of he
luster ean s6e the ros of Nanya ee
‘+ Zookeeper sits onthe side of your system and tes to maintain a consistent
picture of sale on the ene dstibuted system ina consistent manne.
|
HBase is a column-oriented non-relational
database management system that runs on top
of Hadoop Distributed File System (HDFS).
LLMore on Data Storage part of HADOOP Ecosystem: HBASE.
+ Naseis a hadop dtaboe an opensource noncelstonalatir buted, coknn-arented
“atace developed a3 par of HADOO® rage of tan HOFS,
'+ tis banettalwnen rg amouns of formation I requred tobe store. upéated ans
+ Whe Vapteauceeahances Bie dota processing Mosse ates care is storage and
+ Hae use when you ned random, reimereadrwrte accesso you Big Daa
1 sas database enerprse an create ge abe wth mins of ows and clus
Features of HBase:
+ asenebs propane teste antes ns aay Cb
+ eave dainacangrecidtomtanacne cc at mano pc
+ Vismateseuy tad aon tir fhe date oped acai
+ nsteyeu tive epee na arene ta eemmercea te
+ elsutae ern stage sul Denson cea Paces ee
nda vp trea spn sts dane age cite
= eee ae 7
Why Hbase ?
‘+ He.aso isa specialized fle systom used in HOFS which is relevant inthe
following cases
‘© Mote random read and write access to data
© When you want the data tobe stored in a more structured fashion.
© When the veloc of tai ery high.
1 When the lg data ofa website needs tobe stored. Example: facebook
ata stored in HBasesors sone Basic Blocks of HBase
+ tonne ni att ela ted Dandy Vln ean
sate song ie os
OFS cesnetsupeat ist Has prowess oki fran aes
Indl aco oks.
Frocnag no sore beh ens acs Rando ace)
+ MStratrence tren Diference need and
Snore Tieagemsy. fsb Mme Aee Fone oer oe ont
neta Neos Ne
atari Mare hee feed tn O_o eaten
jroo asees Bee weenie | 2045S at piss HD 1890 2a
mie sam 8 ese wo we00 1808 sie tam Shai 25 caron B06 1501 ae
renee
1a sah as serio mas sso te serie ibaa 2567 ely 18ST MMA ena a0
met baa 18 amt Man ana ama meant
== ome seat 48 aserh okt atepe ae saa
perros 178
pmseaiage so 0")
tet fee
1 Random hese ia te ans thatb contra
‘Symi aan wee oa
Summary and How Hbase is combined with Hadoop
‘+ Apache Hoase is te hadoop database, adistibuted, column oriented,
scalable, big data stor.
‘+ Use Apache Hbase when you need random, ealime readiwite access to
your Big Data
‘+ Ifs goals hosting large tables lions of rows, X mons of columes - top
clusters of commoaty hardware.
‘+ Apache Has an open source, cstrbted, on-elatonal database
modelled after Google's Bglable-ADistrbuted Storage System for
‘Structured Data,
‘+ Just as Google Bigtable uses the distibuted data storage provided by the
‘Google File Systom, Apache Hbase provides Bitable-the capabilites ontop
‘of Hadoop and HOFS,Understanding Hadoop components:
Use cases Tabulation
Use cases Tabulation
Hadoop
‘Component of ‘Brief Description
isn
Component of ‘lef Daseription
‘Medoor
Use cases Tabulation
Component
lof Hedoop
Brief Description
Use cases Tabulation
‘Brief Description