0% found this document useful (0 votes)

209 views41 pages

Big Data Tutorial Part4

big

Uploaded by

476

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views41 pages

Big Data Tutorial Part4

big

Uploaded by

476

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Marko Grobelnik marko.grobelnik@ijs.

si Jozef Stefan Institute Ljubljana, Slovenia

Stavanger, May 8th 2012

Introduction

Techniques Tools Applications Literature

What is Big data? Why Big-Data? When Big-Data is really a problem?

Big-data is similar to Small-data, but bigger but having data bigger consequently requires different approaches: to solve:
techniques, tools & architectures New problems and old problems in a better way.

From Understanding Big Data by IBM

Big-Data

Key enablers for the growth of Big Data are:

Increase of storage capacities Increase of processing power Availability of data

NoSQL

MapReduce Storage Servers

DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum S3, Hadoop Distributed File System EC2, Google App Engine, Elastic, Beanstalk, Heroku R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop

Processing

when the operations on data are complex:

e.g. simple counting is not a complex problem Modeling and reasoning with data of different kinds can get extremely complex

Good news about big-data:

Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics) as long as we deal with the scale

Research areas (such as IR, KDD, ML, NLP, SemWeb, ) are subcubes within the data cube

Usage Quality Context Dynamicity Scalability

Good recommendations can make a big difference when keeping a user on a web site

the key is how rich context model a system is using to select information for a user Bad recommendations <1% users, good ones >5% users click

Contextual personalized recommendations generated in ~20ms

Domain Sub-domain Page URL URL sub-directories Page Meta Tags Page Title Page Content Named Entities Has Query Referrer Query

Referring Domain Referring URL Outgoing URL GeoIP Country GeoIP State GeoIP City Absolute Date Day of the Week Day period Hour of the day User Agent

Zip Code State Income Age Gender Country Job Title Job Industry

Trend Detection System

Log Files (~100M page clicks per day) Stream of clicks

User profiles

Stream of profiles

Trends and updated segments NYT articles

Segment Stock Market Health Green Energy Hybrid cars Travel Keywords Stock Market, mortgage, banking, investors, Wall Street, turmoil, New York Stock Exchange diabetes, heart disease, disease, heart, illness Hybrid cars, energy, power, model, carbonated, fuel, bulbs, Hybrid cars, vehicles, model, engines, diesel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant

Sales
Segments
Campaign to sell segments

Advertisers

50Gb of uncompressed log files 10Gb of compressed log files 0.5Gb of processed log files 50-100M clicks 4-6M unique users 7000 unique pages with more then 100 hits Index size 2Gb Pre-processing & indexing time
~10min on workstation (4 cores & 32Gb) ~1hour on EC2 (2 cores & 16Gb)

Alarms Server

Telecom Network (~25 000 devices)

~10-100/sec

Alarms

Live feed of data

Alarms Explorer Server

Alarms Explorer Server implements three real-time scenarios on the alarms stream:

system is used in British Telecom

1. Root-Cause-Analysis finding which device is responsible for occasional flood of alarms 2. Short-Term Fault Prediction predict which device will fail in next 15mins 3. Long-Term Anomaly Detection detect unusual trends in the network

Operator

Big board display

Presented in Planetary-Scale Views on a Large Instant-Messaging Network by Jure Leskovec and Eric Horvitz WWW2008

Observe social and communication phenomena at a planetary scale Largest social network analyzed to date

Research questions: How does communication change with user demographics (age, sex, language, country)? How does geography affect communication? What is the structure of the communication network?
33

We collected the data for June 2006 Log size: 150Gb/day (compressed) Total: 1 month of communication data: 4.5Tb of compressed data Activity over June 2006 (30 days)
245 million users logged in 180 million users engaged in conversations 17,5 million new accounts activated More than 30 billion conversations More than 255 billion exchanged messages

Count the number of users logging in from particular location on the earth
37

Logins from Europe

Hops
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 6 degrees of separation [Milgram 60s] 21 Average distance between two random users is 6.622 23 90% of nodes can be reached in < 8 hops 24 25 19

Nodes
10 78 396 8648 3299252 28395849 79059497 52995778 10321008 1955007 518410 149945 44616 13740 4476 1542 536 167 71 29 16 10 3 2 3

Invoice2016 07 12 - 07 15 50 PDF
No ratings yet
Invoice2016 07 12 - 07 15 50 PDF
1 page
zf2 The Best PDF
No ratings yet
zf2 The Best PDF
1,795 pages
Understanding Group Dynamics
No ratings yet
Understanding Group Dynamics
1 page
Order Management Associate Resume
No ratings yet
Order Management Associate Resume
2 pages
Receipt For Payment of Bills/Demand Notes. This Receipt Is Generated From BSNL Portal
No ratings yet
Receipt For Payment of Bills/Demand Notes. This Receipt Is Generated From BSNL Portal
1 page
Your Tata Docomo Bill Account No
No ratings yet
Your Tata Docomo Bill Account No
7 pages
DLCPM EPF Member Account Statement
No ratings yet
DLCPM EPF Member Account Statement
2 pages
Oea Big Data Guide 1522052
No ratings yet
Oea Big Data Guide 1522052
25 pages
JVC Camera Quick Setup Guide
No ratings yet
JVC Camera Quick Setup Guide
2 pages
Key Account Manager Resume - Ramesh M
No ratings yet
Key Account Manager Resume - Ramesh M
5 pages
Java Aradhya New
100% (2)
Java Aradhya New
222 pages
Smart VERC Equity Research Update
No ratings yet
Smart VERC Equity Research Update
4 pages
Ajeet Singh's Professional Resume
No ratings yet
Ajeet Singh's Professional Resume
2 pages
Purchase Order for Fresh Vegetables
No ratings yet
Purchase Order for Fresh Vegetables
4 pages
Vegetable and Fruit Inventory List
No ratings yet
Vegetable and Fruit Inventory List
4 pages
Purchase Order for Fresh Produce
No ratings yet
Purchase Order for Fresh Produce
2 pages
October 2013 Purchase Order for Vegetables
No ratings yet
October 2013 Purchase Order for Vegetables
1 page
Shreesha Tantry CV: Finance & Accounts Expertise
No ratings yet
Shreesha Tantry CV: Finance & Accounts Expertise
3 pages
Usn 2010
0% (1)
Usn 2010
467 pages
Bangalore Contact Directory
No ratings yet
Bangalore Contact Directory
4 pages
Vegetable Purchase Order Details
No ratings yet
Vegetable Purchase Order Details
5 pages
Usn 2010
0% (1)
Usn 2010
467 pages
Marketing Communication Quiz
No ratings yet
Marketing Communication Quiz
3 pages
PVD and Preloading for Land Reclamation
No ratings yet
PVD and Preloading for Land Reclamation
10 pages
Chapter Six Markov Decision Process
No ratings yet
Chapter Six Markov Decision Process
41 pages
Questions and Answers Real
No ratings yet
Questions and Answers Real
264 pages
Blouse Construction Evaluation Rubric
No ratings yet
Blouse Construction Evaluation Rubric
4 pages
Pilot Candidate Admitted List 2022
No ratings yet
Pilot Candidate Admitted List 2022
320 pages
Algebra 1 Lesson 6.4 Point-Slope Form-1
No ratings yet
Algebra 1 Lesson 6.4 Point-Slope Form-1
9 pages
ABAP/4 Development Workbench Guide
No ratings yet
ABAP/4 Development Workbench Guide
36 pages
Tinker Patch Load Failure Log
No ratings yet
Tinker Patch Load Failure Log
13 pages
CBSE Management Test Paper 01
No ratings yet
CBSE Management Test Paper 01
7 pages
Relationship Betweenvolemw
No ratings yet
Relationship Betweenvolemw
6 pages
NX 12.0.1 Release Notes
No ratings yet
NX 12.0.1 Release Notes
142 pages
4th Grade Author's Purpose Lesson Plan
No ratings yet
4th Grade Author's Purpose Lesson Plan
7 pages
Multiple Regression Analaysis
No ratings yet
Multiple Regression Analaysis
10 pages
BMW Case Study
40% (5)
BMW Case Study
4 pages
Las Vegas and Approximation Algorithms
No ratings yet
Las Vegas and Approximation Algorithms
12 pages
Motorola 68HC11 Microcontroller Overview
No ratings yet
Motorola 68HC11 Microcontroller Overview
25 pages
Medan Area Comand To Filling The Proclamation and Defending The Independence of The Indonesian Republic at North Sumatera 1945-1947
No ratings yet
Medan Area Comand To Filling The Proclamation and Defending The Independence of The Indonesian Republic at North Sumatera 1945-1947
14 pages
ROADM in Optical Networks
No ratings yet
ROADM in Optical Networks
8 pages
KVS Ernakulam CCT Battery Test Links
No ratings yet
KVS Ernakulam CCT Battery Test Links
3 pages
Metacognition and Learning Theories Overview
100% (3)
Metacognition and Learning Theories Overview
4 pages
ANGIELSKI Klasa VI Jesie - 2018
No ratings yet
ANGIELSKI Klasa VI Jesie - 2018
2 pages
CEB - Guide To Hiring Agents For Control Quotient (CQ)
No ratings yet
CEB - Guide To Hiring Agents For Control Quotient (CQ)
3 pages
To Love God: by Bhaktivinoda Thakur
No ratings yet
To Love God: by Bhaktivinoda Thakur
2 pages
Art Education's Role in Student Development
No ratings yet
Art Education's Role in Student Development
2 pages
Olympus Pen EE/EES User Manual
100% (1)
Olympus Pen EE/EES User Manual
21 pages
Summarising Skills in Academic Writing
No ratings yet
Summarising Skills in Academic Writing
7 pages
Iskandar Zulkarnain: EHS Consultant Profile
No ratings yet
Iskandar Zulkarnain: EHS Consultant Profile
5 pages
NU1007-XL-M1-F1-J20AA-C3-M32N Bearing
No ratings yet
NU1007-XL-M1-F1-J20AA-C3-M32N Bearing
2 pages
Physical Education Curriculum Guide
No ratings yet
Physical Education Curriculum Guide
12 pages

Big Data Tutorial Part4

Uploaded by

Big Data Tutorial Part4

Uploaded by

Marko Grobelnik marko.grobelnik@ijs.

si Jozef Stefan Institute Ljubljana, Slovenia

Stavanger, May 8th 2012

Techniques Tools Applications Literature

What is Big data? Why Big-Data? When Big-Data is really a problem?

From Understanding Big Data by IBM

Key enablers for the growth of Big Data are:

MapReduce Storage Servers

when the operations on data are complex:

Good news about big-data:

Usage Quality Context Dynamicity Scalability

Contextual personalized recommendations generated in ~20ms

Trend Detection System

Trends and updated segments NYT articles

Telecom Network (~25 000 devices)

Live feed of data

Alarms Explorer Server

system is used in British Telecom

Big board display

Logins from Europe

You might also like