0% found this document useful (0 votes)
51 views56 pages

Lecture 1 - Data Science and Big Data

Uploaded by

abdulhady378
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views56 pages

Lecture 1 - Data Science and Big Data

Uploaded by

abdulhady378
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CSCI461: Introduction to Big Data

(Sunday 10:30 AM)


(Building 1, GUB1, Room 145 )
(Sunday 14:30 AM)
(Building 1, GUB1, Room 114 )

Course Instructors: Dr. Ibrahim Zaghloul

Course TA : Eng. Salma


Course Content
Week 1 (29/9): Course specs and Introduction to Big Data Analytics
Week 2 (13/10): Data Analytics Lifecycle
Week 3 (20/10):Overview on Methods (Including Pre-processing)
Week 4 (27/10): Text Analysis (Quiz #1)
Week 5 (3/11): Big Data Technology & NoSQL
Weeks 6,7 (9/11 to 17/11): Mid-Term Exam
Week 7 (17/11): Apache Hadoop & MapReduce & HDFS (Part 1)
Week 8 (24/11): Apache Spark
Week 9 (1/12): Apache Flink (Quiz #2)
Week 10 (8/12):Apache Zookeepr & Apache Pig
Week 11 (15/12): Apache Hive & Apache Hbase
Week 12 (22/12): Apache Sqoop & Apache Flume (Quiz #3)
Week 13 (29/12): Project Discussions
The Course Aim

The course introduces Big Data problems and associated frameworks


and technologies.
First, the course motivates the topic using real-world big data problems.
Second, it sheds light on handling big data, from data collection to
monitoring, storage, analysis, and reporting.
The course also includes programming models used for scalable big data
analysis to develop big data analysis.
It also introduces one of the most common Big Data frameworks, namely
Hadoop, in addition to the Map-Reduce Programming Model.
Finally, it solves sample case studies using the covered Big Data analytics
tools
References
 EMC Education Services, Data Science & Big Data
Analytics: Discovering, Analyzing, Visualizing and
Presenting Data, ISBN: 978-1-118-87613-
8, Wiley; 1st edition (January 5, 2015).
Assessment

Final-Term Examination 30
Mid-Term Examination 10
Lab Work 10
Quizzes 10
Project 25
In-lab and lect. assignment 10
Lecture Attendance 5
Course Objectives
Upon completion of this course, you should be able to:
• Immediately participate and contribute as a data science team
member on big data and other analytics projects by:
 Deploy a structured lifecycle approach to data science and big data analytics projects
 Reframe a business challenge as an analytics challenge
 Apply analytic techniques and tools to analyze big data, create statistical models, and
identify insights that can lead to actionable results
 Select optimal visualization techniques to clearly communicate analytic insights to
business sponsors and others
 Use programming environments such as Python tools, MapReduce/Hadoop, in-
database analytics, and window and MADlib functions
• Explain how advanced analytics can be leveraged to create
competitive advantage and how the data scientist role and skills
differ from those of a traditional business intelligence analyst

Introduction and Course Agenda 7


Module 1 – Introduction to Big Data Analytics

8
Introduction to Big Data Analytics

Big Data Overview

• Definition of big data


• Big data characteristics and considerations
• Unstructured data fueling big data analytics
• Analyst perspective on Data Repositories

Module 1: Introduction to BDA 9


Introduction to Big Data Analytics

Your Thoughts?

What is Big Data?

What makes data, “Big” Data?

Module 1: Introduction to BDA 10


Big Data Defined
• “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.
 Requires new data architectures, analytic sandboxes
 New tools
 New analytical methods
 Integrating multiple skills into new role of data scientist

• Organizations are deriving business benefit from analyzing


ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity

Module 1: Introduction to BDA 12


Key Characteristics of Big Data
1. Data Volume
 44x increase from 2009 to 2020
(0.8 zettabytes to 35.2zb =(1021) bytes)

2. Processing Complexity
 Changing data structures
 Use cases warranting additional transformations and
analytical techniques

3. Data Structure
 Greater variety of data structures for mining and analyzing

Module 1: Introduction to BDA 13


Characteristics of Big Data
Data Volume

Module 1: Introduction to BDA 16


The Power of 10

Module 1: Introduction to BDA 17


Cloud Computing
Computing anywhere and anytime
• Cloud computing is one of the ways in
which we can do computing anytime,
and anywhere.
• Cloud computing: We call this on
demand computing.
• It enables to run computations on big
amounts of data.
Data Torrent Computing
Anytime, Anywhere

Big Data Era


Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured

• Data containing a defined data type, format, structure


Structured • Example: Transaction data and OLAP

• Textual data files with a discernible pattern,


Semi-
More Structured

enabling parsing

Structured • Example: XML data files that are self


describing and defined by an xml schema

• Textual data with erratic data formats, can


be formatted with effort, tools, and time
“Quasi”
Structured • Example: Web clickstream data that
may contain some inconsistencies in data
values and formats
• Data that has no inherent
structure and is usually stored
as different types of files.
Unstructured
• Example: Text documents,
PDFs, images and video

Module 1: Introduction to BDA 21


Four Main Types of Data Structures
Structured Data Quasi-Structured Data

Semi-Structured Data
View  Source

http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651

Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams

Module 1: Introduction to BDA 22


Data Repositories, An Analyst Perspective
Data Islands Data Warehouses Analytic Sandbox
“Spreadmarts”
Centralized data containers Data assets gathered from multiple
Isolated data marts in a purpose-built space sources and technologies for analysis

• Spreadsheets and low- • Supports BI and reporting, but • Enables high performance analytics
volume DB‘s for restricts robust analyses using in-db processing
recordkeeping • Analyst dependent on IT & • Reduces costs associated with data
• Analyst dependent on DBAs for data access and replication into "shadow" file
data extracts schema changes systems
• Analysts must spend significant • “Analyst-owned” rather than “DBA
time to get extracts from owned”
multiple sources

Module 1: Introduction to BDA 25


Introduction to Big Data Analytics: Mini-Case Study
Yoyodyne Bank Scenario
• Evolving from small community bank to a global bank
• Needs to move away from its legacy mainframes to an environment that
supports more robust analytics
• Growing through mergers and acquisitions
• Subject to many new regulatory requirements
• Increasing customer base and increased product offerings
Your Thoughts?

Discussion Questions
1. Discuss how the bank’s data would change under these circumstances.
2. How are their needs changing with these business changes?
3. What do you need to consider from an analyst point of view? What are
some things to consider implementing as the bank grows?

Module 1: Introduction to BDA 26


Introduction to Big Data Analytics
State of the Practice in Analytics

• Business drivers for analytics


• Current analytical architecture
• Business intelligence vs. data science
• Drivers of big data and new big data ecosystem

Module 1: Introduction to BDA 27


Business Drivers for Analytics
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
Driver Examples
1
Desire to optimize business
Sales, pricing, profitability, efficiency
operations

2
Desire to identify business risk Customer churn, fraud, default

3
Predict new business Upsell, cross-sell, best new customer
opportunities prospects
4
Comply with laws or regulatory Anti-Money Laundering, Fair Lending,
requirements Basel II

Module 1: Introduction to BDA 28


Analytical Approaches for Meeting Business Drivers
Business Intelligence vs. Data Science
Predictive Analytics & Data Mining
(Data Science)
Typical • Optimization, predictive modeling,
Techniques forecasting, statistical analysis
& Data • Structured/unstructured data, many
Types types of sources, very large data sets
High Common • What if…..?
Questions • What’s the optimal scenario for our
business ?
• What will happen next? What if these
trends continue? Why is this
Data happening?
Science
Business Intelligence
BUSINESS Typical • Standard and ad hoc reporting,
VALUE Techniques
& Data
dashboards, alerts, queries, details on
demand
Business Types • Structured data, traditional sources,
Intelligence manageable data sets
Common • What happened last quarter?
Questions • How many did we sell?
• Where is the problem? In which
situations?
Low

Past TIME Future

Module 1: Introduction to BDA 29


A Typical Analytical Architecture
1 Data
Sources
Non-Agile Models

2 Departmental
“Spread
Marts”
Warehouse

Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes

Static schemas
accrete over time Reporting Siloed
Analytics

Non-Prioritized Data Provisioning

Errant data & marts

Module 1: Introduction to BDA 30


Implications of Typical Architecture for Data Science

• High-value data is hard to reach and leverage


• Predictive analytics & data mining activities are last
in line for data
 Queued after prioritized operational processes
• Data is moving in batches from Enterprise Data Slow
“time-to-insight”
Warehouse EDW to local analytical tools &
 In-memory analytics (such as R, SAS, SPSS, Excel) reduced
 Sampling can skew model accuracy business impact
• Isolated, ad hoc analytic projects, rather than
centrally-managed harnessing of analytics
 Non-standardized initiatives
 Frequently, not aligned with corporate business goals

Module 1: Introduction to BDA 31


Opportunities for a New Approach to Analytics
New Applications Driving Data Volume

MEASURED IN MEASURED IN WILL BE MEASURED IN


LARGE TERABYTES PETABYTES EXABYTES
1TB = 1,000GB 1PB = 1,000TB 1EB = 1,000PB
VOLUME OF INFORMATION

SMALL

1990’s 2000’s 2010’s


(RDBMS & DATA (CONTENT & DIGITAL ASSET (NO-SQL & KEY/VALUE)
WAREHOUSE) MANAGEMENT)
https://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html
Module 1: Introduction to BDA 32
Opportunities for a New Approach to Analytics
Big Data Ecosystem

1
Data
Devices
Individual

Analytic Medical Information


Advertising Marketers Employers
Services Brokers
Law
Enforcemen
t Government Internet
2
Data
Websites
3
collect data from C o l l e c t o r s Data
the device and Aggregator
users s make sense of the data
collected from the various
Data entities from the “SensorNet
Users/Buyer
Catalog
s 4 Co-Ops
Media
Phone/TV Retail
Private
Media Credit List
Investigators
Archives Bureaus Financial Brokers Delivery
Banks /Lawyers
Service
Government

Module 1: Introduction to BDA 33


Considerations for Big Data Analytics
Criteria for Big Data Projects New Analytic Architecture

Analytic Sandbox
Data assets gathered from multiple sources
1. Speed of decision making and technologies for analysis

2. Throughput

3. Analysis flexibility • Enables high performance analytics


using in-db processing
• Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”

Module 1: Introduction to BDA 34


State of the Practice in Analytics: Mini-Case Study
Big Data Enabled Loan Processing at Yoyodyne

Traditional Big Data Enabled


Underwriting Underwriting Your Thoughts?
Risk Level Risk Level
Underwriting Risk

Objectives
1) Using additional data sources,
dramatically improve the quality of the
loan underwriting process
2) Streamline the process to yield results in
less time
Directions
1) Suggest kinds of publicly available data
(big data) that you can leverage to
supplement the traditional lending
process
2) Suggest types of analysis you would
perform with the data to reduce the
bank’s risk and expedite the lending
process

TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED

Module 1: Introduction to BDA 35


Introduction to Big Data Analytics
The Data Scientist
During this lesson the following topics are covered:
• Key Roles of the New Big Data Ecosystem
• Profile of a Data Scientist

Module 1: Introduction to BDA 36


Skills Needed In the New Data Ecosystem

Your Thoughts?

• What new skill sets do you need to take advantage of


the big data sets in the loan processing improvement
case study?

• Do most large organizations have people with these


skill sets?

• If so, who are they?

Module 1: Introduction to BDA 37


Three Key Roles of the New Data Ecosystem

Role Role Description

People with advanced training in quantitative


Deep Analytical Talent disciplines, such as mathematics, statistics, and
machine learning.

People with a basic knowledge of statistics


Data Savvy
and/or machine learning, who can define key
Professionals
questions that can be answered using
advanced analytics
People providing technical expertise to support
Technology & Data analytical projects. Skills sets including
Enablers computer programming and database
administration
Note: Figures above reflect a projected talent gap in US in 2018, as shown in McKinsey May 2011 article Big Data: The next frontier for innovation,
competition, and productivity

Module 1: Introduction to BDA 38


Roles Needed for Analytical Projects
Data Scientist Key Activities
Line of
Data Scientists business
Key Activities Data Data Bl LOB
Engineers Analyst Analyst User
• Reframe business
challenges as analytics
challenges Analytic Productivity Platform

• Design, implement and


deploy statistical models Tools & Services
and data mining Data
Data Access & Query
techniques on big data Platform
Admin

• Create insights that lead


to actionable
recommendations Cloud Infrastructure

Copyright © 2011 EMC Corporation. All Rights Reserved. Module 1: Intro duction to BDA 39
Profile of a Data Scientist

Module 1: Introduction to BDA 40


Introduction to Big Data Analytics

Example Applications of Big Data Analytics

Module 1: Introduction to BDA 41


Applications:
What makes Big Data valuable
• Big data is now being generated all
around us
• Applications of Big Data: It is the way Big Data
in which big data can serve human
needs that makes it valued.
• How companies market themselves
and sell products. Better Models
• How human resources are managed.
• How disasters are responded to.
• Many other applications that evidenced
based data is being used to influence Higher Precision
decisions
Big Data Analytics: Industry Examples

1
Health Care
• Reducing Cost of Care Medical

2
Public Services Government Internet

• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping
4 IT Infrastructure
• Unstructured Data Analysis Phone/TV Retail

5 Online Services Financial

• Social Media for Professionals

Module 1: Introduction to BDA 43


1
Big Data Analytics: Healthcare

• Poor police response and problems with medical care, triggered


Situation by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care

• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals

• City hospitals & Emergency Room provided expensive care, low


quality care
Key • Reduced hospital costs by 56% by realizing that 80% of city’s
Outcomes medical costs came from 13% of its residents, mainly low-income
or elderly
• Now offers preventative care over the phone or through home visits

https://www.datapine.com/blog/big-data-examples-in-healthcare/
Module 1: Introduction to BDA 44
1
Big Data Analytics: Healthcare

• Poor police response and problems with medical care, triggered


Situation by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care

• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals

• City hospitals & Emergency Room provided expensive care, low


quality care
Key • Reduced hospital costs by 56% by realizing that 80% of city’s
Outcomes medical costs came from 13% of its residents, mainly low-income
or elderly
• Now offers preventative care over the phone or through home visits

Module 1: Introduction to BDA 45


2
Big Data Analytics: Public Services

• Threat of global pandemics has increased exponentially


Situation • Pandemics spreads at faster rates, more resistant to antibiotics

• Created a network of viral listening posts


• Combines data from viral discovery in the field, research in
Use of Big Data disease hotspots, and social media trends
• Using Big Data to make accurate predications on spread of new
pandemics

• Identified a fifth form of human malaria, including its origin


Key
Outcomes • Identified why efforts failed to control swine flu
• Proposing more proactive approaches to preventing outbreaks

https://www.analyticssteps.com/blogs/big-data-public-sector-applications-and-benefits
Module 1: Introduction to BDA 46
3
Big Data Analytics: Life Sciences

Situation • Broad Institute (MIT & Harvard) mapping the Human Genome

• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes


Use of Big Data • Developed 30+ software packages, now shared publicly, along
with the genomic data

• Using genetic mappings to identify cellular mutations causing


Key
cancer and other serious diseases
Outcomes
• Innovating how genomic research informs new pharmaceutical
drugs

https://www.propharmagroup.com/thought-leadership/big-data-life-science-industries
https://www.mdpi.com/2227-9717/10/1/41
Module 1: Introduction to BDA 47
4
Big Data Analytics: IT Infrastructure

Situation • Explosion of unstructured data required new technology to


analyze quickly, and efficiently

• Doug Cutting created Hadoop to divide large processing tasks


Use of Big Data into smaller tasks across many computers
• Analyzes social media data generated by hundreds of
thousands of users

Key
• New York Times used Hadoop to transform its entire public
Outcomes
archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing

https://www.projectpro.io/article/sentiment-analysis-project-ideas-with-source-code/518
https://www.projectpro.io/article/text-mining-projects/755
Module 1: Introduction to BDA 48
5
Big Data Analytics: Online Services

Situation • Opportunity to create social media space for professionals

• Collects and analyzes data from over 100 million users


Use of Big Data
• Adding 1 million new users per week

Key
• LinkedIn Skills, InMaps, Job Recommendations, Recruiting
Outcomes • Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution

https://www.hindawi.com/journals/jhe/2022/6967158/
https://github.com/topics/social-media-analytics
Module 1: Introduction to BDA 49
• Machine generated data such as sensors.
• Human generated data refers to the vast
amount of social media data, status updates,
tweets, photos, and medias.
• Organizational generated data refers to more
traditional types of data, including transaction
information in databases and structured data
open stored in data warehouses.

• Note that big data can be either structured,
semi-structured, or unstructured.

50
Why can Big Data help?

People Sensors

Organizations

51
Diverse Data Sources
• What makes this a Big Data problem?
- Because novel approaches and responses can be taken
if we can integrate this many diverse data streams.

• Disaster management today is a dynamic system which


integrates:
- real time sensor networks,
- satellite imagery,
- near real time data management tools,
- wildfire simulation tools,
- connectivity to emergency command centers

52
Machine Data

• Sensor data streaming in


from weather stations and
satellites:
 temperature,
 humidity, a
 air pressure.

53
Organizational Data

• A supercomputer can generate data related to wildfire modeling.


• These include past and current fire perimeter and fuel maps such as
vegetation in a fire's path.

54
• A huge part of data on fires is
generated by the public on
social media sites such as
Twitter, which support photo
sharing resources.

People

55
• These phones and the apps we install
on them are a big source of big data,

2023
• One billion people login in a single day,
• More that 30 billion pieces of content
shared every month.

McKinsey Report (2013)


5
Big Data Analytics: Extra cases (Football sports)

Module 1: Introduction to BDA 58

You might also like