Lecture 1 - Data Science and Big Data
Lecture 1 - Data Science and Big Data
Final-Term Examination 30
Mid-Term Examination 10
Lab Work 10
Quizzes 10
Project 25
In-lab and lect. assignment 10
Lecture Attendance 5
Course Objectives
Upon completion of this course, you should be able to:
• Immediately participate and contribute as a data science team
member on big data and other analytics projects by:
Deploy a structured lifecycle approach to data science and big data analytics projects
Reframe a business challenge as an analytics challenge
Apply analytic techniques and tools to analyze big data, create statistical models, and
identify insights that can lead to actionable results
Select optimal visualization techniques to clearly communicate analytic insights to
business sponsors and others
Use programming environments such as Python tools, MapReduce/Hadoop, in-
database analytics, and window and MADlib functions
• Explain how advanced analytics can be leveraged to create
competitive advantage and how the data scientist role and skills
differ from those of a traditional business intelligence analyst
8
Introduction to Big Data Analytics
Your Thoughts?
2. Processing Complexity
Changing data structures
Use cases warranting additional transformations and
analytical techniques
3. Data Structure
Greater variety of data structures for mining and analyzing
enabling parsing
Semi-Structured Data
View Source
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&
pq=big+data&pf=p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs
_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams
• Spreadsheets and low- • Supports BI and reporting, but • Enables high performance analytics
volume DB‘s for restricts robust analyses using in-db processing
recordkeeping • Analyst dependent on IT & • Reduces costs associated with data
• Analyst dependent on DBAs for data access and replication into "shadow" file
data extracts schema changes systems
• Analysts must spend significant • “Analyst-owned” rather than “DBA
time to get extracts from owned”
multiple sources
Discussion Questions
1. Discuss how the bank’s data would change under these circumstances.
2. How are their needs changing with these business changes?
3. What do you need to consider from an analyst point of view? What are
some things to consider implementing as the bank grows?
2
Desire to identify business risk Customer churn, fraud, default
3
Predict new business Upsell, cross-sell, best new customer
opportunities prospects
4
Comply with laws or regulatory Anti-Money Laundering, Fair Lending,
requirements Basel II
2 Departmental
“Spread
Marts”
Warehouse
Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes
Static schemas
accrete over time Reporting Siloed
Analytics
SMALL
1
Data
Devices
Individual
Analytic Sandbox
Data assets gathered from multiple sources
1. Speed of decision making and technologies for analysis
2. Throughput
Objectives
1) Using additional data sources,
dramatically improve the quality of the
loan underwriting process
2) Streamline the process to yield results in
less time
Directions
1) Suggest kinds of publicly available data
(big data) that you can leverage to
supplement the traditional lending
process
2) Suggest types of analysis you would
perform with the data to reduce the
bank’s risk and expedite the lending
process
Your Thoughts?
Copyright © 2011 EMC Corporation. All Rights Reserved. Module 1: Intro duction to BDA 39
Profile of a Data Scientist
1
Health Care
• Reducing Cost of Care Medical
2
Public Services Government Internet
• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping
4 IT Infrastructure
• Unstructured Data Analysis Phone/TV Retail
• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals
https://www.datapine.com/blog/big-data-examples-in-healthcare/
Module 1: Introduction to BDA 44
1
Big Data Analytics: Healthcare
• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals
https://www.analyticssteps.com/blogs/big-data-public-sector-applications-and-benefits
Module 1: Introduction to BDA 46
3
Big Data Analytics: Life Sciences
Situation • Broad Institute (MIT & Harvard) mapping the Human Genome
https://www.propharmagroup.com/thought-leadership/big-data-life-science-industries
https://www.mdpi.com/2227-9717/10/1/41
Module 1: Introduction to BDA 47
4
Big Data Analytics: IT Infrastructure
Key
• New York Times used Hadoop to transform its entire public
Outcomes
archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing
https://www.projectpro.io/article/sentiment-analysis-project-ideas-with-source-code/518
https://www.projectpro.io/article/text-mining-projects/755
Module 1: Introduction to BDA 48
5
Big Data Analytics: Online Services
Key
• LinkedIn Skills, InMaps, Job Recommendations, Recruiting
Outcomes • Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution
https://www.hindawi.com/journals/jhe/2022/6967158/
https://github.com/topics/social-media-analytics
Module 1: Introduction to BDA 49
• Machine generated data such as sensors.
• Human generated data refers to the vast
amount of social media data, status updates,
tweets, photos, and medias.
• Organizational generated data refers to more
traditional types of data, including transaction
information in databases and structured data
open stored in data warehouses.
•
• Note that big data can be either structured,
semi-structured, or unstructured.
50
Why can Big Data help?
People Sensors
Organizations
51
Diverse Data Sources
• What makes this a Big Data problem?
- Because novel approaches and responses can be taken
if we can integrate this many diverse data streams.
52
Machine Data
53
Organizational Data
54
• A huge part of data on fires is
generated by the public on
social media sites such as
Twitter, which support photo
sharing resources.
People
55
• These phones and the apps we install
on them are a big source of big data,
2023
• One billion people login in a single day,
• More that 30 billion pieces of content
shared every month.