+
Introduction to Data Science
Assoc. Prof. Peerapon Vateekul, Ph.D.
* Part of this slide is modified from a slide of Prof.Natawut Department of Computer Engineering,
Faculty of Engineering, Chulalongkorn University
[email protected] www.cp.eng.chula.ac.th/~peerapon/
+ 2
Outline
n Introduction
n Data is important
n Data Science Definition by Dr.Virote
n Data Science Definition by Aj.Natawut
n Big Data
n Data Science Process & Data Science Trend
+
Introduction
3
+ 4
Data is important (in 2017)
n Alphabet (Google’s parent
company), Amazon, Apple,
Facebook and Microsoft
n $25bn in net profit in the first
quarter of 2017
n Amazon captures half of all
dollars spent online in America.
n Google and Facebook
accounted for almost all the
revenue growth in digital
advertising in America last year
https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
+ 5
Data is important (in 2018)! (cont.)
Data Science
(AI,ML,DM)
https://www.epmag.com/new-oil-1720651
+ 6
Who analyzes these data!
+ 7
What is Data Science?
n Data
n Facts and statistics collected for reference or analysis
n Science
n A systematic study through observation and experiment
n Data Science
n The scientific exploration of data to extract meaning or insight,
n and the construction of software to utilize such insight in a business context.
Data Data Data
Data Product
Preparation Analysis Visualization
+ 8
What is Data Science? (cont.)
1. Transform data into valuable insights
2. Transform data into data products
3. Transform data into interesting stories
Code Mania 2 (01), Jan-2015
+ 9
1) Transform data into valuable insights
+ 10
1) Transform data into valuable insights (cont.)
Code Mania 2, Jan-2015
http://nypost.com/2016/12/05/amazon-introduces-next-major-job-killer-to-face-americans/
+ 11
2) Transform data into data products
+ 12
3) Transform data into interesting stories
Consumer Price Index (CPI) - Inflation
http://www.thebillionpricesproject.com/
+ 13
+ 14
https://www.hbs.edu/faculty/Publication%20Files/BPP_JEP_m_13b5e009-4162-4f2c-b507-593a9a98c082.pdf
+ 15
Google Flu Trend
Ginsberg, Jeremy; Mohebbi, Matthew H.; Patel, Rajan S.; Brammer, Lynnette;
Smolinski, Mark S.; Brilliant, Larry (19 February 2009). "Detecting influenza
epidemics using search engine query data". Nature. 457 (7232): 1012–1014.
+ 16
What are they using data science for?
1. Measurement
2. Insights
3. Data Products
+ 17
1) Measurement
n To make a decision based on data
n Aka. benchmarking
n Turning qualitative information into quantitative values
n Usually called metrics or indicators
n Direct and indirect measurement
+ 18
Why do we need to measure?
n Comparison between alternatives (make a selection)
n Choosing which notebook to buy
n Comparison after improvement or tuning
n Should I add memory to my notebook?
n A/B Testing (split testing)
n Let the actual users decide their preferences
n Very popular for UI design
+ 19
A/B Testing
Source: https://vwo.com/ab-testing/
+ 20
Example: SimCity
Source: https://blog.optimizely.com/2015/06/04/ecommerce-conversion-optimization-case-studies/
+ 21
Example: SmartWool
Source: https://blog.optimizely.com/2015/06/04/ecommerce-conversion-optimization-case-studies/
+ 22
2) Insights
https://blogs.scientificamerican.com/guest-blog/9-bizarre-and-surprising-insights-from-data-science/
n Good understanding of user behavior can lead to new product
development or improvements of the existing products
n Walmart -- Pop-Tarts before a hurricane
n Prehurricane, Strawberry Pop- Tart sales increased about sevenfold
n Financial startup -- Typing with proper capitalization indicates
creditworthiness
n Online loan applicants who complete the application form with the correct case are
more dependable debtors
n Starbucks use customer purchase information from My Starbucks Mobile
Apps to figure out new products
+ 23
Example: Tracing Traffic
+ 24
Example: Tracing Traffic
GPS Average Speed 25
6:00-10:00 10:00-15:00 15:00-18:00
Bus Drivers’ Behaviors 26
Bus A 07/03/2016 ~17:00 Bus B 07/03/2016 ~17:00
Bus A 10/03/2016 ~09:00 Bus B 10/03/2016 ~09:00
Bus A 10/03/2016 ~17:00 Bus B 10/03/2016 ~17:00
+ 27
3) Data Products
n Anapplication or system that uses data to provide “intelligent”
products or services, which create more data that can be further
used
n Machine learning plays an important role in building great data
products
+ 28
Machine Learning Classification
n Identify to which set of categories a new observation belong
n Example: spam filtering, customer churn prediction, complaint classification
+ 29
Example: Students Grade Prediction
30
Historical Data
Training
Model
Predicting
Current Students Predicted Outcomes
𝑂𝑆×𝐷𝑎𝑡𝑎 𝑆𝑡𝑟𝑢𝑐𝑡×𝑃𝑟𝑜𝑔
>7
9
Example: Amazon
Recommendation
n Amazonsells 480M products (485k
new products per day)
n Userecommendation systems to
bring products to customers
n Analyze data from 300M customers
n Purchase history
n Reviews / Ratings
n Search history
n Views
+ 32
Case study: Alibaba Fraud Detection
Source: http://www.sciencedirect.com/science/article/pii/S2405918815000021
+ 33
Case study: Predictive Policing
Being used by 60 cities in the US e.g. Atlanta, LA, etc.
Source: http://www.forbes.com/sites/ellenhuet/2015/02/11/predpol-predictive-policing
+ 34
Drew Conway’s Data Science Venn diagram (Skills)
Drew Conway’s Venn
diagram of data
science, 2010
Data Data
Data Data
Preparat Visualiz
ion Analysis ation Product
Chula Data Science
35
https://odsc.medium.com/40-must-know-data-science-skills-and-frameworks-for-2023-582fef0bc3fa
+
Big Data
36
37
Big Data Explosion
47,000 20 million 3,000
$83,000
204 million App downloads
In sales
Photo views Photo uploads
Emails sent
61,141
Hours of music 320 100,000
New twitter accounts New tweets
1,300 100+
New mobile New Linkedin
users accounts
277,000 6 million
Logins Facebook views
What Happens in 2+ million
an Internet Minute? Search queries
30 1.3 million
SOURCE: INTEL Hours of videos Video views
uploaded
+ 38
https://www.ibmbigdatahub.com/infographic/four-vs-big-data
Now 42 V of Big Data
39
42 V’s?!?
Big Data Driver: Internal + External Data
40
https://owletcare.com/ 41
42
https://findair.eu/#Produkt
Big Data Analytics
43
• It is a process of examining Big Data to uncover useful information and knowledge.
• More data means better decision!
Big Challenges
External Data
Unstructured
Data
Big Data Challenges
44
Same tasks, but much more difficult!
Big Data Solution
45
INFRASTRUCTURE ALGORITHM
46
Big Data Solution (cont.)
Scale-out Infrastructure
Vertical Scaling Horizontal Scaling
(Scale-up) (Scale-out)
Big Data Solution (cont.)
47
In-memory & Distributed Computing
Resilient Distributed Datasets (RDD)
RAM RAM RAM RAM RAM
COM 1 COM 2 COM 3 COM 4 COM …
+ 48
SQL
NoSQL
Python
Hadoop
Spark
LINK
+ 49
https://blog.datath.com/data-engineer-guide/
Data Scientist + ML Engineer
+ 50
https://vocal.media/education/data-scientist-vs-data-
engineer-vs-ml-engineer-vs-ml-ops-engineer
+ MLOps = ML + DEV + OPS
+
Data Science Process
52
+ 53
Data Science Process
Dr.Virote
1. Transform data into valuable insights
2. Transform data into data products
3. Transform data into interesting stories
Aj.Natawut
1. Measurement (decision)
2. Insights (knowledge)
3. Data Products (Innovation, Intelligent)
Data Analytics (Data Science)
54
+ 55
Types of Data Science Projects
Valuable insights Advanced analytics
n Data visualization n AI/Machine Learning/Deep Learning
n Prediction, Forecasting, Clustering, etc.
n Analytical skills & storytelling
n Infographic
+ 56
+ 57
+ 58
https://dataforest.ai/blog/best-business-intelligence-
tool-of-2023-top-16-bi-tools-by-dataforest
+ 1
59
=
2
n 1) Rule-based AI
n 2) Machine Learning (ML)
https://mc.ai/machine-learning-basics-artificial-
intelligence-machine-learning-and-deep-learning/
+ Machine Learning (ML) 60
61
https://www.gartner.com/en/articles/gartner-top-10-
strategic-technology-tre nds-for-2024
Data Trend in 2024 (cont.)
62
• AI (AI everywhere & Gen AI) is the key
component.
• Knowledge without action (Platform
Engineering) is meaningless.
• Cloud technology is a modern infrastructure.
63
Vit Niennattrakul, Ph.D.
64
Vit Niennattrakul, Ph.D.
65
Vit Niennattrakul, Ph.D.
AWS Academy Service
AWS Academy Learner Lab
• Amazon API Gateway • AWS Cost and Usage Report • Amazon Forecast • AWS OpsWorks
• AWS App Mesh • AWS Cost Explorer • AWS Glue • Amazon Personalize
• Application Auto Scaling • AWS Data Pipeline • AWS Glue DataBrew • Amazon QuickSight
• AWS AppSync • AWS DeepComposer • Amazon GuardDuty • Amazon Redshift
• Amazon Athena • AWS DeepLens • AWS Health • Amazon Relational Database Service (RDS)
• Amazon Aurora • AWS DeepRacer • AWS Identity and Access Management (IAM) • AWS Resource Groups & Tag Editor
• AWS Backup • AWS Directory Service • AWS IAM Access Analyzer • AWS RoboMaker
• AWS Certificate Manager (ACM) • Amazon EC2 Auto Scaling • Amazon Inspector • Amazon Route 53
• AWS Batch • AWS Elastic Beanstalk • AWS IoT 1-Click • AWS Secrets Manager
• AWS Cloud9 • Amazon Elastic Block Store (EBS) • AWS IoT Analytics • AWS Security Hub
• AWS CloudFormation • Amazon Elastic Container Registry (ECR) • AWS IoT Core • AWS Security Token Service (STS)
• Amazon CloudFront • Amazon Elastic Container Service (ECS) • AWS IoT Greengrass • AWS Serverless Application Repository (SAR)
• Amazon CloudSearch • Amazon Elastic File System (EFS) • Amazon Kendra • AWS Service Catalog
• AWS CloudShell • Amazon Elastic Inference • AWS Key Management Service (KMS) • Amazon Simple Notification Service (SNS)
• AWS CloudTrail • Amazon Elastic Kubernetes Service (EKS) • Amazon Kinesis • Amazon Simple Queue Service (SQS)
• Amazon CloudWatch • Elastic Load Balancing (ELB) • Amazon Lex • Amazon Simple Storage Service (S3)
• AWS CodeCommit • Amazon Elastic MapReduce (EMR) • Amazon Machine Learning (Amazon ML) • Amazon Simple Storage Service Glacier (S3 Glacier)
• AWS CodeDeploy • Amazon ElastiCache • AWS Marketplace Subscriptions • Amazon Simple Workflow Service (SWF)
• Amazon CodeWhisperer • Amazon EventBridge • AWS Mobile Hub • AWS Step Functions
• AWS Config • AWS Fargate • Amazon Neptune • AWS Storage Gateway
• AWS Systems Manager (SSM) • Amazon Timestream • Amazon Virtual Private Cloud (Amazon VPC) • AWS Well-Architected Tool
• Amazon Textract • AWS Trusted Advisor • AWS WAF - Web Application Firewall • AWS X-Ray
AWS Academy Lab Project - Cloud Data Pipeline Builder Both Learner Lab & Lab Project - Cloud Data Pipeline Builder
• Amazon Managed Streaming for Apache Kafka (Amazon MSK) • Amazon SageMaker
• Amazon Elastic Compute Cloud (EC2)
• Amazon DynamoDB
• AWS Lambda
• Amazon Kinesis Video Streams
• Amazon Rekognition
https://awsacademy.instructure.com/login/canvas
Conclusion
67
4) Cloud technology
1) Data Analytics
Module (AI/ML)
3) Data Vizualiation
Module
2) Data Engineering Module
(Data Pipeline)
+
Any questions? J
68