9/22/2020
INTRODUCTION TO DATA MINING
UNIT # 1
SPRING 2020 Sajjad Haider 1
TODAY’S AGENDA
Course management
Brief overview of Data Mining and allied fields
Summary of a few impactful articles and recent trends
SPRING 2020 Sajjad Haider 2
1
9/22/2020
COURSE MANAGEMENT
SPRING 2020 Sajjad Haider 3
LEARNING OBJECTIVES
Learn the art of modeling and interpreting large complicated data sets
via predictive and descriptive data mining methods.
Get to know several online data repositories and how to participate in
data analytics competitions held at [Link] and other sites
Have advanced level expertise in data analytics software and languages
such as KNIME and Python.
SPRING 2020 Sajjad Haider 4
2
9/22/2020
COURSE OVERVIEW
Data Preparation
Classification Techniques
Clustering
Text Analytics
Regression Analysis
Principal Component Analysis
Association Rule Mining
SPRING 2020 Sajjad Haider 5
SOFTWARE AND DATA REPOSITORIES
KNIME
Python
Data on Kaggle Website
[Link]
SPRING 2020 Sajjad Haider 6
3
9/22/2020
BOOKS
Data Mining for Business Analytics: Concepts,Techniques and
Applications in R (2017)
Data Mining and Data Warehousing: Principles and Practical Techniques
(2019)
Learning Data Mining with Python (2017)
Data Mining: Practical Machine Learning Tools and Techniques by Witten
and Frank (2016)
SPRING 2020 Sajjad Haider 7
ACKNOWLEDGEMENT
Although I am not extensively following the two books below but their
slides are still very popular in the academia and would be using them
occasionally:
Data Mining: Concepts and Techniques (2011)
Introduction to Data Mining (2018)
SPRING 2020 Sajjad Haider 8
4
9/22/2020
(TENTATIVE) MARKS DISTRIBUTION
Final 40
Project 15
Assignments + Quizzes 45
SPRING 2020 Sajjad Haider 9
MEETING HOURS
Office Hours:
Monday/Wednesday: noon – 1 PM and 4 – 5 PM
or by appointment (by e-mailing me at sahaider@[Link]).
Note: I DO NOT entertain SMS/WhatsApp messages. E-mail is the
official medium of correspondence.
SPRING 2020 Sajjad Haider 10
5
9/22/2020
OVERVIEW OF DATA MINING AND ALLIED FIELDS
SPRING 2020 Sajjad Haider 11
APPLICATIONS OF DATA MINING/MACHINE LEARNING
Traffic Predictions
Google Maps
Online Transportation Networks
Uber/Careem for price prediction
Video Surveillence
Crime detection
Fraud Detection
Financial institutions
SPRING 2020 Sajjad Haider 12
6
9/22/2020
APPLICATIONS OF DATA MINING/MACHINE LEARNING (CONT’D)
Social Media Services
Face recognition by Facebook
Hate speech detection by Facebook/Twitter
Inappropriate content by YouTube
Emails
Product Recommendation
Amazon,YouTube, and others
Machine Translation
Autonomous Vehicles
SPRING 2020 Sajjad Haider 13
MACHINE LEARNING
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measures P, if its performance
at tasks in T, as measured by P, improves with experience E.’
(Tom Mitchell, 1988)
SPRING 2020 Sajjad Haider 14
7
9/22/2020
A SIMPLIFIED TAXONOMY
Data Science > Data Analytics > Data Mining > Machine Learning
Data Analytics also deals with Visualization
Data Science also deals with data acquisition and management of data
Beside machine learning, data mining also makes use of statistical models
Because of a significant overlap and due to the popularity of different terms in
different communities, the boundaries of these terms are not as crisp as
shown in this slide.
SPRING 2020 Sajjad Haider 15
DATA MINING
Data mining is a process of automated discovery of previously unknown
patterns in large volumes of data.
This large volume of data is usually the historical data of an organization
known as the data warehouse.
Data mining deals with large volumes of data, in Gigabytes or Terabytes
of data and sometimes as much as Zetabytes of data (in case of big data).
Patterns must be valid, novel, useful and understandable.
SPRING 2020 Sajjad Haider 16
8
9/22/2020
DATA MINING LIFE CYCLE (CRISP-DM)
1. Statistical Models
2. Machine learning
SPRING 2020 Sajjad Haider 17
SUMMARY OF A FEW ARTICLES
SPRING 2020 Sajjad Haider 18
9
9/22/2020
SPRING 2020 Sajjad Haider 19
HBR ARTICLE (CONT’D)
Data scientists are the people who understand how to fish out answers
to important business questions from today’s tsunami of unstructured
information.
As companies rush to capitalize on the potential of big data, the largest
constraint many face is the scarcity of this special talent.
SPRING 2020 Sajjad Haider 20
10
9/22/2020
SPRING 2020 Sajjad Haider 21
BIG DATA:THE NEXT FRONTIER FOR INNOVATION (MCKINSEY 2011)
Big Data referes to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.
The demand for deep analytical positions in a big world could exceed the
supply being produced on current trends by 140K to 190K positions.
A need for 1.5 million additional managers and analysts in the US
who can ask the right questions and consume the results of the analysis
of big data effectively.
SPRING 2020 Sajjad Haider 22
11
9/22/2020
WHAT IS BIG DATA?
There is not a consensus as to how to define big data
“Big data exceeds the reach of commonly used hardware environments and
software tools to capture, manage, and process it with in a tolerable elapsed
time for its user population.” - Teradata Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and analyze.”
- The McKinsey Global Institute, 2011
One reasonable definition is that it’s data which can’t comfortably be
processed on a single machine.
SPRING 2020 Sajjad Haider 23
3 V’S
Doug Laney was the first one in talking
about 3 V's in Big Data management:
Volume: there is more data than ever before,
its size continues increasing, but not the
percent of data that our tools can process
Variety: there are many different types of data,
as text, sensor data, audio, video, graph, and
more
Velocity: data is arriving continuously as
streams of data, and we are interested in
obtaining useful information from it in real
time
SPRING 2020 Sajjad Haider 24
12
9/22/2020
4 V’S (IBM 2014)
SPRING 2020 Sajjad Haider 25
13