0% found this document useful (0 votes)

32 views30 pages

Lecture 2 - The Data Science Process

The document outlines the Data Science process, emphasizing its role in navigating Big Data through various methodologies including data preparation, modeling, and application. It discusses the characteristics of Big Data, the differences between structured and unstructured data, and introduces RapidMiner as a tool for data analysis. The document also includes a home assignment for practical application of the concepts discussed.

Uploaded by

simonecarloseghezzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views30 pages

Lecture 2 - The Data Science Process

Uploaded by

simonecarloseghezzi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Applied Management Research

Methods

The Data Science process

Dott. Federico Mangiò

Room 23

20 February 2025
Agenda

• Data Science & Big Data

• The Data Science Process

• Getting started with RapidMiner

Data Science: a discipline to navigate the Big Data deluge

• Data science is a discipline that “combines math and statistics, specialized

programming, advanced analytics, artificial intelligence (AI), and machine
learning with specific subject matter expertise to uncover actionable insights
hidden in an organization’s data” (IBM).

• Focus: (network of) interactions (Quattrociocchi & Vicini 2023)

• These insights can be used to guide decision making and strategic planning

• Data science utilizes certain specialized computational methods in order to

discover meaningful and useful structures within a dataset.

• The scientific discipline (Diebold, 2021) of data science coexists and is closely
associated with a number of related areas such as database systems, data
engineering, visualization, data analysis, experimentation, and business
intelligence (BI).
Data Science: a discipline to navigate the Big Data deluge

Littlewood (2018). Prioritize Which Data

Skills Your Company Needs with This 2×2
Matrix. Harvard Business Review
Big Data

Big Data can be described as: “high volume, velocity and variety of data that demand cost-
effective, innovative forms of processing for enhanced insight and decision making.” (Laney, 2013)

Big Data 5 Vs (Lee, 2017):

1. Volume: amount of data an organization or an

individual collects and/or generates
2. Velocity: the speed at which data are generated and
processed
3. Variety: heterogeneity of data formats, structure,
types, source
4. Veracity: the unreliability and uncertainty latent in
data sources
5. Value: economic value stemming from BD analytics
Big Data

‘’Traces’’: web traffic data, sensors data, social media data, medical report, IoT…
How big is Big data?
Yearly production of big data by prominent stakeholders

Fifities: “90
observations on each of
10 variables” (in financial
econometrics)

Now: it’s not about

observations, but file
size ( e.g., “a 200GB
dataset”)

(Clissa et al. 2023)

Structured vs Unstructured Data

• Structured data: data that is formatted and organized in a data structure so that
elements can be addressed and accessed in various ways to make better use of the
information (e.g., attribute and relational databases).

• Unstructured data: data that is “non-numeric, multifaceted” and bearer of “concurrent

representation” (Balducci & Marinova, 2018)

• 80%+ of (Big) Data are unstructured

Balducci & Marinova (2018)

Structured vs Unstructured Data

• UD Pros:

1. Nonnumeric: allows more flexibility

for (theoretical) discovery

2. Multifaceted: Richer/deeper
conceptual and managerial insights

3. Concurrent representation:
Enables dynamic analysis at a given
time through simultaneous capture
of facets

• UD Cons:

...tough to handle!
The perils of working with big data

«SMALL» checklist (Brave et al. 2022)

Checklist Question Big Data dimension Main finding
Size is not a substitute for
1. How are the data Sampled? Volume
coverage in sampling
Accuracy begins with proper
2. How are the data Measured? Veracity
measurement
When assembling the data,
3. How are the data Assembled? Variety choose the right tool for the
job
Duration is not the sole
4. Do the data exhibit reporting Lags? Velocity criterion for judging reporting
lags
Leading information requires
5. Are the data a Leading indicator? Value
the right variation in the data
Big Data: “the new oil”?

• Value-generating resource

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

• Negative externalities

• Finite

• Non-rivalrous

Puntoni (2022)
Big Data: “the new oil”?

• Value-generating resource

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

• Negative externalities

• Finite

• Non-rivalrous

Puntoni (2022)
Structured or unstructured?

More More More More

Structured Unstructured Structured Unstructured

•Eye Tracking •IG picture

•Click-throughs •Product prices

•Facial Cues

•Gestural Cues •Video

•Customer satisfaction
scores •Vocal characteristics

•Geographic data •Online conversations

•Financial Ratings •Financial Ratings

•ROI
Structured or unstructured?

Does “unboxing” determine customer affect upon product launch?

Do assistance ticket length and supply chain relationship affect customer churn?

What is the ROI of paid online search advertising and what keywords are the most effective?
The Data Science process

Data Science process Cross Industry Standard Process for Data Mining
1. Prior Knowledge
1. Prior Knowledge

• information that is already known about a subject (agnostic but not atheoretical)
• helps to define what problem is being solved, how it fits the business context, and what data is
needed

• “Prior knowledge” of what?

1. (business) objective: well-defined statement of the problem; building-block of the process (everything
but this one can be changed)

2. subject area: knowledge about the subject matter, the context, and the business process generating
the data

3. The data: Understanding how the data is collected, stored, transformed, reported, and used. Factors to
consider:
• quality of the data;
• quantity of data; GIGO: “garbage in, garbage out”
• availability of data;
2. Data Preparation

• (One of the) most time-consuming tasks in the DSP

• It involves:

A) Exploratory data analysis.

Goal: achieve a basic understanding of the data (e.g., structure of

the data, distribution of the values, presence of outliers, presence of
inter-relationships etc). How:

1. Descriptive statistics (mean, median, mode, sd, range)

2. Data visualization

B) Data cleansing: elimination of duplicate records, quarantining

outliers, standardization, handling missing values..
Source: playground records
Data Collector: company X
Metadata Data collection date: 01.01.18
[…]

• A dataset (example set) is a collection of data

with a defined structure (“dataframe”). Attribute

• A data point (record, object or example) is a

single instance in the dataset.

• An attribute (feature, input, dimension, variable,

or predictor) is a single property of the dataset
(numeric, categorical, date-time, text, or Boolean Example
data types)

• A label (class label, output, prediction, target, or

response) is the special attribute to be predicted
based on all the input attributes (Play)

• Identifiers are special attributes that are used for

locating or providing context to individual records.
Datasets

• Relational (e.g. dataframe) vs non relational (e.g. corpus)

• Sparsity → “curse of dimensionality”

• Training dataset: dataset used to create the model, with known attributes
and target

• Test/Validation dataset: known dataset against which testing the model

validity
Attributes

• Quantitative variables: they take on numerical values (e.g., income, stock price, etc.).
Continuous, integer, real...
• Qualitative (or categorical) variables (or factors): they take on values in one of K different
classes or categories (e.g. male/female for gender; yes/no for a fraudulent financial
transaction; increase/ decrease for a stock index; ...).
Qualitative variables are typically represented numerically by codes.
• Independent variables (or inputs or regressors or predictors or features) usually denoted by
X.
• Dependent variable (or output or response variable) usually denoted by Y.
2. Data Preparation
1) Handling missing values (“NAs”):

• understanding the source of missing values (e.g. recording error vs count data, like document-term matrix)
• data substitution (mean, min, max depending on the characteristics of the attribute)- only if missing values
occur randomly and rarely
• (alternatively), exclude records with missing values (NB reduces the size of the ds)

2) Data types conversion: fit attributes' values to the specific model (e.g. categorical var--> numeric var in
regression, factor)

3) Transformation: some problems require normalizing attributes to prevent one dominating the others (e.g.
distance-based algo)

4) Handling outlier: 1. understanding the source of the outlier (e.g. error) 2. management

5) Features selection: reducing the number of attributes, without significant loss in the performance of the
model

6) Sampling: process of selecting a subset of records as a representation of the original dataset for use in data
analysis or modeling. The sample data serve as a representative of the original dataset with similar properties,
such as a similar mean.
3. Modeling

• Model: abstract (biased) representation of Modeling steps

reality
Training data Build Model
• Model -> (hundreds of) Machine Learning
algorithms
Testing data Evaluation
• Tasks: predictive (classification & regression)
vs descriptive (clustering, association rules)
Final Model
• Evaluation: how well the model generalizes
from the data? How: comparing real vs
predicted values
4. Application

Application (Deployment): stage at which the model becomes production ready or live

• Assessing model readiness (real-time vs time-lagged problems)

• Technical integration: invoking data science tools into production applications (e.g., PMML)
• Response time (model building vs prediction trade-off)
• Model maintenance (data refresh)
• Assimilation: articulate these findings, establish relevance to the original business question, quantify
the risks in the model, and quantify the business impact
5. (A posteriori) knowledge

• The output of the process has to be relevant additional knowledge which cannot be gained
with traditional research methods

• It is up to the practitioner to

1) skillfully transform a business problem to a data problem and apply the right algorithm.

2) invalidate the irrelevant patterns and identify the meaningful information

Introduction to RapidMiner

• Repository: a folder-like structure inside RapidMiner where users can organize their data, processes, and
models. When RapidMiner is launched for the first time, an option to set up the New Local Repository will
be given

• Operator: an atomic piece of functionality (which in fact is a chunk of encapsulated code) performing a
certain task.

• Nested-operators: contain a sub-process, made up by a sequence of operator aimed at a specific task

• Process: when a series of operators are connected together to accomplish the desired data science task
(A process that is created visually in RapidMiner is stored by RapidMiner as a platform-independent XML
code that can be exchanged between RapidMiner users)
Import

A) Repositories Tab→ “Import Data”

1. Select the file on the disk that should be read or imported.

2. Specify how the file should be parsed and how the columns are delimited. If the data have a comma “,” as
the column separator in the configuration parameters, be sure to select it.
3. Annotate the attributes by indicating if the first row of the dataset contains attribute names (which is
usually the case).
4. Detect data types
5. Hit “Finish” and indicate location

To import in the Design View, simply drag and drop the desired dataset from the repository!

B) Read Excel / CSV

Descriptive Statistics Analysis (quick way)

By clicking on the Statistics tab, one can examine the type, missing values, and basic statistics for all the
imported dataset attributes. The data type of each attribute (integer, real, or binomial), and some basic
statistics can also be identified. This high-level overview is a good way to ensure that a dataset has been
loaded correctly and exploring the data in more detail using the visualization tools described later is
possible.

Data types
• numeric/continuous: take infinite values, can be object of math (e.g. +,-) and logical (>/<)
computations; integer: no decimals; ratio/real: zero pint is defined (income)

• categorical/nominal: treated as symbols/names; can be ordered (hot, mild, cold temperature)-

not all algo deal with them, they can be transformed (NB loss of info)
Data transformation, Pivot, & Normalization

• Data transformation: transforming data so that they are suitable for the algorithm

(dummy codying: Numerical to binomial, Nominal to binomial. Nominal to numerical,

Numerical to polynomial, Discretization)

• Pivot: rotate an example set around a given attribute

Home assignment
Getting Started with Rapid Miner :

A. Import in RM the "ikea.csv" DS;

B. Create a subprocess folder named "pre-processing" in which you should:

-assign a unique ID to each example
-assign the "label" role to the DV ("price")
-retain only the attributes "id", "price", "category", "width", "heigth", "length"
-filter out NAs
-extract a random subsample of N: 100-
Z-transform all numeric attributes except for ID

C. Group examples by "category" and compute descriptive stats (mean, SD, range)

Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Defining Data Science and Its Lifecycle
No ratings yet
Defining Data Science and Its Lifecycle
74 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
Introduction To Data Analytics: Roberta Turra
No ratings yet
Introduction To Data Analytics: Roberta Turra
23 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Fods Unit 1
No ratings yet
Fods Unit 1
9 pages
Mod1 DM
No ratings yet
Mod1 DM
9 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Business Analytics Essentials
100% (2)
Business Analytics Essentials
45 pages
Ds Notes-Unit 1, II and III Upto Part1
No ratings yet
Ds Notes-Unit 1, II and III Upto Part1
341 pages
Pruning Techniques in Data Science
No ratings yet
Pruning Techniques in Data Science
13 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Daftar Isi Modul Data Science
100% (1)
Daftar Isi Modul Data Science
56 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Unit 1
No ratings yet
Unit 1
34 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Statistics For Data Science
100% (3)
Statistics For Data Science
39 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Big Data
No ratings yet
Big Data
4 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Introduction To Data Science and Analytics: Summer School 2015
No ratings yet
Introduction To Data Science and Analytics: Summer School 2015
31 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Data Analytics for MBA Students
No ratings yet
Data Analytics for MBA Students
50 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Big Data Essentials & Challenges
No ratings yet
Big Data Essentials & Challenges
71 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Free Data Science Course Material 2018
No ratings yet
Free Data Science Course Material 2018
32 pages
AI Module3 CH2
No ratings yet
AI Module3 CH2
13 pages
Data Science: Process and Applications
No ratings yet
Data Science: Process and Applications
11 pages
Data Science
100% (2)
Data Science
33 pages
3 UNIT-3 Big Data Analytics
No ratings yet
3 UNIT-3 Big Data Analytics
200 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
19 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
2018 Sanne L A de Vries - The Long Term Effects of The Youth Crime Preventio (Retrieved - 2024-09-07)
No ratings yet
2018 Sanne L A de Vries - The Long Term Effects of The Youth Crime Preventio (Retrieved - 2024-09-07)
23 pages
Schwalbe (2008) - A Meta Analysys of Juvenile Justice Risk Assessment Instruments
No ratings yet
Schwalbe (2008) - A Meta Analysys of Juvenile Justice Risk Assessment Instruments
15 pages
Product Failure Prediction With Missing Data Using Graph Neural Networks
No ratings yet
Product Failure Prediction With Missing Data Using Graph Neural Networks
10 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
EORTC - QLQ - C30 Scoring Manual CVRS Cancer
No ratings yet
EORTC - QLQ - C30 Scoring Manual CVRS Cancer
83 pages
Predictive Analytics in Cybersecurity
No ratings yet
Predictive Analytics in Cybersecurity
19 pages
Data Quality and Remediation
No ratings yet
Data Quality and Remediation
40 pages
Facebook Ad Targeting Insights
No ratings yet
Facebook Ad Targeting Insights
7 pages
Ch03 DS-Unit-2 ABM Final
No ratings yet
Ch03 DS-Unit-2 ABM Final
143 pages
Methods For Quantitative Research
100% (1)
Methods For Quantitative Research
5 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Bda Solved Sample Question Paper 70 Marks
No ratings yet
Bda Solved Sample Question Paper 70 Marks
29 pages
MCQ On Public Health Nutrition and Research Methodology
No ratings yet
MCQ On Public Health Nutrition and Research Methodology
125 pages
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
100% (14)
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
15 pages
SAP - 001 Bisoprolol Amlodipine Cross Over
No ratings yet
SAP - 001 Bisoprolol Amlodipine Cross Over
36 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
ML Reference Guide
No ratings yet
ML Reference Guide
32 pages
Deep Recurrent Model For Individualized Prediction of Alzheimer's Disease Progression
No ratings yet
Deep Recurrent Model For Individualized Prediction of Alzheimer's Disease Progression
20 pages
21it044 Dav Practical 6 Colab
No ratings yet
21it044 Dav Practical 6 Colab
9 pages
Credit Card Fraud Detection Study
No ratings yet
Credit Card Fraud Detection Study
14 pages
Credit Card Fraud Detection Analysis
No ratings yet
Credit Card Fraud Detection Analysis
7 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Fundamentals of Clinical Data Science., 978-3319997124
100% (21)
Fundamentals of Clinical Data Science., 978-3319997124
23 pages
Module 4 Research 2 2nd Quarter Quantitative Data Analysis
No ratings yet
Module 4 Research 2 2nd Quarter Quantitative Data Analysis
14 pages
hc753jjgjghj HGHJ HG GHGHGHGHG GH GHGHGH HGHG HGH HGHGHHG HG H HGHG
No ratings yet
hc753jjgjghj HGHJ HG GHGHGHGHG GH GHGHGH HGHG HGH HGHGHHG HG H HGHG
97 pages
De13 Mortari
No ratings yet
De13 Mortari
37 pages
Imc501 Group Project
No ratings yet
Imc501 Group Project
26 pages
Data Management Quiz
No ratings yet
Data Management Quiz
4 pages

Lecture 2 - The Data Science Process

Uploaded by

Lecture 2 - The Data Science Process

Uploaded by

Applied Management Research

The Data Science process

• Data Science & Big Data

• The Data Science Process

• Getting started with RapidMiner

• Data science is a discipline that “combines math and statistics, specialized

• Focus: (network of) interactions (Quattrociocchi & Vicini 2023)

• Data science utilizes certain specialized computational methods in order to

Littlewood (2018). Prioritize Which Data

Big Data 5 Vs (Lee, 2017):

1. Volume: amount of data an organization or an

Now: it’s not about

(Clissa et al. 2023)

• Unstructured data: data that is “non-numeric, multifaceted” and bearer of “concurrent

• 80%+ of (Big) Data are unstructured

Balducci & Marinova (2018)

1. Nonnumeric: allows more flexibility

«SMALL» checklist (Brave et al. 2022)

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

More More More More

•Eye Tracking •IG picture

•Click-throughs •Product prices

•Gestural Cues •Video

•Geographic data •Online conversations

•Financial Ratings •Financial Ratings

Does “unboxing” determine customer affect upon product launch?

• “Prior knowledge” of what?

• (One of the) most time-consuming tasks in the DSP

A) Exploratory data analysis.

Goal: achieve a basic understanding of the data (e.g., structure of

1. Descriptive statistics (mean, median, mode, sd, range)

B) Data cleansing: elimination of duplicate records, quarantining

• A dataset (example set) is a collection of data

• A data point (record, object or example) is a

• An attribute (feature, input, dimension, variable,

• A label (class label, output, prediction, target, or

• Identifiers are special attributes that are used for

• Relational (e.g. dataframe) vs non relational (e.g. corpus)

• Test/Validation dataset: known dataset against which testing the model

• Model: abstract (biased) representation of Modeling steps

• Assessing model readiness (real-time vs time-lagged problems)

2) invalidate the irrelevant patterns and identify the meaningful information

• Nested-operators: contain a sub-process, made up by a sequence of operator aimed at a specific task

A) Repositories Tab→ “Import Data”

1. Select the file on the disk that should be read or imported.

B) Read Excel / CSV

• categorical/nominal: treated as symbols/names; can be ordered (hot, mild, cold temperature)-

(dummy codying: Numerical to binomial, Nominal to binomial. Nominal to numerical,

Numerical to polynomial, Discretization)

• Pivot: rotate an example set around a given attribute

A. Import in RM the "ikea.csv" DS;

B. Create a subprocess folder named "pre-processing" in which you should:

You might also like