0% found this document useful (0 votes)
32 views30 pages

Lecture 2 - The Data Science Process

The document outlines the Data Science process, emphasizing its role in navigating Big Data through various methodologies including data preparation, modeling, and application. It discusses the characteristics of Big Data, the differences between structured and unstructured data, and introduces RapidMiner as a tool for data analysis. The document also includes a home assignment for practical application of the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views30 pages

Lecture 2 - The Data Science Process

The document outlines the Data Science process, emphasizing its role in navigating Big Data through various methodologies including data preparation, modeling, and application. It discusses the characteristics of Big Data, the differences between structured and unstructured data, and introduces RapidMiner as a tool for data analysis. The document also includes a home assignment for practical application of the concepts discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Applied Management Research

Methods

The Data Science process


Dott. Federico Mangiò

Room 23

20 February 2025
Agenda

• Data Science & Big Data

• The Data Science Process

• Getting started with RapidMiner


Data Science: a discipline to navigate the Big Data deluge

• Data science is a discipline that “combines math and statistics, specialized


programming, advanced analytics, artificial intelligence (AI), and machine
learning with specific subject matter expertise to uncover actionable insights
hidden in an organization’s data” (IBM).

• Focus: (network of) interactions (Quattrociocchi & Vicini 2023)

• These insights can be used to guide decision making and strategic planning

• Data science utilizes certain specialized computational methods in order to


discover meaningful and useful structures within a dataset.

• The scientific discipline (Diebold, 2021) of data science coexists and is closely
associated with a number of related areas such as database systems, data
engineering, visualization, data analysis, experimentation, and business
intelligence (BI).
Data Science: a discipline to navigate the Big Data deluge

Littlewood (2018). Prioritize Which Data


Skills Your Company Needs with This 2×2
Matrix. Harvard Business Review
Big Data

Big Data can be described as: “high volume, velocity and variety of data that demand cost-
effective, innovative forms of processing for enhanced insight and decision making.” (Laney, 2013)

Big Data 5 Vs (Lee, 2017):

1. Volume: amount of data an organization or an


individual collects and/or generates
2. Velocity: the speed at which data are generated and
processed
3. Variety: heterogeneity of data formats, structure,
types, source
4. Veracity: the unreliability and uncertainty latent in
data sources
5. Value: economic value stemming from BD analytics
Big Data

‘’Traces’’: web traffic data, sensors data, social media data, medical report, IoT…
How big is Big data?
Yearly production of big data by prominent stakeholders

Fifities: “90
observations on each of
10 variables” (in financial
econometrics)

Now: it’s not about


observations, but file
size ( e.g., “a 200GB
dataset”)

(Clissa et al. 2023)


Structured vs Unstructured Data

• Structured data: data that is formatted and organized in a data structure so that
elements can be addressed and accessed in various ways to make better use of the
information (e.g., attribute and relational databases).

• Unstructured data: data that is “non-numeric, multifaceted” and bearer of “concurrent


representation” (Balducci & Marinova, 2018)

• 80%+ of (Big) Data are unstructured

Balducci & Marinova (2018)


Structured vs Unstructured Data

• UD Pros:

1. Nonnumeric: allows more flexibility


for (theoretical) discovery

2. Multifaceted: Richer/deeper
conceptual and managerial insights

3. Concurrent representation:
Enables dynamic analysis at a given
time through simultaneous capture
of facets

• UD Cons:

...tough to handle!
The perils of working with big data

«SMALL» checklist (Brave et al. 2022)


Checklist Question Big Data dimension Main finding
Size is not a substitute for
1. How are the data Sampled? Volume
coverage in sampling
Accuracy begins with proper
2. How are the data Measured? Veracity
measurement
When assembling the data,
3. How are the data Assembled? Variety choose the right tool for the
job
Duration is not the sole
4. Do the data exhibit reporting Lags? Velocity criterion for judging reporting
lags
Leading information requires
5. Are the data a Leading indicator? Value
the right variation in the data
Big Data: “the new oil”?

• Value-generating resource

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

• Negative externalities

• Finite

• Non-rivalrous

Puntoni (2022)
Big Data: “the new oil”?

• Value-generating resource

• Need (pre)processing to generate value

• Economies of scale → concentrated markets

• Negative externalities

• Finite

• Non-rivalrous

Puntoni (2022)
Structured or unstructured?

More More More More


Structured Unstructured Structured Unstructured

•Eye Tracking •IG picture

•Click-throughs •Product prices


•Facial Cues

•Gestural Cues •Video

•Customer satisfaction
scores •Vocal characteristics

•Geographic data •Online conversations

•Financial Ratings •Financial Ratings


•ROI
Structured or unstructured?

Does “unboxing” determine customer affect upon product launch?

Do assistance ticket length and supply chain relationship affect customer churn?

What is the ROI of paid online search advertising and what keywords are the most effective?
The Data Science process

Data Science process Cross Industry Standard Process for Data Mining
1. Prior Knowledge
1. Prior Knowledge

• information that is already known about a subject (agnostic but not atheoretical)
• helps to define what problem is being solved, how it fits the business context, and what data is
needed

• “Prior knowledge” of what?

1. (business) objective: well-defined statement of the problem; building-block of the process (everything
but this one can be changed)

2. subject area: knowledge about the subject matter, the context, and the business process generating
the data

3. The data: Understanding how the data is collected, stored, transformed, reported, and used. Factors to
consider:
• quality of the data;
• quantity of data; GIGO: “garbage in, garbage out”
• availability of data;
2. Data Preparation

• (One of the) most time-consuming tasks in the DSP


• It involves:

A) Exploratory data analysis.

Goal: achieve a basic understanding of the data (e.g., structure of


the data, distribution of the values, presence of outliers, presence of
inter-relationships etc). How:

1. Descriptive statistics (mean, median, mode, sd, range)


2. Data visualization

B) Data cleansing: elimination of duplicate records, quarantining


outliers, standardization, handling missing values..
Source: playground records
Data Collector: company X
Metadata Data collection date: 01.01.18
[…]

• A dataset (example set) is a collection of data


with a defined structure (“dataframe”). Attribute

• A data point (record, object or example) is a


single instance in the dataset.

• An attribute (feature, input, dimension, variable,


or predictor) is a single property of the dataset
(numeric, categorical, date-time, text, or Boolean Example
data types)

• A label (class label, output, prediction, target, or


response) is the special attribute to be predicted
based on all the input attributes (Play)

• Identifiers are special attributes that are used for


locating or providing context to individual records.
Datasets

• Relational (e.g. dataframe) vs non relational (e.g. corpus)


• Sparsity → “curse of dimensionality”

• Training dataset: dataset used to create the model, with known attributes
and target

• Test/Validation dataset: known dataset against which testing the model


validity
Attributes

• Quantitative variables: they take on numerical values (e.g., income, stock price, etc.).
Continuous, integer, real...
• Qualitative (or categorical) variables (or factors): they take on values in one of K different
classes or categories (e.g. male/female for gender; yes/no for a fraudulent financial
transaction; increase/ decrease for a stock index; ...).
Qualitative variables are typically represented numerically by codes.
• Independent variables (or inputs or regressors or predictors or features) usually denoted by
X.
• Dependent variable (or output or response variable) usually denoted by Y.
2. Data Preparation
1) Handling missing values (“NAs”):

• understanding the source of missing values (e.g. recording error vs count data, like document-term matrix)
• data substitution (mean, min, max depending on the characteristics of the attribute)- only if missing values
occur randomly and rarely
• (alternatively), exclude records with missing values (NB reduces the size of the ds)

2) Data types conversion: fit attributes' values to the specific model (e.g. categorical var--> numeric var in
regression, factor)

3) Transformation: some problems require normalizing attributes to prevent one dominating the others (e.g.
distance-based algo)

4) Handling outlier: 1. understanding the source of the outlier (e.g. error) 2. management

5) Features selection: reducing the number of attributes, without significant loss in the performance of the
model

6) Sampling: process of selecting a subset of records as a representation of the original dataset for use in data
analysis or modeling. The sample data serve as a representative of the original dataset with similar properties,
such as a similar mean.
3. Modeling

• Model: abstract (biased) representation of Modeling steps


reality
Training data Build Model
• Model -> (hundreds of) Machine Learning
algorithms
Testing data Evaluation
• Tasks: predictive (classification & regression)
vs descriptive (clustering, association rules)
Final Model
• Evaluation: how well the model generalizes
from the data? How: comparing real vs
predicted values
4. Application

Application (Deployment): stage at which the model becomes production ready or live

• Assessing model readiness (real-time vs time-lagged problems)


• Technical integration: invoking data science tools into production applications (e.g., PMML)
• Response time (model building vs prediction trade-off)
• Model maintenance (data refresh)
• Assimilation: articulate these findings, establish relevance to the original business question, quantify
the risks in the model, and quantify the business impact
5. (A posteriori) knowledge

• The output of the process has to be relevant additional knowledge which cannot be gained
with traditional research methods

• It is up to the practitioner to

1) skillfully transform a business problem to a data problem and apply the right algorithm.

2) invalidate the irrelevant patterns and identify the meaningful information


Introduction to RapidMiner

• Repository: a folder-like structure inside RapidMiner where users can organize their data, processes, and
models. When RapidMiner is launched for the first time, an option to set up the New Local Repository will
be given

• Operator: an atomic piece of functionality (which in fact is a chunk of encapsulated code) performing a
certain task.

• Nested-operators: contain a sub-process, made up by a sequence of operator aimed at a specific task

• Process: when a series of operators are connected together to accomplish the desired data science task
(A process that is created visually in RapidMiner is stored by RapidMiner as a platform-independent XML
code that can be exchanged between RapidMiner users)
Import

A) Repositories Tab→ “Import Data”

1. Select the file on the disk that should be read or imported.


2. Specify how the file should be parsed and how the columns are delimited. If the data have a comma “,” as
the column separator in the configuration parameters, be sure to select it.
3. Annotate the attributes by indicating if the first row of the dataset contains attribute names (which is
usually the case).
4. Detect data types
5. Hit “Finish” and indicate location

To import in the Design View, simply drag and drop the desired dataset from the repository!

B) Read Excel / CSV


Descriptive Statistics Analysis (quick way)

By clicking on the Statistics tab, one can examine the type, missing values, and basic statistics for all the
imported dataset attributes. The data type of each attribute (integer, real, or binomial), and some basic
statistics can also be identified. This high-level overview is a good way to ensure that a dataset has been
loaded correctly and exploring the data in more detail using the visualization tools described later is
possible.

Data types
• numeric/continuous: take infinite values, can be object of math (e.g. +,-) and logical (>/<)
computations; integer: no decimals; ratio/real: zero pint is defined (income)

• categorical/nominal: treated as symbols/names; can be ordered (hot, mild, cold temperature)-


not all algo deal with them, they can be transformed (NB loss of info)
Data transformation, Pivot, & Normalization

• Data transformation: transforming data so that they are suitable for the algorithm

(dummy codying: Numerical to binomial, Nominal to binomial. Nominal to numerical,

Numerical to polynomial, Discretization)

• Pivot: rotate an example set around a given attribute


Home assignment
Getting Started with Rapid Miner :

A. Import in RM the "ikea.csv" DS;

B. Create a subprocess folder named "pre-processing" in which you should:


-assign a unique ID to each example
-assign the "label" role to the DV ("price")
-retain only the attributes "id", "price", "category", "width", "heigth", "length"
-filter out NAs
-extract a random subsample of N: 100-
Z-transform all numeric attributes except for ID

C. Group examples by "category" and compute descriptive stats (mean, SD, range)

You might also like