0% found this document useful (0 votes)
38 views61 pages

Week 11 Lecture

This document discusses data mining applications and data ethics. It covers topics on mining text data, mining timeseries data, and data ethics. For mining text data, it discusses techniques like vector space representation, preprocessing, tf-idf representation, and representative-based algorithms like k-means clustering and scatter/gather approaches. For mining timeseries data, it discusses challenges like normalization and missing values imputation, and dimensionality in multivariate timeseries. It also provides an overview of forecasting in timeseries analysis.

Uploaded by

Sasi Dharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views61 pages

Week 11 Lecture

This document discusses data mining applications and data ethics. It covers topics on mining text data, mining timeseries data, and data ethics. For mining text data, it discusses techniques like vector space representation, preprocessing, tf-idf representation, and representative-based algorithms like k-means clustering and scatter/gather approaches. For mining timeseries data, it discusses challenges like normalization and missing values imputation, and dimensionality in multivariate timeseries. It also provides an overview of forecasting in timeseries analysis.

Uploaded by

Sasi Dharan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ECS766P Data Mining

Week 11: Data Mining Applications & Data Ethics

Emmanouil Benetos
[email protected]

December 2022
School of EECS, Queen Mary University of London
Last week: Web Mining

• Six paradigms for today’s Internet


• Technology review
• Internet Mining Applications
• Ingesting Internet data
• Search Engine Indexing & Ranking

1
This week’s contents

1. Mining Text Data

2. Mining Timeseries Data

3. Data Ethics

2
Reading

• Chapters 13, 14, 16, and 20 of C. C. Aggarwal, “Data Mining: The


Textbook”, Springer, 2015 [non-essential reading]

Data Ethics content adapted from material by Dr Usman Naeem and the
Institute of Coding (IoC)
http://eecs.qmul.ac.uk/ioc/

3
Mining Text Data
Mining Text Data: Introduction

Mining Text Data


The text domain is sometimes challenging for mining purposes
because of its sparse and high-dimensional nature. Therefore,
specialised algorithms need to be designed. The first step is the
construction of a bag-of-words representation for text data.

Several preprocessing steps need to be applied, such as stop-word


removal, stemming, and the removal of digits from the
representation.

Algorithms for problems such as clustering and classification need


to be modified as well. The k-means method, hierarchical methods,
and probabilistic methods can be suitably modified to work for text
data.

4
Mining Text Data: Introduction

Text data are found in many domains:

• Digital libraries
• Web and Web-enabled applications
• News services

Modeling of Text:

• A sequence (string)
• A multidimensional record

5
Mining Text Data: Multidimensional Representations

Some terminology:

• Data point: document


• Data set: corpus
• Feature: word/term
• The set of features: lexicon

Vector Space Representation:

• Common words are removed


• Variations of the same word are consolidated
• Displays frequencies of individual words

6
Mining Text Data: Vector Space Representation

Figure: vector space representation for a collection of documents.


This particular representation is also called a document-term matrix.

7
Mining Text Data: Specific Characteristics of Text

Number of “Zero” Attributes (Sparsity):

• Most attribute values in a document are 0. This phenomenon is


referred to as high-dimensional sparsity.
• Affects many fundamental aspects of text mining, such as
distance computation.

Nonnegativity:

• Frequencies are nonnegative.


• The presence of a word is statistically more significant than its
absence.

Side Information:

• Hyperlinks or other metadata associated with a document.

8
Mining Text Data: Data Preprocessing

Stop Word Removal:

• Words in a language that are not very discriminative for mining


• Articles, prepositions, and conjunctions

Stemming:

• Consolidate variations of the same word


• Singular and plural representations, different tenses, common
root extraction

Punctuation Marks:

• Commas, semicolons, digits, hyphens

9
Mining Text Data: tf–idf representation

Inverse Document Frequency:

idf(w) = log10 (|D|/|Dw |)

where |Dw | is the number of documents in which the word w occurs,


and |D| is the total number of documents.

Term Frequency:
fw,d
tf(w, d) = !
w′ ∈d fw′ ,d
The ratio of the number of appearances fw,d of word w in document
d, divided with the total number of words in document d.

Term frequency–Inverse document frequency (tf–idf):

tfidf(w, d) = tf(w, d) · idf(w)

10
Mining Text Data: Representative-Based Algorithms

Most clustering algorithms can be extended to text data, following


some modifications.

Representative-Based Algorithms: Since the vector space


representation of text is also a multidimensional data point,
algorithms such as k-means can be used for text data.

Modifications:

• Choice of Similarity Function: Cosine similarity


• Computation of the cluster centroids:
• Low-frequency words in the cluster are not retained
• A representative set of words are retained for each cluster
(200 to 400 words)
• Have significant effectiveness advantages

11
Mining Text Data: Scatter/Gather Approach

The scatter/gather approach is effective because of its ability to


combine hierarchical and k-means algorithms.

• While the k-means algorithm scales as O(k · n), it is sensitive to


initialisation.
• While hierarchical partitioning algorithms are very robust, they
typically do not scale well.
• A Two-phase Approach:
1. Apply a procedure to create a robust set of initial seeds
(buckshot or fractionation procedure)
2. Apply a k-means approach on the resulting set of seeds

12
Mining Text Data: Scatter/Gather Approach

Buckshot

• Select a seed (sample of documents) of size k·n
• k is the number of clusters
• n is the number of documents
• Apply agglomerative hierarchical clustering to this initial sample
of seeds
• The time complexity is O(k · n)

• Agglomerative clustering methods


• The individual data points are successively merged into
higher-level clusters.

13
Mining Text Data: Scatter/Gather Approach

Fractionation

• Break up the corpus into n/m buckets, each of size m


• An agglomerative algorithm is applied to each bucket to reduce
them by a factor ν ∈ {0, 1}
• Then, we obtain νn agglomerated documents over all buckets
• An “agglomerated document” is defined as the
concatenation of the documents in a cluster.
• Repeat the above process until k agglomerated documents

14
Mining Text Data: Scatter/Gather Approach

Fractionation

• Types of Partition
• Random partitioning
• Sort the documents by the index of the jth most common
word in the document. Contiguous groups of m documents
in this sort order are mapped to clusters.
• Time Complexity
• O(nm(1 + ν + ν 2 + ...)) = O(nm)

15
Mining Text Data: Scatter/Gather Approach

k-means algorithm
When the initial cluster centers have been determined with the use
of the buckshot or fractionation algorithms, one can apply the
k-means algorithm with the seeds obtained in the first step.

• Each document is assigned to the nearest of the k cluster


centers
• The centroid of each such cluster is determined as the
concatenation of the documents in that cluster
• Furthermore, the less frequent words of each centroid are
removed

16
Mining Timeseries Data
Mining Timeseries Data: Introduction

Mining Timeseries Data


Timeseries data is common in many domains, such as sensor
networking, healthcare, and financial markets.

Typically, timeseries data needs to be normalised, and missing


values need to be imputed for effective processing. Numerous data
reduction techniques such as Fourier and wavelet transforms are
used in timeseries analysis. The choice of similarity function is the
most crucial aspect of time series analysis.

Forecasting is an important problem in timeseries analysis because


it can be used to make predictions about data points in the future.
Most timeseries applications use either point-wise or shape-wise
analysis.

17
Mining Timeseries Data: Introduction

• Temporal data may be either discrete or continuous:


• Continuous temporal data sets are timeseries
• Discrete temporal data sets are sequences
• Time series data are viewed as contextual data representations,
with contextual and behavoural attrributes.
• Two types of models:
• Real-time analysis
• Retrospective analysis

18
Mining Timeseries Data: Data Preparation

Multivariate Time Series Data

A time series of length n and dimensionality d contains d numeric


features at each of n timestamps t1 , ..., tn . Each timestamp contains a
component for each of the d series. Therefore, the set of values
received at timestamp ti is Ȳi = (y1i , ..., ydi ). The value of the jth series
at timestamp ti is yji .

In a univariate time series, the value of d is 1. In such cases, a series


of length n is represented as a set of scalar behavioral values
y1 , ..., yn , associated with the timestamps t1 , ..., tn .

19
Mining Timeseries Data: Data Preprocessing

Handling Missing Values

The most common methodology used for handling missing,


unequally spaced, or unsynchronised values is linear interpolation.

Let yi and yj be values of the timeseries at times ti and tj ,


respectively, where i < j. Let t be a time drawn from the interval (ti ,
tj ). Then, the interpolated value of the series is given by:
" #
t − ti
y = yi + · (yj − yi )
tj − ti

Polynomial interpolation or spline interpolation are also possible.

20
Mining Timeseries Data: Data Preprocessing

Noise Removal

• Binning
• Grouping data into time intervals of size k
• Averaging value of data points in each interval
• Let yi·k+1 ...yi·k+k be the values at timestamps ti·k+1 ...ti·k+k .
The new binned value is:
!k
yi·k+r
yi+1 = r=1

k
• Moving-Average Smoothing: Moving-average (rolling averages)
methods reduce the loss in binning by using overlapping bins,
over which the averages are computed. Here a bin is
constructed starting at each timestamp in the series.

21
Mining Timeseries Data: Data Preprocessing

• Exponential Smoothing
The smoothed value yi is defined as a linear combination of the

current value yi , and the previously smoothed value yi−1 .


Parameter α ∈ (0, 1) controls the smoothing:



yi = α · yi + (1 − α) · y′i−1

22
Mining Timeseries Data: Data Preprocessing

Normalisation

• Minmax normalisation to (0,1)


Let the minimum and maximum value of the time series be min
and max, respectively. Then, the time series value yi is mapped
to the new value yi in the range (0, 1) as:

′ yi − min
yi =
max − min
• z-score normalisation
Let µ and σ represent the mean and standard deviation of the
values in the timeseries. Then, the timeseries value yi is mapped
to a new value zi as:
y −µ
zi = i
σ

23
Mining Timeseries Data: Data Transformation

Discrete Wavelet Transform (DWT)

• DWT converts a timeseries to multidimensional data.


• A key advantage is that the DWT can capture both frequency and
temporal information.

24
Mining Timeseries Data: Data Transformation

Discrete Fourier Transform (DFT)


Idea: Decompose a given signal into a superposition of sinusoids
(elementary signals).

• The magnitude reflects the intensity at which the sinusoid of a


specific frequency appears in the signal.
• The phase reflects how the sinusoid has to be shi ted to best
correlate with the signal.
25
Mining Timeseries Data: Data Transformation

Discrete Fourier Transform (DFT)


Any series of length n can be expressed as a linear combination of
smooth periodic sinusoidal series. Consider a time series x0 ...xn−1 .
Each coefficient Xk of the Fourier transform is a complex value which
is defined as follows:

n−1
$ n−1
$ n−1
$
Xk = xr ·e −irωk
= xr ·cos(rωk)−i xr ·sin(rωk) ∀k ∈ {0...n−1}
r=0 r=0 r=0

Where ω is set to 2π/n radians, and the notation i denotes the



imaginary number −1.

26
Mining Timeseries Data: Forecasting

The prediction of future trends has applications in:


• Retail sales
• Stock markets
• Weather forecasting
• Medicine and health

27
Mining Timeseries Data: Forecasting

Timeseries can be either stationary or nonstationary:

• A stationary stochastic process is one whose parameters, such


as the mean and variance, do not change with time.
• A nonstationary process is one whose parameters change with
time.

In forecasting, we typically convert or assume timeseries to be


stationary and use statistical parameters for forecasting.
Statistical methods for timeseries forecasing:

• Autoregressive (AR) models


• Moving average (MA) models
• Autoregressive Moving Average (ARMA) models
• Autoregressive Integrated Moving Average (ARIMA) models

28
Data Ethics
Is our underlying data fit for purpose?

The objective of this section is to provide students with an


understanding of the key ethical and legal issues as well as
challenges that they might face when working on data mining. The
lecture will also provide insights on how to address these issues
based on the UK’s Data Ethics Framework.

Fundamental questions:

• “Does my analysis of the dataset infringe on a user’s privacy?”


• “Does the use of a particular dataset lead to ethical issues?”
• “Is the dataset accurate and fit for purpose?”

29
What is Data Ethics?

In simple terms, ethics can be considered as conducting an activity


in a ‘good’, ‘acceptable’ or ‘right’ way. But, how can we determine
what is ‘good’, ‘acceptable’ or ‘right’? This can be subjective, as this is
something that is normally based on the values that are the norm
within different groups of people, which is normally influenced by
factors such as culture. The moral philosophy discipline categorises
ethics into the following two perspectives:

• Kantian: The ethical action is driven by moral values and


principles of the individual. This perspective is not concerned
about the consequence of an individual’s actions.
• Utilitarian: The action is ethical if the intention is to maximise
positive outcomes for a larger population of individuals. This
perspective is concerned about the consequence of an
individual’s actions.

30
What is Data Ethics?

Both Kantian and Utilitarian perspectives have their advantages and


disadvantages:

• Kantian perspective can be difficult to recognise moral (good)


values of an individual.
• Utilitarian perspective can overlook minority groups, as this
perspective only considers positive outcomes for the larger
group of individuals.

Data Ethics is concerned with the values and methods that are
adopted when we generate, analyse and disseminate data. Hence, a
fundamental objective of data ethics is to ensure that you consider
the social and legal implications of how and for what purpose you
use the data and algorithms as a data scientist.

31
Data Ethics - Suggested Reading

• Mingers, J., & Walsham, G. (2010). Towards ethical information systems:


The contribution of discourse ethics. MIS Quarterly, 34(4), 833–854.
• Pasquale, Frank & Citron, Danielee Keats (2014) Promoting Innovation
While Preventing Discrimination: Policy Goals for the Scored Society.
Washington Law Review 89:1413.
• Newell, S., & Marabelli, M. (2015). Strategic opportunities (and
challenges) of algorithmic decision making: A call for action on the
long-term societal effects of ‘datification’. The Journal of Strategic
Information Systems.
• Vallor, S. (2016). Technology and the virtues: A philosophical guide to a
future worth wanting. Oxford University Press.
• Gumbus, A., & Grodzinsky, F. (2016). Era of big data: Danger of
descrimination. ACM SIGCAS Computers and Society, 45(3), 118–125.

32
Case study

“We also should be worried about misdirection of the innovation of scoring


in the employment context—particularly if firms can effectively hide
misconduct via scores. Existing laws prohibit some discriminatory uses of
the data. For example, an employer cannot fire workers simply because they
have an illness. But Big Data methods are able to predict diabetes from a
totally innocuous data set (including items like eating habits, drugstore
visits, magazine subscriptions, and the like). [...] For example, a firm could
conclude a worker is likely to be diabetic and that they are likely to be a
“high cost worker” given the significant monthly costs of diabetic medical
care.
(from Pasquale and Citron, 2014)

33
What is the Data Ethics Framework?

The Data Ethics Framework has been developed by the UK


government that prescribes the design of appropriate data use,
which is aimed at statisticians, analysts and data scientists working
directly/indirectly within the public sector. The objective of this
framework is to encourage ethical data use to build better services,
which is based on the following values of the Civil Service Code:

• Integrity
• Honesty
• Objectivity
• Impartiality

34
Resources:

• Data Ethics Framework


https://www.gov.uk/government/publications/
data-ethics-framework/data-ethics-framework
• Data Ethics Workbook
https://www.gov.uk/government/publications/
data-ethics-workbook

35
Which data are we allowed to use?

Quantitative secondary research sources that includes datasets such


as census data, birth/death rates, unemployment rates are a type of
data normally generated by governments, organisations and
charities.

Are we allowed to make use of this data? The answer is ‘yes’,


however we need to be aware of legislations related to the usage of
data. According to gov.uk, This includes how we:

• Produce statistics
• Protect privacy by design
• Minimise the data needed to achieve our need
• Keep personal and non-personal data secure

36
Personal Data Protection

If you intend to use personal data, then you must ensure that you
comply with the principles of the General Data Protection Regulation
(GDPR) and Data Protection Act 2018 (DPA 2018).

The importance of GDPR cannot be underestimated, as it aims to


improve the protection of data subject’s rights within Europe. In
addition to this, GDPR clearly articulates what companies must do to
protect personal data.

37
Personal Data Protection

Data scientists also need to take into consideration the


interpretability of data, as this is also a GDPR requirement.

There are two aspects to the interpretability of data (this legal


definition also includes models), which are transparency and post
hoc explanations.

38
Personal Data Protection

Transparency is based on how your model works, while post hoc


explanations are based on the information derived from your model.
From a GDPR perspective, this is important as a user has the legal
right to find out how an algorithmic decision was made about them.

Resources:

• General Protection Data Protection Regulation (GDPR)


https://gdpr-info.eu

• Data Protection Act 2018 (DPA 2018)


https://www.legislation.gov.uk/ukpga/2018/12/enacted

39
Case Study: Autonomous Vehicles

Decision making models are dependent on data that is generated


given a particular scenario. One such example is the series of
decisions that have to be made given the data captured by the
multiple sensors in Autonomous Vehicles (AVs). The questions that
we need to think about are:

• “Who makes these decisions?”


• “Are there any legal liabilities for these decisions?”

The advent of AVs is seen as a progressive step towards a smart city


infrastructure, where the motivation is to provide safe roads by
reducing traffic accidents. However, decision-making models will
likely make a series of difficult moral decisions if the vehicle is
involved in a crash.

40
Case Study: Autonomous Vehicles

Let us consider the following scenarios:


• Scenario A:
The vehicle will keep on driving straight on the road and kill a group of
pedestrians or
The vehicle will swerve to the right and kill one person walking on the
pavement
• Scenario B:
The vehicle will keep on driving straight on the road and kill one
pedestrian or
The vehicle will swerve to the right onto the pavement and kill the
passenger in the vehicle
• Scenario C:
The vehicle will keep on driving straight on the road and kill a group of
pedestrians or
The vehicle will swerve to the right onto the pavement and kill the
passenger in the vehicle

41
Case Study: Autonomous Vehicles

These scenarios clearly illustrate why the designing of these models


can lead to ethical dilemmas, however will the legal liabilities be the
same for a pre-programmed AV and a human-driven car?
42
Readings:

• Contissa, G., Lagioia, F., & Sartor, G. (2017). The Ethical Knob:
ethically-customisable automated vehicles and the law. Artificial
Intelligence and Law, 25(3), 365-378.
• Ethics guidelines for trustworthy AI
https://ec.europa.eu/digital-single-market/
en/news/ethics-guidelines-trustworthy-ai

43
Data Reliability

Limitations with datasets can lead to data analysis being misleading


and unreliable. Hence, this is seen as an ethical concern.

How do we determine if a dataset is reliable?

We need to take into consideration the lineage of the dataset, as this


will allow us to trace back any errors or discrepancies to the
beginning of the data analysis process.

44
Data Reliability

Identification of the data lineage can be done by answering the


following set of questions:

• What is the source of the data?


• How was the data collected? Was it by humans? Or automated
systems?
• Why was the data collected?
• Does the data reflect its target population?
• Are there any patterns in the data?
• Is data likely to change over time?
• Are there omissions from the dataset?
• What was the sampling method used to collect this data?

45
Data Bias

Bias within datasets can be caused by:

• Datasets that do not accurately represent the cohort that the


insights will be based on.
• Datasets produced by humans, which can be in the form of
curated news articles or social media content, which leads to
bias against a group of people.

There are huge ethical implications of having bias within datasets


that can lead to biased models that will be prejudiced and harmful
towards people.

An example of this is the case study about the “COMPAS Recidivism


Algorithm” that was used to predict a defendant’s likelihood to
commit a crime.

46
Data Bias - Types of Biases

Selection Bias
This type of bias occurs when the dataset does not reflect the
population or cohort that the insights or decisions will be based on.
This is very common with surveys. Hence, it tends to lead to a
situation where you only end up with willing participants who are a
small subset of the population and do not reflect the characteristics
of an average person. Hence, the existence of this bias is due to the
need of working with data that is easily accessible.

Self-Selection Bias
This is a subcategory of the selection bias, where the subject within
the analysis selects themselves. For example, if you are running an
online poll on how many people in a town can use an email client.
The results for this will not represent the entire town as only the
participants who had received the poll via email would be the most
likely to reply with a response.
47
Data Bias - Types of Biases

Omitted Variable Bias


This bias occurs when variables or features are omitted from the
dataset with the belief that they are not relevant to the output given
existing beliefs.

Observer Bias
The is type of bias occurs when the data scientist subconsciously
influences the outcome of their research by:

• Having previous knowledge or subjective feelings about a


sample of people being studied.
• Unintentional manipulation of participants during surveys or
interviews.
• Cherry picking a group of people who have characteristics that
will support the data scientist’s hypothesis.

48
Data Bias - Types of Biases

Social Bias
Social bias can be positive and negative and refers to being in favor
or against individuals or groups based on their social identities.
Commonly occurs in data science when using data collected from the
web, news, and social media.

The following case study illustrates an example of this, where text


features trained on Google News articles exhibited female and male
gender stereotypes:

• T. Bolukbasi et al, “Man is to Computer Programmer as Woman is


to Homemaker? Debiasing Word Embeddings”, 30th Conference
on Neural Information Processing Systems, 2016.

49
Personal vs Sensitive Data

What is the difference between personal and sensitive data?


Personal Data
Personal data is information that can be used to identify an
individual. Typical examples are name (first/middle/last name),
address, email address, national insurance number, location data, IP
address, signature, date of birth and bank account details.

Typically datasets are made up of multiple pieces of personal data


which can be combined together to identify an individual.

50
Personal vs Sensitive Data

Sensitive Data
Sensitive data is a category of personal information that may lead to
harm or discrimination if not treated with extra care and security. For
example, sensitive information about an individual could be on:

• ethnicity
• religious beliefs
• political views and opinions
• sexual orientation
• trade union membership
• biometric data
• health records
• criminal records

This type of data should be encrypted or pseudonymised and stored


separately from other personal data.
51
Resources:

• https://www.itgovernance.co.uk/data-protection-dpa-and-eu-
data-protection-regulation
• https://ico.org.uk/media/for-
organisations/documents/1554/determining-what-is-personal-
data.pdf

52
Quiz

Question 1. Which of the following is considered personal data?

A Salary/wages
B Religious beliefs
C Sexual orientation
D Philosophical beliefs

Question 2. Which of the following is considered sensitive data?

A Hours of employment
B Emergency contact person details
C IP address
D Religious affiliation

53
Research Ethics at QMUL

All projects that involve human participants or personal data require ethics
approval from the university - including MSc projects!
Most projects which involve surveys/questionnaires can be approved
through the EECS ‘low risk’ ethics approval process:
https://qmulprod.sharepoint.com/sites/EECS-
DevolvedSchoolResearchEthicsCommittee/

Medium/high risk ethics applications are submitted to the Queen Mary


Ethics of Research Committee: http://www.jrmo.org.uk/performing-
research/conducting-research-with-human-participants-outside-the-nhs/

54
Summary

Mining Text Data It is the process of deriving high-quality information


from text-like datasets.
Mining Timeseries Data It comprises methods for analysing
timeseries data in order to extract meaningful statistics and other
characteristics of the datasets.
Data Ethics evaluates moral problems related to data, algorithms
and corresponding practices in order to formulate and support
morally good solutions.
Data Reliability refers to the assurance of the accuracy and
consistency of datasets.
Data Bias results in skewed outcomes, low accuracy levels, and
analytical errors.
Personal Data is information on an individual. Sensitive Data is
specific personal information that can cause discrimination.
55
2022 Intelligent Sensing Winter School

56
Questions?
also please use the forum on QM+

56

You might also like