Data science over small movie dataset – Part 2

Introduction

This document (notebook) shows transformation of movie dataset into a form more suitable for making a movie recommender system. (It builds upon Part 1 of the blog posts series.)

The movie data was downloaded from “IMDB Movie Ratings Dataset”. That dataset was chosen because:

It has the right size for demonstration of data wrangling techniques
- ≈5000 rows and 15 columns (each row corresponding to a movie)
It is “real life” data with expected skewness of variable distributions
It is diverse enough over movie years and genres
Relatively small number of missing values

The full “Raku for Data Science” showcase is done with three notebooks, [AAn1, AAn2, AAn3]:

Data transformations and analysis, [AAn1]
Sparse matrix recommender, [AAn2]
Relationships graphs, [AAn3]

Remark: All three notebooks feature the same introduction, setup, and references sections in order to make it easier for readers to browse, access, or reproduce the content.

Remark: The series data files can be found in the folder “Data” of the GitHub repository “RakuForPrediction-blog”, [AAr1].

The notebook series can be used in several ways:

Just reading this introduction and then browsing the notebooks
Reading only this (data transformations) notebook in order to see how data wrangling is done
Evaluating all three notebooks in order to learn and reproduce the computational steps in them

Outline

Here are the transformation, data analysis, and machine learning steps taken in the notebook series, [AAn1, AAn2, AAn3]:

Ingest the data — Part 1
- Shape size and summaries
- Numerical columns transformation
- Renaming columns to have more convenient names
- Separating the non-uniform genres column into movie-genre associations
  - Into long format
Basic data analysis — Part 1
- Number of movies per year distribution
- Movie-genre distribution
- Pareto principle adherence for movie directors
- Correlation between number of votes and rating
Association Rules Learning (ARL) — Part 1
- Converting long format dataset into “baskets” of genres
- Most frequent combinations of genres
- Implications between genres
  - I.e. a biography-movie is also a drama-movie 94% of the time
- LLM-derived dictionary of most commonly used ARL measures
Recommender system creation — Part 2
- Conversion of numerical data into categorical data
- Application of one hot embedding
- Experimenting / observing recommendation results
- Getting familiar with the movie data by computing profiles for sets of movies
Relationships graphs — Part 3
- Find the nearest neighbors for every movie in a certain range of years
- Make the corresponding nearest neighbors graph
  - Using different weights for the different types of movie metadata
- Visualize largest components
- Make and visualize graphs based on different filtering criteria

Comments & observations

This notebook series started as a demonstration of making a “real life” data Recommender System (RS).
- The data transformations notebook would not be needed if the data had “nice” tabular form.
  - Since the data have aggregated values in its “genres” column typical long form transformations have to be done.
  - On the other hand, the actor names per movie are not aggregated but spread-out in three columns.
  - Both cases represent a single movie metadata type.
    - For both long format transformations (or similar) are needed in order to make an RS.
- After a corresponding Sparse Matrix Recommender (SMR) is made its sparse matrix can be used to do additional analysis.
  - Such extensions are: deriving clusters, making and visualizing graphs, making and evaluating suitable classifiers.
In most “real life” data processing most of the data transformation listed steps above are taken.
- Another exploratory data analysis demo is given in the video “Exploratory Data Analysis with Raku”, [AAv3].
ARL can be also used for deriving recommendations if the data is large enough.
The SMR object is based on Nearest Neighbors finding over “bags of tags.”
- Latent Semantic Indexing (LSI) tag-weighting functions are applied.
The data does not have movie-viewer data, hence only item-item recommenders are created and used.
One hot embedding is a common technique, which in this notebook is done via cross-tabulation.
The categorization of numerical data means putting number into suitable bins or “buckets.”
- The bin or bucket boundaries can be on a regular grid or a quantile grid.
For categorized numerical data one-hot embedding matrices can be processed to increase similarity between numeric buckets that are close to each to other.
Nearest-neighbors based recommenders — like SMR — can be used as classifiers.
- These are the so called K-Nearest Neighbors (KNN) classifiers.
- Although the data is small (both row-wise & column-wise) we can consider making classifiers predicting IMDB ratings or number of votes.
Using the recommender matrix similarities between different movies can be computed and a corresponding graph can be made.
Centrality analysis and simulations of random walks over the graph can be made.
- Like Google’s “Page-rank” algorithm.
The relationship graphs can be used to visualize the “structure” of movie dataset.
Alternatively, clustering can be used.
- Hierarchical clustering might be of interest.
If the movies had reviews or summaries associated with them, then Latent Semantic Analysis (LSA) could be applied.
- SMR can use both LSA-terms-based and LSA-topics-based representations of the movies.
- LLMs can be used to derive the LSA representation.
- Again, not done in these series of notebooks.
  - See, the video “Raku RAG demo”, [AAv4], for such demonstration.

Setup

Load packages used in the notebook:

use Math::SparseMatrix;
use ML::SparseMatrixRecommender;
use ML::SparseMatrixRecommender::Utilities;
use Statistics::OutlierIdentifiers;

#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = '#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;

# {background => #1F1F1F, edge-thickness => 3, title-color => Silver, vertex-size => 6}

Ingest data

Download from GitHub the files:

And unzip them.

Ingest movie data:

my $fileName = $*HOME ~ '/Downloads/movie_data.csv';
my @dsMovieData=data-import($fileName, headers=>'auto');
@dsMovieData .= map({ $_<title_year> = $_<title_year>.Int.Str; $_});
deduce-type(@dsMovieData)

# Vector(Assoc(Atom((Str)), Atom((Str)), 15), 5043)

Here is a sample of the movie data over the columns we most interested in:

#% html
my @movie-columns = <index movie_title title_year genres imdb_score num_voted_users>;
@dsMovieData.pick(4)
==> to-html(field-names => @movie-columns)

index	movie_title	title_year	genres	imdb_score	num_voted_users
3322	Veronika Decides to Die	2009	Drama\|Romance	6.5	10100
1511	The Maze Runner	2014	Action\|Mystery\|Sci-Fi\|Thriller	6.8	310903
1301	Big Miracle	2012	Biography\|Drama\|Romance	6.5	15231
55	The Good Dinosaur	2015	Adventure\|Animation\|Comedy\|Family\|Fantasy	6.8	62836

Ingest the movie data already transformed in the first notebook, [AAn1]:

my @dsMovieDataLongForm = data-import($*HOME ~ '/Downloads/dsMovieDataLongForm.csv', headers => 'auto');
deduce-type(@dsMovieDataLongForm)

# Vector(Assoc(Atom((Str)), Atom((Str)), 3), 84481)

Data summary:

my @field-names = <Item TagType Tag>;
sink records-summary(@dsMovieDataLongForm, :@field-names)

# +------------------+------------------------+-------------------+
# | Item             | TagType                | Tag               |
# +------------------+------------------------+-------------------+
# | 1387    => 27    | genre         => 29008 | Drama    => 5188  |
# | 3539    => 27    | actor         => 15129 | English  => 4704  |
# | 902     => 27    | title         => 5043  | USA      => 3807  |
# | 2340    => 27    | reviews_count => 5043  | Comedy   => 3744  |
# | 839     => 25    | language      => 5043  | Thriller => 2822  |
# | 1667    => 25    | country       => 5043  | Action   => 2306  |
# | 466     => 25    | director      => 5043  | Romance  => 2214  |
# | (Other) => 84298 | (Other)       => 15129 | (Other)  => 59696 |
# +------------------+------------------------+-------------------+

Recommender system

One way to investigate (browse) the data is to make a recommender system and explore with it different aspects of the movie dataset like movie profiles and nearest neighbors similarities distribution.

Make the recommender

In order to make a more meaningful recommender we put the values of the different numerical variables into “buckets” — i.e. intervals derived corresponding to the values distribution for each variable. The boundaries of the intervals can form a regular grid, correspond to quanitile values, or be specially made. Here we use quantiles:

my @bucketVars = <score votes_count reviews_count>;
my @dsMovieDataLongForm2;
sink for @dsMovieDataLongForm.map(*<TagType>).unique -> $var {
    if $var ∈ @bucketVars {
        my %bucketizer = ML::SparseMatrixRecommender::Utilities::categorize-to-intervals(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*<Tag>)».Numeric, probs => (0..6) >>/>> 6, :interval-names):pairs;
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*.clone).map({ $_<Tag> = %bucketizer{$_<Tag>}; $_ }))
    } else {
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var))
    }
}

sink records-summary(@dsMovieDataLongForm2, :@field-names, :12max-tallies)

# +------------------+------------------------+--------------------+
# | Item             | TagType                | Tag                |
# +------------------+------------------------+--------------------+
# | 902     => 19    | actor         => 15129 | English   => 4704  |
# | 2340    => 19    | genre         => 14504 | USA       => 3807  |
# | 1387    => 19    | score         => 5043  | Drama     => 2594  |
# | 3539    => 19    | country       => 5043  | Comedy    => 1872  |
# | 152     => 18    | votes_count   => 5043  | Thriller  => 1411  |
# | 466     => 18    | language      => 5043  | Action    => 1153  |
# | 1424    => 18    | year          => 5043  | Romance   => 1107  |
# | 839     => 18    | director      => 5043  | Adventure => 923   |
# | 132     => 18    | title         => 5043  | 6.1≤v<6.6 => 901   |
# | 113     => 18    | reviews_count => 5043  | 7≤v<7.5   => 891   |
# | 720     => 18    |                        | Crime     => 889   |
# | 1284    => 18    |                        | 7.5≤v<9.5 => 886   |
# | (Other) => 69757 |                        | (Other)   => 48839 |
# +------------------+------------------------+--------------------+

Here we make a Sparse Matrix Recommender (SMR):

my $smrObj = 
    ML::SparseMatrixRecommender.new
    .create-from-long-form(
        @dsMovieDataLongForm2, 
        item-column-name => 'Item', 
        tag-type-column-name => 'TagType',
        tag-column-name => 'Tag',
        :add-tag-types-to-column-names)        
    .apply-term-weight-functions('IDF', 'None', 'Cosine')

# ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<23319/23239825>), :tag-types(("reviews_count", "score", "votes_count", "genre", "country", "language", "actor", "director", "title", "year")))

Here are the recommender sub-matrices dimensions (rows and columns):

.say for $smrObj.take-matrices.deepmap(*.dimensions).sort(*.key)

# actor => (5043 6256)
# country => (5043 66)
# director => (5043 2399)
# genre => (5043 26)
# language => (5043 48)
# reviews_count => (5043 7)
# score => (5043 7)
# title => (5043 4917)
# votes_count => (5043 7)
# year => (5043 92)

Note that the sub-matrices of “reviews_count”, “score”, and “votes_count” have small number of columns, corresponding to the number probabilities specified when categorizing to intervals.

Enhance with one-hot embedding

my $mat = $smrObj.take-matrices<year>;

my $matUp = Math::SparseMatrix.new(
    diagonal => 1/2 xx ($mat.columns-count - 1), k => 1, 
    row-names => $mat.column-names,
    column-names => $mat.column-names
);

my $matDown = $matUp.transpose;

# mat = mat + mat . matDown + mat . matDown
$mat = $mat.add($mat.dot($matUp)).add($mat.dot($matDown));

# Math::SparseMatrix(:specified-elements(14915), :dimensions((5043, 92)), :density(<14915/463956>))

Make a new recommender with the enhanced matrices:

my %matrices = $smrObj.take-matrices;
%matrices<year> = $mat;
my $smrObj2 = ML::SparseMatrixRecommender.new(%matrices)

# ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<79829/69719475>), :tag-types(("genre", "title", "year", "actor", "director", "votes_count", "reviews_count", "score", "country", "language")))

Recommendations

Example recommendation by profile:

sink $smrObj2
.apply-tag-type-weights({genre => 2})
.recommend-by-profile(<genre:History year:1999>, 12, :!normalize)
.join-across(select-columns(@dsMovieData, @movie-columns), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@movie-columns])})

# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+
# | score    | index | movie_title                              | title_year | genres                                       | imdb_score | num_voted_users |
# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+
# | 1.887751 | 553   | Anna and the King                       | 1999       | Drama|History|Romance                        | 6.7        | 31080           |
# | 1.817476 | 215   | The 13th Warrior                        | 1999       | Action|Adventure|History                     | 6.6        | 101411          |
# | 1.567726 | 1016  | The Messenger: The Story of Joan of Arc | 1999       | Adventure|Biography|Drama|History|War        | 6.4        | 55889           |
# | 1.500264 | 2468  | One Man's Hero                          | 1999       | Action|Drama|History|Romance|War|Western     | 6.2        | 899             |
# | 1.487091 | 2308  | Topsy-Turvy                             | 1999       | Biography|Comedy|Drama|History|Music|Musical | 7.4        | 10037           |
# | 1.479006 | 4006  | La otra conquista                       | 1998       | Drama|History                                | 6.8        | 1024            |
# | 1.411933 | 492   | Thirteen Days                           | 2000       | Drama|History|Thriller                       | 7.3        | 45231           |
# | 1.312900 | 909   | Beloved                                 | 1998       | Drama|History|Horror                         | 5.9        | 6082            |
# | 1.237700 | 1931  | Elizabeth                               | 1998       | Biography|Drama|History                      | 7.5        | 75973           |
# | 1.168287 | 253   | The Patriot                             | 2000       | Action|Drama|History|War                     | 7.1        | 207613          |
# | 1.069476 | 1820  | The Newton Boys                         | 1998       | Action|Crime|Drama|History|Western           | 6.0        | 8309            |
# | 1.000000 | 4767  | America Is Still the Place              | 2015       | History                                      | 7.5        | 22              |
# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+

Recommendation by history:

sink $smrObj
.recommend(<2125 2308>, 12, :!normalize, :!remove-history)
.join-across(select-columns(@dsMovieData, @movie-columns), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@movie-columns])})

# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+
# | score     | index | movie_title             | title_year | genres                                       | imdb_score | num_voted_users |
# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+
# | 12.510011 | 2125  | Molière                | 2007       | Comedy|History                               | 7.3        | 5166            |
# | 12.510011 | 2308  | Topsy-Turvy            | 1999       | Biography|Comedy|Drama|History|Music|Musical | 7.4        | 10037           |
# | 8.364831  | 1728  | The Color of Freedom   | 2007       | Biography|Drama|History                      | 7.1        | 10175           |
# | 8.182233  | 1724  | Little Nicholas        | 2009       | Comedy|Family                                | 7.2        | 9214            |
# | 7.753039  | 3619  | Little Voice           | 1998       | Comedy|Drama|Music|Romance                   | 7.0        | 13892           |
# | 7.439471  | 2285  | Mrs Henderson Presents | 2005       | Comedy|Drama|Music|War                       | 7.1        | 13505           |
# | 7.430299  | 3404  | Made in Dagenham       | 2010       | Biography|Comedy|Drama|History               | 7.2        | 11158           |
# | 7.270637  | 1799  | A Passage to India     | 1984       | Adventure|Drama|History                      | 7.4        | 12980           |
# | 7.264810  | 3837  | The Names of Love      | 2010       | Comedy|Drama|Romance                         | 7.2        | 6304            |
# | 7.117232  | 4648  | The Hammer             | 2007       | Comedy|Romance|Sport                         | 7.3        | 5489            |
# | 7.046925  | 4871  | Shotgun Stories        | 2007       | Drama|Thriller                               | 7.3        | 7148            |
# | 7.040720  | 3194  | The House of Mirth     | 2000       | Drama|Romance                                | 7.1        | 6377            |
# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+

Profiles

Find movie IDs for a certain criteria (e.g. historic action movies):

my @movieIDs = $smrObj.recommend-by-profile(<genre:Action genre:History>, Inf, :!normalize).take-value.grep(*.value > 1)».key;
deduce-type(@movieIDs)

# Vector(Atom((Str)), 14)

Find the profile of the movie set:

my @profile = |$smrObj.profile(@movieIDs).take-value;
deduce-type(@profile)

# Vector(Pair(Atom((Str)), Atom((Numeric))), 108)

Find the top outliers in that profile:

outlier-identifier(@profile».value, identifier => &top-outliers o &quartile-identifier-parameters)
==> {@profile[$_]}()
==> my @profile2;

deduce-type(@profile2)

# Vector(Pair(Atom((Str)), Atom((Numeric))), 26)

Here is a table of the top outlier profile tags and their scores:

#%html
@profile.head(28)
==> { $_.map({ to-html-table([$_,]) }) }()
==> to-html(:multi-column, :4columns, :html-elements)

genre:History0.9999999999999999	language:Mandarin0.3626315299347615	score:7.5≤v<9.50.2719736474510711	year:20150.18131576496738075
language:English0.8159209423532133	reviews_count:0≤v<370.3626315299347615	votes_count:5≤v<41200.2719736474510711	year:20140.18131576496738075
genre:Action0.46214109363846967	score:6.1≤v<6.60.36263152993476144	title:Hero0.18131576496738075	country:UK0.18131576496738075
genre:Adventure0.38097093240387203	country:USA0.36263152993476144	votes_count:68935≤v<1473170.18131576496738075	score:7≤v<7.50.18131576496738075
score:6.6≤v<70.3626315299347615	reviews_count:450≤v<50600.36263152993476144	reviews_count:37≤v<910.18131576496738075	votes_count:4120≤v<149850.18131576496738072
country:China0.3626315299347615	votes_count:147317≤v<16897640.2719736474510711	year:20020.18131576496738075	genre:Drama0.1320986315690731
votes_count:14985≤v<343590.3626315299347615	reviews_count:91≤v<1550.2719736474510711	director:Yimou Zhang0.18131576496738075	genre:Romance0.13001981085966202

Plot all of profile’s scores and the score outliers:

#%js
js-d3-list-plot(
    [|@profile».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'full profile' ) }), 
     |@profile2».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'outliers' ) })], 
    :$background,
    :300height,
    :600width
    )

References

[AAp4] Anton Antonov, Graph, Raku package, (2024-2025), GitHub/antononcube.

[AAp5] Anton Antonov, JavaScript::D3, Raku package, (2022-2025), GitHub/antononcube.

[AAp6] Anton Antonov, Jupyter::Chatbook, Raku package, (2023-2025), GitHub/antononcube.

[AAp7] Anton Antonov, Math::SparseMatrix, Raku package, (2024-2025), GitHub/antononcube.

[AAp8] Anton Antonov, ML::AssociationRuleLearning, Raku package, (2022-2024), GitHub/antononcube.

[AAp9] Anton Antonov, ML::SparseMatrixRecommender, Raku package, (2025), GitHub/antononcube.

[AAp10] Anton Antonov, Statistics::OutlierIdentifiers, Raku package, (2022), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Simplified Machine Learning Workflows Overview (Raku-centric)”, (2022), YouTube/@AAA4prediction.

[AAv2] Anton Antonov, “TRC 2022 Implementation of ML algorithms in Raku”, (2022), YouTube/@AAA4prediction.

[AAv3] Anton Antonov, “Exploratory Data Analysis with Raku”, (2024), YouTube/@AAA4prediction.

[AAv4] Anton Antonov, “Raku RAG demo”, (2024), YouTube/@AAA4prediction.

Outlier detection in a list of numbers

Introduction

Outlier identification is indispensable for data cleaning, normalization, and analysis.

I frequently include outlier identification in the interfaces and algorithms I make for search and recommendation engines. Another, fundamental application of 1D outlier detection is in algorithms for anomalies detection in time series. (See [AAv1, AAv2].)

My first introduction to outlier detection was through the book “Mining Imperfect Data: Dealing with Contamination and Incomplete Records” by Ronald K. Pearson, [RKP1].

This notebook shows examples of using the Raku package “Statistics::OutlierIdentifiers”, [AAp1]. There are related Mathematica and R packages; see [AAp2, AAp3].

Remark: This Mathematica notebook uses the Raku connection described in [AA2]. See the section “Setup” at the end. The Raku function for data summarization is described in [AA3].

Remark: In this WordPress blog post the programming languages are not marked in the code blocks (as in GitHub Markdown.) Hence the Wolfram Language (WL) code is announced in the explanations immediately before that code.

Outlier detection basics

The purpose of the outlier detection algorithms is to find those elements in a list of numbers that have values significantly higher or lower than the rest of the values.

Taking a certain number of elements with the highest values is not the same as an outlier detection, but it can be used as a replacement.

Let us consider the following set of 50 numbers (WL):

SeedRandom[1212];
points = RandomVariate[GammaDistribution[5, 1], 50];
ResourceFunction["RecordsSummary"][points]

If we sort those numbers in descending order and plot them we get (WL):

points = points // Sort // Reverse;
ListPlot[points, PlotStyle -> {PointSize[0.015]}, PlotTheme -> "Detailed", PlotRange -> All, Filling -> Axis, ImageSize -> Medium]

OutlierPosition[lsPoints]

(*{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 45, 46, 47, 48, 49, 50}*)

Let us use the following outlier detection algorithm:

Find all values in the list that are larger than the mean value multiplied by 1.5;
Then find the positions of these values in the list of numbers.

Let us show how we can implement that algorithm in Raku.

First we “transfer” the WL generated points to Raku (WL):

RakuInputExecute["my @points = " <> ToRakuCode[points]];

Here is the summary:

use Data::Summarizers;
records-summary(@points)

# +---------------------+
# | numerical           |
# +---------------------+
# | Max    => 9.82048   |
# | Mean   => 4.8709482 |
# | 3rd-Qu => 6.04842   |
# | Min    => 1.88537   |
# | 1st-Qu => 3.5015    |
# | Median => 4.44519   |
# +---------------------+

Here are the first and second steps combined:

@points.pairs.grep({ $_.value > 1.5 * mean(@points) })

# (0 => 9.82048 1 => 8.78346 2 => 8.55282 3 => 7.94426 4 => 7.6337 5 => 7.43507)

Here we transfer the found outlier positions (the keys of the pairs above) from Raku to WL:

pos = 1 + RakuInputExecute["@points.pairs.grep({ $_.value > 1.5 * mean(@points) })>>.key ==>encode-to-wl()"]

# {1, 2, 3, 4, 5, 6}

Here we plot the data and the outliers (WL):

ListPlot[{points, Transpose[{pos, points[[pos]]}]}, PlotStyle -> {{PointSize[0.02]}, {Red, PointSize[0.012]}}, Filling -> Axis, PlotRange -> All, PlotTheme -> "Detailed", ImageSize -> Medium, PlotLegends -> {"data", "outliers"}]

Instead of the mean value we can use another reference point, like the median value.

Obviously, we can also use a multiplier different than 1.5.

Using the package

First let us load the outlier identification package:

use Statistics::OutlierIdentifiers;

We can find the outliers in a list of numbers with the function outlier-identifier (using the adverb “values”):

outlier-identifier(@points):values

# (9.82048 8.78346 8.55282 7.94426 7.6337 7.43507 7.25105 7.18306 7.1653 6.66771 6.44773 6.27979 2.65329 2.59209 1.92725 1.88537)

The package has three functions for the calculation of outlier identifier parameters over a list of numbers:

.say for (&hampel-identifier-parameters, &splus-quartile-identifier-parameters, &quartile-identifier-parameters).map({ $_ => $_.(@points) });

&hampel-identifier-parameters => (2.678434884 6.211945116)
&splus-quartile-identifier-parameters => (-0.31888 9.8688)
&quartile-identifier-parameters => (1.89827 6.99211)

Elements of the number list that are outside of the numerical interval made by one of these pairs of numbers are considered outliers.

In many cases we want only the top outliers or only the bottom outliers. We can use the functions top-outliers and bottom-outliers for that. Here is an example of finding top outliers using the Hampel outlier identifier:

@points ==> 
outlier-identifier(identifier => &top-outliers o &hampel-identifier-parameters ):values

# (9.82048 8.78346 8.55282 7.94426 7.6337 7.43507 7.25105 7.18306 7.1653 6.66771 6.44773 6.27979) Remark: We use the composition operator in the code above.

Comparison

Here is a visual comparison of the three outlier identifiers in the package “Statistics::OutlierIdentifiers”:

Assume we have a (sorted) list of values:

use Data::Generators;
my @vals = random-variate(NormalDistribution.new(:mean(12), :sd(6)), 600).sort;
records-summary(@vals)

# +------------------------------+
# | numerical                    |
# +------------------------------+
# | 1st-Qu => 7.83167266283631   |
# | Mean   => 11.93490153897063  |
# | Min    => -4.807707682102588 |
# | 3rd-Qu => 15.873759102570908 |
# | Median => 11.743637620896695 |
# | Max    => 31.0034374940894   |
# +------------------------------+

Here we get the Raku values into WL:

vals = RakuInputExecute["@vals==>encode-to-wl()"];

Here is a plot of the sorted values (WL):

ListPlot[vals, PlotRange -> All, PlotTheme -> "Detailed"]

Here we find the outlier positions for each identifier:

(&hampel-identifier-parameters, &splus-quartile-identifier-parameters, &quartile-identifier-parameters).map({ $_.name => outlier-identifier(@vals, identifier=>$_) }).Hash==>encode-to-wl()

(*<|"hampel-identifier-parameters" -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599}, 
"splus-quartile-identifier-parameters" -> {0, 1, 2, 3, 596, 597, 598,599}, 
"quartile-identifier-parameters" -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599}|>*)

Here we assign the last Raku value – a hash – to a WL variable:

aOutliers = %;

Here is visual comparison of the three outlier detection algorithms (WL):

vals2 = Transpose[{Range[Length[vals]], vals}];
ListPlot[{vals2, vals2[[aOutliers[#] + 1]]}, PlotLabel -> #, ImageSize -> Medium, PlotStyle -> {{}, {Red, PointSize[0.01]}}, PlotTheme -> "Detailed", PlotRange -> All] & /@ Sort[Keys[aOutliers]]

We can see that the Hampel outlier identifier is most “permissive” at labeling points as outliers, and the SPLUS quartile-based identifier is the most “conservative.”

Application example

Let us consider and application of outlier detection using one of the Raku challenges, 165, “Task 2: Line of Best Fit”.

In this section we use the code given in the blog post “Writing it down”, [WP1].

Data

Here we get the data:

my $input = '333,129  39,189 140,156 292,134 393,52  160,166 362,122  13,193
                341,104 320,113 109,177 203,152 343,100 225,110  23,186 282,102
                284,98  205,133 297,114 292,126 339,112 327,79  253,136  61,169
                128,176 346,72  316,103 124,162  65,181 159,137 212,116 337,86
                215,136 153,137 390,104 100,180  76,188  77,181  69,195  92,186
                275,96  250,147  34,174 213,134 186,129 189,154 361,82  363,89';

my @points = $input.words».split(',')».Int;
@points.elems

# 48

Here we plot the points (WL):

points = RakuInputExecute["@points==>encode-to-wl()"];
grData = ListPlot[points, PlotRange -> All, PlotStyle -> Gray, PlotTheme -> "Detailed", ImageSize -> Medium]

Best fit line

Here we compute the best linear fit:

my \term:<x²> := @points[*;0]».&(*²);
my \xy = @points[*;0] »*« @points[*;1];
my \Σx = [+] @points[*;0];
my \Σy = [+] @points[*;1];
my \term:<Σx²> = [+] x²;
my \Σxy = [+] xy;
my \N = +@points;

my $m = (N * Σxy - Σx * Σy) / (N * Σx² - (Σx)²);
my $b = (Σy - $m * Σx) / N;

say [$m, $b];

# [-0.2999565 200.132272536]

Here we get into WL the fitted line slope and offset computed above:

{m, b} = RakuInputExecute["[$m, $b]==>encode-to-wl()"];

Here we plot the best fit line (WL):

grLine = Plot[x*m + b, {x, Min[points[[All, 1]]], Max[points[[All, 1]]]}];
Show[{grData, grLine}]

Remark: Of course, we can quickly check the Raku line fit result with WL:

Fit[points, {1, x}, x]

(* 200.132 - 0.299957 x *)

Fit-wise outliers

Let us find the points that are the closest and are the most distant from the fitted line.

First, we find the distances:

my @diffs = @points.map({ my $y = $m * $_[0] + $b; abs($_[1] - $y ) / $y })

Here we find the top outliers:

my @topPos = outlier-identifier(@diffs, identifier => (&top-outliers o &hampel-identifier-parameters));

# [0 3 4 6 13 21 25 34 40 41]

Here we find the bottom outliers:

my @bottomPos = outlier-identifier(@diffs, identifier => (&bottom-outliers o &hampel-identifier-parameters))

# [1 27 28 32]

Here is a plot with the data, the linear fit, and the found top and bottom outliers (WL):

grTopOutliers = 
  ListPlot[points[[1 + RakuInputExecute["@topPos==>encode-to-wl()"]]], PlotStyle -> {PointSize[0.015], Blue}];
grBottomOutliers = 
  ListPlot[ points[[1 + RakuInputExecute["@bottomPos==>encode-to-wl()"]]], PlotStyle -> {PointSize[0.015], Red}];
Legended[ Show[{grData, grLine, grTopOutliers, grBottomOutliers}, ImageSize -> Large],  SwatchLegend[{Gray, Blue, Red}, {"data", "top outliers", "bottom outliers"}]]

Setup

WL Raku process initialization:

RakuMode[]
KillRakuProcess[]
StartRakuProcess["Raku" -> "~/.rakubrew/shims/raku"]

Serializers load (WL)

Import["https://raw.githubusercontent.com/antononcube/ConversationalAgents/master/Packages/WL/RakuDecoder.m"]
Import["https://raw.githubusercontent.com/antononcube/ConversationalAgents/master/Packages/WL/RakuEncoder.m"]
SetOptions[RakuInputExecute, Epilog -> FromRakuCode];
SetOptions[Dataset, MaxItems -> {Automatic, 40}];

Load Raku packages

use Data::Generators;
use Data::Reshapers;
use Data::Summarizers;
use Stats;
use Mathematica::Serializer;

use Statistics::OutlierIdentifiers;

Reference

Articles, books

[AA1] Anton Antonov, “Outlier detection in a list of numbers”, (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Connecting Mathematica and Raku”, (2021), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[RKP1] Ronald K. Pearson, Mining Imperfect Data: Dealing with Contamination and Incomplete Records, 2005, SIAM. (Volume 93 of Other titles in applied mathematics.). ISBN 0898715822, 9780898715828.

[WP1] Wenzel P.P. Peppmeyer, “Writing it down”, (2022), Playing Perl6 esc b6xA Raku at WordPress.

Packages

[AAp1] Anton Antonov, Statistics::OutlierIdentifiers Raku package, (2022), GitHub/antononcube.

[AAp2] Anton Antonov, “Implementation of one dimensional outlier identifying algorithms in Mathematica”, (2013), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, “OutlierIdentifiers” R-package, (2019), R-packages at GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Anomalies, Breaks, and Outlier Detection in Time Series (in WL)”, (2020), Wolfram Research Inc, at YouTube.

[AAv2] Anton Antonov, “Anomalies, Breaks, and Outlier Detection in Time Series (in R)”, (2021), A.Antonov channel at YouTube.

Raku for Prediction

Menu

Tag Archives: Outlier detection

Data science over small movie dataset – Part 2

Introduction

Outline

Comments & observations

Setup

Ingest data

Recommender system

Make the recommender

Enhance with one-hot embedding

Recommendations

Profiles

References

Articles, blog posts

Notebooks

Packages

Videos

Outlier detection in a list of numbers

Introduction

Outlier detection basics

Using the package

Comparison

Application example

Data

Best fit line

Fit-wise outliers

Setup

Serializers load (WL)

Load Raku packages

Reference

Articles, books

Packages

Videos