Data science over small movie dataset – Part 2

Introduction

This document (notebook) shows transformation of movie dataset into a form more suitable for making a movie recommender system. (It builds upon Part 1 of the blog posts series.)

The movie data was downloaded from “IMDB Movie Ratings Dataset”. That dataset was chosen because:

  • It has the right size for demonstration of data wrangling techniques
    • ≈5000 rows and 15 columns (each row corresponding to a movie)
  • It is “real life” data with expected skewness of variable distributions
  • It is diverse enough over movie years and genres
  • Relatively small number of missing values

The full “Raku for Data Science” showcase is done with three notebooks, [AAn1, AAn2, AAn3]:

  1. Data transformations and analysis, [AAn1]
  2. Sparse matrix recommender, [AAn2]
  3. Relationships graphs, [AAn3]

Remark: All three notebooks feature the same introduction, setup, and references sections in order to make it easier for readers to browse, access, or reproduce the content.

Remark: The series data files can be found in the folder “Data” of the GitHub repository “RakuForPrediction-blog”, [AAr1].

The notebook series can be used in several ways:

  • Just reading this introduction and then browsing the notebooks
  • Reading only this (data transformations) notebook in order to see how data wrangling is done
  • Evaluating all three notebooks in order to learn and reproduce the computational steps in them

Outline

Here are the transformation, data analysis, and machine learning steps taken in the notebook series, [AAn1, AAn2, AAn3]:

  1. Ingest the data — Part 1
    • Shape size and summaries
    • Numerical columns transformation
    • Renaming columns to have more convenient names
    • Separating the non-uniform genres column into movie-genre associations
      • Into long format
  2. Basic data analysis — Part 1
    • Number of movies per year distribution
    • Movie-genre distribution
    • Pareto principle adherence for movie directors
    • Correlation between number of votes and rating
  3. Association Rules Learning (ARL) — Part 1
    • Converting long format dataset into “baskets” of genres
    • Most frequent combinations of genres
    • Implications between genres
      • I.e. a biography-movie is also a drama-movie 94% of the time
    • LLM-derived dictionary of most commonly used ARL measures
  4. Recommender system creation — Part 2
    • Conversion of numerical data into categorical data
    • Application of one hot embedding
    • Experimenting / observing recommendation results
    • Getting familiar with the movie data by computing profiles for sets of movies
  5. Relationships graphs — Part 3
    • Find the nearest neighbors for every movie in a certain range of years
    • Make the corresponding nearest neighbors graph
      • Using different weights for the different types of movie metadata
    • Visualize largest components
    • Make and visualize graphs based on different filtering criteria

Comments & observations

  • This notebook series started as a demonstration of making a “real life” data Recommender System (RS).
    • The data transformations notebook would not be needed if the data had “nice” tabular form.
      • Since the data have aggregated values in its “genres” column typical long form transformations have to be done.
      • On the other hand, the actor names per movie are not aggregated but spread-out in three columns.
      • Both cases represent a single movie metadata type.
        • For both long format transformations (or similar) are needed in order to make an RS.
    • After a corresponding Sparse Matrix Recommender (SMR) is made its sparse matrix can be used to do additional analysis.
      • Such extensions are: deriving clusters, making and visualizing graphs, making and evaluating suitable classifiers.
  • In most “real life” data processing most of the data transformation listed steps above are taken.
  • ARL can be also used for deriving recommendations if the data is large enough.
  • The SMR object is based on Nearest Neighbors finding over “bags of tags.”
    • Latent Semantic Indexing (LSI) tag-weighting functions are applied.
  • The data does not have movie-viewer data, hence only item-item recommenders are created and used.
  • One hot embedding is a common technique, which in this notebook is done via cross-tabulation.
  • The categorization of numerical data means putting number into suitable bins or “buckets.”
    • The bin or bucket boundaries can be on a regular grid or a quantile grid.
  • For categorized numerical data one-hot embedding matrices can be processed to increase similarity between numeric buckets that are close to each to other.
  • Nearest-neighbors based recommenders — like SMR — can be used as classifiers.
    • These are the so called K-Nearest Neighbors (KNN) classifiers.
    • Although the data is small (both row-wise & column-wise) we can consider making classifiers predicting IMDB ratings or number of votes.
  • Using the recommender matrix similarities between different movies can be computed and a corresponding graph can be made.
  • Centrality analysis and simulations of random walks over the graph can be made.
    • Like Google’s “Page-rank” algorithm.
  • The relationship graphs can be used to visualize the “structure” of movie dataset.
  • Alternatively, clustering can be used.
    • Hierarchical clustering might be of interest.
  • If the movies had reviews or summaries associated with them, then Latent Semantic Analysis (LSA) could be applied.
    • SMR can use both LSA-terms-based and LSA-topics-based representations of the movies.
    • LLMs can be used to derive the LSA representation.
    • Again, not done in these series of notebooks.

Setup

Load packages used in the notebook:

use Math::SparseMatrix;
use ML::SparseMatrixRecommender;
use ML::SparseMatrixRecommender::Utilities;
use Statistics::OutlierIdentifiers;

#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = '#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;

# {background => #1F1F1F, edge-thickness => 3, title-color => Silver, vertex-size => 6}


Ingest data

Download from GitHub the files:

And unzip them.

Ingest movie data:

my $fileName = $*HOME ~ '/Downloads/movie_data.csv';
my @dsMovieData=data-import($fileName, headers=>'auto');
@dsMovieData .= map({ $_<title_year> = $_<title_year>.Int.Str; $_});
deduce-type(@dsMovieData)

# Vector(Assoc(Atom((Str)), Atom((Str)), 15), 5043)

Here is a sample of the movie data over the columns we most interested in:

#% html
my @movie-columns = <index movie_title title_year genres imdb_score num_voted_users>;
@dsMovieData.pick(4)
==> to-html(field-names => @movie-columns)

indexmovie_titletitle_yeargenresimdb_scorenum_voted_users
3322Veronika Decides to Die2009Drama|Romance6.510100
1511The Maze Runner2014Action|Mystery|Sci-Fi|Thriller6.8310903
1301Big Miracle2012Biography|Drama|Romance6.515231
55The Good Dinosaur2015Adventure|Animation|Comedy|Family|Fantasy6.862836

Ingest the movie data already transformed in the first notebook, [AAn1]:

my @dsMovieDataLongForm = data-import($*HOME ~ '/Downloads/dsMovieDataLongForm.csv', headers => 'auto');
deduce-type(@dsMovieDataLongForm)

# Vector(Assoc(Atom((Str)), Atom((Str)), 3), 84481)

Data summary:

my @field-names = <Item TagType Tag>;
sink records-summary(@dsMovieDataLongForm, :@field-names)

# +------------------+------------------------+-------------------+
# | Item             | TagType                | Tag               |
# +------------------+------------------------+-------------------+
# | 1387    => 27    | genre         => 29008 | Drama    => 5188  |
# | 3539    => 27    | actor         => 15129 | English  => 4704  |
# | 902     => 27    | title         => 5043  | USA      => 3807  |
# | 2340    => 27    | reviews_count => 5043  | Comedy   => 3744  |
# | 839     => 25    | language      => 5043  | Thriller => 2822  |
# | 1667    => 25    | country       => 5043  | Action   => 2306  |
# | 466     => 25    | director      => 5043  | Romance  => 2214  |
# | (Other) => 84298 | (Other)       => 15129 | (Other)  => 59696 |
# +------------------+------------------------+-------------------+



Recommender system

One way to investigate (browse) the data is to make a recommender system and explore with it different aspects of the movie dataset like movie profiles and nearest neighbors similarities distribution.

Make the recommender

In order to make a more meaningful recommender we put the values of the different numerical variables into “buckets” — i.e. intervals derived corresponding to the values distribution for each variable. The boundaries of the intervals can form a regular grid, correspond to quanitile values, or be specially made. Here we use quantiles:

my @bucketVars = <score votes_count reviews_count>;
my @dsMovieDataLongForm2;
sink for @dsMovieDataLongForm.map(*<TagType>).unique -> $var {
    if $var ∈ @bucketVars {
        my %bucketizer = ML::SparseMatrixRecommender::Utilities::categorize-to-intervals(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*<Tag>)».Numeric, probs => (0..6) >>/>> 6, :interval-names):pairs;
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var).map(*.clone).map({ $_<Tag> = %bucketizer{$_<Tag>}; $_ }))
    } else {
        @dsMovieDataLongForm2.append(@dsMovieDataLongForm.grep(*<TagType> eq $var))
    }
}

sink records-summary(@dsMovieDataLongForm2, :@field-names, :12max-tallies)

# +------------------+------------------------+--------------------+
# | Item             | TagType                | Tag                |
# +------------------+------------------------+--------------------+
# | 902     => 19    | actor         => 15129 | English   => 4704  |
# | 2340    => 19    | genre         => 14504 | USA       => 3807  |
# | 1387    => 19    | score         => 5043  | Drama     => 2594  |
# | 3539    => 19    | country       => 5043  | Comedy    => 1872  |
# | 152     => 18    | votes_count   => 5043  | Thriller  => 1411  |
# | 466     => 18    | language      => 5043  | Action    => 1153  |
# | 1424    => 18    | year          => 5043  | Romance   => 1107  |
# | 839     => 18    | director      => 5043  | Adventure => 923   |
# | 132     => 18    | title         => 5043  | 6.1≤v<6.6 => 901   |
# | 113     => 18    | reviews_count => 5043  | 7≤v<7.5   => 891   |
# | 720     => 18    |                        | Crime     => 889   |
# | 1284    => 18    |                        | 7.5≤v<9.5 => 886   |
# | (Other) => 69757 |                        | (Other)   => 48839 |
# +------------------+------------------------+--------------------+


Here we make a Sparse Matrix Recommender (SMR):

my $smrObj = 
    ML::SparseMatrixRecommender.new
    .create-from-long-form(
        @dsMovieDataLongForm2, 
        item-column-name => 'Item', 
        tag-type-column-name => 'TagType',
        tag-column-name => 'Tag',
        :add-tag-types-to-column-names)        
    .apply-term-weight-functions('IDF', 'None', 'Cosine')

# ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<23319/23239825>), :tag-types(("reviews_count", "score", "votes_count", "genre", "country", "language", "actor", "director", "title", "year")))

Here are the recommender sub-matrices dimensions (rows and columns):

.say for $smrObj.take-matrices.deepmap(*.dimensions).sort(*.key)

# actor => (5043 6256)
# country => (5043 66)
# director => (5043 2399)
# genre => (5043 26)
# language => (5043 48)
# reviews_count => (5043 7)
# score => (5043 7)
# title => (5043 4917)
# votes_count => (5043 7)
# year => (5043 92)


Note that the sub-matrices of “reviews_count”, “score”, and “votes_count” have small number of columns, corresponding to the number probabilities specified when categorizing to intervals.

Enhance with one-hot embedding

my $mat = $smrObj.take-matrices<year>;

my $matUp = Math::SparseMatrix.new(
    diagonal => 1/2 xx ($mat.columns-count - 1), k => 1, 
    row-names => $mat.column-names,
    column-names => $mat.column-names
);

my $matDown = $matUp.transpose;

# mat = mat + mat . matDown + mat . matDown
$mat = $mat.add($mat.dot($matUp)).add($mat.dot($matDown));

# Math::SparseMatrix(:specified-elements(14915), :dimensions((5043, 92)), :density(<14915/463956>))

Make a new recommender with the enhanced matrices:

my %matrices = $smrObj.take-matrices;
%matrices<year> = $mat;
my $smrObj2 = ML::SparseMatrixRecommender.new(%matrices)

# ML::SparseMatrixRecommender(:matrix-dimensions((5043, 13825)), :density(<79829/69719475>), :tag-types(("genre", "title", "year", "actor", "director", "votes_count", "reviews_count", "score", "country", "language")))

Recommendations

Example recommendation by profile:

sink $smrObj2
.apply-tag-type-weights({genre => 2})
.recommend-by-profile(<genre:History year:1999>, 12, :!normalize)
.join-across(select-columns(@dsMovieData, @movie-columns), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@movie-columns])})

# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+
# | score    | index | movie_title                              | title_year | genres                                       | imdb_score | num_voted_users |
# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+
# | 1.887751 | 553   | Anna and the King                       | 1999       | Drama|History|Romance                        | 6.7        | 31080           |
# | 1.817476 | 215   | The 13th Warrior                        | 1999       | Action|Adventure|History                     | 6.6        | 101411          |
# | 1.567726 | 1016  | The Messenger: The Story of Joan of Arc | 1999       | Adventure|Biography|Drama|History|War        | 6.4        | 55889           |
# | 1.500264 | 2468  | One Man's Hero                          | 1999       | Action|Drama|History|Romance|War|Western     | 6.2        | 899             |
# | 1.487091 | 2308  | Topsy-Turvy                             | 1999       | Biography|Comedy|Drama|History|Music|Musical | 7.4        | 10037           |
# | 1.479006 | 4006  | La otra conquista                       | 1998       | Drama|History                                | 6.8        | 1024            |
# | 1.411933 | 492   | Thirteen Days                           | 2000       | Drama|History|Thriller                       | 7.3        | 45231           |
# | 1.312900 | 909   | Beloved                                 | 1998       | Drama|History|Horror                         | 5.9        | 6082            |
# | 1.237700 | 1931  | Elizabeth                               | 1998       | Biography|Drama|History                      | 7.5        | 75973           |
# | 1.168287 | 253   | The Patriot                             | 2000       | Action|Drama|History|War                     | 7.1        | 207613          |
# | 1.069476 | 1820  | The Newton Boys                         | 1998       | Action|Crime|Drama|History|Western           | 6.0        | 8309            |
# | 1.000000 | 4767  | America Is Still the Place              | 2015       | History                                      | 7.5        | 22              |
# +----------+-------+------------------------------------------+------------+----------------------------------------------+------------+-----------------+


Recommendation by history:

sink $smrObj
.recommend(<2125 2308>, 12, :!normalize, :!remove-history)
.join-across(select-columns(@dsMovieData, @movie-columns), 'index')
.echo-value(as => {to-pretty-table($_, align => 'l', field-names => ['score', |@movie-columns])})

# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+
# | score     | index | movie_title             | title_year | genres                                       | imdb_score | num_voted_users |
# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+
# | 12.510011 | 2125  | Molière                | 2007       | Comedy|History                               | 7.3        | 5166            |
# | 12.510011 | 2308  | Topsy-Turvy            | 1999       | Biography|Comedy|Drama|History|Music|Musical | 7.4        | 10037           |
# | 8.364831  | 1728  | The Color of Freedom   | 2007       | Biography|Drama|History                      | 7.1        | 10175           |
# | 8.182233  | 1724  | Little Nicholas        | 2009       | Comedy|Family                                | 7.2        | 9214            |
# | 7.753039  | 3619  | Little Voice           | 1998       | Comedy|Drama|Music|Romance                   | 7.0        | 13892           |
# | 7.439471  | 2285  | Mrs Henderson Presents | 2005       | Comedy|Drama|Music|War                       | 7.1        | 13505           |
# | 7.430299  | 3404  | Made in Dagenham       | 2010       | Biography|Comedy|Drama|History               | 7.2        | 11158           |
# | 7.270637  | 1799  | A Passage to India     | 1984       | Adventure|Drama|History                      | 7.4        | 12980           |
# | 7.264810  | 3837  | The Names of Love      | 2010       | Comedy|Drama|Romance                         | 7.2        | 6304            |
# | 7.117232  | 4648  | The Hammer             | 2007       | Comedy|Romance|Sport                         | 7.3        | 5489            |
# | 7.046925  | 4871  | Shotgun Stories        | 2007       | Drama|Thriller                               | 7.3        | 7148            |
# | 7.040720  | 3194  | The House of Mirth     | 2000       | Drama|Romance                                | 7.1        | 6377            |
# +-----------+-------+-------------------------+------------+----------------------------------------------+------------+-----------------+


Profiles

Find movie IDs for a certain criteria (e.g. historic action movies):

my @movieIDs = $smrObj.recommend-by-profile(<genre:Action genre:History>, Inf, :!normalize).take-value.grep(*.value > 1)».key;
deduce-type(@movieIDs)

# Vector(Atom((Str)), 14)

Find the profile of the movie set:

my @profile = |$smrObj.profile(@movieIDs).take-value;
deduce-type(@profile)

# Vector(Pair(Atom((Str)), Atom((Numeric))), 108)

Find the top outliers in that profile:

outlier-identifier(@profile».value, identifier => &top-outliers o &quartile-identifier-parameters)
==> {@profile[$_]}()
==> my @profile2;

deduce-type(@profile2)

# Vector(Pair(Atom((Str)), Atom((Numeric))), 26)

Here is a table of the top outlier profile tags and their scores:

#%html
@profile.head(28)
==> { $_.map({ to-html-table([$_,]) }) }()
==> to-html(:multi-column, :4columns, :html-elements)

genre:History0.9999999999999999language:Mandarin0.3626315299347615score:7.5≤v<9.50.2719736474510711year:20150.18131576496738075
language:English0.8159209423532133reviews_count:0≤v<370.3626315299347615votes_count:5≤v<41200.2719736474510711year:20140.18131576496738075
genre:Action0.46214109363846967score:6.1≤v<6.60.36263152993476144title:Hero0.18131576496738075country:UK0.18131576496738075
genre:Adventure0.38097093240387203country:USA0.36263152993476144votes_count:68935≤v<1473170.18131576496738075score:7≤v<7.50.18131576496738075
score:6.6≤v<70.3626315299347615reviews_count:450≤v<50600.36263152993476144reviews_count:37≤v<910.18131576496738075votes_count:4120≤v<149850.18131576496738072
country:China0.3626315299347615votes_count:147317≤v<16897640.2719736474510711year:20020.18131576496738075genre:Drama0.1320986315690731
votes_count:14985≤v<343590.3626315299347615reviews_count:91≤v<1550.2719736474510711director:Yimou Zhang0.18131576496738075genre:Romance0.13001981085966202

Plot all of profile’s scores and the score outliers:

#%js
js-d3-list-plot(
    [|@profile».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'full profile' ) }), 
     |@profile2».value.kv.map(-> $x, $y { %(:$x, :$y, group => 'outliers' ) })], 
    :$background,
    :300height,
    :600width
    )


References

Articles, blog posts

[AA1] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[AA2] Anton Antonov, “Implementing Machine Learning algorithms in Raku (TRC-2022 talk)”, (2021), RakuForPrediction at WordPress.

Notebooks

[AAn1] Anton Antonov,
“Small movie dataset analysis”,
(2025),
RakuForPrediction-blog at GitHub.

[AAn2] Anton Antonov,
“Small movie dataset recommender”,
(2025),
RakuForPrediction-blog at GitHub.

[AAn3] Anton Antonov,
“Small movie dataset graph”,
(2025),
RakuForPrediction-blog at GitHub.

Packages

[AAp1] Anton Antonov, Data::Importers, Raku package, (2024-2025), GitHub/antononcube.

[AAp2] Anton Antonov, Data::Reshapers, Raku package, (2021-2025), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Summarizers, Raku package, (2021-2024), GitHub/antononcube.

[AAp4] Anton Antonov, Graph, Raku package, (2024-2025), GitHub/antononcube.

[AAp5] Anton Antonov, JavaScript::D3, Raku package, (2022-2025), GitHub/antononcube.

[AAp6] Anton Antonov, Jupyter::Chatbook, Raku package, (2023-2025), GitHub/antononcube.

[AAp7] Anton Antonov, Math::SparseMatrix, Raku package, (2024-2025), GitHub/antononcube.

[AAp8] Anton Antonov, ML::AssociationRuleLearning, Raku package, (2022-2024), GitHub/antononcube.

[AAp9] Anton Antonov, ML::SparseMatrixRecommender, Raku package, (2025), GitHub/antononcube.

[AAp10] Anton Antonov, Statistics::OutlierIdentifiers, Raku package, (2022), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Simplified Machine Learning Workflows Overview (Raku-centric)”, (2022), YouTube/@AAA4prediction.

[AAv2] Anton Antonov, “TRC 2022 Implementation of ML algorithms in Raku”, (2022), YouTube/@AAA4prediction.

[AAv3] Anton Antonov, “Exploratory Data Analysis with Raku”, (2024), YouTube/@AAA4prediction.

[AAv4] Anton Antonov, “Raku RAG demo”, (2024), YouTube/@AAA4prediction.

Outlier detection in a list of numbers

Introduction

Outlier identification is indispensable for data cleaning, normalization, and analysis.

I frequently include outlier identification in the interfaces and algorithms I make for search and recommendation engines. Another, fundamental application of 1D outlier detection is in algorithms for anomalies detection in time series. (See [AAv1, AAv2].)

My first introduction to outlier detection was through the book “Mining Imperfect Data: Dealing with Contamination and Incomplete Records” by Ronald K. Pearson, [RKP1].

This notebook shows examples of using the Raku package “Statistics::OutlierIdentifiers”, [AAp1]. There are related Mathematica and R packages; see [AAp2, AAp3].

Remark: This Mathematica notebook uses the Raku connection described in [AA2]. See the section “Setup” at the end. The Raku function for data summarization is described in [AA3].

Remark: In this WordPress blog post the programming languages are not marked in the code blocks (as in GitHub Markdown.) Hence the Wolfram Language (WL) code is announced in the explanations immediately before that code.


Outlier detection basics

The purpose of the outlier detection algorithms is to find those elements in a list of numbers that have values significantly higher or lower than the rest of the values.

Taking a certain number of elements with the highest values is not the same as an outlier detection, but it can be used as a replacement.

Let us consider the following set of 50 numbers (WL):

SeedRandom[1212];
points = RandomVariate[GammaDistribution[5, 1], 50];
ResourceFunction["RecordsSummary"][points]
1v3fe9820vj7e

If we sort those numbers in descending order and plot them we get (WL):

points = points // Sort // Reverse;
ListPlot[points, PlotStyle -> {PointSize[0.015]}, PlotTheme -> "Detailed", PlotRange -> All, Filling -> Axis, ImageSize -> Medium]
1kpkp8mq0pg6j
OutlierPosition[lsPoints]

(*{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 45, 46, 47, 48, 49, 50}*)

Let us use the following outlier detection algorithm:

  1. Find all values in the list that are larger than the mean value multiplied by 1.5;
  2. Then find the positions of these values in the list of numbers.

Let us show how we can implement that algorithm in Raku.

First we “transfer” the WL generated points to Raku (WL):

RakuInputExecute["my @points = " <> ToRakuCode[points]];

Here is the summary:

use Data::Summarizers;
records-summary(@points)

# +---------------------+
# | numerical           |
# +---------------------+
# | Max    => 9.82048   |
# | Mean   => 4.8709482 |
# | 3rd-Qu => 6.04842   |
# | Min    => 1.88537   |
# | 1st-Qu => 3.5015    |
# | Median => 4.44519   |
# +---------------------+

Here are the first and second steps combined:

@points.pairs.grep({ $_.value > 1.5 * mean(@points) })

# (0 => 9.82048 1 => 8.78346 2 => 8.55282 3 => 7.94426 4 => 7.6337 5 => 7.43507)

Here we transfer the found outlier positions (the keys of the pairs above) from Raku to WL:

pos = 1 + RakuInputExecute["@points.pairs.grep({ $_.value > 1.5 * mean(@points) })>>.key ==>encode-to-wl()"]

# {1, 2, 3, 4, 5, 6}

Here we plot the data and the outliers (WL):

ListPlot[{points, Transpose[{pos, points[[pos]]}]}, PlotStyle -> {{PointSize[0.02]}, {Red, PointSize[0.012]}}, Filling -> Axis, PlotRange -> All, PlotTheme -> "Detailed", ImageSize -> Medium, PlotLegends -> {"data", "outliers"}]
08n44gi4n6adx

Instead of the mean value we can use another reference point, like the median value.

Obviously, we can also use a multiplier different than 1.5.

Using the package

First let us load the outlier identification package:

use Statistics::OutlierIdentifiers;

We can find the outliers in a list of numbers with the function outlier-identifier (using the adverb “values”):

outlier-identifier(@points):values

# (9.82048 8.78346 8.55282 7.94426 7.6337 7.43507 7.25105 7.18306 7.1653 6.66771 6.44773 6.27979 2.65329 2.59209 1.92725 1.88537)

The package has three functions for the calculation of outlier identifier parameters over a list of numbers:

.say for (&hampel-identifier-parameters, &splus-quartile-identifier-parameters, &quartile-identifier-parameters).map({ $_ => $_.(@points) });

&hampel-identifier-parameters => (2.678434884 6.211945116)
&splus-quartile-identifier-parameters => (-0.31888 9.8688)
&quartile-identifier-parameters => (1.89827 6.99211)

Elements of the number list that are outside of the numerical interval made by one of these pairs of numbers are considered outliers.

In many cases we want only the top outliers or only the bottom outliers. We can use the functions top-outliers and bottom-outliers for that. Here is an example of finding top outliers using the Hampel outlier identifier:

@points ==> 
outlier-identifier(identifier => &top-outliers o &hampel-identifier-parameters ):values

# (9.82048 8.78346 8.55282 7.94426 7.6337 7.43507 7.25105 7.18306 7.1653 6.66771 6.44773 6.27979) Remark: We use the composition operator in the code above.

Comparison

Here is a visual comparison of the three outlier identifiers in the package “Statistics::OutlierIdentifiers”:

Assume we have a (sorted) list of values:

use Data::Generators;
my @vals = random-variate(NormalDistribution.new(:mean(12), :sd(6)), 600).sort;
records-summary(@vals)

# +------------------------------+
# | numerical                    |
# +------------------------------+
# | 1st-Qu => 7.83167266283631   |
# | Mean   => 11.93490153897063  |
# | Min    => -4.807707682102588 |
# | 3rd-Qu => 15.873759102570908 |
# | Median => 11.743637620896695 |
# | Max    => 31.0034374940894   |
# +------------------------------+

Here we get the Raku values into WL:

vals = RakuInputExecute["@vals==>encode-to-wl()"];

Here is a plot of the sorted values (WL):

ListPlot[vals, PlotRange -> All, PlotTheme -> "Detailed"]
0y61ylref7zde

Here we find the outlier positions for each identifier:

(&hampel-identifier-parameters, &splus-quartile-identifier-parameters, &quartile-identifier-parameters).map({ $_.name => outlier-identifier(@vals, identifier=>$_) }).Hash==>encode-to-wl()

(*<|"hampel-identifier-parameters" -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599}, 
"splus-quartile-identifier-parameters" -> {0, 1, 2, 3, 596, 597, 598,599}, 
"quartile-identifier-parameters" -> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599}|>*)

Here we assign the last Raku value – a hash – to a WL variable:

aOutliers = %;

Here is visual comparison of the three outlier detection algorithms (WL):

vals2 = Transpose[{Range[Length[vals]], vals}];
ListPlot[{vals2, vals2[[aOutliers[#] + 1]]}, PlotLabel -> #, ImageSize -> Medium, PlotStyle -> {{}, {Red, PointSize[0.01]}}, PlotTheme -> "Detailed", PlotRange -> All] & /@ Sort[Keys[aOutliers]]
01tqeugm6tcii

We can see that the Hampel outlier identifier is most “permissive” at labeling points as outliers, and the SPLUS quartile-based identifier is the most “conservative.”


Application example

Let us consider and application of outlier detection using one of the Raku challenges, 165, “Task 2: Line of Best Fit”.

In this section we use the code given in the blog post “Writing it down”, [WP1].

Data

Here we get the data:

my $input = '333,129  39,189 140,156 292,134 393,52  160,166 362,122  13,193
                341,104 320,113 109,177 203,152 343,100 225,110  23,186 282,102
                284,98  205,133 297,114 292,126 339,112 327,79  253,136  61,169
                128,176 346,72  316,103 124,162  65,181 159,137 212,116 337,86
                215,136 153,137 390,104 100,180  76,188  77,181  69,195  92,186
                275,96  250,147  34,174 213,134 186,129 189,154 361,82  363,89';

my @points = $input.words».split(',')».Int;
@points.elems

# 48

Here we plot the points (WL):

points = RakuInputExecute["@points==>encode-to-wl()"];
grData = ListPlot[points, PlotRange -> All, PlotStyle -> Gray, PlotTheme -> "Detailed", ImageSize -> Medium]
10jw29bko1kjj

Best fit line

Here we compute the best linear fit:

my \term:<x²> := @points[*;0]».&(*²);
my \xy = @points[*;0] »*« @points[*;1];
my \Σx = [+] @points[*;0];
my \Σy = [+] @points[*;1];
my \term:<Σx²> = [+] x²;
my \Σxy = [+] xy;
my \N = +@points;

my $m = (N * Σxy - Σx * Σy) / (N * Σx² - (Σx)²);
my $b = (Σy - $m * Σx) / N;

say [$m, $b];

# [-0.2999565 200.132272536]

Here we get into WL the fitted line slope and offset computed above:

{m, b} = RakuInputExecute["[$m, $b]==>encode-to-wl()"];

Here we plot the best fit line (WL):

grLine = Plot[x*m + b, {x, Min[points[[All, 1]]], Max[points[[All, 1]]]}];
Show[{grData, grLine}]
0uhd1yjdca24j

Remark: Of course, we can quickly check the Raku line fit result with WL:

Fit[points, {1, x}, x]

(* 200.132 - 0.299957 x *)

Fit-wise outliers

Let us find the points that are the closest and are the most distant from the fitted line.

First, we find the distances:

my @diffs = @points.map({ my $y = $m * $_[0] + $b; abs($_[1] - $y ) / $y })

Here we find the top outliers:

my @topPos = outlier-identifier(@diffs, identifier => (&top-outliers o &hampel-identifier-parameters));

# [0 3 4 6 13 21 25 34 40 41]

Here we find the bottom outliers:

my @bottomPos = outlier-identifier(@diffs, identifier => (&bottom-outliers o &hampel-identifier-parameters))

# [1 27 28 32]

Here is a plot with the data, the linear fit, and the found top and bottom outliers (WL):

grTopOutliers = 
  ListPlot[points[[1 + RakuInputExecute["@topPos==>encode-to-wl()"]]], PlotStyle -> {PointSize[0.015], Blue}];
grBottomOutliers = 
  ListPlot[ points[[1 + RakuInputExecute["@bottomPos==>encode-to-wl()"]]], PlotStyle -> {PointSize[0.015], Red}];
Legended[ Show[{grData, grLine, grTopOutliers, grBottomOutliers}, ImageSize -> Large],  SwatchLegend[{Gray, Blue, Red}, {"data", "top outliers", "bottom outliers"}]]
10g4x6f5c7z0i

Setup

WL Raku process initialization:

RakuMode[]
KillRakuProcess[]
StartRakuProcess["Raku" -> "~/.rakubrew/shims/raku"]
0oelkun1d4hsw

Serializers load (WL)

Import["https://raw.githubusercontent.com/antononcube/ConversationalAgents/master/Packages/WL/RakuDecoder.m"]
Import["https://raw.githubusercontent.com/antononcube/ConversationalAgents/master/Packages/WL/RakuEncoder.m"]
SetOptions[RakuInputExecute, Epilog -> FromRakuCode];
SetOptions[Dataset, MaxItems -> {Automatic, 40}];

Load Raku packages

use Data::Generators;
use Data::Reshapers;
use Data::Summarizers;
use Stats;
use Mathematica::Serializer;

use Statistics::OutlierIdentifiers; 

Reference

Articles, books

[AA1] Anton Antonov, “Outlier detection in a list of numbers”, (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Connecting Mathematica and Raku”, (2021), RakuForPrediction at WordPress.

[AA3] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[RKP1] Ronald K. Pearson, Mining Imperfect Data: Dealing with Contamination and Incomplete Records, 2005, SIAM. (Volume 93 of Other titles in applied mathematics.). ISBN 0898715822, 9780898715828.

[WP1] Wenzel P.P. Peppmeyer, “Writing it down”, (2022), Playing Perl6 esc b6xA Raku at WordPress.

Packages

[AAp1] Anton Antonov, Statistics::OutlierIdentifiers Raku package, (2022), GitHub/antononcube.

[AAp2] Anton Antonov, “Implementation of one dimensional outlier identifying algorithms in Mathematica”, (2013), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, “OutlierIdentifiers” R-package, (2019), R-packages at GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Anomalies, Breaks, and Outlier Detection in Time Series (in WL)”, (2020), Wolfram Research Inc, at YouTube.

[AAv2] Anton Antonov, “Anomalies, Breaks, and Outlier Detection in Time Series (in R)”, (2021), A.Antonov channel at YouTube.