Data science over small movie dataset — Part 1

«Data transformations and analysis»

Introduction

This document (notebook) shows transformations of a movie dataset into a format more suitable for data analysis and for making a movie recommender system. It is the first of a three-part series of notebooks that showcase Raku packages for doing Data Science (DS). The notebook series as a whole goes through this general DS loop:

The movie data was downloaded from “IMDB Movie Ratings Dataset”. That dataset was chosen because:

  • It has the right size for demonstration of data wrangling techniques
    • ≈5000 rows and 15 columns (each row corresponding to a movie)
  • It is “real life” data with expected skewness of variable distributions
  • It is diverse enough over movie years and genres
  • Relatively small number of missing values

The full “Raku for Data Science” showcase is done with three notebooks, [AAn1, AAn2, AAn3]:

  1. Data transformations and analysis, [AAn1]
  2. Sparse matrix recommender, [AAn2]
  3. Relationships graphs, [AAn3]

Remark: All three notebooks feature the same introduction, setup, and references sections in order to make it easier for readers to browse, access, or reproduce the content.

Remark: The series data files can be found in the folder “Data” of the GitHub repository “RakuForPrediction-blog”, [AAr1].

The notebook series can be used in several ways:

  • Just reading this introduction and then browsing the notebooks
  • Reading only this (data transformations) notebook in order to see how data wrangling is done
  • Evaluating all three notebooks in order to learn and reproduce the computational steps in them

Outline

Here are the transformation, data analysis, and machine learning steps taken in the notebook series, [AAn1, AAn2, AAn3]:

  1. Ingest the data — Part 1
    • Shape size and summaries
    • Numerical columns transformation
    • Renaming columns to have more convenient names
    • Separating the non-uniform genres column into movie-genre associations
      • Into long format
  2. Basic data analysis — Part 1
    • Number of movies per year distribution
    • Movie-genre distribution
    • Pareto principle adherence for movie directors
    • Correlation between number of votes and rating
  3. Association Rules Learning (ARL) — Part 1
    • Converting long format dataset into “baskets” of genres
    • Most frequent combinations of genres
    • Implications between genres
      • I.e. a biography-movie is also a drama-movie 94% of the time
    • LLM-derived dictionary of most commonly used ARL measures
  4. Recommender system creation — Part 2
    • Conversion of numerical data into categorical data
    • Application of one hot embedding
    • Experimenting / observing recommendation results
    • Getting familiar with the movie data by computing profiles for sets of movies
  5. Relationships graphs — Part 3
    • Find the nearest neighbors for every movie in a certain range of years
    • Make the corresponding nearest neighbors graph
      • Using different weights for the different types of movie metadata
    • Visualize largest components
    • Make and visualize graphs based on different filtering criteria

Comments & observations

  • This notebook series started as a demonstration of making a “real life” data Recommender System (RS).
    • The data transformations notebook would not be needed if the data had “nice” tabular form.
      • Since the data have aggregated values in its “genres” column typical long form transformations have to be done.
      • On the other hand, the actor names per movie are not aggregated but spread-out in three columns.
      • Both cases represent a single movie metadata type.
        • For both long format transformations (or similar) are needed in order to make an RS.
    • After a corresponding Sparse Matrix Recommender (SMR) is made its sparse matrix can be used to do additional analysis.
      • Such extensions are: deriving clusters, making and visualizing graphs, making and evaluating suitable classifiers.
  • In most “real life” data processing most of the data transformation listed steps above are taken.
  • ARL can be also used for deriving recommendations if the data is large enough.
  • The SMR object is based on Nearest Neighbors finding over “bags of tags.”
    • Latent Semantic Indexing (LSI) tag-weighting functions are applied.
  • The data does not have movie-viewer data, hence only item-item recommenders are created and used.
  • One hot embedding is a common technique, which in this notebook is done via cross-tabulation.
  • The categorization of numerical data means putting number into suitable bins or “buckets.”
    • The bin or bucket boundaries can be on a regular grid or a quantile grid.
  • For categorized numerical data one-hot embedding matrices can be processed to increase similarity between numeric buckets that are close to each to other.
  • Nearest-neighbors based recommenders — like SMR — can be used as classifiers.
    • These are the so called K-Nearest Neighbors (KNN) classifiers.
    • Although the data is small (both row-wise & column-wise) we can consider making classifiers predicting IMDB ratings or number of votes.
  • Using the recommender matrix similarities between different movies can be computed and a corresponding graph can be made.
  • Centrality analysis and simulations of random walks over the graph can be made.
    • Like Google’s “Page-rank” algorithm.
  • The relationship graphs can be used to visualize the “structure” of movie dataset.
  • Alternatively, clustering can be used.
    • Hierarchical clustering might be of interest.
  • If the movies had reviews or summaries associated with them, then Latent Semantic Analysis (LSA) could be applied.
    • SMR can use both LSA-terms-based and LSA-topics-based representations of the movies.
    • LLMs can be used to derive the LSA representation.
    • Again, not done in these series of notebooks.

Setup

Load packages used in the notebook:

use Math::SparseMatrix;
use ML::SparseMatrixRecommender;
use ML::SparseMatrixRecommender::Utilities;
use Statistics::OutlierIdentifiers;

Prime the notebook to show JavaScript plots:

#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

Example JavaScript plot:

#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none', stroke-width => 2)

Set different plot style variables:

my $title-color = 'Silver';
my $stroke-color = 'SlateGray';
my $tooltip-color = 'LightBlue';
my $tooltip-background-color = 'none';
my $tick-labels-font-size = 10;
my $tick-labels-color = 'Silver';
my $tick-labels-font-family = 'Helvetica';
my $background = 'White'; #'#1F1F1F';
my $color-scheme = 'schemeTableau10';
my $color-palette = 'Inferno';
my $edge-thickness = 3;
my $vertex-size = 6;
my $mmd-theme = q:to/END/;
%%{
  init: {
    'theme': 'forest',
    'themeVariables': {
      'lineColor': 'Ivory'
    }
  }
}%%
END
my %force = collision => {iterations => 0, radius => 10},link => {distance => 180};
my %force2 = charge => {strength => -30, iterations => 4}, collision => {radius => 50, iterations => 4}, link => {distance => 30};

sink my %opts = :$background, :$title-color, :$edge-thickness, :$vertex-size;


Ingest data

Ingest the movie data:

# Download and unzip: https://github.com/antononcube/RakuForPrediction-blog/raw/refs/heads/main/Data/movie_data.csv.zip
my $fileName=$*HOME ~ '/Downloads/movie_data.csv';
my @dsMovieData=data-import($fileName, headers=>'auto');

deduce-type(@dsMovieData)

# Vector(Assoc(Atom((Str)), Atom((Str)), 15), 5043)

Show a sample of the movie data:

#% html
my @field-names = <index movie_title title_year country duration language actor_1_name actor_2_name actor_3_name director_name imdb_score num_user_for_reviews num_voted_users movie_imdb_link>;
@dsMovieData.pick(8)
==> to-html(:@field-names)

Convert string values of the numerical columns into numbers:

@dsMovieData .= map({ 
    $_<title_year> = $_<title_year>.trim.Int; 
    $_<imdb_score> = $_<imdb_score>.Numeric; 
    $_<num_user_for_reviews> = $_<num_user_for_reviews>.Int; 
    $_<num_voted_users> = $_<num_voted_users>.Int; 
    $_});
deduce-type(@dsMovieData)

# Vector(Struct([actor_1_name, actor_2_name, actor_3_name, country, director_name, duration, genres, imdb_score, index, language, movie_imdb_link, movie_title, num_user_for_reviews, num_voted_users, title_year], [Str, Str, Str, Str, Str, Str, Str, Rat, Str, Str, Str, Str, Int, Int, Int]), 5043)

Summary of the numerical columns:

sink 
<index title_year imdb_score num_voted_users num_user_for_reviews>
andthen [select-columns(@dsMovieData, $_), $_]
andthen records-summary($_.head, field-names => $_.tail);

+-----------------+-----------------------+--------------------+------------------------+----------------------+
| index           | title_year            | imdb_score         | num_voted_users        | num_user_for_reviews |
+-----------------+-----------------------+--------------------+------------------------+----------------------+
| 252     => 1    | Min    => 0           | Min    => 1.6      | Min    => 5            | Min    => 0          |
| 1453    => 1    | 1st-Qu => 1998        | 1st-Qu => 5.8      | 1st-Qu => 8589         | 1st-Qu => 64         |
| 2004    => 1    | Mean   => 1959.585961 | Mean   => 6.442138 | Mean   => 83668.160817 | Mean   => 271.63494  |
| 3545    => 1    | Median => 2005        | Median => 6.6      | Median => 34359        | Median => 155        |
| 2903    => 1    | 3rd-Qu => 2011        | 3rd-Qu => 7.2      | 3rd-Qu => 96385        | 3rd-Qu => 324        |
| 2429    => 1    | Max    => 2016        | Max    => 9.5      | Max    => 1689764      | Max    => 5060       |
| 2764    => 1    |                       |                    |                        |                      |
| (Other) => 5036 |                       |                    |                        |                      |
+-----------------+-----------------------+--------------------+------------------------+----------------------+

Summary of the name-columns in the data:

sink 
<director_name actor_1_name actor_2_name actor_3_name>
andthen [select-columns(@dsMovieData, $_), $_]
andthen records-summary($_.head, field-names => $_.tail);

+--------------------------+---------------------------+-------------------------+------------------------+
| director_name            | actor_1_name              | actor_2_name            | actor_3_name           |
+--------------------------+---------------------------+-------------------------+------------------------+
|                  => 104  | Robert De Niro    => 49   | Morgan Freeman  => 20   |                => 23   |
| Steven Spielberg => 26   | Johnny Depp       => 41   | Charlize Theron => 15   | Steve Coogan   => 8    |
| Woody Allen      => 22   | Nicolas Cage      => 33   | Brad Pitt       => 14   | John Heard     => 8    |
| Clint Eastwood   => 20   | J.K. Simmons      => 31   |                 => 13   | Ben Mendelsohn => 8    |
| Martin Scorsese  => 20   | Denzel Washington => 30   | Meryl Streep    => 11   | Anne Hathaway  => 7    |
| Ridley Scott     => 17   | Bruce Willis      => 30   | James Franco    => 11   | Stephen Root   => 7    |
| Spike Lee        => 16   | Matt Damon        => 30   | Jason Flemyng   => 10   | Sam Shepard    => 7    |
| (Other)          => 4818 | (Other)           => 4799 | (Other)         => 4949 | (Other)        => 4975 |
+--------------------------+---------------------------+-------------------------+------------------------+

Convert to long form by skipping special columns (like “genres”):

my @varnames = <movie_title title_year country actor_1_name actor_2_name actor_3_name num_voted_users num_user_for_reviews imdb_score director_name language>;
my @dsMovieDataLongForm = to-long-format(@dsMovieData, 'index', @varnames, variables-to => 'TagType', values-to => 'Tag');

deduce-type(@dsMovieDataLongForm)

#  Vector((Any), 55473)

Remark: The transformation above is also known as “unpivoting” or “pivoting columns into rows”.

Show a sample of the converted data:

#% html
@dsMovieDataLongForm.pick(8)
==> to-html(field-names => <index TagType Tag>)

indexTagTypeTag
3586title_year1980
539actor_3_nameBen Mendelsohn
1087countryUSA
968languageEnglish
4856director_nameMaria Maggenti
3101movie_titleThe Longest Day 
2297num_user_for_reviews26
684num_user_for_reviews175

Give some tag types more convenient names:

my %toBetterTagTypes = 
    movie_title => 'title', 
    title_year => 'year', 
    director_name => 'director',
    actor_1_name => 'actor', actor_2_name => 'actor', actor_3_name => 'actor', 
    num_voted_users => 'votes_count', num_user_for_reviews => 'reviews_count',
    imdb_score => 'score', 
    ;

@dsMovieDataLongForm = @dsMovieDataLongForm.map({ $_<TagType> = %toBetterTagTypes{$_<TagType>} // $_<TagType>; $_ });
@dsMovieDataLongForm = |rename-columns(@dsMovieDataLongForm, {index=>'Item'});

deduce-type(@dsMovieDataLongForm)

# Vector((Any), 55473)

Summarize the long form data:

sink records-summary(@dsMovieDataLongForm, :12max-tallies)

+------------------------+------------------+------------------+
| TagType                | Tag              | Item             |
+------------------------+------------------+------------------+
| actor         => 15129 | English => 4704  | 4173    => 11    |
| title         => 5043  | USA     => 3807  | 1330    => 11    |
| votes_count   => 5043  | UK      => 448   | 552     => 11    |
| reviews_count => 5043  | 2009    => 260   | 5022    => 11    |
| country       => 5043  | 2014    => 252   | 4503    => 11    |
| language      => 5043  | 2006    => 239   | 463     => 11    |
| year          => 5043  | 2013    => 237   | 395     => 11    |
| score         => 5043  | 2010    => 230   | 3122    => 11    |
| director      => 5043  | 2015    => 226   | 4873    => 11    |
|                        | 2011    => 226   | 2959    => 11    |
|                        | 2008    => 225   | 23      => 11    |
|                        | 2012    => 223   | 715     => 11    |
|                        | (Other) => 44396 | (Other) => 55341 |
+------------------------+------------------+------------------+

Make a separate dataset with movie-genre associations:

my @dsMovieGenreLongForm = @dsMovieData.map({ $_<index> X $_<genres>.split('|', :skip-empty)}).flat(1).map({ <index genre> Z=> $_ })».Hash;
deduce-type(@dsMovieGenreLongForm)

# Vector(Assoc(Atom((Str)), Atom((Str)), 2), 14504)

Make the genres long form similar to that with the rest of the movie metadata:

@dsMovieGenreLongForm = rename-columns(@dsMovieGenreLongForm, {index => 'Item', genre => 'Tag'}).map({ $_.push('TagType' => 'genre') });

deduce-type(@dsMovieGenreLongForm)

# Vector(Assoc(Atom((Str)), Atom((Str)), 3), 14504)

#% html
@dsMovieGenreLongForm.head(8)
==> to-html(field-names => <Item TagType Tag>)

ItemTagTypeTag
0genreAction
0genreAdventure
0genreFantasy
0genreSci-Fi
1genreAction
1genreAdventure
1genreFantasy
2genreAction

Statistics

In this section we compute different statistics that should give us better idea what the data is.

Show movie years distribution:

#% js
js-d3-bar-chart(@dsMovieData.map(*<title_year>.Str).&tally.sort(*.head), title => 'Movie years distribution', :$title-color, :1200width, :$background)
~
js-d3-box-whisker-chart(@dsMovieData.map(*<title_year>)».Int.grep(*>1916), :horizontal, :$background)

Show movie genre distribution:

#% js
my %genreCounts = cross-tabulate(@dsMovieGenreLongForm, 'Item', 'Tag', :sparse).column-sums(:p);
js-d3-bar-chart(%genreCounts.sort, title => 'Genre distributions', :$background, :$title-color)

Check Pareto principle adherence for director names:

#% js
pareto-principle-statistic(@dsMovieData.map(*<director_name>))
==> js-d3-list-line-plot(
        :$background,
        title => 'Pareto principle adherence for movie directors',
        y-label => 'probability', x-label => 'index',
        :grid-lines, :5stroke-width, :$title-color)

Plot the number of IMDB votes vs IMBDB scores:

#% js
@dsMovieData.map({ %( x => $_<num_voted_users>».Num».log(10), y => $_<imdb_score>».Num ) })
==> js-d3-list-plot(
        :$background,
        title => 'Number of IMBD votes vs IMDB scores',
        x-label => 'Number of votes, lg', y-label => 'score',
        :grid-lines, point-size => 4, :$title-color)


Association rules learning

It is interesting to see which genres associated closely with each other. One way to find to those associations is to use Association Rule Learning (ARL).

For each movie make a “basket” of genres:

my @baskets = cross-tabulate(@dsMovieGenreLongForm, 'Item', 'Tag').values».keys».List;
@baskets».elems.&tally

# {1 => 633, 2 => 1355, 3 => 1628, 4 => 981, 5 => 349, 6 => 75, 7 => 18, 8 => 4}

Find frequent sets that are seen in at least 300 movies:

my @freqSets = frequent-sets(@baskets, min-support => 300, min-number-of-items => 2, max-number-of-items => Inf);
deduce-type(@freqSets):tally

# Tuple([Pair(Vector(Atom((Str)), 2), Atom((Rat))) => 14, Pair(Vector(Atom((Str)), 3), Atom((Rat))) => 1], 15)

to-pretty-table(@freqSets.map({ %( FrequentSet => $_.key.join(' '), Frequency => $_.value) }).sort(-*<Frequency>), field-names => <FrequentSet Frequency>, align => 'l');

+----------------------+-----------+
| FrequentSet          | Frequency |
+----------------------+-----------+
| Drama Romance        | 0.146143  |
| Drama Thriller       | 0.138211  |
| Comedy Drama         | 0.131469  |
| Action Thriller      | 0.116796  |
| Comedy Romance       | 0.116796  |
| Crime Thriller       | 0.108665  |
| Crime Drama          | 0.104303  |
| Action Adventure     | 0.093198  |
| Comedy Family        | 0.070989  |
| Mystery Thriller     | 0.070196  |
| Action Drama         | 0.068412  |
| Action Sci-Fi        | 0.066627  |
| Crime Drama Thriller | 0.066032  |
| Action Crime         | 0.065041  |
| Adventure Comedy     | 0.061670  |
+----------------------+-----------+

Here are the corresponding association rules:

association-rules(@baskets, min-support => 0.025, min-confidence => 0.70)
==> { .sort(-*<confidence>) }()
==> { to-pretty-table($_, field-names => <antecedent consequent count support confidence lift leverage conviction>) }()

+---------------------+------------+-------+----------+------------+----------+----------+------------+
|      antecedent     | consequent | count | support  | confidence |   lift   | leverage | conviction |
+---------------------+------------+-------+----------+------------+----------+----------+------------+
|      Biography      |   Drama    |  275  | 0.054531 |  0.938567  | 1.824669 | 0.024646 |  7.904874  |
|       History       |   Drama    |  189  | 0.037478 |  0.913043  | 1.775049 | 0.016364 |  5.584672  |
|   Animation Comedy  |   Family   |  154  | 0.030537 |  0.895349  | 8.269678 | 0.026845 |  8.520986  |
| Adventure Animation |   Family   |  151  | 0.029942 |  0.893491  | 8.252520 | 0.026314 |  8.372364  |
|         War         |   Drama    |  190  | 0.037676 |  0.892019  | 1.734175 | 0.015950 |  4.497297  |
|      Animation      |   Family   |  205  | 0.040650 |  0.847107  | 7.824108 | 0.035455 |  5.832403  |
|    Crime Mystery    |  Thriller  |  129  | 0.025580 |  0.821656  | 2.936649 | 0.016869 |  4.038299  |
|     Action Crime    |  Thriller  |  259  | 0.051358 |  0.789634  | 2.822201 | 0.033160 |  3.423589  |
|  Adventure Thriller |   Action   |  175  | 0.034702 |  0.781250  | 3.417037 | 0.024546 |  3.526246  |
|    Drama Mystery    |  Thriller  |  200  | 0.039659 |  0.769231  | 2.749278 | 0.025234 |  3.120894  |
|   Animation Family  |   Comedy   |  154  | 0.030537 |  0.751220  | 2.023718 | 0.015448 |  2.527499  |
|   Adventure Sci-Fi  |   Action   |  193  | 0.038271 |  0.736641  | 3.221927 | 0.026393 |  2.928956  |
|   Animation Family  | Adventure  |  151  | 0.029942 |  0.736585  | 4.024485 | 0.022502 |  3.101475  |
|      Animation      |   Comedy   |  172  | 0.034107 |  0.710744  | 1.914680 | 0.016293 |  2.173825  |
|       Mystery       |  Thriller  |  354  | 0.070196 |  0.708000  | 2.530435 | 0.042456 |  2.466460  |
+---------------------+------------+-------+----------+------------+----------+----------+------------+

Measure cheat-sheet

Here is a table showing the formulas for the Association Rules Learning measures (confidence, lift, leverage, conviction), along with their minimum value, maximum value, and value of indifference:

Explanation of terms:

  • support(X) = P(X), the proportion of transactions containing itemset X.
  • ¬A = complement of A (transactions not containing A).
  • Value of indifference generally means the value where the measure indicates independence or no association.
  • For Confidence, the baseline is support(B) (probability of B alone).
  • For Lift and Conviction, 1 indicates no association.
  • Leverage’s minimum and maximum depend on the supports of A and B.

LLM prompt

Here is the prompt used to generate the ARL metrics dictionary table above:

Give the formulas for the Association Rules Learning measures: confidence, lift, leverage, and conviction.
In a Markdown table for each measure give the min value, max value, value of indifference. Make sure the formulas are in LaTeX code.


Export transformed data

Here we export the transformed data in order to streamline the computations in the other notebooks of the series:

data-export($*HOME ~ '/Downloads/dsMovieDataLongForm.csv', @dsMovieDataLongForm.append(@dsMovieGenreLongForm))


References

Articles, blog posts

[AA1] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

[AA2] Anton Antonov, “Implementing Machine Learning algorithms in Raku (TRC-2022 talk)”, (2021), RakuForPrediction at WordPress.

Notebooks

[AAn1] Anton Antonov, “Data science over small movie dataset — Part 1”, (2025), RakuForPrediction-blog at GitHub.

[AAn2] Anton Antonov, “Data science over small movie dataset — Part 1”, (2025), RakuForPrediction-blog at GitHub.

[AAn3] Anton Antonov, “Data science over small movie dataset — Part 3”, (2025), RakuForPrediction-blog at GitHub.

Packages

[AAp1] Anton Antonov, Data::Importers, Raku package, (2024-2025), GitHub/antononcube.

[AAp2] Anton Antonov, Data::Reshapers, Raku package, (2021-2025), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Summarizers, Raku package, (2021-2024), GitHub/antononcube.

[AAp4] Anton Antonov, Graph, Raku package, (2024-2025), GitHub/antononcube.

[AAp5] Anton Antonov, JavaScript::D3, Raku package, (2022-2025), GitHub/antononcube.

[AAp6] Anton Antonov, Jupyter::Chatbook, Raku package, (2023-2025), GitHub/antononcube.

[AAp7] Anton Antonov, Math::SparseMatrix, Raku package, (2024-2025), GitHub/antononcube.

[AAp8] Anton Antonov, ML::AssociationRuleLearning, Raku package, (2022-2024), GitHub/antononcube.

[AAp9] Anton Antonov, ML::SparseMatrixRecommender, Raku package, (2025), GitHub/antononcube.

[AAp10] Anton Antonov, Statistics::OutlierIdentifiers, Raku package, (2022), GitHub/antononcube.

Repositories

[AAr1] Anton Antonov, RakuForPrediction-blog, (2022-2025), GitHub/antononcube.

[AAr2] Anton Antonov, RakuForPrediction-book, (2021-2025), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Simplified Machine Learning Workflows Overview (Raku-centric)”, (2022), YouTube/@AAA4prediction.

[AAv2] Anton Antonov, “TRC 2022 Implementation of ML algorithms in Raku”, (2022), YouTube/@AAA4prediction.

[AAv3] Anton Antonov, “Exploratory Data Analysis with Raku”, (2024), YouTube/@AAA4prediction.

[AAv4] Anton Antonov, “Raku RAG demo”, (2024), YouTube/@AAA4prediction.

ML::AssociationRuleLearning

The blog post proclaims and describes the Raku package “ML::AssociationRuleLearning” for Association Rule Learning (ARL) functions, [Wk1].

The ARL framework includes the algorithms Apriori and Eclat, and the measures confidence, lift, and conviction (and others.)

For computational introduction to ARL utilization (in Mathematica) see the article “Movie genre associations”, [AA1].

The examples below use the packages “Data::Generators”, “Data::Reshapers”, and “Data::Summarizers”, described in the article “Introduction to data wrangling with Raku”, [AA2].


Installation

Via zef-ecosystem:

zef install ML::AssociationRuleLearning

From GitHub:

zef install https://github.com/antononcube/Raku-ML-AssociationRuleLearning

Frequent sets finding

Here we get the Titanic dataset (from “Data::Reshapers”) and summarize it:

use Data::Reshapers;
use Data::Summarizers;
my @dsTitanic = get-titanic-dataset();
records-summary(@dsTitanic);
# +---------------+-------------------+----------------+----------------+-----------------+
# | passengerSex  | passengerSurvival | passengerClass | passengerAge   | id              |
# +---------------+-------------------+----------------+----------------+-----------------+
# | male   => 843 | died     => 809   | 3rd => 709     | 20      => 334 | 607     => 1    |
# | female => 466 | survived => 500   | 1st => 323     | -1      => 263 | 849     => 1    |
# |               |                   | 2nd => 277     | 30      => 258 | 519     => 1    |
# |               |                   |                | 40      => 190 | 724     => 1    |
# |               |                   |                | 50      => 88  | 189     => 1    |
# |               |                   |                | 60      => 57  | 948     => 1    |
# |               |                   |                | 0       => 56  | 287     => 1    |
# |               |                   |                | (Other) => 63  | (Other) => 1302 |
# +---------------+-------------------+----------------+----------------+-----------------+

Problem: Find all combinations of values of the variables “passengerAge”, “passengerClass”, “passengerSex”, and “passengerSurvival” that appear more than 200 times in the Titanic dataset.

Here is how we use the function frequent-sets to give an answer:

use ML::AssociationRuleLearning;
my @freqSets = frequent-sets(@dsTitanic, min-support => 200, min-number-of-items => 2, max-number-of-items => Inf):counts;
@freqSets.elems
# 11

The function frequent-sets returns the frequent sets together with their support.

Here we tabulate the result:

say to-pretty-table(@freqSets.map({ %( Frequent-set => $_.key.join(' '), Count => $_.value) }), align => 'l');
# +-------+-------------------------------------------------------------+
# | Count | Frequent-set                                                |
# +-------+-------------------------------------------------------------+
# | 208   | passengerAge:-1 passengerClass:3rd                          |
# | 206   | passengerAge:20 passengerClass:3rd                          |
# | 207   | passengerAge:20 passengerSex:male                           |
# | 207   | passengerAge:20 passengerSurvival:died                      |
# | 200   | passengerClass:1st passengerSurvival:survived               |
# | 216   | passengerClass:3rd passengerSex:female                      |
# | 493   | passengerClass:3rd passengerSex:male                        |
# | 418   | passengerClass:3rd passengerSex:male passengerSurvival:died |
# | 528   | passengerClass:3rd passengerSurvival:died                   |
# | 339   | passengerSex:female passengerSurvival:survived              |
# | 682   | passengerSex:male passengerSurvival:died                    |
# +-------+-------------------------------------------------------------+

We can verify the result by looking into these group counts,
[AA2]:

my $obj = group-by( @dsTitanic, <passengerClass passengerSex>);
.say for $obj>>.elems.grep({ $_.value >= 200 });
$obj = group-by( @dsTitanic, <passengerClass passengerSurvival passengerSex>);
.say for $obj>>.elems.grep({ $_.value >= 200 });
# 3rd.female => 216
# 3rd.male => 493
# 3rd.died.male => 418

Or these contingency tables:

my $obj = group-by( @dsTitanic, "passengerClass") ;
$obj = $obj.map({ $_.key => cross-tabulate( $_.value, "passengerSex", "passengerSurvival" ) });
.say for $obj.Array;
# 2nd => {female => {died => 12, survived => 94}, male => {died => 146, survived => 25}}
# 1st => {female => {died => 5, survived => 139}, male => {died => 118, survived => 61}}
# 3rd => {female => {died => 110, survived => 106}, male => {died => 418, survived => 75}}

Remark: For datasets – i.e. arrays of hashes – frequent-sets preprocesses the data by concatenating column names with corresponding column values. This is done in order to prevent “collisions” of same values coming from different columns. If that concatenation is not desired then manual preprocessing like this can be used:

{perl6, eval=FALSE} @dsTitanic.map({ $_.values.List }).Array

Remark: frequent-sets’s argument min-support can take both integers greater than 1 and frequencies between 0 and 1. (If an integer greater than one is given, then the corresponding frequency is derived.)

Remark: By default frequent-sets uses the Eclat algorithm. The functions apriori and eclat call frequent-sets with the option settings method=>'Apriori' and method=>'Eclat' respectively.


Association rules finding

Here we find association rules with min support 0.3 and min confidence 0.7:

association-rules(@dsTitanic, min-support => 0.3, min-confidence => 0.7)
==> to-pretty-table
# +----------+------------+-------------------------------------------+----------+------------+------------------------+----------+-------+
# | support  | conviction |                antecendent                | leverage | confidence |       consequent       |   lift   | count |
# +----------+------------+-------------------------------------------+----------+------------+------------------------+----------+-------+
# | 0.403361 |  1.496229  |             passengerClass:3rd            | 0.068615 |  0.744711  | passengerSurvival:died | 1.204977 |  528  |
# | 0.521008 |  2.000009  |             passengerSex:male             | 0.122996 |  0.809015  | passengerSurvival:died | 1.309025 |  682  |
# | 0.521008 |  2.267729  |           passengerSurvival:died          | 0.122996 |  0.843016  |   passengerSex:male    | 1.309025 |  682  |
# | 0.319328 |  2.510823  |    passengerClass:3rd passengerSex:male   | 0.086564 |  0.847870  | passengerSurvival:died | 1.371894 |  418  |
# | 0.319328 |  1.708785  | passengerClass:3rd passengerSurvival:died | 0.059562 |  0.791667  |   passengerSex:male    | 1.229290 |  418  |
# +----------+------------+-------------------------------------------+----------+------------+------------------------+----------+-------+

Reusing found frequent sets

The function frequent-sets takes the adverb “:object” that makes frequent-sets return an object of type ML::AssociationRuleLearning::Apriori or ML::AssociationRuleLearning::Eclat, which can be “pipelined” to find association rules.

Here we find frequent sets, return the corresponding object, and retrieve the result:

my $eclatObj = frequent-sets(@dsTitanic.map({ $_.values.List }).Array, min-support => 0.12, min-number-of-items => 2, max-number-of-items => 6):object;
$eclatObj.result.elems
# 23

Here we find association rules and pretty-print them:

$eclatObj.find-rules(min-confidence=>0.7)
==> to-pretty-table 
# +------------+-------+------------+----------+----------+------------+-------------+----------+
# | consequent | count | confidence |   lift   | support  | conviction | antecendent | leverage |
# +------------+-------+------------+----------+----------+------------+-------------+----------+
# |    3rd     |  208  |  0.790875  | 1.460162 | 0.158900 |  2.191819  |      -1     | 0.050076 |
# |    died    |  190  |  0.722433  | 1.168931 | 0.145149 |  1.376142  |      -1     | 0.020977 |
# |    died    |  528  |  0.744711  | 1.204977 | 0.403361 |  1.496229  |     3rd     | 0.068615 |
# |    died    |  158  |  0.759615  | 1.229093 | 0.120703 |  1.588999  |    -1 3rd   | 0.022498 |
# |    3rd     |  158  |  0.831579  | 1.535313 | 0.120703 |  2.721543  |   -1 died   | 0.042085 |
# |    male    |  185  |  0.703422  | 1.092265 | 0.141329 |  1.200349  |      -1     | 0.011938 |
# |    male    |  682  |  0.843016  | 1.309025 | 0.521008 |  2.267729  |     died    | 0.122996 |
# |    died    |  682  |  0.809015  | 1.309025 | 0.521008 |  2.000009  |     male    | 0.122996 |
# |    male    |  159  |  0.836842  | 1.299438 | 0.121467 |  2.181917  |   -1 died   | 0.027990 |
# |    died    |  159  |  0.859459  | 1.390646 | 0.121467 |  2.717870  |   -1 male   | 0.034121 |
# |    male    |  176  |  0.846154  | 1.313897 | 0.134454 |  2.313980  |   20 died   | 0.032122 |
# |    died    |  176  |  0.846154  | 1.369117 | 0.134454 |  2.482811  |   20 male   | 0.036249 |
# |    male    |  418  |  0.791667  | 1.229290 | 0.319328 |  1.708785  |   3rd died  | 0.059562 |
# |    died    |  418  |  0.847870  | 1.371894 | 0.319328 |  2.510823  |   3rd male  | 0.086564 |
# |  survived  |  339  |  0.727468  | 1.904511 | 0.258976 |  2.267729  |    female   | 0.122996 |
# +------------+-------+------------+----------+----------+------------+-------------+----------+

Remark: Note that because of the specified min confidence, the number of association rules is “contained” – a (much) larger number of rules would be produced with, say, min-confidence=>0.2.


Implementation considerations

UML diagram

Here is a UML diagram that shows package’s structure:

class-diagram

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package “UML::Translators”, [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml

Here get the diagram:

to-uml-spec ML::AssociationRuleLearning | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

Remark: Maybe it is a good idea to have an abstract class named, say, ML::AssociationRuleLearning::AbstractFinder that is a parent of both ML::AssociationRuleLearning::Apriori and ML::AssociationRuleLearning::Eclat, but I have not found to be necessary. (At this point of development.)

Eclat

We can say that Eclat uses a “vertical database representation” of the transactions.

Eclat is based on Raku’s sets, bags, and mixes functionalities.

Eclat represents the transactions as a hash of sets:

  • The keys of the hash are items
  • The elements of the sets are transaction identifiers.

(In other words, for each item an inverse index is made.)

This representation allows for quick calculations of item combinations support.

Apriori

Apriori uses the standard, horizontal database transactions representation.

We can say that Apriori:

Apriori is usually (much) slower than Eclat. Historically, Apriori is the first ARL method, and its implementation in the package is didactic.

Association rules

We can say that the association rule finding function is a general one, but that function does require fast computation of confidence, lift, etc. Hence Eclat’s transactions representation is used.

Association rules finding with Apriori is the same as with Eclat. The package function assocition-rules with the option setting method=>'Apriori' simply sends frequent sets found with Apriori to the Eclat based association rule finding.


References

Articles

[Wk1] Wikipedia entry, “Association Rule Learning”.

[AA1] Anton Antonov, “Movie genre associations”, (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, Implementation of the Apriori algorithm in Mathematica, (2014-2016), MathematicaForPrediction at GitHub/antononcube.

[AAp1a] Anton Antonov Implementation of the Apriori algorithm via Tries in Mathematica, (2022), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Implementation of the Eclat algorithm in Mathematica, (2022), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, Data::Summarizers
Raku package
, (2021), GitHub/antononcube.

[AAp6] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp7] Anton Antonov, ML::TrieWithFrequencies Raku package, (2021), GitHub/antononcube.