sqlite-utils analyze-tables command and table.analyze_column() method #208

simonw · 2020-12-12T05:27:49Z

…st_common as null

* Record total_rows for each column * Record (value, count) if there is just a single distinct value * Do not calculate most/least common if all values are distinct * Calculate table count once per table, not once per column

simonw · 2020-12-12T05:42:26Z

Should truncate values in the least/most common JSON array to a sensible length, otherwise you end up with stuff like this:

[
    [
        "b'\\x00\\x05barry\\x03\\x01\\x02\\x00\\x00\\x03cat\\x03\\x01\\x03\\x00\\x00\\x03dog\\x08\\x01\\x01\\x01\\x03\\x00\\x01\\x03\\x00\\x00\\x07panther\\x05\\x01\\x01\\x02\\x02\\x00\\x01\\x03uma\\x05\\x02\\x01\\x02\\x02\\x00\\x00\\x04sara\\x05\\x02\\x01\\x01\\x02\\x00\\x00\\x05terry\\x08\\x01\\x01\\x01\\x02\\x00\\x01\\x02\\x00\\x00\\x06weasel\\x05\\x02\\x01\\x01\\x03\\x00'",
        1
    ]
]

This example also shows that binary values (like those in _fts tables) look a bit weird, but I think I'm OK with that since binary data can't be represented neatly in JSON anyway.

simonw · 2020-12-12T05:43:45Z

CLI output looks like this at the moment, which is bad:

 % sqlite-utils analyze-tables ../datasette/fixtures.db facetable
1/10: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)
2/10: ColumnDetails(table='facetable', column='created', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[('2019-01-17 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-14 08:00:00', 4), ('2019-01-16 08:00:00', 3)], least_common=[('2019-01-16 08:00:00', 3), ('2019-01-14 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-17 08:00:00', 4)])
3/10: ColumnDetails(table='facetable', column='planet_int', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (2, 1)], least_common=[(2, 1), (1, 14)])
4/10: ColumnDetails(table='facetable', column='on_earth', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (0, 1)], least_common=[(0, 1), (1, 14)])
5/10: ColumnDetails(table='facetable', column='state', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('CA', 10), ('MI', 4), ('MC', 1)], least_common=[('MC', 1), ('MI', 4), ('CA', 10)])
6/10: ColumnDetails(table='facetable', column='city_id', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[(1, 6), (3, 4), (2, 4), (4, 1)], least_common=[(4, 1), (2, 4), (3, 4), (1, 6)])
7/10: ColumnDetails(table='facetable', column='neighborhood', total_rows=15, num_null=0, num_blank=0, num_distinct=14, most_common=[('Downtown', 2), ('Tenderloin', 1), ('SOMA', 1), ('Mission', 1), ('Mexicantown', 1), ('Los Feliz', 1), ('Koreatown', 1), ('Hollywood', 1), ('Hayes Valley', 1), ('Greektown', 1)], least_common=[('Arcadia Planitia', 1), ('Bernal Heights', 1), ('Corktown', 1), ('Dogpatch', 1), ('Greektown', 1), ('Hayes Valley', 1), ('Hollywood', 1), ('Koreatown', 1), ('Los Feliz', 1), ('Mexicantown', 1)])
8/10: ColumnDetails(table='facetable', column='tags', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('[]', 13), ('["tag1", "tag3"]', 1), ('["tag1", "tag2"]', 1)], least_common=[('["tag1", "tag2"]', 1), ('["tag1", "tag3"]', 1), ('[]', 13)])
9/10: ColumnDetails(table='facetable', column='complex_array', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[('[]', 14), ('[{"foo": "bar"}]', 1)], least_common=[('[{"foo": "bar"}]', 1), ('[]', 14)])
10/10: ColumnDetails(table='facetable', column='distinct_some_null', total_rows=15, num_null=13, num_blank=0, num_distinct=2, most_common=[(None, 13), ('two', 1), ('one', 1)], least_common=[('one', 1), ('two', 1), (None, 13)])
(sqlite-utils) sqlite-utils %

simonw · 2020-12-12T05:44:46Z

If there are less than ten values is it worth outputting them twice, once in most_common and then in reverse in least_common? Feels redundant - I think I should leave least_common empty in that case.

simonw · 2020-12-12T05:46:27Z

It would be neat if you could optionally specify a subset of columns to analyze, using -c or --column.

simonw · 2020-12-12T05:48:20Z

% sqlite-utils analyze-tables ../datasette/fixtures.db facetable --column pk
1/1: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)

simonw · 2020-12-13T05:44:49Z

Example output:

% sqlite-utils analyze-tables github.db tags             
tags.repo: (1/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 14

  Most common:
    88: 107914493
    75: 140912432
    27: 206156866
    21: 207052882
    17: 197431109
    8: 197882382
    5: 256834907
    5: 205429375
    4: 248903544
    3: 206202864

  Least common:
    1: 209590345
    2: 206649770
    2: 303218369
    3: 206202864
    3: 213286752
    4: 248903544
    5: 205429375
    5: 256834907
    8: 197882382
    17: 197431109

tags.name: (2/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 175

  Most common:
    10: 0.2
    9: 0.1
    7: 0.3
    6: 0.4
    5: 0.7
    5: 0.5
    5: 0.1a
    4: 0.9
    4: 0.8
    4: 0.6

  Least common:
    1: 0.1.1
    1: 0.11.1
    1: 0.1a2
    1: 0.20.1
    1: 0.21.1
    1: 0.21.2
    1: 0.21.3
    1: 0.22
    1: 0.22.1
    1: 0.23

tags.sha: (3/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 261

Refs #204, #207, #208

Initial prototype of sqlite-utils analyze-tables, refs #207

ca2cccc

simonw added the enhancement label Dec 12, 2020

simonw added 2 commits December 11, 2020 21:30

If only one distinct value, record that in most_commond and leave lea…

5c176cc

…st_common as null

Improvements to most/least common values

d4b8d9e

* Record total_rows for each column * Record (value, count) if there is just a single distinct value * Do not calculate most/least common if all values are distinct * Calculate table count once per table, not once per column

Don't bother with least_common if less than 10 values

61461b9

-c / --column option

ef219d6

simonw added 7 commits December 12, 2020 20:46

Tests for analyze tables

f7efe77

Test for analyze-tables --save

c0d2195

Truncate values longer than 100 characters

955daaa

Much improved design for CLI output of analyze-tables

736dca6

Documentation for analyze-tables

f65fe8a

Swap count and value in analyze output

1491467

table.analyze_column() documentation

faedf95

simonw marked this pull request as ready for review December 13, 2020 05:40

Ensure reliable sort order for tests

5a813ba

Make least_common/most_common order dependable for tests

95a966b

simonw merged commit 69a121e into main Dec 13, 2020

simonw deleted the analyze-table branch December 13, 2020 07:20

simonw added a commit that referenced this pull request Dec 13, 2020

Release 3.1

6eea94a

Refs #204, #207, #208

simonw added a commit that referenced this pull request Dec 13, 2020

Release 3.1

1ce96d8

Refs #204, #207, #208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlite-utils analyze-tables command and table.analyze_column() method #208

sqlite-utils analyze-tables command and table.analyze_column() method #208

simonw commented Dec 12, 2020 •

edited

Loading

simonw commented Dec 12, 2020 •

edited

Loading

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020 •

edited

Loading

simonw commented Dec 13, 2020

sqlite-utils analyze-tables command and table.analyze_column() method #208

sqlite-utils analyze-tables command and table.analyze_column() method #208

Conversation

simonw commented Dec 12, 2020 • edited Loading

simonw commented Dec 12, 2020 • edited Loading

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020

simonw commented Dec 12, 2020 • edited Loading

simonw commented Dec 13, 2020

simonw commented Dec 12, 2020 •

edited

Loading

simonw commented Dec 12, 2020 •

edited

Loading

simonw commented Dec 12, 2020 •

edited

Loading