Skip to content

sqlite-utils analyze-tables command and table.analyze_column() method #208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 13, 2020

Conversation

simonw
Copy link
Owner

@simonw simonw commented Dec 12, 2020

Refs #207

  • Improve design of CLI output
  • Truncate long values in least/most common
  • Add a -c column selection option
  • Tests
  • Documentation

@simonw simonw added the enhancement New feature or request label Dec 12, 2020
* Record total_rows for each column
* Record (value, count) if there is just a single distinct value
* Do not calculate most/least common if all values are distinct
* Calculate table count once per table, not once per column
@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

Should truncate values in the least/most common JSON array to a sensible length, otherwise you end up with stuff like this:

[
    [
        "b'\\x00\\x05barry\\x03\\x01\\x02\\x00\\x00\\x03cat\\x03\\x01\\x03\\x00\\x00\\x03dog\\x08\\x01\\x01\\x01\\x03\\x00\\x01\\x03\\x00\\x00\\x07panther\\x05\\x01\\x01\\x02\\x02\\x00\\x01\\x03uma\\x05\\x02\\x01\\x02\\x02\\x00\\x00\\x04sara\\x05\\x02\\x01\\x01\\x02\\x00\\x00\\x05terry\\x08\\x01\\x01\\x01\\x02\\x00\\x01\\x02\\x00\\x00\\x06weasel\\x05\\x02\\x01\\x01\\x03\\x00'",
        1
    ]
]

This example also shows that binary values (like those in _fts tables) look a bit weird, but I think I'm OK with that since binary data can't be represented neatly in JSON anyway.

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

CLI output looks like this at the moment, which is bad:

 % sqlite-utils analyze-tables ../datasette/fixtures.db facetable
1/10: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)
2/10: ColumnDetails(table='facetable', column='created', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[('2019-01-17 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-14 08:00:00', 4), ('2019-01-16 08:00:00', 3)], least_common=[('2019-01-16 08:00:00', 3), ('2019-01-14 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-17 08:00:00', 4)])
3/10: ColumnDetails(table='facetable', column='planet_int', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (2, 1)], least_common=[(2, 1), (1, 14)])
4/10: ColumnDetails(table='facetable', column='on_earth', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (0, 1)], least_common=[(0, 1), (1, 14)])
5/10: ColumnDetails(table='facetable', column='state', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('CA', 10), ('MI', 4), ('MC', 1)], least_common=[('MC', 1), ('MI', 4), ('CA', 10)])
6/10: ColumnDetails(table='facetable', column='city_id', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[(1, 6), (3, 4), (2, 4), (4, 1)], least_common=[(4, 1), (2, 4), (3, 4), (1, 6)])
7/10: ColumnDetails(table='facetable', column='neighborhood', total_rows=15, num_null=0, num_blank=0, num_distinct=14, most_common=[('Downtown', 2), ('Tenderloin', 1), ('SOMA', 1), ('Mission', 1), ('Mexicantown', 1), ('Los Feliz', 1), ('Koreatown', 1), ('Hollywood', 1), ('Hayes Valley', 1), ('Greektown', 1)], least_common=[('Arcadia Planitia', 1), ('Bernal Heights', 1), ('Corktown', 1), ('Dogpatch', 1), ('Greektown', 1), ('Hayes Valley', 1), ('Hollywood', 1), ('Koreatown', 1), ('Los Feliz', 1), ('Mexicantown', 1)])
8/10: ColumnDetails(table='facetable', column='tags', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('[]', 13), ('["tag1", "tag3"]', 1), ('["tag1", "tag2"]', 1)], least_common=[('["tag1", "tag2"]', 1), ('["tag1", "tag3"]', 1), ('[]', 13)])
9/10: ColumnDetails(table='facetable', column='complex_array', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[('[]', 14), ('[{"foo": "bar"}]', 1)], least_common=[('[{"foo": "bar"}]', 1), ('[]', 14)])
10/10: ColumnDetails(table='facetable', column='distinct_some_null', total_rows=15, num_null=13, num_blank=0, num_distinct=2, most_common=[(None, 13), ('two', 1), ('one', 1)], least_common=[('one', 1), ('two', 1), (None, 13)])
(sqlite-utils) sqlite-utils % 

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

If there are less than ten values is it worth outputting them twice, once in most_common and then in reverse in least_common? Feels redundant - I think I should leave least_common empty in that case.

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

It would be neat if you could optionally specify a subset of columns to analyze, using -c or --column.

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

% sqlite-utils analyze-tables ../datasette/fixtures.db facetable --column pk
1/1: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)

@simonw simonw marked this pull request as ready for review December 13, 2020 05:40
@simonw
Copy link
Owner Author

simonw commented Dec 13, 2020

Example output:

% sqlite-utils analyze-tables github.db tags             
tags.repo: (1/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 14

  Most common:
    88: 107914493
    75: 140912432
    27: 206156866
    21: 207052882
    17: 197431109
    8: 197882382
    5: 256834907
    5: 205429375
    4: 248903544
    3: 206202864

  Least common:
    1: 209590345
    2: 206649770
    2: 303218369
    3: 206202864
    3: 213286752
    4: 248903544
    5: 205429375
    5: 256834907
    8: 197882382
    17: 197431109

tags.name: (2/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 175

  Most common:
    10: 0.2
    9: 0.1
    7: 0.3
    6: 0.4
    5: 0.7
    5: 0.5
    5: 0.1a
    4: 0.9
    4: 0.8
    4: 0.6

  Least common:
    1: 0.1.1
    1: 0.11.1
    1: 0.1a2
    1: 0.20.1
    1: 0.21.1
    1: 0.21.2
    1: 0.21.3
    1: 0.22
    1: 0.22.1
    1: 0.23

tags.sha: (3/3)

  Total rows: 261
  Null rows: 0
  Blank rows: 0

  Distinct values: 261

@simonw simonw merged commit 69a121e into main Dec 13, 2020
@simonw simonw deleted the analyze-table branch December 13, 2020 07:20
simonw added a commit that referenced this pull request Dec 13, 2020
simonw added a commit that referenced this pull request Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant