Handle case where subsequent records (after first batch) include extra columns #146

simonwiles · 2020-08-30T07:13:58Z

Addresses #145.

I think this should do the job. If it meets with your approval I'll update this PR to include an update to the documentation -- I came across this bug while preparing a PR to update the documentation around batch_size in any event.

simonwiles · 2020-09-07T19:10:33Z

@simonw -- I've gone ahead updated the documentation to reflect the changes introduced in this PR. IMO it's ready to merge now.

In writing the documentation changes, I begin to wonder about the value and role of batch_size at all, tbh. May I assume it was originally intended to prevent using the entire row set to determine columns and column types, and that this was a performance consideration? If so, this PR entirely undermines its purpose. I've been passing in excess of 500,000 rows at a time to insert_all() with these changes and although I'm sure the performance difference is measurable it's not really noticeable; given #145, I don't know that any performance advantages outweigh the problems doing it this way removes. What do you think about just dropping the argument and defaulting to the maximum batch_size permissible given SQLITE_MAX_VARS? Are there other reasons one might want to restrict batch_size that I've overlooked? I could open a new issue to discuss/implement this.

Of course the documentation will need to change again too if/when something is done about #147.

simonwiles · 2020-09-07T19:18:55Z

Just force-pushed to update d042f9c with more formatting changes to satisfy black==20.8b1 and pass the GitHub Actions "Test" workflow.

simonw · 2020-09-07T20:56:03Z

The problem with this approach is that it requires us to consume the entire iterator before we can start inserting rows into the table - here on line 1052:

sqlite-utils/sqlite_utils/db.py

Lines 1047 to 1054 in bb13179

    
           assert not ( 
        
               ignore and replace 
        
           ), "Use either ignore=True or replace=True, not both" 
        
           num_records_processed = 0 
        
           records = list(records) 
        
           if not records: 
        
               return self  # It was an empty list

I designed the .insert_all() to avoid doing this, because I want to be able to pass it an iterator (or more likely a generator) that could produce potentially millions of records. Doing things one batch of 100 records at a time means that the Python process doesn't need to pull millions of records into memory at once.

db-to-sqlite is one example of a tool that uses that characteristic, in https://github.com/simonw/db-to-sqlite/blob/63e4ee972f292de13bb11767c0fb64b35339d954/db_to_sqlite/cli.py#L94-L106

So we need to solve this issue without consuming the entire iterator with a records = list(records) call.

I think one way to do this is to execute each chunk one at a time and watch out for an exception that indicates that we sent too many parameters - then adjust the chunk size down and try again.

simonwiles · 2020-09-08T01:55:07Z

Okay, I've rewritten this PR to preserve the batching behaviour but still fix #145, and rebased the branch to account for the db.execute() api change. It's not terribly sophisticated -- if it attempts to insert a batch which has too many variables, the exception is caught, the batch is split in two and each half is inserted separately, and then it carries on as before with the same batch_size. In the edge case where this gets triggered, subsequent batches will all be inserted in two groups too if they continue to have the same number of columns (which is presumably reasonably likely). Do you reckon this is acceptable when set against the awkwardness of recalculating the batch_size on the fly?

simonw · 2020-09-08T23:17:42Z

That seems like a reasonable approach to me, especially since this is going to be a pretty rare edge-case.

simonwiles force-pushed the main branch 2 times, most recently from d282783 to 74907aa Compare September 7, 2020 18:57

simonwiles force-pushed the main branch from 74907aa to bb13179 Compare September 7, 2020 19:15

simonwiles force-pushed the main branch from bb13179 to 0b652a6 Compare September 8, 2020 00:16

simonwiles added 3 commits September 7, 2020 18:23

Create test for simonw#145

1b1e0cc

Extract build_insert_queries_and_params

c241162

Extract insert_chunk so it can be called recursively

40fcea4

simonwiles force-pushed the main branch from 0b652a6 to 40fcea4 Compare September 8, 2020 01:25

simonw changed the title ~~Get columns, column types, and batch size from all records when using insert_all()~~ Don't break if subsequent records contain extra columns Sep 8, 2020

simonw changed the title ~~Don't break if subsequent records contain extra columns~~ Handle case where subsequent records (after first batch) include extra columns Sep 8, 2020

simonw merged commit e6d202b into simonw:main Sep 8, 2020

simonw mentioned this pull request Sep 8, 2020

Bug when first record contains fewer columns than subsequent records #145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle case where subsequent records (after first batch) include extra columns #146

Handle case where subsequent records (after first batch) include extra columns #146

simonwiles commented Aug 30, 2020

simonwiles commented Sep 7, 2020 •

edited

Loading

simonwiles commented Sep 7, 2020

simonw commented Sep 7, 2020 •

edited

Loading

simonwiles commented Sep 8, 2020

simonw commented Sep 8, 2020

Handle case where subsequent records (after first batch) include extra columns #146

Handle case where subsequent records (after first batch) include extra columns #146

Conversation

simonwiles commented Aug 30, 2020

simonwiles commented Sep 7, 2020 • edited Loading

simonwiles commented Sep 7, 2020

simonw commented Sep 7, 2020 • edited Loading

simonwiles commented Sep 8, 2020

simonw commented Sep 8, 2020

simonwiles commented Sep 7, 2020 •

edited

Loading

simonw commented Sep 7, 2020 •

edited

Loading