CSV files with too many values in a row cause errors #440

frafra · 2022-05-27T10:54:44Z

Original title: csv.DictReader can have None as key

In some cases, csv.DictReader can have None as key for unnamed columns, and a list of values as value.
sqlite_utils.utils.rows_from_file cannot handle that:

url="https://artsdatabanken.no/Fab2018/api/export/csv"
db = sqlite_utils.Database(":memory")

with urlopen(url) as fab:
    reader, _ = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")   
    db["fab2018"].insert_all(reader, pk="Id")

Result:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 2924, in insert_all
    chunk = list(chunk)
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in fix_square_braces
    if any("[" in key or "]" in key for key in record.keys()):
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in <genexpr>
    if any("[" in key or "]" in key for key in record.keys()):
TypeError: argument of type 'NoneType' is not iterable

Code:

sqlite-utils/sqlite_utils/db.py

Line 3454 in 59be60c

if any("[" in key or "]" in key for key in record.keys()):

sqlite-utils insert from command line is not affected by this issue.

The text was updated successfully, but these errors were encountered:

simonw · 2022-06-13T20:15:49Z

rows_from_file() isn't part of the documented API but maybe it should be!

simonw · 2022-06-13T20:16:53Z

Steps to demonstrate that sqlite-utils insert is not affected:

curl -o artsdatabanken.csv https://artsdatabanken.no/Fab2018/api/export/csv
sqlite-utils insert arts.db artsdatabanken artsdatabanken.csv --sniff --csv --encoding utf-16le

simonw · 2022-06-13T20:17:51Z

I don't understand why that works but calling insert_all() does not.

simonw · 2022-06-13T20:28:25Z

Fixing that key thing (to ignore any key that is None) revealed a new bug:

File ~/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py:376, in hash_record(record, keys)
    373 if keys is not None:
    374     to_hash = {key: record[key] for key in keys}
    375 return hashlib.sha1(
--> 376     json.dumps(to_hash, separators=(",", ":"), sort_keys=True, default=repr).encode(
    377         "utf8"
    378     )
    379 ).hexdigest()

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
--> 234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
    238     **kw).encode(obj)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

TypeError: '<' not supported between instances of 'NoneType' and 'str'

simonw · 2022-06-13T21:18:26Z

Here are full steps to replicate the bug:

from urllib.request import urlopen
import sqlite_utils
db = sqlite_utils.Database(memory=True)
with urlopen("https://artsdatabanken.no/Fab2018/api/export/csv") as fab:
    reader, other = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")
    db["fab2018"].insert_all(reader, pk="Id")

simonw · 2022-06-13T21:23:16Z

Aha! I think I see what's happening here. Here's what DictReader does if one of the lines has too many items in it:

>>> import csv, io
>>> list(csv.DictReader(io.StringIO("id,name\n1,Cleo,nohead\n2,Barry")))
[{'id': '1', 'name': 'Cleo', None: ['nohead']}, {'id': '2', 'name': 'Barry'}]

See how that row with too many items gets this:
[{'id': '1', 'name': 'Cleo', None: ['nohead']}

That's a None for the key and (weirdly) a list containing the single item for the value!

simonw · 2022-06-13T21:24:18Z

That weird behaviour is documented here: https://docs.python.org/3/library/csv.html#csv.DictReader

If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).

simonw · 2022-06-13T21:26:55Z

So I need to make a design decision here: what should sqlite-utils do with CSV files that have rows with more values than there are headings?

Some options:

Ignore those extra fields entirely - silently drop that data. I'm not keen on this.
Throw an error. The library does this already, but the error is incomprehensible - it could turn into a useful, human-readable error instead.
Put the data in a JSON list in a column with a known name (None is not a valid column name, so not that). This could be something like _restkey or _values_with_no_heading. This feels like a better option, but I'd need to carefully pick a name for it - and come up with an answer for the question of what to do if the CSV file being important already uses that heading name for something else.

simonw · 2022-06-13T21:28:03Z

Whatever I decide, I can implement it in rows_from_file(), maybe as an optional parameter - then decide how to call it from the sqlite-utils insert CLI (perhaps with a new option there too).

simonw · 2022-06-13T21:29:02Z

Here's the current function signature for rows_from_file():

sqlite-utils/sqlite_utils/utils.py

Lines 174 to 179 in 26e6d26

    
           def rows_from_file( 
        
               fp: BinaryIO, 
        
               format: Optional[Format] = None, 
        
               dialect: Optional[Type[csv.Dialect]] = None, 
        
               encoding: Optional[str] = None, 
        
           ) -> Tuple[Iterable[dict], Format]:

simonw · 2022-06-13T21:50:59Z

Decision: I'm going to default to raising an exception if a row has too many values in it.

You'll be able to pass ignore_extras=True to ignore those extra values, or pass restkey="the_rest" to stick them in a list in the restkey column.

simonw · 2022-06-13T21:52:03Z

The exception will be called RowError.

simonw · 2022-06-14T14:58:50Z

Interesting challenge in writing tests for this: if you give csv.Sniffer a short example with an invalid row in it sometimes it picks the wrong delimiter!

id,name\r\n1,Cleo,oops

It decided the delimiter there was e.

simonw · 2022-06-14T15:04:01Z

I think that's unavoidable: it looks like csv.Sniffer only works if you feed it a CSV file with an equal number of values in each row, which is understandable.

simonw · 2022-06-14T15:25:18Z

That broke mypy:

sqlite_utils/utils.py:229: error: Incompatible types in assignment (expression has type "Iterable[Dict[Any, Any]]", variable has type "DictReader[str]")

simonw · 2022-06-14T15:31:34Z

Getting this past mypy is really hard!

% mypy sqlite_utils
sqlite_utils/utils.py:189: error: No overload variant of "pop" of "MutableMapping" matches argument type "None"
sqlite_utils/utils.py:189: note: Possible overload variants:
sqlite_utils/utils.py:189: note:     def pop(self, key: str) -> str
sqlite_utils/utils.py:189: note:     def [_T] pop(self, key: str, default: Union[str, _T] = ...) -> Union[str, _T]

That's because of this line:

row.pop(key=None)

Which is legit here - we have a dictionary where one of the keys is None and we want to remove that key. But the baked in type is apparently def pop(self, key: str) -> str.

simonw · 2022-06-14T15:54:03Z

Filed an issue against python/typeshed:

row.pop(None) type error when working with rows from DictReader python/typeshed#8075

simonw · 2022-06-14T20:11:52Z

I'm going to rename restkey to extras_key for consistency with ignore_extras.

simonw · 2022-06-14T20:19:07Z

Documentation: https://sqlite-utils.datasette.io/en/latest/python-api.html#reading-rows-from-a-file

simonw · 2022-06-14T22:22:27Z

I forgot to add equivalents of extras_key= and ignore_extras= to the CLI tool - will do that in a separate issue.

Refs #434, #435, #436, #440, #441, #442, #443

simonw added the bug label Jun 13, 2022

simonw changed the title ~~csv.DictReader can have None as key~~ CSV files with too many values in a row cause errors Jun 13, 2022

simonw mentioned this issue Jun 13, 2022

Make utils.rows_from_file() a documented API #443

Closed

simonw added a commit that referenced this issue Jun 14, 2022

rows_from_file(... ignore_extras: bool, restkey: str), refs #440

d379f43

simonw added a commit that referenced this issue Jun 14, 2022

Tests for rows_from_file, refs #440

ad96bd1

simonw added a commit that referenced this issue Jun 14, 2022

mypy fixes, refs #440

19efee2

simonw added a commit that referenced this issue Jun 14, 2022

One more typing fix, refs #440

d9c715a

simonw added a commit that referenced this issue Jun 14, 2022

flake8 fix, refs #440

f142bb1

simonw closed this as completed in ce670e2 Jun 14, 2022

simonw added a commit that referenced this issue Jun 14, 2022

Update test for renamed restkey, refs #440, #443

0cee77b

simonw mentioned this issue Jun 14, 2022

CSV extras_key= and ignore_extras= equivalents for CLI tool #444

Open

simonw added the python-library label Jun 14, 2022

simonw mentioned this issue Jun 14, 2022

Misleading progress bar against utf-16-le CSV input #439

Open

simonw added a commit that referenced this issue Jun 15, 2022

Release 3.27

8366693

Refs #434, #435, #436, #440, #441, #442, #443

simonw added a commit that referenced this issue Jun 15, 2022

Release 3.27

679d608

Refs #434, #435, #436, #440, #441, #442, #443

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV files with too many values in a row cause errors #440

CSV files with too many values in a row cause errors #440

frafra commented May 27, 2022 •

edited by simonw

Loading

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022 •

edited

Loading

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022 •

edited

Loading

simonw commented Jun 13, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

CSV files with too many values in a row cause errors #440

CSV files with too many values in a row cause errors #440

Comments

frafra commented May 27, 2022 • edited by simonw Loading

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022 • edited Loading

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022

simonw commented Jun 13, 2022 • edited Loading

simonw commented Jun 13, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

simonw commented Jun 14, 2022

frafra commented May 27, 2022 •

edited by simonw

Loading

simonw commented Jun 13, 2022 •

edited

Loading

simonw commented Jun 13, 2022 •

edited

Loading