Skip to content

CSV files with too many values in a row cause errors #440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frafra opened this issue May 27, 2022 · 20 comments
Closed

CSV files with too many values in a row cause errors #440

frafra opened this issue May 27, 2022 · 20 comments
Labels
bug Something isn't working python-library

Comments

@frafra
Copy link

frafra commented May 27, 2022

Original title: csv.DictReader can have None as key

In some cases, csv.DictReader can have None as key for unnamed columns, and a list of values as value.
sqlite_utils.utils.rows_from_file cannot handle that:

url="https://artsdatabanken.no/Fab2018/api/export/csv"
db = sqlite_utils.Database(":memory")

with urlopen(url) as fab:
    reader, _ = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")   
    db["fab2018"].insert_all(reader, pk="Id")

Result:

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 2924, in insert_all
    chunk = list(chunk)
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in fix_square_braces
    if any("[" in key or "]" in key for key in record.keys()):
  File "/home/user/.local/pipx/venvs/sqlite-utils/lib/python3.8/site-packages/sqlite_utils/db.py", line 3454, in <genexpr>
    if any("[" in key or "]" in key for key in record.keys()):
TypeError: argument of type 'NoneType' is not iterable

Code:

if any("[" in key or "]" in key for key in record.keys()):

sqlite-utils insert from command line is not affected by this issue.

@simonw simonw added the bug Something isn't working label Jun 13, 2022
@simonw
Copy link
Owner

simonw commented Jun 13, 2022

rows_from_file() isn't part of the documented API but maybe it should be!

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Steps to demonstrate that sqlite-utils insert is not affected:

curl -o artsdatabanken.csv https://artsdatabanken.no/Fab2018/api/export/csv
sqlite-utils insert arts.db artsdatabanken artsdatabanken.csv --sniff --csv --encoding utf-16le

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

I don't understand why that works but calling insert_all() does not.

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Fixing that key thing (to ignore any key that is None) revealed a new bug:

File ~/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py:376, in hash_record(record, keys)
    373 if keys is not None:
    374     to_hash = {key: record[key] for key in keys}
    375 return hashlib.sha1(
--> 376     json.dumps(to_hash, separators=(",", ":"), sort_keys=True, default=repr).encode(
    377         "utf8"
    378     )
    379 ).hexdigest()

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/__init__.py:234, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
--> 234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
    238     **kw).encode(obj)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File ~/.pyenv/versions/3.8.2/lib/python3.8/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

TypeError: '<' not supported between instances of 'NoneType' and 'str'

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Here are full steps to replicate the bug:

from urllib.request import urlopen
import sqlite_utils
db = sqlite_utils.Database(memory=True)
with urlopen("https://artsdatabanken.no/Fab2018/api/export/csv") as fab:
    reader, other = sqlite_utils.utils.rows_from_file(fab, encoding="utf-16le")
    db["fab2018"].insert_all(reader, pk="Id")

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Aha! I think I see what's happening here. Here's what DictReader does if one of the lines has too many items in it:

>>> import csv, io
>>> list(csv.DictReader(io.StringIO("id,name\n1,Cleo,nohead\n2,Barry")))
[{'id': '1', 'name': 'Cleo', None: ['nohead']}, {'id': '2', 'name': 'Barry'}]

See how that row with too many items gets this:
[{'id': '1', 'name': 'Cleo', None: ['nohead']}

That's a None for the key and (weirdly) a list containing the single item for the value!

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

That weird behaviour is documented here: https://docs.python.org/3/library/csv.html#csv.DictReader

If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

So I need to make a design decision here: what should sqlite-utils do with CSV files that have rows with more values than there are headings?

Some options:

  • Ignore those extra fields entirely - silently drop that data. I'm not keen on this.
  • Throw an error. The library does this already, but the error is incomprehensible - it could turn into a useful, human-readable error instead.
  • Put the data in a JSON list in a column with a known name (None is not a valid column name, so not that). This could be something like _restkey or _values_with_no_heading. This feels like a better option, but I'd need to carefully pick a name for it - and come up with an answer for the question of what to do if the CSV file being important already uses that heading name for something else.

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Whatever I decide, I can implement it in rows_from_file(), maybe as an optional parameter - then decide how to call it from the sqlite-utils insert CLI (perhaps with a new option there too).

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Here's the current function signature for rows_from_file():

def rows_from_file(
fp: BinaryIO,
format: Optional[Format] = None,
dialect: Optional[Type[csv.Dialect]] = None,
encoding: Optional[str] = None,
) -> Tuple[Iterable[dict], Format]:

@simonw simonw changed the title csv.DictReader can have None as key CSV files with too many values in a row cause errors Jun 13, 2022
@simonw
Copy link
Owner

simonw commented Jun 13, 2022

Decision: I'm going to default to raising an exception if a row has too many values in it.

You'll be able to pass ignore_extras=True to ignore those extra values, or pass restkey="the_rest" to stick them in a list in the restkey column.

@simonw
Copy link
Owner

simonw commented Jun 13, 2022

The exception will be called RowError.

@simonw
Copy link
Owner

simonw commented Jun 14, 2022

Interesting challenge in writing tests for this: if you give csv.Sniffer a short example with an invalid row in it sometimes it picks the wrong delimiter!

id,name\r\n1,Cleo,oops

It decided the delimiter there was e.

@simonw
Copy link
Owner

simonw commented Jun 14, 2022

I think that's unavoidable: it looks like csv.Sniffer only works if you feed it a CSV file with an equal number of values in each row, which is understandable.

@simonw
Copy link
Owner

simonw commented Jun 14, 2022

That broke mypy:

sqlite_utils/utils.py:229: error: Incompatible types in assignment (expression has type "Iterable[Dict[Any, Any]]", variable has type "DictReader[str]")

@simonw
Copy link
Owner

simonw commented Jun 14, 2022

Getting this past mypy is really hard!

% mypy sqlite_utils
sqlite_utils/utils.py:189: error: No overload variant of "pop" of "MutableMapping" matches argument type "None"
sqlite_utils/utils.py:189: note: Possible overload variants:
sqlite_utils/utils.py:189: note:     def pop(self, key: str) -> str
sqlite_utils/utils.py:189: note:     def [_T] pop(self, key: str, default: Union[str, _T] = ...) -> Union[str, _T]

That's because of this line:

row.pop(key=None)

Which is legit here - we have a dictionary where one of the keys is None and we want to remove that key. But the baked in type is apparently def pop(self, key: str) -> str.

simonw added a commit that referenced this issue Jun 14, 2022
@simonw
Copy link
Owner

simonw commented Jun 14, 2022

simonw added a commit that referenced this issue Jun 14, 2022
simonw added a commit that referenced this issue Jun 14, 2022
@simonw
Copy link
Owner

simonw commented Jun 14, 2022

I'm going to rename restkey to extras_key for consistency with ignore_extras.

@simonw simonw closed this as completed in ce670e2 Jun 14, 2022
@simonw
Copy link
Owner

simonw commented Jun 14, 2022

@simonw
Copy link
Owner

simonw commented Jun 14, 2022

I forgot to add equivalents of extras_key= and ignore_extras= to the CLI tool - will do that in a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python-library
Projects
None yet
Development

No branches or pull requests

2 participants