-
-
Notifications
You must be signed in to change notification settings - Fork 116
[Enhancement] Please allow 'insert-files' to insert content as text. #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a good idea. |
Will need to re-title this section of the documentation: https://sqlite-utils.datasette.io/en/3.16/cli.html#inserting-binary-data-from-files - "Inserting binary data from files" will become "Inserting data from files" I'm OK with keeping the default as If the text can't be stored as |
I was thinking that an approach could be making FILE_COLUMNS a generator (_get_file_columns(mode)) or you can just have a different set of columns (is there something else that makes sense to be changed on the text scenario?). About UTF-8 I was referring to the encoding to use when reading files. This can be difficult to auto-detect but I believe that UTF-8 is pretty much the standard for text files. |
I'm going to assume utf-8 but allow |
Here's the error message I have working for invalid unicode:
|
Oh, I misread. Yes some files will not be valid UTF-8, I'd throw a warning and continue (not adding that file) but if you want to get more elaborate you could allow to define a policy on what to do. Not adding the file, index binary content or use a conversion policy like the ones available on Python's decode.
|
I had a few doubts about the design just now. Since
This does exactly the same thing as just using sqlite-utils/sqlite_utils/cli.py Lines 1851 to 1855 in 0c796cd
But actually I think that's OK - |
I thought about supporting those different policies (with something like
If someone has data that can't be translated to valid text using a known encoding, I'm happy leaving them to have to insert it into a |
I'm happy with this functionality left the way you describe. In my case the data is homogeneous but other cases would work just by being consistent on the encoding. Thanks a lot, Simon! |
'insert-files' creates BLOB columns for file contents. Transforming the column to TEXT still keep the content as binary. Even though I'm sure there is a transform that can be applied decoding the text it would be great to have a argument to make 'insert-files' to do it as text (with optional text encoding).
The use case is a bunch of htmls (single file) on a directory structure that inserted with this command could be served in Datasette allowing full text search.
The text was updated successfully, but these errors were encountered: