[Enhancement] Please allow 'insert-files' to insert content as text. #319

pjamargh · 2021-08-22T15:10:46Z

'insert-files' creates BLOB columns for file contents. Transforming the column to TEXT still keep the content as binary. Even though I'm sure there is a transform that can be applied decoding the text it would be great to have a argument to make 'insert-files' to do it as text (with optional text encoding).

The use case is a bunch of htmls (single file) on a directory structure that inserted with this command could be served in Datasette allowing full text search.

simonw · 2021-08-24T21:49:08Z

This is a good idea.

simonw · 2021-08-24T21:52:50Z

Will need to re-title this section of the documentation: https://sqlite-utils.datasette.io/en/3.16/cli.html#inserting-binary-data-from-files - "Inserting binary data from files" will become "Inserting data from files"

I'm OK with keeping the default as BLOB but I could add a --text option which stores the content as text instead.

If the text can't be stored as utf-8 I'll probably raise an error.

pjamargh · 2021-08-24T21:56:49Z

I was thinking that an approach could be making FILE_COLUMNS a generator (_get_file_columns(mode)) or you can just have a different set of columns (is there something else that makes sense to be changed on the text scenario?).

About UTF-8 I was referring to the encoding to use when reading files. This can be difficult to auto-detect but I believe that UTF-8 is pretty much the standard for text files.

simonw · 2021-08-24T22:15:31Z

I'm going to assume utf-8 but allow --encoding to be used to specify something different, since that option is already supported by other commands.

simonw · 2021-08-24T22:15:34Z

Here's the error message I have working for invalid unicode:

sqlite-utils insert-files /tmp/text.db files *.txt --text
  [------------------------------------]    0%
Error: Could not read file '/Users/simon/Dropbox/Development/sqlite-utils/data.txt' as text

'utf-8' codec can't decode byte 0xe3 in position 83: invalid continuation byte

pjamargh · 2021-08-24T22:33:42Z

Oh, I misread. Yes some files will not be valid UTF-8, I'd throw a warning and continue (not adding that file) but if you want to get more elaborate you could allow to define a policy on what to do. Not adding the file, index binary content or use a conversion policy like the ones available on Python's decode.

From https://stackoverflow.com/questions/24616678/unicodedecodeerror-in-python-when-reading-a-file-how-to-ignore-the-error-and-ju :

'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

simonw · 2021-08-24T22:33:48Z

I had a few doubts about the design just now. Since content_text is supported as a special argument, an alternative way of handling the above would be:

sqlite-utils insert-files /tmp/text.db files *.txt -c path -c content_text -c size

This does exactly the same thing as just using --text and not specifying any columns, because the actual implementation of --text is as follows:

sqlite-utils/sqlite_utils/cli.py

Lines 1851 to 1855 in 0c796cd

    
           if not column: 
        
               if text: 
        
                   column = ["path:path", "content_text:content_text", "size:size"] 
        
               else: 
        
                   column = ["path:path", "content:content", "size:size"]

But actually I think that's OK - --text is a useful shorthand that avoids you having to remember how to manually specify those columns with -c. So I'm going to leave the design as is.

simonw · 2021-08-24T22:36:04Z

Oh, I misread. Yes some files will not be valid UTF-8, I'd throw a warning and continue (not adding that file) but if you want to get more elaborate you could allow to define a policy on what to do. Not adding the file, index binary content or use a conversion policy like the ones available on Python's decode.

I thought about supporting those different policies (with something like --errors ignore) but I feel like that's getting a little bit too deep into the weeds. Right now if you try to import an invalid file the behaviour is the same as for the sqlite-utils insert command (I added the same detailed error message):

Error: Could not read file '/Users/simon/Dropbox/Development/sqlite-utils/data.txt' as text

'utf-8' codec can't decode byte 0xe3 in position 83: invalid continuation byte

The input you provided uses a character encoding other than utf-8.

You can fix this by passing the --encoding= option with the encoding of the file.

If you do not know the encoding, running 'file filename.csv' may tell you.

It's often worth trying: --encoding=latin-1

If someone has data that can't be translated to valid text using a known encoding, I'm happy leaving them to have to insert it into a BLOB column instead.

pjamargh · 2021-08-24T22:41:39Z

I'm happy with this functionality left the way you describe. In my case the data is homogeneous but other cases would work just by being consistent on the encoding. Thanks a lot, Simon!

simonw · 2021-08-24T23:33:44Z

Updated documentation: https://sqlite-utils.datasette.io/en/latest/cli.html#inserting-data-from-files

Refs #319, #320

simonw added cli-tool enhancement labels Aug 24, 2021

simonw added a commit that referenced this issue Aug 24, 2021

WIP insert-files --text option, refs #319

0c796cd

simonw mentioned this issue Aug 24, 2021

Ability to insert file contents as text, in addition to blob #321

Merged

simonw closed this as completed Aug 24, 2021

simonw added a commit that referenced this issue Aug 24, 2021

Release 3.17

c18ae2f

Refs #319, #320

simonw added a commit that referenced this issue Aug 24, 2021

Release 3.17

77c240d

Refs #319, #320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Please allow 'insert-files' to insert content as text. #319

[Enhancement] Please allow 'insert-files' to insert content as text. #319

pjamargh commented Aug 22, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021

[Enhancement] Please allow 'insert-files' to insert content as text. #319

[Enhancement] Please allow 'insert-files' to insert content as text. #319

Comments

pjamargh commented Aug 22, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021

simonw commented Aug 24, 2021

pjamargh commented Aug 24, 2021

simonw commented Aug 24, 2021