Skip to content

[Enhancement] Please allow 'insert-files' to insert content as text. #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pjamargh opened this issue Aug 22, 2021 · 10 comments
Closed

[Enhancement] Please allow 'insert-files' to insert content as text. #319

pjamargh opened this issue Aug 22, 2021 · 10 comments
Labels
cli-tool enhancement New feature or request

Comments

@pjamargh
Copy link

'insert-files' creates BLOB columns for file contents. Transforming the column to TEXT still keep the content as binary. Even though I'm sure there is a transform that can be applied decoding the text it would be great to have a argument to make 'insert-files' to do it as text (with optional text encoding).

The use case is a bunch of htmls (single file) on a directory structure that inserted with this command could be served in Datasette allowing full text search.

@simonw simonw added cli-tool enhancement New feature or request labels Aug 24, 2021
@simonw
Copy link
Owner

simonw commented Aug 24, 2021

This is a good idea.

@simonw
Copy link
Owner

simonw commented Aug 24, 2021

Will need to re-title this section of the documentation: https://sqlite-utils.datasette.io/en/3.16/cli.html#inserting-binary-data-from-files - "Inserting binary data from files" will become "Inserting data from files"

I'm OK with keeping the default as BLOB but I could add a --text option which stores the content as text instead.

If the text can't be stored as utf-8 I'll probably raise an error.

@pjamargh
Copy link
Author

I was thinking that an approach could be making FILE_COLUMNS a generator (_get_file_columns(mode)) or you can just have a different set of columns (is there something else that makes sense to be changed on the text scenario?).

About UTF-8 I was referring to the encoding to use when reading files. This can be difficult to auto-detect but I believe that UTF-8 is pretty much the standard for text files.

@simonw
Copy link
Owner

simonw commented Aug 24, 2021

I'm going to assume utf-8 but allow --encoding to be used to specify something different, since that option is already supported by other commands.

@simonw
Copy link
Owner

simonw commented Aug 24, 2021

Here's the error message I have working for invalid unicode:

sqlite-utils insert-files /tmp/text.db files *.txt --text
  [------------------------------------]    0%
Error: Could not read file '/Users/simon/Dropbox/Development/sqlite-utils/data.txt' as text

'utf-8' codec can't decode byte 0xe3 in position 83: invalid continuation byte

@pjamargh
Copy link
Author

Oh, I misread. Yes some files will not be valid UTF-8, I'd throw a warning and continue (not adding that file) but if you want to get more elaborate you could allow to define a policy on what to do. Not adding the file, index binary content or use a conversion policy like the ones available on Python's decode.

From https://stackoverflow.com/questions/24616678/unicodedecodeerror-in-python-when-reading-a-file-how-to-ignore-the-error-and-ju :

  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

@simonw
Copy link
Owner

simonw commented Aug 24, 2021

I had a few doubts about the design just now. Since content_text is supported as a special argument, an alternative way of handling the above would be:

sqlite-utils insert-files /tmp/text.db files *.txt -c path -c content_text -c size

This does exactly the same thing as just using --text and not specifying any columns, because the actual implementation of --text is as follows:

if not column:
if text:
column = ["path:path", "content_text:content_text", "size:size"]
else:
column = ["path:path", "content:content", "size:size"]

But actually I think that's OK - --text is a useful shorthand that avoids you having to remember how to manually specify those columns with -c. So I'm going to leave the design as is.

@simonw
Copy link
Owner

simonw commented Aug 24, 2021

Oh, I misread. Yes some files will not be valid UTF-8, I'd throw a warning and continue (not adding that file) but if you want to get more elaborate you could allow to define a policy on what to do. Not adding the file, index binary content or use a conversion policy like the ones available on Python's decode.

I thought about supporting those different policies (with something like --errors ignore) but I feel like that's getting a little bit too deep into the weeds. Right now if you try to import an invalid file the behaviour is the same as for the sqlite-utils insert command (I added the same detailed error message):

Error: Could not read file '/Users/simon/Dropbox/Development/sqlite-utils/data.txt' as text

'utf-8' codec can't decode byte 0xe3 in position 83: invalid continuation byte

The input you provided uses a character encoding other than utf-8.

You can fix this by passing the --encoding= option with the encoding of the file.

If you do not know the encoding, running 'file filename.csv' may tell you.

It's often worth trying: --encoding=latin-1

If someone has data that can't be translated to valid text using a known encoding, I'm happy leaving them to have to insert it into a BLOB column instead.

@pjamargh
Copy link
Author

I'm happy with this functionality left the way you describe. In my case the data is homogeneous but other cases would work just by being consistent on the encoding. Thanks a lot, Simon!

@simonw simonw closed this as completed Aug 24, 2021
simonw added a commit that referenced this issue Aug 24, 2021
simonw added a commit that referenced this issue Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli-tool enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants