--no-headers option for CSV and TSV #228

simonw · 2021-02-12T17:56:51Z

https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c has a fascinating CSV file that doesn't have a header row - it starts like this:

Computation and measurement of turbulent flow through idealized turbine blade passages,,"Loizou, Panos A.",https://isni.org/isni/0000000136122593,,University of Manchester,https://isni.org/isni/0000000121662407,1989,Thesis (Ph.D.),,Physical Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232781,
"Prolactin and growth hormone secretion in normal, hyperprolactinaemic and acromegalic man",,"Prescott, R. W. G.",https://isni.org/isni/0000000134992122,,University of Newcastle upon Tyne,https://isni.org/isni/0000000104627212,1983,Thesis (Ph.D.),,Biological Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232784,

It would be useful if sqlite-utils insert ... --csv had a mechanism for importing files like this one.

The text was updated successfully, but these errors were encountered:

simonw · 2021-02-12T18:00:43Z

I could combine this with #131 to allow types to be specified in addition to column names.

Probably need an option that means "ignore the existing heading row and use this one instead".

simonw · 2021-02-12T23:27:50Z

For the moment, a workaround can be to cat an additional row onto the start of the file.

echo "name,url,description" | cat - missing_headings.csv | sqlite-utils insert blah.db table - --csv

simonw · 2021-02-14T17:39:30Z

I'm going to detach this from the #131 column types idea.

The three things I need to handle here are:

The CSV file doesn't have a header row at all, so I need to specify what the column names should be
The CSV file DOES have a header row but I want to ignore it and use alternative column names
The CSV doesn't have a header row at all and I want to automatically use unknown1,unknown2... so I can start exploring it as quickly as possible.

Here's a potential design that covers the first two:

--replace-header="foo,bar,baz" - ignore whatever is in the first row and pretend it was this instead
--add-header="foo,bar,baz" - add a first row with these details, to use as the header

It doesn't cover the "give me unknown column names" case though.

simonw · 2021-02-14T17:40:48Z

Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use sqlite-utils transform --rename to rename the columns.

I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.

simonw · 2021-02-14T17:41:30Z

I just spotted that csv.Sniffer in the Python standard library has a .has_header(sample) method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer

simonw · 2021-02-14T21:15:11Z

Implementation tip: I have code that reads the first row and uses it as headers here:

sqlite-utils/sqlite_utils/cli.py

Lines 689 to 691 in 8f042ae

    
           reader = csv_std.reader(decoded, **csv_reader_args) 
        
           headers = next(reader) 
        
           docs = (dict(zip(headers, row)) for row in reader)

So If I want to use unknown1,unknown2... I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with itertools.chain() to replay the record we already read).

simonw · 2021-02-14T21:15:43Z

I'm not convinced the .has_header() rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383

    def has_header(self, sample):
        # Creates a dictionary of types of data in each column. If any
        # column is of a single type (say, integers), *except* for the first
        # row, then the first row is presumed to be labels. If the type
        # can't be determined, it is assumed to be a string in which case
        # the length of the string is the determining factor: if all of the
        # rows except for the first are the same length, it's a header.
        # Finally, a 'vote' is taken at the end for each column, adding or
        # subtracting from the likelihood of the first row being a header.

simonw · 2021-02-14T22:06:53Z

For the moment I think just adding --no-header - which causes column names "unknown1,unknown2,..." to be used - should be enough.

Users can import with that option, then use sqlite-utils transform --rename to rename them.

simonw · 2021-02-14T22:23:46Z

I called this --no-headers for consistency with the existing output option:

sqlite-utils/sqlite_utils/cli.py

Lines 61 to 64 in 427dace

    
           click.option("--csv", is_flag=True, help="Output CSV"), 
        
           click.option("--tsv", is_flag=True, help="Output TSV"), 
        
           click.option("--no-headers", is_flag=True, help="Omit CSV headers"), 
        
           click.option("-t", "--table", is_flag=True, help="Output as a table"),

Refs #228, #229, #230, #231, #232, #234

agguser · 2021-12-26T07:01:31Z

--no-headers does not work?

$ echo 'a,1\nb,2' | sqlite-utils memory --no-headers -t - 'select * from stdin'
a      1                                                                                                                             
---  ---                                                                                                                             
b      2

simonw added the enhancement label Feb 12, 2021

simonw mentioned this issue Feb 14, 2021

--sniff option for sniffing delimiters #230

Closed

simonw changed the title ~~Support reading CSVs with no header row~~ --no-header option for CSV and TSV Feb 14, 2021

simonw changed the title ~~--no-header option for CSV and TSV~~ --no-headers option for CSV and TSV Feb 14, 2021

simonw closed this as completed in 50d2096 Feb 14, 2021

simonw added a commit that referenced this issue Feb 14, 2021

Release 3.5

8fcaee0

Refs #228, #229, #230, #231, #232, #234

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--no-headers option for CSV and TSV #228

--no-headers option for CSV and TSV #228

simonw commented Feb 12, 2021

simonw commented Feb 12, 2021

simonw commented Feb 12, 2021

simonw commented Feb 14, 2021 •

edited

Loading

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

agguser commented Dec 26, 2021

--no-headers option for CSV and TSV #228

--no-headers option for CSV and TSV #228

Comments

simonw commented Feb 12, 2021

simonw commented Feb 12, 2021

simonw commented Feb 12, 2021

simonw commented Feb 14, 2021 • edited Loading

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

simonw commented Feb 14, 2021

agguser commented Dec 26, 2021

simonw commented Feb 14, 2021 •

edited

Loading