Skip to content

--no-headers option for CSV and TSV #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonw opened this issue Feb 12, 2021 · 10 comments
Closed

--no-headers option for CSV and TSV #228

simonw opened this issue Feb 12, 2021 · 10 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Feb 12, 2021

https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c has a fascinating CSV file that doesn't have a header row - it starts like this:

Computation and measurement of turbulent flow through idealized turbine blade passages,,"Loizou, Panos A.",https://isni.org/isni/0000000136122593,,University of Manchester,https://isni.org/isni/0000000121662407,1989,Thesis (Ph.D.),,Physical Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232781,
"Prolactin and growth hormone secretion in normal, hyperprolactinaemic and acromegalic man",,"Prescott, R. W. G.",https://isni.org/isni/0000000134992122,,University of Newcastle upon Tyne,https://isni.org/isni/0000000104627212,1983,Thesis (Ph.D.),,Biological Sciences,,,https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.232784,

It would be useful if sqlite-utils insert ... --csv had a mechanism for importing files like this one.

@simonw simonw added the enhancement New feature or request label Feb 12, 2021
@simonw
Copy link
Owner Author

simonw commented Feb 12, 2021

I could combine this with #131 to allow types to be specified in addition to column names.

Probably need an option that means "ignore the existing heading row and use this one instead".

@simonw
Copy link
Owner Author

simonw commented Feb 12, 2021

For the moment, a workaround can be to cat an additional row onto the start of the file.

echo "name,url,description" | cat - missing_headings.csv | sqlite-utils insert blah.db table - --csv

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

I'm going to detach this from the #131 column types idea.

The three things I need to handle here are:

  • The CSV file doesn't have a header row at all, so I need to specify what the column names should be
  • The CSV file DOES have a header row but I want to ignore it and use alternative column names
  • The CSV doesn't have a header row at all and I want to automatically use unknown1,unknown2... so I can start exploring it as quickly as possible.

Here's a potential design that covers the first two:

--replace-header="foo,bar,baz" - ignore whatever is in the first row and pretend it was this instead
--add-header="foo,bar,baz" - add a first row with these details, to use as the header

It doesn't cover the "give me unknown column names" case though.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use sqlite-utils transform --rename to rename the columns.

I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

I just spotted that csv.Sniffer in the Python standard library has a .has_header(sample) method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

Implementation tip: I have code that reads the first row and uses it as headers here:

reader = csv_std.reader(decoded, **csv_reader_args)
headers = next(reader)
docs = (dict(zip(headers, row)) for row in reader)

So If I want to use unknown1,unknown2... I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with itertools.chain() to replay the record we already read).

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

I'm not convinced the .has_header() rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383

    def has_header(self, sample):
        # Creates a dictionary of types of data in each column. If any
        # column is of a single type (say, integers), *except* for the first
        # row, then the first row is presumed to be labels. If the type
        # can't be determined, it is assumed to be a string in which case
        # the length of the string is the determining factor: if all of the
        # rows except for the first are the same length, it's a header.
        # Finally, a 'vote' is taken at the end for each column, adding or
        # subtracting from the likelihood of the first row being a header.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

For the moment I think just adding --no-header - which causes column names "unknown1,unknown2,..." to be used - should be enough.

Users can import with that option, then use sqlite-utils transform --rename to rename them.

@simonw simonw changed the title Support reading CSVs with no header row --no-header option for CSV and TSV Feb 14, 2021
@simonw simonw changed the title --no-header option for CSV and TSV --no-headers option for CSV and TSV Feb 14, 2021
@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

I called this --no-headers for consistency with the existing output option:

click.option("--csv", is_flag=True, help="Output CSV"),
click.option("--tsv", is_flag=True, help="Output TSV"),
click.option("--no-headers", is_flag=True, help="Omit CSV headers"),
click.option("-t", "--table", is_flag=True, help="Output as a table"),

@simonw simonw closed this as completed in 50d2096 Feb 14, 2021
simonw added a commit that referenced this issue Feb 14, 2021
@agguser
Copy link

agguser commented Dec 26, 2021

--no-headers does not work?

$ echo 'a,1\nb,2' | sqlite-utils memory --no-headers -t - 'select * from stdin'
a      1                                                                                                                             
---  ---                                                                                                                             
b      2 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants