Skip to content

fread handling of row names with header=TRUE #5634

@MLopez-Ibanez

Description

@MLopez-Ibanez
library(data.table)
txt <- '
  foo  foo2
1   0 false
2   1    NA
'
print(read.table(text = txt, header = TRUE))
print(fread(text = txt, header="auto"))
print(fread(text = txt, header = TRUE))

read.table() handles the row names as expected.
fread(header="auto") gives a warning and creates an extra column. Not great but can be fixed with suppressWarnings() and removing the extra column.
fread(header=TRUE) should do the same but instead gives:

   1 0 false
1: 2 1  <NA>

which is completely wrong.

Verbose output:

fread(text=txt, header="auto", verbose=TRUE)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            4
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          4
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 2 threads with throttle==1024. See ?setDTthreads.
Input contains a \n or is ")". Taking this to be text input (not a filename)
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 2 starting: <<  foo  foo2>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=' '  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 2. This line is either column names or first data row. Line starts as: <<1   0 false>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 1 because (24 bytes from row 1 to eof) / (2 * 24 jump0size) == 0
  Type codes (jump 000)    : 55C  Quote rule 0
Types in 1st data row match types in 2nd data row but previous row has 2 fields. Taking previous row as column names.  All rows were sampled since file is small so we know nrow=1 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 55C
[10] Allocate memory for the datatable
  Allocating 3 column slots (3 - 0 dropped) with 1 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=24
  Too few rows allocated. Allocating additional 1024 rows (now nrows=1025) and continue reading from jump 0
  jumps=[0..1), chunk_size=1048576, total_size=24
Read 2 rows x 3 columns from 37 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         1 : string    'C'
=============================
   0.000s ( 18%) Memory map 0.000GB file
   0.001s ( 61%) sep=' ' ncol=3 and header detection
   0.000s (  4%) Column type detection using 1 sample rows
   0.000s (  4%) Allocation of 1025 rows x 3 cols (0.000GB) of which 2 (  0%) rows used
   0.000s ( 13%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 2 rows) using 1 threads
   +    0.000s (  1%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s ( 12%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
   V1 foo  foo2
1:  1   0 false
2:  2   1  <NA>
Warning message:
In fread(text = txt, header = "auto",  :
  Detected 2 column names but the data has 3 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.
fread(text=txt, header=TRUE, verbose=TRUE)
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            4
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          4
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 2 threads with throttle==1024. See ?setDTthreads.
Input contains a \n or is ")". Taking this to be text input (not a filename)
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 2 starting: <<  foo  foo2>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=' '  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 2. This line is either column names or first data row. Line starts as: <<1   0 false>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 1 because (24 bytes from row 1 to eof) / (2 * 24 jump0size) == 0
  Type codes (jump 000)    : 552  Quote rule 0
  All rows were sampled since file is small so we know nrow=1 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 552
[10] Allocate memory for the datatable
  Allocating 3 column slots (3 - 0 dropped) with 1 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=12
Read 1 rows x 3 columns from 37 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
  Type counts:
         1 : bool8     '2'
         2 : int32     '5'
=============================
   0.000s ( 24%) Memory map 0.000GB file
   0.000s ( 53%) sep=' ' ncol=3 and header detection
   0.000s (  6%) Column type detection using 1 sample rows
   0.000s (  6%) Allocation of 1 rows x 3 cols (0.000GB) of which 1 (100%) rows used
   0.000s ( 11%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 1 rows) using 1 threads
   +    0.000s (  1%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s ( 10%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
   1 0 false
1: 2 1    NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions