Fix int overflows in the ASCII fast reader c parser #5319

saimn · 2016-09-09T13:46:40Z

Fixes #5302 : int overflow when the file size is too big.
Tested with @yannick1974, works on his example.
I don't see a mean to add a regression test, obviously we can't generate a ~4Gb file on Travis ;-).

saimn · 2016-09-09T13:58:17Z

Added changelog for 1.2, but just saw that the fast reader was added in 1.0
cc @taldcroft

taldcroft · 2016-09-09T14:36:11Z

Thanks @saimn. I agree that there is no obvious way to test this on Travis. A while ago I had played around with changing int to long in various places and ran into some issues maintaining the concordance between the Cython and C code. I'm not sure if that will apply here, hopefully tests will pass.

taldcroft · 2016-09-09T14:38:45Z

CHANGES.rst


 - ``astropy.table``

+  - Fix reading of big ascii tables with the fast reader. [#5319]


Being more specific would be useful for users. I assume the cutoff is 4 Gb file length? ASCII should be upper case here. Also, this should be in the astropy.io.ascii section, not table.

As @astrofrog noted, this entry should be moved to 1.0.x section.

Right, the limit is 2Gb, as the code used signed int.
I updated the changelog, and moved it to 1.0. Should I duplicate the line in the 1.2 section ?

pllim · 2016-09-09T15:27:11Z

Adding milestone to be consistent with change log for now. When milestone is changed, change log should also be updated accordingly.

astrofrog · 2016-09-09T16:07:30Z

Ok, I think we indeed need to backport to v1.0.x

saimn · 2016-09-09T21:21:46Z

@taldcroft

ran into some issues maintaining the concordance between the Cython and C code

Good point, as these two variables are also declared as int in the .pyx file. It seems it doesn't matter though, and I can't find the result of this in the generated c file. The doc is also not very clear ...
There are also a few variables in _read_parallel that use int for source_len and source_pos, and the bug is also valid with parallel mode, so I will update these declarations.

MSeifert04 · 2016-09-10T16:28:43Z

@saimn I just had a quick look at the compiler warnings (some were already present before) but maybe you could have a look if any of these might break things:

cparser.c
astropy\io\ascii\cparser.c(3254): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(3925): warning C4244: 'function': conversion from 'long' to 'char', possible loss of data
astropy\io\ascii\cparser.c(4747): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5028): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5205): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5628): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(5988): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(6156): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(6466): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(7271): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(14503): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(17969): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(17982): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data

tokenizer.c
astropy\io\ascii\src\tokenizer.c(82): warning C4244: 'initializing': conversion from '__int64' to 'long', possible loss of data
astropy\io\ascii\src\tokenizer.c(447): warning C4018: '<': signed/unsigned mismatch
astropy\io\ascii\src\tokenizer.c(956): warning C4244: '=': conversion from '__int64' to 'int', possible loss of data

saimn · 2016-09-13T15:20:15Z

@MSeifert04 - Good point, thanks. I didn't find how to reproduce these warnings with gcc though. Most of the warnings seem unrelated to this PR. I pushed a small change because source_pos can possibly be negative, this was maybe source of one of the warnings.

juliantaylor · 2016-09-15T11:16:03Z

a long is 32 bit on windows, use Py_ssize_t, intptr_t, npy_intp, ...

juliantaylor · 2016-09-15T11:26:25Z

The correct type in regards to files is actually off_t. It also works on 32 bit machines if you have large file support enabled. Though I don't know if that exists in windows. But there is Py_off_t which should be portable.

saimn · 2016-09-19T20:43:27Z

Thanks @juliantaylor , I used off_t as it seems a reasonable choice (and I don't know from where I can import Py_off_t).
I also fixed the warnings reported by @MSeifert04

juliantaylor

some basics on C types:

int is a 32 bit (on relevant systems at least) the places where the int type is the correct type are extremely rare nowadays, an int in C/cython code should always be a red flag for reviews
long sounds like the correct type for signed sizes as it is 32 bit on 32 bit linux and 64 bit on 64 bit linux, unfortunately it is 32 bit on 64 bit windows so its usually wrong for portable code, should also be a red flag for reviews
32 bit machines can handle large files, but they can never load them into memory at once, thus there is the special off_t type which is 64 bit (if large file support is enabled) it is the correct type for offsetting into files, e.g. fseeko and ftello, it is not necessary for sizes in memory but as it always large enough its not bad either
possible sizes in memory depend on whether the machine is 32 or 64 bit, there are types in C/cython that handle that, namely intptr_t and size_t. These (in the absence of resurgence of segmented memory) will always have the correct size to index anything in memory. Thus they are the best types to be using for integers. They are always big enough and small enough at the same time for the processor. If you don't care about 32 bit performance using int64_t for memory indexing is fine too.

juliantaylor · 2016-09-20T11:06:13Z

astropy/io/ascii/cparser.pyx

        cdef char *tmp
        cdef int line_len
-        cdef int map_len = len(self.mmap)
+        cdef int map_len = <int>len(self.mmap)


this is wrong, never use int in cython code when not absolutely needed (which is very rare)

mmap is a pointer to memory so it can exceed 32 bit on 64 bit machines, use a wordsized type, like Py_ssize_t, intptr_t, npy_intp, size_t, ...

juliantaylor · 2016-09-20T11:07:21Z

astropy/io/ascii/cparser.pyx

                else:
                    name += chr(c)
-            self.width = len(self.header_names)
+            self.width = <int>len(self.header_names)


probably fine as header names is likely small, but still why risk it? use a correct type for size in memory

It is the number of columns, so hopefully it's fine.

juliantaylor · 2016-09-20T11:07:35Z

astropy/io/ascii/cparser.pyx


        if self.names:
-            self.width = len(self.names)
+            self.width = <int>len(self.names)


no int again

juliantaylor · 2016-09-20T11:09:15Z

astropy/io/ascii/cparser.pyx

        # This queue is used to signal processes to reconvert if necessary
        reconvert_queue = multiprocessing.Queue()

+        cdef int i


whats the expected size of N? is it guaranteed to be smaller than 32 bit?
or just don't risk it and use large enough type

N is the number of processes for the parallel mode.

taldcroft · 2016-09-20T12:40:03Z

This is good work and all the effort in handling arcane details of types is awesome. Some time ago I took a crack at trying to change the default int type on windows in the fast reader from int4 to int8 in #4686, but could not get it to work. This issue is currently causing a test fail in #5347. If you have ideas or would like to take over #4686 that would excellent. Having platform consistency in output types from the ASCII reader would be a good thing IMO.

saimn · 2016-09-20T14:22:31Z

Ok, thanks for the useful explanations @juliantaylor ! int is used almost everywhere in the C code ... Most of these should be fine (indexing columns/rows/...), though for map_len it was indeed a problematic case (I made the change).

saimn · 2016-11-02T10:06:12Z

Rebased to get rid off a conflict in the Changes file. Anything else I can do to get this in ?

taldcroft · 2016-11-02T22:55:07Z

@saimn - can you write a test which is skipped unless an env. var TEST_READ_HUGE_FILE (or whatever) is defined? The test then writes a sufficiently large file to temp and then reads it back and verifies the right content. The idea is to allow local testing as needed.

@MSeifert04 @juliantaylor - are you happy with this otherwise?

taldcroft · 2016-11-02T22:56:44Z

@saimn - p.s. you can [skip ci] the commit for the new test of course.

saimn · 2016-11-02T23:10:57Z

@taldcroft - Yep, good idea, I will add it in the next days.

saimn · 2016-11-07T10:28:02Z

@taldcroft - Done (and realized I forgot [ci skip] when doing the push, sorry :/)

saimn · 2016-11-07T10:28:35Z

BTW, if someone is available and can cancel the builds ...

taldcroft · 2016-11-07T12:21:49Z

@saimn - excellent, looks good. For the record (since the new test is not actually run), can you confirm that this test:

Fails with current master.
Succeeds with this PR.

saimn · 2016-11-07T13:18:15Z

Fails with current master.
Succeeds with this PR.

Yep, I tested both. However I realized that with the code given in #5302 by @yannick1974 (from which the test is adapted) we were able to read the table with master, even if it had then a wrong number of rows. And now on master I get a ValueError: Inconsistent data column lengths: set([]) and I don't find why I cannot reproduce the partial reading. I don't think it is a problem, but it is puzzling.

taldcroft · 2016-11-07T13:29:43Z

OK, well it is not reading successfully on master, and perhaps the failure mode(s) are tricky because of the overflow situation. So that is sufficient, merging! Thanks.

saimn · 2016-11-07T13:50:56Z

Ok, just to be sure I checked with 500'000 rows, and then it managed to read 54248 rows ! But with 250'000 lines it crashes.

In astropy#5319 an import of `off_t` was added, but `off_t` was moved from `posix.unistd` to `posix.types` in Cython 0.21 (cython/cython@4345655). So using a direct import avoids to require a specific Cython version.

Fix int overflows in the ASCII fast reader c parser

taldcroft added io.ascii Bug Affects-release labels Sep 9, 2016

taldcroft reviewed Sep 9, 2016
View reviewed changes

pllim added this to the v1.2.2 milestone Sep 9, 2016

pllim modified the milestones: v1.0.11, v1.2.2 Sep 9, 2016

saimn force-pushed the fix-cparser branch from 879e698 to d5a1245 Compare September 9, 2016 21:28

saimn force-pushed the fix-cparser branch from d5a1245 to 7a37007 Compare September 13, 2016 15:17

saimn force-pushed the fix-cparser branch from d5c92b6 to 1f36561 Compare September 19, 2016 13:29

juliantaylor suggested changes Sep 20, 2016

View reviewed changes

saimn and others added 6 commits November 2, 2016 10:59

Fix int overflow when the file size is too big for an int

05e7117

Add changelog

ca90b6f

Change the type of a few variables to unsigned long

e8270f8

Use off_t instead of unsigned long for file indexing

135ec56

Fix type conversion warnings

483578f

Fix a problematic use of int for get_line

4741de7

saimn force-pushed the fix-cparser branch from 55e1721 to 4741de7 Compare November 2, 2016 10:00

taldcroft self-assigned this Nov 2, 2016

Add a test to generate and read a huge file, disabled by default

6d9aff8

taldcroft merged commit ab28e3e into astropy:master Nov 7, 2016

taldcroft changed the title ~~Fix int overflow in the c parser~~ Fix int overflows in the ASCII fast reader c parser Nov 7, 2016

saimn deleted the fix-cparser branch November 7, 2016 13:35

astrofrog mentioned this pull request Nov 23, 2016

Update Cython requirement #5498

Merged

MSeifert04 mentioned this pull request Nov 23, 2016

Implementation of units in models [simplified version of #4615] #4855

Merged

astrofrog mentioned this pull request Nov 23, 2016

astropy.test() has error Segmentation fault (core dumped) #5497

Closed

saimn mentioned this pull request Nov 24, 2016

Avoid to require a more recent version of Cython #5501

Closed

eteq pushed a commit that referenced this pull request Dec 21, 2016

Merge pull request #5319 from saimn/fix-cparser

b5b3b9b

Fix int overflows in the ASCII fast reader c parser

eteq pushed a commit that referenced this pull request Dec 21, 2016

Merge pull request #5319 from saimn/fix-cparser

81356d8

Fix int overflows in the ASCII fast reader c parser

This was referenced Oct 10, 2018

Unable to read in big csv file #7871

Closed

Fix reading big csv files #7885

Merged


		- ``astropy.table``

		- Fix reading of big ascii tables with the fast reader. [#5319]

Uh oh!

Fix int overflows in the ASCII fast reader c parser #5319

Fix int overflows in the ASCII fast reader c parser #5319

Uh oh!

Conversation

saimn commented Sep 9, 2016

Uh oh!

saimn commented Sep 9, 2016

Uh oh!

taldcroft commented Sep 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pllim commented Sep 9, 2016

Uh oh!

astrofrog commented Sep 9, 2016

Uh oh!

saimn commented Sep 9, 2016

Uh oh!

MSeifert04 commented Sep 10, 2016

Uh oh!

saimn commented Sep 13, 2016

Uh oh!

juliantaylor commented Sep 15, 2016

Uh oh!

juliantaylor commented Sep 15, 2016

Uh oh!

saimn commented Sep 19, 2016

Uh oh!

juliantaylor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Sep 20, 2016

Uh oh!

saimn commented Sep 20, 2016

Uh oh!

saimn commented Nov 2, 2016

Uh oh!

taldcroft commented Nov 2, 2016

Uh oh!

taldcroft commented Nov 2, 2016

Uh oh!

saimn commented Nov 2, 2016

Uh oh!

saimn commented Nov 7, 2016

Uh oh!

saimn commented Nov 7, 2016

Uh oh!

taldcroft commented Nov 7, 2016

Uh oh!

saimn commented Nov 7, 2016

Uh oh!

taldcroft commented Nov 7, 2016

Uh oh!

saimn commented Nov 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development