-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix int overflows in the ASCII fast reader c parser #5319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Added changelog for 1.2, but just saw that the fast reader was added in 1.0 |
|
Thanks @saimn. I agree that there is no obvious way to test this on Travis. A while ago I had played around with changing |
CHANGES.rst
Outdated
|
|
||
| - ``astropy.table`` | ||
|
|
||
| - Fix reading of big ascii tables with the fast reader. [#5319] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being more specific would be useful for users. I assume the cutoff is 4 Gb file length? ASCII should be upper case here. Also, this should be in the astropy.io.ascii section, not table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @astrofrog noted, this entry should be moved to 1.0.x section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the limit is 2Gb, as the code used signed int.
I updated the changelog, and moved it to 1.0. Should I duplicate the line in the 1.2 section ?
|
Adding milestone to be consistent with change log for now. When milestone is changed, change log should also be updated accordingly. |
|
Ok, I think we indeed need to backport to v1.0.x |
Good point, as these two variables are also declared as int in the .pyx file. It seems it doesn't matter though, and I can't find the result of this in the generated c file. The doc is also not very clear ... |
|
@saimn I just had a quick look at the compiler warnings (some were already present before) but maybe you could have a look if any of these might break things: |
|
@MSeifert04 - Good point, thanks. I didn't find how to reproduce these warnings with gcc though. Most of the warnings seem unrelated to this PR. I pushed a small change because |
|
a long is 32 bit on windows, use Py_ssize_t, intptr_t, npy_intp, ... |
|
The correct type in regards to files is actually off_t. It also works on 32 bit machines if you have large file support enabled. Though I don't know if that exists in windows. But there is Py_off_t which should be portable. |
|
Thanks @juliantaylor , I used |
juliantaylor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some basics on C types:
- int is a 32 bit (on relevant systems at least) the places where the int type is the correct type are extremely rare nowadays, an int in C/cython code should always be a red flag for reviews
- long sounds like the correct type for signed sizes as it is 32 bit on 32 bit linux and 64 bit on 64 bit linux, unfortunately it is 32 bit on 64 bit windows so its usually wrong for portable code, should also be a red flag for reviews
- 32 bit machines can handle large files, but they can never load them into memory at once, thus there is the special off_t type which is 64 bit (if large file support is enabled) it is the correct type for offsetting into files, e.g. fseeko and ftello, it is not necessary for sizes in memory but as it always large enough its not bad either
- possible sizes in memory depend on whether the machine is 32 or 64 bit, there are types in C/cython that handle that, namely intptr_t and size_t. These (in the absence of resurgence of segmented memory) will always have the correct size to index anything in memory. Thus they are the best types to be using for integers. They are always big enough and small enough at the same time for the processor. If you don't care about 32 bit performance using int64_t for memory indexing is fine too.
astropy/io/ascii/cparser.pyx
Outdated
| cdef char *tmp | ||
| cdef int line_len | ||
| cdef int map_len = len(self.mmap) | ||
| cdef int map_len = <int>len(self.mmap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong, never use int in cython code when not absolutely needed (which is very rare)
mmap is a pointer to memory so it can exceed 32 bit on 64 bit machines, use a wordsized type, like Py_ssize_t, intptr_t, npy_intp, size_t, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| else: | ||
| name += chr(c) | ||
| self.width = len(self.header_names) | ||
| self.width = <int>len(self.header_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably fine as header names is likely small, but still why risk it? use a correct type for size in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the number of columns, so hopefully it's fine.
|
|
||
| if self.names: | ||
| self.width = len(self.names) | ||
| self.width = <int>len(self.names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no int again
| # This queue is used to signal processes to reconvert if necessary | ||
| reconvert_queue = multiprocessing.Queue() | ||
|
|
||
| cdef int i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the expected size of N? is it guaranteed to be smaller than 32 bit?
or just don't risk it and use large enough type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N is the number of processes for the parallel mode.
|
This is good work and all the effort in handling arcane details of types is awesome. Some time ago I took a crack at trying to change the default int type on windows in the fast reader from int4 to int8 in #4686, but could not get it to work. This issue is currently causing a test fail in #5347. If you have ideas or would like to take over #4686 that would excellent. Having platform consistency in output types from the ASCII reader would be a good thing IMO. |
|
Ok, thanks for the useful explanations @juliantaylor ! |
|
Rebased to get rid off a conflict in the Changes file. Anything else I can do to get this in ? |
|
@saimn - can you write a test which is skipped unless an env. var TEST_READ_HUGE_FILE (or whatever) is defined? The test then writes a sufficiently large file to temp and then reads it back and verifies the right content. The idea is to allow local testing as needed. @MSeifert04 @juliantaylor - are you happy with this otherwise? |
|
@saimn - p.s. you can |
|
@taldcroft - Yep, good idea, I will add it in the next days. |
|
@taldcroft - Done (and realized I forgot |
|
BTW, if someone is available and can cancel the builds ... |
|
@saimn - excellent, looks good. For the record (since the new test is not actually run), can you confirm that this test:
|
Yep, I tested both. However I realized that with the code given in #5302 by @yannick1974 (from which the test is adapted) we were able to read the table with master, even if it had then a wrong number of rows. And now on master I get a |
|
OK, well it is not reading successfully on master, and perhaps the failure mode(s) are tricky because of the overflow situation. So that is sufficient, merging! Thanks. |
|
Ok, just to be sure I checked with 500'000 rows, and then it managed to read 54248 rows ! But with 250'000 lines it crashes. |
In astropy#5319 an import of `off_t` was added, but `off_t` was moved from `posix.unistd` to `posix.types` in Cython 0.21 (cython/cython@4345655). So using a direct import avoids to require a specific Cython version.
In astropy#5319 an import of `off_t` was added, but `off_t` was moved from `posix.unistd` to `posix.types` in Cython 0.21 (cython/cython@4345655). So using a direct import avoids to require a specific Cython version.
Fix int overflows in the ASCII fast reader c parser
Fix int overflows in the ASCII fast reader c parser
Fixes #5302 : int overflow when the file size is too big.
Tested with @yannick1974, works on his example.
I don't see a mean to add a regression test, obviously we can't generate a ~4Gb file on Travis ;-).