Skip to content

Conversation

@saimn
Copy link
Contributor

@saimn saimn commented Sep 9, 2016

Fixes #5302 : int overflow when the file size is too big.
Tested with @yannick1974, works on his example.
I don't see a mean to add a regression test, obviously we can't generate a ~4Gb file on Travis ;-).

@saimn
Copy link
Contributor Author

saimn commented Sep 9, 2016

Added changelog for 1.2, but just saw that the fast reader was added in 1.0
cc @taldcroft

@taldcroft
Copy link
Member

Thanks @saimn. I agree that there is no obvious way to test this on Travis. A while ago I had played around with changing int to long in various places and ran into some issues maintaining the concordance between the Cython and C code. I'm not sure if that will apply here, hopefully tests will pass.

CHANGES.rst Outdated

- ``astropy.table``

- Fix reading of big ascii tables with the fast reader. [#5319]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being more specific would be useful for users. I assume the cutoff is 4 Gb file length? ASCII should be upper case here. Also, this should be in the astropy.io.ascii section, not table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @astrofrog noted, this entry should be moved to 1.0.x section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the limit is 2Gb, as the code used signed int.
I updated the changelog, and moved it to 1.0. Should I duplicate the line in the 1.2 section ?

@pllim pllim added this to the v1.2.2 milestone Sep 9, 2016
@pllim
Copy link
Member

pllim commented Sep 9, 2016

Adding milestone to be consistent with change log for now. When milestone is changed, change log should also be updated accordingly.

@astrofrog
Copy link
Member

Ok, I think we indeed need to backport to v1.0.x

@pllim pllim modified the milestones: v1.0.11, v1.2.2 Sep 9, 2016
@saimn
Copy link
Contributor Author

saimn commented Sep 9, 2016

@taldcroft

ran into some issues maintaining the concordance between the Cython and C code

Good point, as these two variables are also declared as int in the .pyx file. It seems it doesn't matter though, and I can't find the result of this in the generated c file. The doc is also not very clear ...
There are also a few variables in _read_parallel that use int for source_len and source_pos, and the bug is also valid with parallel mode, so I will update these declarations.

@MSeifert04
Copy link
Contributor

@saimn I just had a quick look at the compiler warnings (some were already present before) but maybe you could have a look if any of these might break things:

cparser.c
astropy\io\ascii\cparser.c(3254): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(3925): warning C4244: 'function': conversion from 'long' to 'char', possible loss of data
astropy\io\ascii\cparser.c(4747): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5028): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5205): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(5628): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(5988): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(6156): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(6466): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(7271): warning C4244: '=': conversion from 'Py_ssize_t' to 'unsigned long', possible loss of data
astropy\io\ascii\cparser.c(14503): warning C4244: 'function': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(17969): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
astropy\io\ascii\cparser.c(17982): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data

tokenizer.c
astropy\io\ascii\src\tokenizer.c(82): warning C4244: 'initializing': conversion from '__int64' to 'long', possible loss of data
astropy\io\ascii\src\tokenizer.c(447): warning C4018: '<': signed/unsigned mismatch
astropy\io\ascii\src\tokenizer.c(956): warning C4244: '=': conversion from '__int64' to 'int', possible loss of data

@saimn
Copy link
Contributor Author

saimn commented Sep 13, 2016

@MSeifert04 - Good point, thanks. I didn't find how to reproduce these warnings with gcc though. Most of the warnings seem unrelated to this PR. I pushed a small change because source_pos can possibly be negative, this was maybe source of one of the warnings.

@juliantaylor
Copy link

a long is 32 bit on windows, use Py_ssize_t, intptr_t, npy_intp, ...

@juliantaylor
Copy link

The correct type in regards to files is actually off_t. It also works on 32 bit machines if you have large file support enabled. Though I don't know if that exists in windows. But there is Py_off_t which should be portable.

@saimn
Copy link
Contributor Author

saimn commented Sep 19, 2016

Thanks @juliantaylor , I used off_t as it seems a reasonable choice (and I don't know from where I can import Py_off_t).
I also fixed the warnings reported by @MSeifert04

Copy link

@juliantaylor juliantaylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some basics on C types:

  • int is a 32 bit (on relevant systems at least) the places where the int type is the correct type are extremely rare nowadays, an int in C/cython code should always be a red flag for reviews
  • long sounds like the correct type for signed sizes as it is 32 bit on 32 bit linux and 64 bit on 64 bit linux, unfortunately it is 32 bit on 64 bit windows so its usually wrong for portable code, should also be a red flag for reviews
  • 32 bit machines can handle large files, but they can never load them into memory at once, thus there is the special off_t type which is 64 bit (if large file support is enabled) it is the correct type for offsetting into files, e.g. fseeko and ftello, it is not necessary for sizes in memory but as it always large enough its not bad either
  • possible sizes in memory depend on whether the machine is 32 or 64 bit, there are types in C/cython that handle that, namely intptr_t and size_t. These (in the absence of resurgence of segmented memory) will always have the correct size to index anything in memory. Thus they are the best types to be using for integers. They are always big enough and small enough at the same time for the processor. If you don't care about 32 bit performance using int64_t for memory indexing is fine too.

cdef char *tmp
cdef int line_len
cdef int map_len = len(self.mmap)
cdef int map_len = <int>len(self.mmap)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, never use int in cython code when not absolutely needed (which is very rare)

mmap is a pointer to memory so it can exceed 32 bit on 64 bit machines, use a wordsized type, like Py_ssize_t, intptr_t, npy_intp, size_t, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

else:
name += chr(c)
self.width = len(self.header_names)
self.width = <int>len(self.header_names)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably fine as header names is likely small, but still why risk it? use a correct type for size in memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the number of columns, so hopefully it's fine.


if self.names:
self.width = len(self.names)
self.width = <int>len(self.names)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no int again

# This queue is used to signal processes to reconvert if necessary
reconvert_queue = multiprocessing.Queue()

cdef int i

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the expected size of N? is it guaranteed to be smaller than 32 bit?
or just don't risk it and use large enough type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N is the number of processes for the parallel mode.

@taldcroft
Copy link
Member

This is good work and all the effort in handling arcane details of types is awesome. Some time ago I took a crack at trying to change the default int type on windows in the fast reader from int4 to int8 in #4686, but could not get it to work. This issue is currently causing a test fail in #5347. If you have ideas or would like to take over #4686 that would excellent. Having platform consistency in output types from the ASCII reader would be a good thing IMO.

@saimn
Copy link
Contributor Author

saimn commented Sep 20, 2016

Ok, thanks for the useful explanations @juliantaylor ! int is used almost everywhere in the C code ... Most of these should be fine (indexing columns/rows/...), though for map_len it was indeed a problematic case (I made the change).

@saimn
Copy link
Contributor Author

saimn commented Nov 2, 2016

Rebased to get rid off a conflict in the Changes file. Anything else I can do to get this in ?

@taldcroft
Copy link
Member

@saimn - can you write a test which is skipped unless an env. var TEST_READ_HUGE_FILE (or whatever) is defined? The test then writes a sufficiently large file to temp and then reads it back and verifies the right content. The idea is to allow local testing as needed.

@MSeifert04 @juliantaylor - are you happy with this otherwise?

@taldcroft taldcroft self-assigned this Nov 2, 2016
@taldcroft
Copy link
Member

@saimn - p.s. you can [skip ci] the commit for the new test of course.

@saimn
Copy link
Contributor Author

saimn commented Nov 2, 2016

@taldcroft - Yep, good idea, I will add it in the next days.

@saimn
Copy link
Contributor Author

saimn commented Nov 7, 2016

@taldcroft - Done (and realized I forgot [ci skip] when doing the push, sorry :/)

@saimn
Copy link
Contributor Author

saimn commented Nov 7, 2016

BTW, if someone is available and can cancel the builds ...

@taldcroft
Copy link
Member

@saimn - excellent, looks good. For the record (since the new test is not actually run), can you confirm that this test:

  • Fails with current master.
  • Succeeds with this PR.

@saimn
Copy link
Contributor Author

saimn commented Nov 7, 2016

Fails with current master.
Succeeds with this PR.

Yep, I tested both. However I realized that with the code given in #5302 by @yannick1974 (from which the test is adapted) we were able to read the table with master, even if it had then a wrong number of rows. And now on master I get a ValueError: Inconsistent data column lengths: set([]) and I don't find why I cannot reproduce the partial reading. I don't think it is a problem, but it is puzzling.

@taldcroft
Copy link
Member

OK, well it is not reading successfully on master, and perhaps the failure mode(s) are tricky because of the overflow situation. So that is sufficient, merging! Thanks.

@taldcroft taldcroft merged commit ab28e3e into astropy:master Nov 7, 2016
@taldcroft taldcroft changed the title Fix int overflow in the c parser Fix int overflows in the ASCII fast reader c parser Nov 7, 2016
@saimn saimn deleted the fix-cparser branch November 7, 2016 13:35
@saimn
Copy link
Contributor Author

saimn commented Nov 7, 2016

Ok, just to be sure I checked with 500'000 rows, and then it managed to read 54248 rows ! But with 250'000 lines it crashes.

saimn added a commit to saimn/astropy that referenced this pull request Nov 24, 2016
In astropy#5319 an import of `off_t` was added, but `off_t` was moved from
`posix.unistd` to `posix.types` in Cython 0.21
(cython/cython@4345655).
So using a direct import avoids to require a specific Cython version.
saimn added a commit to saimn/astropy that referenced this pull request Nov 24, 2016
In astropy#5319 an import of `off_t` was added, but `off_t` was moved from
`posix.unistd` to `posix.types` in Cython 0.21
(cython/cython@4345655).
So using a direct import avoids to require a specific Cython version.
eteq pushed a commit that referenced this pull request Dec 21, 2016
Fix int overflows in the ASCII fast reader c parser
eteq pushed a commit that referenced this pull request Dec 21, 2016
Fix int overflows in the ASCII fast reader c parser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants