Use int instead of long to keep CSV row number#1287
Use int instead of long to keep CSV row number#1287isabelle-dr merged 2 commits intoMobilityData:masterfrom
Conversation
060c77f to
f2ecbe4
Compare
Generate TooManyEntriesNotice (error) for files that have more than MAX_INT rows. Such large files would require too much memory and cause OOM.
f2ecbe4 to
0d55c90
Compare
|
Thanks @aababilov, one question: how do we define This new rule will have to be added to the documentation. 😊 |
|
Thanks! The change looks good. Could you please add a bit more information in the first comment on what impact the change has (e.g. memory savings, etc)? Also, should we use something more conservative than INT_MAX [0], maybe restricting to 500 million entries in a single file or 1 billion? @isabelle-dr do you have any opinion on this? [0] INT_MAX is 2'147'483'647 |
|
@asvechnikov2 what is the proportion of datasets than have > 500 million entries? |
|
@isabelle-dr I think the biggest I saw was under 100 million entries, so it should be safe to assume that there are none for > 500 millions (not even close) |
This is a risky practice. |
…nyRows We are counting the amount of rows, not entities. A CSV file may have empty rows that have no GTFS entities.
|
@aababilov agreed that RULES.md is a pain to maintain. I don't recommend opening it in IDEA but instead a simple text editor. I've explored generating the file from comments directly and have a local patch that automates a big chunk of it, but it's not quite fully there. |
asvechnikov2
left a comment
There was a problem hiding this comment.
LGTM! @isabelle-dr the proposal is to use 1 billion entries as a limit, this is well beyond of what is available now and won't cause any issues.
isabelle-dr
left a comment
There was a problem hiding this comment.
LGTM! Merging this PR. 🥳
Generate TooManyRowsNotice (error) for files that have more than 1 billion rows. Such large files would require too much memory and cause OOM.
This change reduces memory consumption by 4 bytes per row. Large
stops_times.txtfiles may contain 30 M and even 500 M lines, so we are saving from 120 MB to 2 GB.