Conversation
- implement new rule logic - write additional unit tests - update documentation
barbeau
left a comment
There was a problem hiding this comment.
Thanks @lionel-nj, mostly LGTM! Two formatting items in-line.
core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Sean Barbeau <[email protected]>
… into non-ascii-id
core/src/main/java/org/mobilitydata/gtfsvalidator/notice/NonAsciiOrNonPrintableCharNotice.java
Show resolved
Hide resolved
aababilov
left a comment
There was a problem hiding this comment.
Unfortunately, that change treats many real production feeds as invalid since they use non-ASCII characters and therefore parsing stops for them. So, that becomes a blocker for me since I cannot update to the latest code from GitHub. This change ought to be submitted after support for warnings was added. This new NonAsciiOrNonPrintableCharNotice should be a warnings rather than an error because GTFS spec says:
An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters. Using only printable ASCII characters is recommended.
| if (value == null) { | ||
| return asString(columnIndex, required); | ||
| } | ||
| for (char ch : value.toCharArray()) { |
There was a problem hiding this comment.
This makes a copy of the string that is a small performance issue. It is better to avoid them from the beginning rather than fixing them later. Please use a usual cycle here:
for(int i = 0, n = value.length() ; i < n ; i++) {
...
There was a problem hiding this comment.
Good piece of advice. Strings are full of hidden quirks.
🗨️ https://www.baeldung.com/java-string-performance 🗨️
| row.getFileName(), row.getRowNumber(), row.getColumnName(columnIndex))); | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Why are you calling asString once again at the end if you already have the value to return?
| public String asId(int columnIndex, boolean required) { | ||
| String value = row.asString(columnIndex); | ||
| if (value == null) { | ||
| return asString(columnIndex, required); |
There was a problem hiding this comment.
Why are you calling asString which is calling row.asString again if you already have the value?
| for (char ch : value.toCharArray()) { | ||
| if (!(ch >= 32 && ch < 127)) { | ||
| addErrorInRow( | ||
| new NonAsciiOrNonPrintableCharNotice( |
There was a problem hiding this comment.
That would be great to include the invalid value as well.
|
🙏🏾 Thanks @aababilov for your review. As you suggest, the changes will be provided after support for warnings was added. |
|
@aababilov My bad, I just noticed: #728 |
|
No worries, things happen! Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos. |
I can propose to extend the end to end workflow here before submitting it so that @lionel-nj could merge it ?nackko#1 |
Sure! Do you have recommendations on a set of feeds that we could test the validator against? @aababilov @MobilityData/transit-specs |
|
I could take said list and implement the workflow extension? :)
…On Tue., Feb. 9, 2021, 10:40 Lionel Nébot Janvier, ***@***.***> wrote:
Can we extend the amount of feeds that you use to test your commits? A
large amount of feeds would probably show that valid data is rejected with
errors instead of warnings or infos.
Sure! Do you have recommendations on a set of feeds that we could test the
validator against? @aababilov <https://github.com/aababilov>
@MobilityData/transit-specs
<https://github.com/orgs/MobilityData/teams/transit-specs>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADFXC6YCVHX4UEUT3DDFXLS6FCOTANCNFSM4W6XU5WA>
.
|
|
|
Thks @timMillet , will implement those tonight.
…On Tue., Feb. 9, 2021, 13:34 Tim Millet, ***@***.***> wrote:
- Suggestions of feeds using non-ASCII characters in IDs:
I checked 10+ feeds in countries where the primary language is not
transcribed in the Latin alphabet, but I haven't found any datasets with
non-ASCII characters in IDs.
- Suggestion of feeds with "good data", to extend the pool of the
feeds tested, available in OpenMobilityData:
- AMT (Genova, Italy)
- Bibus (Brest, France)
- Metro (Christchurch, New Zealand)
- Ruter (Oslo, Norway)
- TAG (Grenoble, France)
- Translink (Vancouver, Canada)
- VAG (Freiburg, Germany)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADFXC33HB2EMD2B7S3LBU3S6FW37ANCNFSM4W6XU5WA>
.
|
|
The more feeds we use to test the validator, the better. It increases the chance to find a problem. At Google, I just run the validator against all prod feeds without exception. That allows me to find many hidden bugs. How many open-data feeds you have? |
|
Maxime would now the answer @maximearmstrong
…On Tue., Feb. 9, 2021, 15:49 Alexej Ababilov, ***@***.***> wrote:
The more feeds we use to test the validator, the better. It increases the
chance to find a problem.
At Google, I just run the validator against all prod feeds without
exception. That allows me to find many hidden bugs.
How many open-data feeds you have?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADFXC7BRJA4T4XBKFO6I3LS6GGVRANCNFSM4W6XU5WA>
.
|
|
@aababilov We have access to over 2000 datasets (static and real-time) through OpenMobilityData. A first step could be sampling a part of the collection to integrate it to our end-to-end workflow. This should be discussed with @carlfredl, @barbeau and @lionel-nj in order to find the best strategy and amount of feeds to integrate. |
|
2000 feeds is what we need! Do you have technologies to run validation in
parallel on them? I use a proprietary Google library for that, so I can't
publish my code that validates several thousands of our feeds.
Den ons 10 feb. 2021 08:30Maxime Armstrong <[email protected]> skrev:
… @aababilov <https://github.com/aababilov> We have access to over 2000
datasets (static and real-time) through OpenMobilityData. A first step
could be sampling a part of the collection to integrate it to our
end-to-end workflow. This should be discussed with @carlfredl
<https://github.com/carlfredl>, @barbeau <https://github.com/barbeau> and
@lionel-nj <https://github.com/lionel-nj> in order to find the best
strategy and amount of feeds to integrate.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#712 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG6IJDXPS6CTSTPFJGBFXDS6GSN7ANCNFSM4W6XU5WA>
.
|
|
As @maximearmstrong suggested we will evaluate our possibilities. Meanwhile I would be curious to know your process to check all the validation reports that are generated @aababilov, we could get inspiration from that. |
|
@aababilov I guess the end to end workflow could be split in like, 50 or a 100 feeds per workflow file? They would run in parallel I think. Excellent point by @lionel-nj. |
As suggested by @timMillet here: MobilityData#712 (comment)
|
@aababilov @lionel-nj #736 is ready for comments. Lemme know and I'll complete it to a 100 feeds. |
closes #529
Summary:
This PR implements a new validation rule: no non-ascii character in IDs.
Expected behavior:
Please make sure these boxes are checked before submitting your pull request - thanks!
gradle testto make sure you didn't break anything[] Include screenshot(s) showing how this pull request works and fixes the issue(s)