feat: non ascii character in ID by lionel-nj · Pull Request #712 · MobilityData/gtfs-validator

lionel-nj · 2021-02-02T14:37:12Z

closes #529
Summary:

This PR implements a new validation rule: no non-ascii character in IDs.

Expected behavior:

a notice should be generated if an ID contains non ascii characters

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with gradle test to make sure you didn't break anything
Format the title like "feat: -new feature short description-" (PR title must follow the conventional commit specification)
Linked all relevant issues
~~[] Include screenshot(s) showing how this pull request works and fixes the issue(s)~~

- implement new rule logic - write additional unit tests - update documentation

barbeau

Thanks @lionel-nj, mostly LGTM! Two formatting items in-line.

RULES.md

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

Co-authored-by: Sean Barbeau <[email protected]>

… into non-ascii-id

core/src/main/java/org/mobilitydata/gtfsvalidator/notice/NonAsciiOrNonPrintableCharNotice.java

aababilov

Unfortunately, that change treats many real production feeds as invalid since they use non-ASCII characters and therefore parsing stops for them. So, that becomes a blocker for me since I cannot update to the latest code from GitHub. This change ought to be submitted after support for warnings was added. This new NonAsciiOrNonPrintableCharNotice should be a warnings rather than an error because GTFS spec says:

An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters. Using only printable ASCII characters is recommended.

aababilov · 2021-02-08T00:15:06Z

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

+    if (value == null) {
+      return asString(columnIndex, required);
+    }
+    for (char ch : value.toCharArray()) {


This makes a copy of the string that is a small performance issue. It is better to avoid them from the beginning rather than fixing them later. Please use a usual cycle here:

for(int i = 0, n = value.length() ; i < n ; i++) {
...

Good piece of advice. Strings are full of hidden quirks.
🗨️ https://www.baeldung.com/java-string-performance 🗨️

aababilov · 2021-02-08T00:16:48Z

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

+                row.getFileName(), row.getRowNumber(), row.getColumnName(columnIndex)));
+        break;
+      }
+    }


Why are you calling asString once again at the end if you already have the value to return?

aababilov · 2021-02-08T00:17:14Z

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

  public String asId(int columnIndex, boolean required) {
+    String value = row.asString(columnIndex);
+    if (value == null) {
+      return asString(columnIndex, required);


Why are you calling asString which is calling row.asString again if you already have the value?

aababilov · 2021-02-08T00:22:49Z

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java

+    for (char ch : value.toCharArray()) {
+      if (!(ch >= 32 && ch < 127)) {
+        addErrorInRow(
+            new NonAsciiOrNonPrintableCharNotice(


That would be great to include the invalid value as well.

lionel-nj · 2021-02-08T16:52:46Z

🙏🏾 Thanks @aababilov for your review. As you suggest, the changes will be provided after support for warnings was added.

lionel-nj · 2021-02-08T17:29:30Z

@aababilov My bad, I just noticed: #728

aababilov · 2021-02-08T19:22:08Z

No worries, things happen!

Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

nackko · 2021-02-08T19:33:29Z

A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

I can propose to extend the end to end workflow here before submitting it so that @lionel-nj could merge it ?nackko#1

lionel-nj · 2021-02-09T14:40:23Z

Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

Sure! Do you have recommendations on a set of feeds that we could test the validator against? @aababilov @MobilityData/transit-specs

nackko · 2021-02-09T16:15:40Z

I could take said list and implement the workflow extension? :)

…

On Tue., Feb. 9, 2021, 10:40 Lionel Nébot Janvier, ***@***.***> wrote: Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos. Sure! Do you have recommendations on a set of feeds that we could test the validator against? @aababilov <https://github.com/aababilov> @MobilityData/transit-specs <https://github.com/orgs/MobilityData/teams/transit-specs> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADFXC6YCVHX4UEUT3DDFXLS6FCOTANCNFSM4W6XU5WA> .

timMillet · 2021-02-09T17:34:35Z

Suggestions of feeds using non-ASCII characters in IDs:
I checked 10+ feeds in countries where the primary language is not transcribed in the Latin alphabet, but I haven't found any datasets with non-ASCII characters in IDs.
Suggestion of feeds with "good data", to extend the pool of the feeds tested, available in OpenMobilityData:
- AMT (Genova, Italy)
- Bibus (Brest, France)
- Metro (Christchurch, New Zealand)
- Ruter (Oslo, Norway)
- TAG (Grenoble, France)
- Translink (Vancouver, Canada)
- VAG (Freiburg, Germany)

nackko · 2021-02-09T19:17:50Z

Thks @timMillet , will implement those tonight.

…

On Tue., Feb. 9, 2021, 13:34 Tim Millet, ***@***.***> wrote: - Suggestions of feeds using non-ASCII characters in IDs: I checked 10+ feeds in countries where the primary language is not transcribed in the Latin alphabet, but I haven't found any datasets with non-ASCII characters in IDs. - Suggestion of feeds with "good data", to extend the pool of the feeds tested, available in OpenMobilityData: - AMT (Genova, Italy) - Bibus (Brest, France) - Metro (Christchurch, New Zealand) - Ruter (Oslo, Norway) - TAG (Grenoble, France) - Translink (Vancouver, Canada) - VAG (Freiburg, Germany) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADFXC33HB2EMD2B7S3LBU3S6FW37ANCNFSM4W6XU5WA> .

aababilov · 2021-02-09T19:49:29Z

The more feeds we use to test the validator, the better. It increases the chance to find a problem.

At Google, I just run the validator against all prod feeds without exception. That allows me to find many hidden bugs.

How many open-data feeds you have?

nackko · 2021-02-09T20:10:54Z

Maxime would now the answer @maximearmstrong

…

On Tue., Feb. 9, 2021, 15:49 Alexej Ababilov, ***@***.***> wrote: The more feeds we use to test the validator, the better. It increases the chance to find a problem. At Google, I just run the validator against all prod feeds without exception. That allows me to find many hidden bugs. How many open-data feeds you have? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADFXC7BRJA4T4XBKFO6I3LS6GGVRANCNFSM4W6XU5WA> .

maximearmstrong · 2021-02-09T21:29:51Z

@aababilov We have access to over 2000 datasets (static and real-time) through OpenMobilityData. A first step could be sampling a part of the collection to integrate it to our end-to-end workflow. This should be discussed with @carlfredl, @barbeau and @lionel-nj in order to find the best strategy and amount of feeds to integrate.

aababilov · 2021-02-09T21:33:17Z

2000 feeds is what we need! Do you have technologies to run validation in parallel on them? I use a proprietary Google library for that, so I can't publish my code that validates several thousands of our feeds. Den ons 10 feb. 2021 08:30Maxime Armstrong <[email protected]> skrev:

…

@aababilov <https://github.com/aababilov> We have access to over 2000 datasets (static and real-time) through OpenMobilityData. A first step could be sampling a part of the collection to integrate it to our end-to-end workflow. This should be discussed with @carlfredl <https://github.com/carlfredl>, @barbeau <https://github.com/barbeau> and @lionel-nj <https://github.com/lionel-nj> in order to find the best strategy and amount of feeds to integrate. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG6IJDXPS6CTSTPFJGBFXDS6GSN7ANCNFSM4W6XU5WA> .

lionel-nj · 2021-02-09T21:43:56Z

As @maximearmstrong suggested we will evaluate our possibilities. Meanwhile I would be curious to know your process to check all the validation reports that are generated @aababilov, we could get inspiration from that.

#734

nackko · 2021-02-09T21:52:47Z

@aababilov I guess the end to end workflow could be split in like, 50 or a 100 feeds per workflow file? They would run in parallel I think.

Excellent point by @lionel-nj.

@timMillet

As suggested by @timMillet here: MobilityData#712 (comment)

nackko · 2021-02-10T03:30:53Z

@aababilov @lionel-nj #736 is ready for comments. Lemme know and I'll complete it to a 100 feeds.

feat: non ascii character in id

7318510

- implement new rule logic - write additional unit tests - update documentation

lionel-nj added the v2.0-feature-parity-v1.4 label Feb 2, 2021

lionel-nj added this to the v2.0 milestone Feb 2, 2021

lionel-nj requested review from barbeau and maximearmstrong February 2, 2021 14:37

lionel-nj self-assigned this Feb 2, 2021

lionel-nj changed the title ~~feat: non ascii character in id~~ feat: non ascii character in ID Feb 2, 2021

chore: modify notice code

db185dd

barbeau approved these changes Feb 2, 2021

View reviewed changes

RULES.md Outdated Show resolved Hide resolved

core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java Outdated Show resolved Hide resolved

lionel-nj and others added 3 commits February 2, 2021 18:28

apply suggestion fro code review

e06c2d4

Co-authored-by: Sean Barbeau <[email protected]>

fix typo in formatting

a978248

Merge branch 'non-ascii-id' of github.com:MobilityData/gtfs-validator…

035ea5b

… into non-ascii-id

lionel-nj requested a review from barbeau February 2, 2021 17:30

barbeau merged commit 96d4206 into master Feb 2, 2021

barbeau deleted the non-ascii-id branch February 2, 2021 18:55

nackko reviewed Feb 3, 2021

View reviewed changes

core/src/main/java/org/mobilitydata/gtfsvalidator/notice/NonAsciiOrNonPrintableCharNotice.java Show resolved Hide resolved

aababilov reviewed Feb 8, 2021

View reviewed changes

barbeau mentioned this pull request Feb 9, 2021

Write higher level acceptance tests #734

Closed

nackko added a commit to nackko/gtfs-validator that referenced this pull request Feb 10, 2021

3 feeds

4c31947

As suggested by @timMillet here: MobilityData#712 (comment)

nackko mentioned this pull request Feb 10, 2021

ci: Add more feeds to end-to-end GitHub workflow #736

Merged

2 tasks

niyalist mentioned this pull request Jun 24, 2021

Allow non ASCII but printable characters in ID field #918

Open

maximearmstrong mentioned this pull request Aug 9, 2022

Lighten end-to-end workflows or remove them #1233

Closed

Conversation

lionel-nj commented Feb 2, 2021

Uh oh!

barbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aababilov left a comment

Choose a reason for hiding this comment

Uh oh!

aababilov Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

nackko Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

aababilov Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

aababilov Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

aababilov Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

lionel-nj commented Feb 8, 2021

Uh oh!

lionel-nj commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aababilov commented Feb 8, 2021

Uh oh!

nackko commented Feb 8, 2021

Uh oh!

lionel-nj commented Feb 9, 2021

Uh oh!

nackko commented Feb 9, 2021 via email

Uh oh!

timMillet commented Feb 9, 2021

Uh oh!

nackko commented Feb 9, 2021 via email

Uh oh!

aababilov commented Feb 9, 2021

Uh oh!

nackko commented Feb 9, 2021 via email

Uh oh!

maximearmstrong commented Feb 9, 2021

Uh oh!

aababilov commented Feb 9, 2021 via email

Uh oh!

lionel-nj commented Feb 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nackko commented Feb 9, 2021

Uh oh!

nackko commented Feb 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lionel-nj commented Feb 8, 2021 •

edited

Loading

lionel-nj commented Feb 9, 2021 •

edited

Loading