Skip to content

feat: non ascii character in ID#712

Merged
barbeau merged 5 commits intomasterfrom
non-ascii-id
Feb 2, 2021
Merged

feat: non ascii character in ID#712
barbeau merged 5 commits intomasterfrom
non-ascii-id

Conversation

@lionel-nj
Copy link
Copy Markdown
Contributor

closes #529
Summary:

This PR implements a new validation rule: no non-ascii character in IDs.

Expected behavior:

  • a notice should be generated if an ID contains non ascii characters

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Format the title like "feat: -new feature short description-" (PR title must follow the conventional commit specification)
  • Linked all relevant issues
  • [] Include screenshot(s) showing how this pull request works and fixes the issue(s)

- implement new rule logic
- write additional unit tests
- update documentation
@lionel-nj lionel-nj added this to the v2.0 milestone Feb 2, 2021
@lionel-nj lionel-nj self-assigned this Feb 2, 2021
@lionel-nj lionel-nj changed the title feat: non ascii character in id feat: non ascii character in ID Feb 2, 2021
Copy link
Copy Markdown
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lionel-nj, mostly LGTM! Two formatting items in-line.

@lionel-nj lionel-nj requested a review from barbeau February 2, 2021 17:30
@barbeau barbeau merged commit 96d4206 into master Feb 2, 2021
@barbeau barbeau deleted the non-ascii-id branch February 2, 2021 18:55
Copy link
Copy Markdown
Collaborator

@aababilov aababilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that change treats many real production feeds as invalid since they use non-ASCII characters and therefore parsing stops for them. So, that becomes a blocker for me since I cannot update to the latest code from GitHub. This change ought to be submitted after support for warnings was added. This new NonAsciiOrNonPrintableCharNotice should be a warnings rather than an error because GTFS spec says:

An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters. Using only printable ASCII characters is recommended.

if (value == null) {
return asString(columnIndex, required);
}
for (char ch : value.toCharArray()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a copy of the string that is a small performance issue. It is better to avoid them from the beginning rather than fixing them later. Please use a usual cycle here:

for(int i = 0, n = value.length() ; i < n ; i++) {
...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row.getFileName(), row.getRowNumber(), row.getColumnName(columnIndex)));
break;
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you calling asString once again at the end if you already have the value to return?

public String asId(int columnIndex, boolean required) {
String value = row.asString(columnIndex);
if (value == null) {
return asString(columnIndex, required);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you calling asString which is calling row.asString again if you already have the value?

for (char ch : value.toCharArray()) {
if (!(ch >= 32 && ch < 127)) {
addErrorInRow(
new NonAsciiOrNonPrintableCharNotice(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great to include the invalid value as well.

@lionel-nj
Copy link
Copy Markdown
Contributor Author

🙏🏾 Thanks @aababilov for your review. As you suggest, the changes will be provided after support for warnings was added.

@lionel-nj
Copy link
Copy Markdown
Contributor Author

lionel-nj commented Feb 8, 2021

@aababilov My bad, I just noticed: #728

@aababilov
Copy link
Copy Markdown
Collaborator

No worries, things happen!

Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 8, 2021

A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

I can propose to extend the end to end workflow here before submitting it so that @lionel-nj could merge it ?nackko#1

@lionel-nj
Copy link
Copy Markdown
Contributor Author

Can we extend the amount of feeds that you use to test your commits? A large amount of feeds would probably show that valid data is rejected with errors instead of warnings or infos.

Sure! Do you have recommendations on a set of feeds that we could test the validator against? @aababilov @MobilityData/transit-specs

@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 9, 2021 via email

@timMillet
Copy link
Copy Markdown

  • Suggestions of feeds using non-ASCII characters in IDs:
    I checked 10+ feeds in countries where the primary language is not transcribed in the Latin alphabet, but I haven't found any datasets with non-ASCII characters in IDs.
  • Suggestion of feeds with "good data", to extend the pool of the feeds tested, available in OpenMobilityData:
    • AMT (Genova, Italy)
    • Bibus (Brest, France)
    • Metro (Christchurch, New Zealand)
    • Ruter (Oslo, Norway)
    • TAG (Grenoble, France)
    • Translink (Vancouver, Canada)
    • VAG (Freiburg, Germany)

@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 9, 2021 via email

@aababilov
Copy link
Copy Markdown
Collaborator

The more feeds we use to test the validator, the better. It increases the chance to find a problem.

At Google, I just run the validator against all prod feeds without exception. That allows me to find many hidden bugs.

How many open-data feeds you have?

@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 9, 2021 via email

@maximearmstrong
Copy link
Copy Markdown
Contributor

@aababilov We have access to over 2000 datasets (static and real-time) through OpenMobilityData. A first step could be sampling a part of the collection to integrate it to our end-to-end workflow. This should be discussed with @carlfredl, @barbeau and @lionel-nj in order to find the best strategy and amount of feeds to integrate.

@aababilov
Copy link
Copy Markdown
Collaborator

aababilov commented Feb 9, 2021 via email

@lionel-nj
Copy link
Copy Markdown
Contributor Author

lionel-nj commented Feb 9, 2021

As @maximearmstrong suggested we will evaluate our possibilities. Meanwhile I would be curious to know your process to check all the validation reports that are generated @aababilov, we could get inspiration from that.

#734

@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 9, 2021

@aababilov I guess the end to end workflow could be split in like, 50 or a 100 feeds per workflow file? They would run in parallel I think.

Excellent point by @lionel-nj.

nackko added a commit to nackko/gtfs-validator that referenced this pull request Feb 10, 2021
@nackko
Copy link
Copy Markdown
Contributor

nackko commented Feb 10, 2021

@aababilov @lionel-nj #736 is ready for comments. Lemme know and I'll complete it to a 100 feeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non ascii or non printable char in id

6 participants