Skip to content

feat: Strip leading and trailing whitespaces and emit an error#729

Merged
barbeau merged 3 commits intoMobilityData:masterfrom
aababilov:strip-whitespace
Feb 11, 2021
Merged

feat: Strip leading and trailing whitespaces and emit an error#729
barbeau merged 3 commits intoMobilityData:masterfrom
aababilov:strip-whitespace

Conversation

@aababilov
Copy link
Copy Markdown
Collaborator

feat: Strip leading and trailing whitespaces and emit a warning

GTFS reference says:
Remove any extra spaces between fields or field names. Many parsers
consider the spaces to be part of the value, which may cause errors.

GTFS Validator strips whitespaces from protected values. We do not
see any use case when such a whitespace may be needed. On the other
hand, some real-world feeds use trailing whitespaces for some values
and omit them for the others. This is causing the largest problem
when a primary key and a foreign key differ just by a whitespace:
it is clear that they are intended to be the same, that is why we
always strip whitespaces.

Note that this is a warning:

  • not an error because the feed may be still parsed;
  • not an info because leading and trailing whitespaces are ambiguous
    and we want the feed providers to be aware of them.

Copy link
Copy Markdown
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aababilov! Generally this looks good although there is a question of ERROR vs WARNING here.

@MobilityData/transit-specs please see below too

* <p>This warning is emitted for values protected with double quotes since whitespaces for
* non-protected values are trimmed automatically by CSV parser.
*
* <p>Note that this is a warning: not an error because the feed may be still parsed and also not an
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example where if we're treating this validator as canonical with the spec as discussed in #549, then IMHO this needs to be an ERROR and not a WARNING.

Under "File Requirements" the spec says:

Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.

@MobilityData/transit-specs thoughts?

As I mentioned in #549 (comment) if we want to make this a "lower severity" error we could by making the concept of "severity" orthogonal to the designation of ERROR vs WARNING.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not become prescriptivists. If a ton of feeds are using those spaces, then our validator must accept them, so that may not be an error :)

This is like linguistics. If thousands of native speakers use a certain way of saying, it may not be proclaimed as an error. It may be a "lower style" (i.e., a warning in the case of validator) but not an error.

A nice video talking about "mistakes" that native English speaker make: https://www.youtube.com/watch?v=vGDb-fbvJmQ

And finally: the real way to fix that problem and avoid ambiguities is to give up CSV and switch static GTFS to protobufs. CSV is not a production data format, it is a toy format nice for office workers and students - everywhere where you do not care much about accuracy, speed and unambiguity.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm against making discrepancies between the GTFS specification and the GTFS validator. If it is a common mistake and if it does not cause issues to GTFS consumers, then the rule

Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.

should not be a requirement anymore, but one of the Best Practices.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you propose to amend the GTFS reference instead of enforcing it in its current shape.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is our order of operations here @timMillet @carlfredl?

Is it a):

  1. Merge this PR as a warning, even though the spec indicates it should be an error
  2. Revise the spec to match the validator

...or b):

  1. Modify and merge this PR as an error to match the current spec
  2. Modify the spec to a treat as a warning (best practice instead of stated requirement)
  3. Modify the validator to a warning to match the updated spec

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aababilov I like to find a solution, as long as it is consistent :)

@barbeau I'd go with solution B. There is no guarantee that step 2 in both solutions will be completed, so going with option A will result in keeping a discrepancy between the spec and the validator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timMillet!

Which changes do you want me to make to this pull request before submit?

@aababilov So to merge this PR we just need the notice to be classified as an error instead of a warning.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I hope that we either fix the feeds or update the spec.

@aababilov
Copy link
Copy Markdown
Collaborator Author

aababilov commented Feb 10, 2021 via email

@barbeau
Copy link
Copy Markdown
Member

barbeau commented Feb 10, 2021

If you wish, I can add an option to ValidatorContext to choose between a warning and an error.

@aababilov This would be great, although I think it needs to be done within the construct of a larger "profile" support, where we could select a "Google" or "X" profile and change certain notices from error to warning or vice versa, as mentioned in #549 (comment). In other words, I don't think this notice will be the only one that we need to configure this way. Do you want to open a separate issue to discuss this?

@aababilov
Copy link
Copy Markdown
Collaborator Author

I am not sure that we need a concept of profile right now.

Which changes do you want me to make to this pull request before submit?

GTFS reference says:
Remove any extra spaces between fields or field names. Many parsers
consider the spaces to be part of the value, which may cause errors.

GTFS Validator strips whitespaces from protected values. We do not
see any use case when such a whitespace may be needed. On the other
hand, some real-world feeds use trailing whitespaces for some values
and omit them for the others. This is causing the largest problem
when a primary key and a foreign key differ just by a whitespace:
it is clear that they are intended to be the same, that is why we
always strip whitespaces.

Note that this is a warning:
* not an error because the feed may be still parsed;
* not an info because leading and trailing whitespaces are ambiguous
  and we want the feed providers to be aware of them.
Copy link
Copy Markdown
Member

@barbeau barbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aababilov!

@barbeau barbeau changed the title feat: Strip leading and trailing whitespaces and emit a warning feat: Strip leading and trailing whitespaces and emit a notice Feb 11, 2021
@barbeau barbeau changed the title feat: Strip leading and trailing whitespaces and emit a notice feat: Strip leading and trailing whitespaces and emit an error Feb 11, 2021
@barbeau barbeau merged commit 933a8bc into MobilityData:master Feb 11, 2021
@aababilov aababilov deleted the strip-whitespace branch April 20, 2021 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants