feat: Strip leading and trailing whitespaces and emit an error#729
feat: Strip leading and trailing whitespaces and emit an error#729barbeau merged 3 commits intoMobilityData:masterfrom
Conversation
163fe26 to
95494f9
Compare
barbeau
left a comment
There was a problem hiding this comment.
Thanks @aababilov! Generally this looks good although there is a question of ERROR vs WARNING here.
@MobilityData/transit-specs please see below too
| * <p>This warning is emitted for values protected with double quotes since whitespaces for | ||
| * non-protected values are trimmed automatically by CSV parser. | ||
| * | ||
| * <p>Note that this is a warning: not an error because the feed may be still parsed and also not an |
There was a problem hiding this comment.
This is an example where if we're treating this validator as canonical with the spec as discussed in #549, then IMHO this needs to be an ERROR and not a WARNING.
Under "File Requirements" the spec says:
Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.
@MobilityData/transit-specs thoughts?
As I mentioned in #549 (comment) if we want to make this a "lower severity" error we could by making the concept of "severity" orthogonal to the designation of ERROR vs WARNING.
There was a problem hiding this comment.
Let's not become prescriptivists. If a ton of feeds are using those spaces, then our validator must accept them, so that may not be an error :)
This is like linguistics. If thousands of native speakers use a certain way of saying, it may not be proclaimed as an error. It may be a "lower style" (i.e., a warning in the case of validator) but not an error.
A nice video talking about "mistakes" that native English speaker make: https://www.youtube.com/watch?v=vGDb-fbvJmQ
And finally: the real way to fix that problem and avoid ambiguities is to give up CSV and switch static GTFS to protobufs. CSV is not a production data format, it is a toy format nice for office workers and students - everywhere where you do not care much about accuracy, speed and unambiguity.
There was a problem hiding this comment.
I'm against making discrepancies between the GTFS specification and the GTFS validator. If it is a common mistake and if it does not cause issues to GTFS consumers, then the rule
Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.
should not be a requirement anymore, but one of the Best Practices.
There was a problem hiding this comment.
I like that you propose to amend the GTFS reference instead of enforcing it in its current shape.
There was a problem hiding this comment.
So what is our order of operations here @timMillet @carlfredl?
Is it a):
- Merge this PR as a warning, even though the spec indicates it should be an error
- Revise the spec to match the validator
...or b):
- Modify and merge this PR as an error to match the current spec
- Modify the spec to a treat as a warning (best practice instead of stated requirement)
- Modify the validator to a warning to match the updated spec
There was a problem hiding this comment.
@aababilov I like to find a solution, as long as it is consistent :)
@barbeau I'd go with solution B. There is no guarantee that step 2 in both solutions will be completed, so going with option A will result in keeping a discrepancy between the spec and the validator.
There was a problem hiding this comment.
Thanks @timMillet!
Which changes do you want me to make to this pull request before submit?
@aababilov So to merge this PR we just need the notice to be classified as an error instead of a warning.
There was a problem hiding this comment.
Done. I hope that we either fix the feeds or update the spec.
core/src/main/java/org/mobilitydata/gtfsvalidator/parsing/RowParser.java
Show resolved
Hide resolved
95494f9 to
b3b3b04
Compare
|
I am OK with both options. If we decide to make it an error, I can apply a
local patch at Google to make it a warning because we have to support a
fair amount of feeds that use those spaces in double quotes.
If you wish, I can add an option to ValidatorContext to choose between a
warning and an error.
Den tors 11 feb. 2021 07:15Sean Barbeau <[email protected]> skrev:
… ***@***.**** commented on this pull request.
------------------------------
In
core/src/main/java/org/mobilitydata/gtfsvalidator/notice/LeadingOrTrailingWhitespacesNotice.java
<#729 (comment)>
:
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.mobilitydata.gtfsvalidator.notice;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * The value in CSV file has leading or trailing whitespaces.
+ *
+ * <p>This warning is emitted for values protected with double quotes since whitespaces for
+ * non-protected values are trimmed automatically by CSV parser.
+ *
+ * <p>Note that this is a warning: not an error because the feed may be still parsed and also not an
So what is our order of operations here @timMillet
<https://github.com/timMillet> @carlfredl <https://github.com/carlfredl>?
Is it a):
1. Merge this PR as a warning, even though the spec indicates it
should be an error
2. Revise the spec to match the validator
...or b):
1. Modify and merge this PR as an error to match the current spec
2. Modify the spec to a treat as a warning (best practice instead of
stated requirement)
3. Modify the validator to a warning to match the updated spec
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#729 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG6IJCRZVF2MEXIBIAF5J3S6LSNBANCNFSM4XIV6HPQ>
.
|
@aababilov This would be great, although I think it needs to be done within the construct of a larger "profile" support, where we could select a "Google" or "X" profile and change certain notices from error to warning or vice versa, as mentioned in #549 (comment). In other words, I don't think this notice will be the only one that we need to configure this way. Do you want to open a separate issue to discuss this? |
|
I am not sure that we need a concept of profile right now. Which changes do you want me to make to this pull request before submit? |
GTFS reference says: Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors. GTFS Validator strips whitespaces from protected values. We do not see any use case when such a whitespace may be needed. On the other hand, some real-world feeds use trailing whitespaces for some values and omit them for the others. This is causing the largest problem when a primary key and a foreign key differ just by a whitespace: it is clear that they are intended to be the same, that is why we always strip whitespaces. Note that this is a warning: * not an error because the feed may be still parsed; * not an info because leading and trailing whitespaces are ambiguous and we want the feed providers to be aware of them.
b3b3b04 to
297a5dd
Compare
297a5dd to
1af9c25
Compare
feat: Strip leading and trailing whitespaces and emit a warning
GTFS reference says:
Remove any extra spaces between fields or field names. Many parsers
consider the spaces to be part of the value, which may cause errors.
GTFS Validator strips whitespaces from protected values. We do not
see any use case when such a whitespace may be needed. On the other
hand, some real-world feeds use trailing whitespaces for some values
and omit them for the others. This is causing the largest problem
when a primary key and a foreign key differ just by a whitespace:
it is clear that they are intended to be the same, that is why we
always strip whitespaces.
Note that this is a warning:
and we want the feed providers to be aware of them.