Skip to content

Factorize field types#104

Merged
LeoFrachet merged 7 commits intogoogle:masterfrom
MobilityData:factorize_field_types
Oct 2, 2018
Merged

Factorize field types#104
LeoFrachet merged 7 commits intogoogle:masterfrom
MobilityData:factorize_field_types

Conversation

@LeoFrachet
Copy link
Copy Markdown
Contributor

Hi GTFS community!

We are starting to have some heavy duplication in the specification. The definition of an URL is given five times (stop_url, route_url, feed_publisher_url, agency_url, agency_fare_url), the definition of a timezone is given twice, just like the definition of a language code or a service day or a color.

This pull request offers to factorize those definitions of types on the top of the specification, to avoid duplication and make it more concise. It also makes it easier for consumers who wants to parse a GTFS and who wants to see which data is in which field.

Last but not least, it highlights the specific definition of "date" and "time" in a GTFS dataset, which isn't the usual definition of date and time.

You can read the pull request using the "Files changed" tab, or you can read the result formatted markdown, which might be easier:

https://github.com/MobilityData/transit/blob/field_types_2/gtfs/spec/en/reference.md#field-types

Thoughts?

Side note: I would have loved to tighten some definitions, e.g. to force the phone numbers to be in the E123 format, but I didn't do it because I wanted to keep this proposal as being just a factorization of definitions and not a change in the specification per se.

@harringtonp
Copy link
Copy Markdown

With regard to:

"Time - The field contains the time in the HH:MM:SS format (H:MM:SS is also accepted)"

would have preferred if the optional H:MM:SS was removed. After loading data into a DB I look for anything like this and prepend it with a 0. Doing this allows the times to be compared as strings which means a DB clause such as departure_time>"07:35:00" works as expected.

@skinkie
Copy link
Copy Markdown
Contributor

skinkie commented Sep 14, 2018

@harringtonp while I am not against what you suggest, but you can never expect a HH:MM:SS timestamp anyway. Because the time might even go up to HHH:MM:SS. Which is not supported by Google Transit, but certainly not against the spirit of the standard. The take away: HH is not 00-23,

@abyrd
Copy link
Copy Markdown

abyrd commented Sep 14, 2018

@LeoFrachet tightening definitions would be breaking backward compatibility, which has generally been avoided in the past. It might be nearly impossible to track down and force all data producers to change their previously-conforming output - my sense is that many of them cannot or will not do so because they paid an external contractor or vendor to make a one-off export tool. I believe things like phone numbers were intended for literal display to human users, so is there a need to specify an exact format?

@abyrd
Copy link
Copy Markdown

abyrd commented Sep 14, 2018

@harringtonp similar to my comment to @LeoFrachet, requiring two or three digits for the hours field would cause existing feeds to be invalid, and this type of change has been avoided in the past. Another point of view: arguably it is not optimal to store and sort times as strings - if these are treated as numbers (we typically convert to seconds after midnight) this eliminates the perceived problem.

In any case, as @LeoFrachet said, factoring out definitions should be separate from changing syntax. If you have proposals for changing how times are expressed, those should be separate issues.

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

@LeoFrachet tightening definitions would be breaking backward compatibility, which has generally been avoided in the past.

@abyrd: I fully agree, and this is why this proposal (almost, see below) doesn't contains change in the definition of the types, I'm just grouping them on top of the spec to avoid duplication, and I'm adding examples to help comprehension.

The only two "changes" that I've made are:

  • Defining latitude as being between -90 and +90 (longitude was already defined as being between -180 and +180).
  • Strongly suggesting to use only printable ASCII characters in ID (aka no special character which could produce encoding error while being passed through different system).

I can remove those two "changes" if they are controversial.

Copy link
Copy Markdown

@slai slai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Makes it easier to maintain, although at the slight cost of having to jump around the docs if you're unfamiliar with the types. I think consistent definitions of types is worth it though.

- **ID** - The field contains an identifier which uniquely identifies an entity (therefore is dataset unique). Its value is not aimed to be displayed to the user, and is a sequence of any UTF-8 characters, but using only printable ASCII characters is recommended.
- **Language Code** - The field contains a IETF BCP 47 language code. For an introduction to IETF BCP 47, please refer to http://www.rfc-editor.org/rfc/bcp/bcp47.txt and http://www.w3.org/International/articles/language-tags/. For example: "en" for English, "en-US" for American English or "de" for German.
- **Latitude** - The field contains the WGS84 latitude in decimal degrees. The value must be greater than or equal to -90.0 and less than or equal to 90.0. For example: "41.890169" for the Colloseum in Rome.
- **Longitude** - The field contains the WGS84 longitude in decimal degrees. The value must be geater than or equal to -180.0 and less than or equal to 180.0. For example: "12.492269" for the Colloseum in Rome.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for both. Fixed.

@abyrd
Copy link
Copy Markdown

abyrd commented Sep 20, 2018

About needing to jump around in the docs - any chance we can make links back to the definitions or somehow get them to show up in tooltips? Probably asking too much from markdown.

In response to my own previous observation:

things like phone numbers were intended for literal display to human users, so is there a need to specify an exact format?

Looking at this again I realize few people would actually type a phone number, they'd probably tap it to auto-dial on their phone, so maybe format is important. But I think it should be a "strong recommendation" rather than a breaking spec change.

@abyrd
Copy link
Copy Markdown

abyrd commented Sep 20, 2018

I completely agree with the two changes you listed @LeoFrachet - the latitude one is clearly just a mistake, and it would be great to have full voluntary adoption of IDs being plain ASCII (excluding commas and whitespace). It would be interesting to do a scan of all existing known data to see if there are exceptions to that, and if not it might even be worth a breaking, more restrictive definition.

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

@abyrd

Phone number formatting

Regarding phone number formatting, my current proposal do not change the spec. I keep the current definition and I just move it. If we want to change it, I'm expecting a whole conversation will be needed, and I would rather do that separate.

Jump around in the spec

I can add links to the type definition. Aka you could clic on "Date" and it would move up to the date. But IMHO the best case would be that we keep the raw spec on GitHub, and we have a nice and dynamic website showing the specification in a better way, e.g. gtfs.org. On this more complex page, we could for example have the possibility to "expend" the type definition when we want to see more, but it would be hidden by default to avoid cluttering it. But yeah I can add links to the type definition.

IDs tightening

I fully agree that spaces and commas are also annoying, and I would be open to forbid them. We can ask big consumers to do a survey. Maybe I could just remove the sentence from this proposal and open another thread to really make it stricter.

@barbeau
Copy link
Copy Markdown
Contributor

barbeau commented Sep 20, 2018

Maybe I could just remove the sentence from this proposal and open another thread to really make it stricter.

My preference is to keep the current text in this proposal for IDs, unless someone has a strong preference to remove it. It's only "recommended", so it's not breaking anything.

@prhod
Copy link
Copy Markdown

prhod commented Sep 24, 2018

Nothing to add at our side, we are totally inline with the proposal and the comments above.

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

Since the Text data format is also used for feed_version, which is not meant to be displayed to the user, I changed sightly the definition of the data type Text.

From:

  • Text - The field contains a string of UTF-8 characters, which is aimed to be displayed to the rider and which must therefore be human readable.

To:

  • Text - The field contains a string of UTF-8 characters, which is aimed to be displayed and which must therefore be human readable.

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

This pull request has been open more than one week ago. As per the Official Process, we can now open the vote on it.

Vote will be closed on Monday October 1st at 23:59:59 UTC.

@barbeau
Copy link
Copy Markdown
Contributor

barbeau commented Sep 24, 2018

+1

3 similar comments
@TeXitoi
Copy link
Copy Markdown

TeXitoi commented Sep 24, 2018

+1

@slai
Copy link
Copy Markdown

slai commented Sep 24, 2018

+1

@gcamp
Copy link
Copy Markdown
Contributor

gcamp commented Sep 24, 2018

+1

@skinkie
Copy link
Copy Markdown
Contributor

skinkie commented Sep 24, 2018

+1 thanks for the effort

@abyrd
Copy link
Copy Markdown

abyrd commented Sep 25, 2018

+1

@ECF-dot-com
Copy link
Copy Markdown

I am not sure how to vote?

Is writing "+1" here, all I have to do to say I agree with this proposal ?

@harringtonp
Copy link
Copy Markdown

+1

@paulswartz
Copy link
Copy Markdown
Contributor

IDs tightening

I fully agree that spaces and commas are also annoying, and I would be open to forbid them. We can ask big consumers to do a survey. Maybe I could just remove the sentence from this proposal and open another thread to really make it stricter.

Here @mbta, we currently have a handful of IDs with spaces, but none with commas. It's possible that we'll be changing them in the future, but since we use them in many different places it's unlikely to happen soon.

One other note about the documentation for IDs: we use them in URLs, so we consider them partially user-visible.

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

@ECF-dot-com: Yes, you can just "+1" to give your approval.

@paulswartz: Good to know about spaces in MBTA's ids. The current proposal under vote do not speak about spaces in ID. It just "strongly suggest" to avoid non ASCII character in it (e.g. ü, ß or ç).

@TeXitoi
Copy link
Copy Markdown

TeXitoi commented Sep 26, 2018

Just for the record, we have data with ids that contains "é".

@LeoFrachet
Copy link
Copy Markdown
Contributor Author

7 votes in favor (even 8 if we include the one from @ECF-dot-com).
No vote against.

=> The proposal is adopted.

@LeoFrachet LeoFrachet merged commit 77a771d into google:master Oct 2, 2018
@LeoFrachet LeoFrachet deleted the factorize_field_types branch October 2, 2018 14:42
| | | | • If this stop ID represents **a stop located inside a station**, this entry's location type must be **0 or blank**, and this entry's parent_station field contains **the stop ID of the station where this stop is located and the stop referenced by parent_station must have location_type=1.** |
| | | | • If this stop ID represents **a stop located outside a station**, this entry's location type must be **0 or blank**, and this entry's parent_station field contains **a blank value (the parent_station field doesn't apply to this stop).** |
| | | | • If this stop ID represents **a station**, this entry's location type must be **1**, and this entry's parent_station field contains **a blank value (stations can't contain other stations).** |
| stop_timezone | Timezone | Optional | The **stop_timezone** field contains the timezone in which this stop, station, or station entrance is located. If omitted, the stop should be assumed to be located in the timezone specified by **agency_timezone** in [agency.txt](#agencytxt). When a stop has a parent station, the stop is considered to be in the timezone specified by the parent station's **stop_timezone** value. If the parent has no stop_timezone value, the stops that belong to that station are assumed to be in the timezone specified by **agency_timezone**, even if the stops have their own **stop_timezone** values. In other words, if a given stop has a **parent_station** value, any **stop_timezone** value specified for that stop must be ignored. Even if **stop_timezone** values are provided in stops.txt, the times in [stop_times.txt](#stop_timestxt) should continue to be specified as time since midnight in the timezone specified by **agency_timezone** in agency.txt. This ensures that the time values in a trip always increase over the course of a trip, regardless of which timezones the trip crosses. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git blame lead me here, I have a question.

The stop_timezone definition says that

[...] if stop_timezone values are provided in stops.txt, the times in stop_times.txt should continue to be specified as time since midnight in the timezone specified by agency_timezone in agency.txt.

But the newly added GTFS Time definition says:

The time is measured from "noon minus 12h" of the service day (effectively midnight except for days on which daylight savings time changes occur).

I assume that the latter is the correct definition, and that the explanation for stop_timezone should be adapted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.