Skip to content

PR6695: Do not treat paths as encoded in ISO-8859-1#124

Closed
whitequark wants to merge 6 commits intoocaml:trunkfrom
whitequark:utf8-paths
Closed

PR6695: Do not treat paths as encoded in ISO-8859-1#124
whitequark wants to merge 6 commits intoocaml:trunkfrom
whitequark:utf8-paths

Conversation

@gasche
Copy link
Member

gasche commented Dec 12, 2014

I think the merge of "Make sure the compiler only uses ASCII" would be a no-brainer if it didn't imply changing the stdlib, which is always more controversial (as it should be). I'd merge a commit with the ascii_foo functions in utils/misc (and let people discuss a second commit that moves them to the stdlib directly). Unfortunately, that would require duplicating them in ocamlbuild/my_std as well, as we don't want ocamlbuild to start depending on internal compiler libraries.

@gasche
Copy link
Member

gasche commented Dec 12, 2014

I forgot to say: I do support adding the ascii_* to the stdlib (I would personally have just changed the semantics of the existing String....) and have reviewed the patch, I think it is fine.

(We may want to expose latin1_* functions and encourage people to use them if they explicitly rely on the accented-characters behavior.)

@whitequark
Copy link
Member Author

@gasche Yes, that was the reason I did not put them in Misc.

Should I move/duplicate them or wait for more input?

@alainfrisch
Copy link
Contributor

It might be useful to check that no OPAM package uses compilation units whose name starts with a non-ASCII letter. I assume and hope the answer is no.

@gasche
Copy link
Member

gasche commented Dec 12, 2014

Should I move/duplicate them or wait for more input?

Well it's as you prefer. If you value the "at least the compiler is fixed in trunk now" aspect enough to put micro-work in splitting the commits, you want to do that. Otherwise, just wait -- but be patient, these things can stay silent for months. Both are fine options.

@whitequark
Copy link
Member Author

@alainfrisch The answer is actually no. ocamldep does not print dependencies whose names start with non-'A'..'Z' -- this will be the topic of a later PR of mine.

Additionally, since Ubuntu, Github et al use UTF-8 and the compiler uses Latin-1, even if it did not, that would still fail the compilation.

@whitequark
Copy link
Member Author

@gasche I will put them in Misc. This will make it easier to decide on how/if change the semantics of String.

@damiendoligez
Copy link
Member

I'm in favor of adding them to stdlib, but we need to discuss the interface:

  1. I don't like the names, I think capitalize_ascii would look better than ascii_capitalize
  2. We need to discuss the idea of adding an optional argument to the existing functions instead of adding new functions (it looks cleaner but makes it harder to deprecate the old behavior).

@whitequark
Copy link
Member Author

@gasche I tried to put them in Misc. It is not easy. There's a never-ending hole of dependencies all around the compiler.

@planar

  1. I will rename them.
  2. How about adding ?(latin1=true) ?

@gasche
Copy link
Member

gasche commented Dec 12, 2014

I think adding optional arguments is a bad way to evolve APIs. In this particular case, the default value would be the bad value for this parameter.

@dbuenzli
Copy link
Contributor

@gasche If that's the bad value for the optional parameter it means that you ready to break compatibility ?

If it's ok to break backward compatibility then why not move the old functions with the same name to a deprecated String.Latin1 module and keep the rest as is but now only acting on the ASCII set. Code relying on the old behaviour will only have to add Latin1 in front of a bunch of identifiers.

Another alternative (which I'm not very fond of) would be to have a full copy of the old String module in the deprecated String.Latin1 module and tell people to open String.Latin1 if they need the deprecated behaviour. The advantage and disadvantage of this solution is that backward compatibility can be maintained at the build system level with -open String.Latin1. But I'm not very fond of that command line option...

@dbuenzli
Copy link
Contributor

Oh btw. adding optional parameters to an API is an interface breaking change anyways.

@whitequark
Copy link
Member Author

I think the required change is by far not invasive enough to warrant a full copy. In any case it is never worse than adding ascii affix and is simple enough that you could even apply it with sed. (Not that you should.)

Personally I prefer changing the behavior of existing functions, given that it seems that breaking backwards compatibility is inevitable.

@lpw25
Copy link
Contributor

lpw25 commented Dec 12, 2014

If we're going to break backwards compatibility, why not break it in the same way it was broken for Bytes. This would meaning adding a Latin1 module to stdlib with the old implementations, and a latin1 type to the predefined types. Then a compiler option would decide whether string was equal to latin1 and whether latin1 characters were accepted in string literals.

We could also add a "foo"l literal form to create latin1 strings (which could be a first step towards allowing Unicode literals as expressions one day).

The benefit of this approach is that we use types to properly separate latin1 encoded strings from ascii encoded strings -- which should hopefully make bugs around incorrect use of the encodings less likely.

@whitequark
Copy link
Member Author

@lpw25 I think this is the best solution. We also know migration pitfalls reasonably well.

@dbuenzli
Copy link
Contributor

@whitequark this as in this PR or this as in @lpw25 ? I personally think that @lpw25's solution is overkill for four functions acting on a legacy charset. If backward compatibility should not be broken on that I prefer deprecation of the current functions and the _ascii suffix way.

@whitequark
Copy link
Member Author

@dbuenzli I was originally talking about @lpw25 but after consideration I realized I do not understand the ramifications of either breaking or not breaking compatibility well enough to have an opinion, so I'll let someone else figure out how it should be done.

@damiendoligez
Copy link
Member

@dbuenzli Indeed, @lpw25's solution would be fine if Latin1 was here to stay, but we're trying to get rid of it so I don't see much point in adding that much complexity to the stdlib. Like you, I prefer the "deprecation and _ascii suffix" solution.

@gasche
Copy link
Member

gasche commented Dec 13, 2014

Thirded. I think the current patch is optimal (I would maybe add latin1_* functions as well, but it's a detail).

This updates Char, String, Bytes in the stdlib.

For now, they are hidden from documentation and are only for
internal compiler use.
@whitequark
Copy link
Member Author

@gasche I've updated the patch to include @planar's suggestion on naming. Otherwise it is unchanged. Anything else needs to be done?

@gasche
Copy link
Member

gasche commented Dec 14, 2014

If you also marked the legacy functions as deprecated this patch would solve Mantis PR#6695 as well.

@whitequark
Copy link
Member Author

@gasche Did you mean PR6694?

@gasche
Copy link
Member

gasche commented Dec 14, 2014

Yes indeed.

This should cover all places involving filenames in the compiler.
There are a few more paths still using Latin-1 in other ways,
e.g. in ocamldoc.
The only place that includes changes is the code for checking
the suffix. It is highly unlikely that the change has any
impact at all.
Also, add documentation for the US-ASCII variants.
@whitequark
Copy link
Member Author

Updated. The deprecation warnings actually uncovered a bug (a stray lowercase in Filename), and added some fallout, mainly in Str.

@gasche
Copy link
Member

gasche commented Dec 21, 2014

Merged in trunk. Thanks!

@gasche gasche closed this Dec 21, 2014
@whitequark whitequark deleted the utf8-paths branch December 21, 2014 11:56
@whitequark whitequark restored the utf8-paths branch December 21, 2014 11:56
@whitequark whitequark deleted the utf8-paths branch December 21, 2014 11:56
@dbuenzli
Copy link
Contributor

@gasche given the recent #131 you may want to add @SInCE directives to {Char,String}.{uppercase,lowercase}_ascii now.

@whitequark
Copy link
Member Author

@dbuenzli Already done in trunk.

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants