PR6695: Do not treat paths as encoded in ISO-8859-1 by whitequark · Pull Request #124 · ocaml/ocaml

whitequark · 2014-12-12T16:18:29Z

See http://caml.inria.fr/mantis/view.php?id=6695.

gasche · 2014-12-12T16:39:13Z

I think the merge of "Make sure the compiler only uses ASCII" would be a no-brainer if it didn't imply changing the stdlib, which is always more controversial (as it should be). I'd merge a commit with the ascii_foo functions in utils/misc (and let people discuss a second commit that moves them to the stdlib directly). Unfortunately, that would require duplicating them in ocamlbuild/my_std as well, as we don't want ocamlbuild to start depending on internal compiler libraries.

gasche · 2014-12-12T16:42:15Z

I forgot to say: I do support adding the ascii_* to the stdlib (I would personally have just changed the semantics of the existing String....) and have reviewed the patch, I think it is fine.

(We may want to expose latin1_* functions and encourage people to use them if they explicitly rely on the accented-characters behavior.)

whitequark · 2014-12-12T16:42:36Z

@gasche Yes, that was the reason I did not put them in Misc.

Should I move/duplicate them or wait for more input?

alainfrisch · 2014-12-12T16:45:05Z

It might be useful to check that no OPAM package uses compilation units whose name starts with a non-ASCII letter. I assume and hope the answer is no.

gasche · 2014-12-12T16:45:35Z

Should I move/duplicate them or wait for more input?

Well it's as you prefer. If you value the "at least the compiler is fixed in trunk now" aspect enough to put micro-work in splitting the commits, you want to do that. Otherwise, just wait -- but be patient, these things can stay silent for months. Both are fine options.

whitequark · 2014-12-12T16:46:37Z

@alainfrisch The answer is actually no. ocamldep does not print dependencies whose names start with non-'A'..'Z' -- this will be the topic of a later PR of mine.

Additionally, since Ubuntu, Github et al use UTF-8 and the compiler uses Latin-1, even if it did not, that would still fail the compilation.

whitequark · 2014-12-12T16:49:10Z

@gasche I will put them in Misc. This will make it easier to decide on how/if change the semantics of String.

damiendoligez · 2014-12-12T17:24:42Z

I'm in favor of adding them to stdlib, but we need to discuss the interface:

I don't like the names, I think capitalize_ascii would look better than ascii_capitalize
We need to discuss the idea of adding an optional argument to the existing functions instead of adding new functions (it looks cleaner but makes it harder to deprecate the old behavior).

whitequark · 2014-12-12T17:27:27Z

@gasche I tried to put them in Misc. It is not easy. There's a never-ending hole of dependencies all around the compiler.

@planar

I will rename them.
How about adding ?(latin1=true) ?

gasche · 2014-12-12T17:54:56Z

I think adding optional arguments is a bad way to evolve APIs. In this particular case, the default value would be the bad value for this parameter.

dbuenzli · 2014-12-12T19:48:17Z

@gasche If that's the bad value for the optional parameter it means that you ready to break compatibility ?

If it's ok to break backward compatibility then why not move the old functions with the same name to a deprecated String.Latin1 module and keep the rest as is but now only acting on the ASCII set. Code relying on the old behaviour will only have to add Latin1 in front of a bunch of identifiers.

Another alternative (which I'm not very fond of) would be to have a full copy of the old String module in the deprecated String.Latin1 module and tell people to open String.Latin1 if they need the deprecated behaviour. The advantage and disadvantage of this solution is that backward compatibility can be maintained at the build system level with -open String.Latin1. But I'm not very fond of that command line option...

dbuenzli · 2014-12-12T19:51:47Z

Oh btw. adding optional parameters to an API is an interface breaking change anyways.

whitequark · 2014-12-12T19:52:41Z

I think the required change is by far not invasive enough to warrant a full copy. In any case it is never worse than adding ascii affix and is simple enough that you could even apply it with sed. (Not that you should.)

Personally I prefer changing the behavior of existing functions, given that it seems that breaking backwards compatibility is inevitable.

lpw25 · 2014-12-12T19:58:19Z

If we're going to break backwards compatibility, why not break it in the same way it was broken for Bytes. This would meaning adding a Latin1 module to stdlib with the old implementations, and a latin1 type to the predefined types. Then a compiler option would decide whether string was equal to latin1 and whether latin1 characters were accepted in string literals.

We could also add a "foo"l literal form to create latin1 strings (which could be a first step towards allowing Unicode literals as expressions one day).

The benefit of this approach is that we use types to properly separate latin1 encoded strings from ascii encoded strings -- which should hopefully make bugs around incorrect use of the encodings less likely.

whitequark · 2014-12-12T20:04:03Z

@lpw25 I think this is the best solution. We also know migration pitfalls reasonably well.

dbuenzli · 2014-12-12T20:23:22Z

@whitequark this as in this PR or this as in @lpw25 ? I personally think that @lpw25's solution is overkill for four functions acting on a legacy charset. If backward compatibility should not be broken on that I prefer deprecation of the current functions and the _ascii suffix way.

whitequark · 2014-12-12T20:25:33Z

@dbuenzli I was originally talking about @lpw25 but after consideration I realized I do not understand the ramifications of either breaking or not breaking compatibility well enough to have an opinion, so I'll let someone else figure out how it should be done.

damiendoligez · 2014-12-12T22:34:33Z

@dbuenzli Indeed, @lpw25's solution would be fine if Latin1 was here to stay, but we're trying to get rid of it so I don't see much point in adding that much complexity to the stdlib. Like you, I prefer the "deprecation and _ascii suffix" solution.

gasche · 2014-12-13T10:08:54Z

Thirded. I think the current patch is optimal (I would maybe add latin1_* functions as well, but it's a detail).

This updates Char, String, Bytes in the stdlib. For now, they are hidden from documentation and are only for internal compiler use.

whitequark · 2014-12-14T08:06:06Z

@gasche I've updated the patch to include @planar's suggestion on naming. Otherwise it is unchanged. Anything else needs to be done?

gasche · 2014-12-14T09:47:10Z

If you also marked the legacy functions as deprecated this patch would solve Mantis PR#6695 as well.

whitequark · 2014-12-14T10:53:43Z

@gasche Did you mean PR6694?

gasche · 2014-12-14T10:58:46Z

Yes indeed.

This should cover all places involving filenames in the compiler. There are a few more paths still using Latin-1 in other ways, e.g. in ocamldoc.

The only place that includes changes is the code for checking the suffix. It is highly unlikely that the change has any impact at all.

Also, add documentation for the US-ASCII variants.

…behavior.

whitequark · 2014-12-14T11:14:34Z

Updated. The deprecation warnings actually uncovered a bug (a stray lowercase in Filename), and added some fallout, mainly in Str.

gasche · 2014-12-21T11:53:25Z

Merged in trunk. Thanks!

dbuenzli · 2014-12-21T12:11:12Z

@gasche given the recent #131 you may want to add @SInCE directives to {Char,String}.{uppercase,lowercase}_ascii now.

whitequark · 2014-12-21T12:14:02Z

@dbuenzli Already done in trunk.

whitequark mentioned this pull request Dec 12, 2014

PR6692: Support for UTF-8 identifiers #125

Closed

PR6695: Add ASCII counterparts to case-mapping functions.

681af58

This updates Char, String, Bytes in the stdlib. For now, they are hidden from documentation and are only for internal compiler use.

whitequark added 5 commits December 14, 2014 14:12

PR6695: Make sure the compiler only uses ASCII string functions.

ea1bce2

This should cover all places involving filenames in the compiler. There are a few more paths still using Latin-1 in other ways, e.g. in ocamldoc.

PR6695: Make Filename use only US-ASCII functions.

19ac5b7

The only place that includes changes is the code for checking the suffix. It is highly unlikely that the change has any impact at all.

PR6694: Deprecate Latin-1 string manipulation functions.

97e8df0

Also, add documentation for the US-ASCII variants.

PR6694: Un-warn-error deprecation warnings in Str to preserve legacy …

7b21629

…behavior.

Update Changes.

58ed8bc

gasche closed this Dec 21, 2014

whitequark deleted the utf8-paths branch December 21, 2014 11:56

whitequark restored the utf8-paths branch December 21, 2014 11:56

whitequark deleted the utf8-paths branch December 21, 2014 11:56

vicuna mentioned this pull request Mar 14, 2019

Do not treat paths as encoded in ISO-8859-1 #6695

Closed

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

Remove Blog section (ocaml#124)

54c825c

Conversation

whitequark commented Dec 12, 2014

Uh oh!

gasche commented Dec 12, 2014

Uh oh!

gasche commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

alainfrisch commented Dec 12, 2014

Uh oh!

gasche commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

damiendoligez commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

gasche commented Dec 12, 2014

Uh oh!

dbuenzli commented Dec 12, 2014

Uh oh!

dbuenzli commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

lpw25 commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

dbuenzli commented Dec 12, 2014

Uh oh!

whitequark commented Dec 12, 2014

Uh oh!

damiendoligez commented Dec 12, 2014

Uh oh!

gasche commented Dec 13, 2014

Uh oh!

whitequark commented Dec 14, 2014

Uh oh!

gasche commented Dec 14, 2014

Uh oh!

whitequark commented Dec 14, 2014

Uh oh!

gasche commented Dec 14, 2014

Uh oh!

whitequark commented Dec 14, 2014

Uh oh!

gasche commented Dec 21, 2014

Uh oh!

dbuenzli commented Dec 21, 2014

Uh oh!

whitequark commented Dec 21, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants