Convert functions to use code points#15
Convert functions to use code points#15thomashoneyman merged 34 commits intopurescript-contrib:mainfrom
Conversation
|
@MonoidMusician Thanks for this PR! To be honest, I'm not super knowledgable about unicode. Would you be able to link me to something explaining the difference between Javascript Before I merge this, could you do the following things:
Also, purescript-parsing depends on purescript-unicode. After this is merged, would you be willing to send a small PR to purescript-parsing updating their uses of purescript-unicode? |
|
@michaelficarra, would it be possible for you to take a quick glance at this PR and make sure everything looks sane? It looks like you're the author of In case you haven't see |
|
Also, I wonder if it makes sense to change the module name from |
src/Data/Char/Unicode.purs
Outdated
| toUpper :: Char -> Char | ||
| toUpper = fromCharCode <<< uTowupper <<< toCharCode | ||
| toUpper :: CodePoint -> CodePoint | ||
| toUpper = modify uTowupper |
There was a problem hiding this comment.
toUpper can't be CodePoint -> CodePoint, it must be CodePoint -> String. But maybe that's something we should address separately, since I see it's currently broken anyway.
There was a problem hiding this comment.
Okay ... Is this because a some precomposed characters don't have direct upper case equivalents and would need to be decomposed?
There was a problem hiding this comment.
There was a problem hiding this comment.
Okay, cool ... I'm not familiar with the representation of the rules, so I wouldn't know how to fix that. Also, would we want to return String or Array CodePoint? (maybe the latter for the the Internal module?)
There was a problem hiding this comment.
Array CodePoint sounds good.
There was a problem hiding this comment.
currently it looks like toUpper does not change the eszett.
There was a problem hiding this comment.
@MonoidMusician could you also add some documentation to toUpper explaining why the type is CodePoint -> Array CodePoint?
Also, I'm wondering if it would be helpful to have a function like unsafeToUpper :: CodePoint -> CodePoint for people who want to easily map over a list of codepoints, without worrying about being 100% correct. Its use should be discouraged of course.
|
I don't like the use of |
|
@cdepillabout Thanks for the suggestions! According the the documentation for Prim.Char [1], it represents one UTF-16 code unit, i.e. all non-astral code points. But CodePoint is a newtype around Int so it can represent any Unicode value.
I should make a PR for |
After thinking about it some more, I think this might be a good change to make. It would make it possible for us to go back and re-add a The documentation for |
|
Why even have the |
Does "code unit" mean PureScript's I was thinking that there would be a lot of [beginner?] PureScript programmers that are using However, there will certainly be people that want to handle unicode correctly. They can use the However, if the rest of the PureScript community is moving to completely use |
7a544c3 to
99e1549
Compare
|
I will update this PR once this is merged and released: purescript/purescript-strings#92 It looks like the mapping of “ß” I've started my own standard parsing library so if I could probably attempt to convert our parser here to PS from awk: https://github.com/MonoidMusician/purescript-uievents-key/blob/master/test/Generate.purs Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README? I'm not sure how significant the changes are (more emoji anyone?? 😄) And my last point: I renamed it to Thanks for the feedback from both of you! |
Yes, if you want to provide case mapping functions, you should base them on that table.
Yes, but stay consistent within major releases of purescript-unicode.
No, see purescript/purescript-strings#95. |
|
It looks like there has been some discussion about moving to CodePoints on purescript/purescript-strings#95, so it seems like it might be a good time to revisit this issue! |
|
I'm going to need a response to this "quickly" or I'll make a 0.12 release with the old API and we'll need to make another breaking release to incorporate this PR. |
|
@MonoidMusician Does this PR need to be updated? Or can it be merged in as-is? |
|
I am starting to work on updating it :) Special casing might be a separate PR though ... maybe we should keep |
|
@MonoidMusician For anything that you do, make sure you do it against the compiler/0.12 branch. This package-set should have all the dependencies you need: https://github.com/kRITZCREEK/package-sets/tree/test-core EDIT: fromCharCode returns a Maybe now, but since this library is pretty low-level you might want to redefine a non-Maybe version with the FFI and use that whenever you're sure you'll stay within bounds. |
|
Hm, the dev dependency |
This is more accurate because`CodePoint` represents any unicode character, whereas `Char` only works with non-astral characters, due to its JavaScript representation.
They work with local modifications ...
|
I had to fix up the testing libraries locally, but the tests pass here now. @kritzcreek suggested using |
There was a problem hiding this comment.
This is really impressive, @MonoidMusician, and thank you for your work. I only have minor comments to make; on the whole it's good to see this modernized and ready for Unicode 13.0.0, as well as on a sounder footing with the switch to code points.
Admittedly, I'm not particularly well-versed in the subtleties of unicode, so perhaps if @michaelficarra has time for another look that would be helpful as well.
| @@ -1,10 +1,10 @@ | |||
| ----------------------------------------------------------- | |||
| -- This is an automatically generated file: do not edit | |||
| -- Generated by ubconfc at Fri Nov 10 20:05:16 PST 2017 | |||
| -- Generated by ubconfc at Tue Jan 12 18:57:20 CST 2021 | |||
There was a problem hiding this comment.
I don't have a good way to verify this file other than seeing that the tests still pass. From a scan through it looks fine to me. If you have any suggestions on verifying this, @MonoidMusician, I'm all ears!
There was a problem hiding this comment.
Also, we have some documentation that describes generating these files anew:
You've made some changes to how these are generated; if there's any information you think would be helpful to maintainers working on this in the future, please consider adding it to that documentation file as well. Thanks!
There was a problem hiding this comment.
Thanks, I missed that it was moved to there, will update.
| import Data.CodePoint.Unicode.Internal (bsearch, uTowlower, uTowtitle, uTowupper) | ||
| import Data.Maybe (Maybe(..)) | ||
|
|
||
| type CaseRec = |
There was a problem hiding this comment.
Could we add documentation comments to this and the declarations throughout the file, as this is a public module? Simple ones are fine, but they can help guide maintainers in the future as well.
There was a problem hiding this comment.
I moved it to a new Internal folder – users should not use these functions on Ints but instead the ones on CodePoints.
Co-authored-by: Thomas Honeyman <[email protected]>
Co-authored-by: Thomas Honeyman <[email protected]>
And update documentation for generating internal modules
Instead of hypothetical performance benefits
|
Okay, thanks for the feedback! I think I addressed what was addressable. |
thomashoneyman
left a comment
There was a problem hiding this comment.
👍 Looks good to me! Thanks for your work! If no other maintainer notices an issue before this must be merged in order to make the 0.14 release, then I will merge this PR. I'm going to leave it open in the meantime.
michaelficarra
left a comment
There was a problem hiding this comment.
Generally LGTM. What do you think about also exporting the Unicode version as a String?
Co-authored-by: Michael Ficarra <[email protected]>
|
Okay, I think I addressed your feedback, thanks! I don't think we need to export the Unicode version from the library, I just want to make sure it's available for documentation when someone goes looking for it. Unicode seems to be pretty stable so I don't think it affects most users. |
|
I just noticed that this fixes #28 (I used hexadecimal escapes). |
|
Fantastic work, @MonoidMusician! Thank you, and if anyone does happen to notice an issue we can merge a fix. |
This is more accurate because
CodePointrepresents any unicode character, whereasCharonly works with non-astral characters, due to its JavaScript representation. Right?I'll probably submit a PR to
-stringsto add an equivalent ofcharPointdirectly to Data.String.CodePoints, which will clean this up a bit.