Generalize StringLike to StreamLike fix #58#62
Generalize StringLike to StreamLike fix #58#62safareli wants to merge 26 commits intopurescript-contrib:mainfrom
Conversation
|
I was thinking on how to solve the issue but decided to actually open pr instead of writing this as comment. API of whitespace has changed. We can rename let me know what you think @paf31 |
src/Text/Parsing/Parser/String.purs
Outdated
| -- | | ||
| class StreamLike f c | f -> c where | ||
| uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } | ||
| drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: (Position -> Position) } |
There was a problem hiding this comment.
We can name it stripPrefix
src/Text/Parsing/Parser/String.purs
Outdated
| uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } | ||
| drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: (Position -> Position) } | ||
|
|
||
| instance stringLikeString :: StreamLike String Char where |
There was a problem hiding this comment.
stringLikeString here should be streamLikeString? Same below.
src/Text/Parsing/Parser/String.purs
Outdated
| instance listcharLikeString :: (Eq a, HasUpdatePosition a) => StreamLike (L.List a) a where | ||
| uncons f = L.uncons f <#> \({ head, tail}) -> | ||
| { head: head, updatePos: (_ `updatePos` head), tail} | ||
| drop (Prefix p') s' = case (tailRecM3 go p' s' id) of -- no MonadRec for Maybe |
There was a problem hiding this comment.
Maybe it's worth adding stripPrefix to Data.List?
There was a problem hiding this comment.
Should we define it like in String ? (ie add Pattern type)?
There was a problem hiding this comment.
I'm not sure. I'm tempted to say no, but if you'd like to open a PR, we can discuss it.
There was a problem hiding this comment.
src/Text/Parsing/Parser/String.purs
Outdated
| -- | Instances must satisfy the following laws: | ||
| -- | | ||
| class StreamLike f c | f -> c where | ||
| uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } |
There was a problem hiding this comment.
Parens are redundant here around the type of updatePos.
|
Sorry for the delay. I really like this, once the tailrec and list dependencies are updated, I'll merge this and make a major release. Thanks! |
src/Text/Parsing/Parser/String.purs
Outdated
| cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' | ||
| pure $ fromCharArray cs | ||
| -- | Match many whitespace characters. | ||
| whiteSpace :: forall f m g. StreamLike f Char => Unfoldable g => Monoid f => Monad m => ParserT f m (g Char) |
There was a problem hiding this comment.
are you fine with the signature?
src/Text/Parsing/Parser/String.purs
Outdated
|
|
||
| -- | Match a whitespace characters but returns them as Array. | ||
| whiteSpace' :: forall f m. StreamLike f Char => Monad m => ParserT f m (Array Char) | ||
| whiteSpace' = many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' |
There was a problem hiding this comment.
Are we sure Data.Array.many is fine here? maybe we should use catenable list or something similar.
src/Text/Parsing/Parser/String.purs
Outdated
|
|
||
| -- | Match end-of-file. | ||
| eof :: forall s m. StringLike s => Monad m => ParserT s m Unit | ||
| -- | |
There was a problem hiding this comment.
this description is outdated
There was a problem hiding this comment.
are we fine with description and the law?
|
we should update descriptions here and possibly remove some of combinators from Token |
|
👍 Looks good to me, thanks! I'd be fine with removing overlapping functionality from @garyb Any comments before I merge this? Obviously it'll be a breaking change. |
|
Looks like it's conflicting currently? |
|
LGTM if it LGTY though! |
|
@safareli Can you please merge with what's on master? |
|
@paf31 Will tackle this on weekend. |
instead String{anyChar,satisfy,char} chould be used
src/Text/Parsing/Parser/String.purs
Outdated
| -- | | ||
| class StreamLike f c | f -> c where | ||
| uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: Position -> Position } | ||
| drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: Position -> Position } |
There was a problem hiding this comment.
Are we fine with this name? stipPrefix could be used instead too
src/Text/Parsing/Parser/String.purs
Outdated
|
|
||
| -- | Match the specified string. | ||
| string :: forall s m. StringLike s => Monad m => String -> ParserT s m String | ||
| -- | Match the specified stream. |
There was a problem hiding this comment.
Are we fine wis descriptions?
We can use function names from Token instead of char and anyChar.
There was a problem hiding this comment.
Yes, I think that's better. So maybe prefix or something?
There was a problem hiding this comment.
ok will use:
matchforchartokenforanyCharprefixforstring
Will also rename drop from StreamLike to stipPrefix so it lines up with prefix.
src/Text/Parsing/Parser/String.purs
Outdated
| cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' | ||
| pure $ fromCharArray cs | ||
| -- | Match many whitespace character in some Unfoldable. | ||
| whiteSpace :: forall f m g. StreamLike f Char => Unfoldable g => Monoid f => Monad m => ParserT f m (g Char) |
There was a problem hiding this comment.
We can remove this function as it's still braking change
There was a problem hiding this comment.
Remind me again why we'd need to remove it?
There was a problem hiding this comment.
If you are operating on list of some tokens you most likely are not gonna use it.
Major use case of this would be to get String as result, but String is not Unfoldable, so you would still need to map over it with stringFromChars.
I think we can just returning Array Char is fine, and if client wants a string they can map over it (as they would need to do it any ways).
If you agree i would remove this function and rename whitespace' to whitespace (this way we wouldn't have two whitespace functions)
|
@paf31 can you take a look, I have renamed StreamLike to Stream and added |
src/Text/Parsing/Parser/Stream.purs
Outdated
| class StreamLike f c | f -> c where | ||
| uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: Position -> Position } | ||
| stripPrefix :: Prefix f -> f -> Maybe { rest :: f, updatePos :: Position -> Position } | ||
| class StreamLike s m t | s -> t where |
There was a problem hiding this comment.
I'm wondering if we should add s -> m here too.
There was a problem hiding this comment.
not sure, in parsec this is how it looks class (Monad m) => Stream s m t | s -> t
|
Looks good to me! Just one comment about that potential fundep. What do you think? Also, could you please add a small test for a custom stream type which involves a non-trivial monad - |
|
Will try to add tests using |
|
@safareli Any more thoughts on testing? If you don't have time, I'm happy to merge this and leave it until later. |
|
this comment made me think that maybe if you think current signature is fine we can merge it. |
|
@paf31 I have addressed last issue I had can you take a look? |
s -> m (Maybe { head :: t, tail :: s, updatePos :: Position -> Position })
instead of having updatePos as a result of uncons or stripPrefix now this operations take position with input which is part of a parser state. this way we should allocation less of intermediate objects.
|
Closing due to lack of activity. |
|
Also, it sounded like the design needed to be thought through more to account for |
It's exactly what was stated in the last comment of this PR #62 (comment) |
I skimmed through I'll keep this PR open. |
|
I'm not actively doing PS any more (for now), so this could be closed. (if anyone is up for taking this on they can just checkout from this branch or copy paste code) |
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
|
In v7.0.0 we eliminated the |
fix #58