Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preg_match missing array shape for pattern #11323

Closed
Tracked by #12045
staabm opened this issue Jul 11, 2024 · 26 comments
Closed
Tracked by #12045

preg_match missing array shape for pattern #11323

staabm opened this issue Jul 11, 2024 · 26 comments
Labels

Comments

@staabm
Copy link
Contributor

staabm commented Jul 11, 2024

Bug report

for some reason the snippet does not get a proper array shape.

hoa reports

Unexpected token "?" (zero_or_one) at line 1 and column 10:/Price: (?P<currency>£|€)\d+/

while it works at php runtime: https://3v4l.org/rugXe

grafik

Code snippet that reproduces the problem

https://phpstan.org/r/3a6f6eef-9ecc-4304-b707-b9a688d7f48c

Expected output

expectations in the snippet

Did PHPStan help you today? Did it make you happy in any way?

No response

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@mvorisek could you have a look?

@Seldaek
Copy link
Contributor

Seldaek commented Jul 11, 2024

Note that both (?P<groupname>pattern) and without P (?<groupname>pattern) are both supported https://3v4l.org/qJI26

@mvorisek
Copy link
Contributor

@mvorisek could you have a look?

I can help with grammar, but my time is very limited and creating unit tests would help me - phpstan/phpstan-src#2589 (comment)

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@mvorisek put the contents of https://phpstan.org/r/3a6f6eef-9ecc-4304-b707-b9a688d7f48c
into a file, e.g. test.php in the root of the phpstan-src repo and invoke php bin/phpstan analyze test.php --debug --xdebug
and put a breakpoint into RegexArrayShapeMatcher->parseGroups

I can build a proper unit test when the fix is made

@mvorisek
Copy link
Contributor

mvorisek commented Jul 11, 2024

I will look into the hoa regex grammar fixes.

Is (?P currently the only known unsupported grammar?

@Seldaek
Copy link
Contributor

Seldaek commented Jul 11, 2024

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@mvorisek thanks for looking into it. the suggested patch leads to errors (regressions), see phpstan/phpstan-src#3228

@staabm
Copy link
Contributor Author

staabm commented Jul 11, 2024

@mvorisek I might have spoken to fast and needs another look from my side. I mixed something up

@Seldaek
Copy link
Contributor

Seldaek commented Jul 11, 2024

Found another grammar issue @mvorisek https://phpstan.org/r/94070d37-7a16-4487-9098-f2a5677dfd43

The issue is that () are probably seen as a match group here when they are not escaped, but they are within a character class [] so escaping them is not needed (only - if it is between characters, and ], need to be escaped in character classes, no other character has special meaning).

See https://3v4l.org/m0Vrv for runtime proof :)

@mvorisek
Copy link
Contributor

I will look into that.

@Seldaek
Copy link
Contributor

Seldaek commented Jul 12, 2024

@staabm I think it might be fixable by changing this line https://github.com/hoaproject/Regex/blob/master/Source/Grammar.pp#L168

to this:

    ( <class_> | range() | literal() | ::capturing_:: | ::_capturing:: )+ <range>?

@mvorisek
Copy link
Contributor

mvorisek commented Jul 12, 2024

@phpstan-bot
Copy link
Contributor

@Seldaek After the latest push in 1.12.x, PHPStan now reports different result with your code snippet:

@@ @@
  9: Dumped type: array<string>
-13: Dumped type: array{0: string, name: string, 1: string, email?: string, 2?: string}
+13: Dumped type: array<string>
 18: Dumped type: array<string>
 22: Dumped type: array{string}
Full report
Line Error
9 Dumped type: array<string>
13 Dumped type: array<string>
18 Dumped type: array<string>
22 Dumped type: array{string}

@Seldaek
Copy link
Contributor

Seldaek commented Jul 12, 2024

From PCRE:

   Part  of  a  pattern  that is in square brackets is called a "character
   class". In a character class the only metacharacters are:

     \      general escape character
     ^      negate the class, but only if the first character
     -      indicates character range
     [      POSIX character class (only if followed by POSIX
              syntax)
     ]      terminates the character class
  • \ is handled already
  • ^ is handled too
    • is handled
  • [ for POSIX is not handled it seems but I'd say irrelevant, these are [:digit:] which are not so widely used these days I think
  • ] is handled

Everything else should be just parsed as literal in there.

So probably the better option is to add a new token to parse all the leftover characters:

%token class_literal [^\\\[\]-]

Then again change line 168 to include that as last resort:

    ( <class_> | range() | literal() | ::class_literal:: )+ <range>?

I am not super familiar with the syntax here so hopefully there is no mistake, but I think if this passes tests and it can parse /[[()+|.*?]/ as a character class matching any of [()+|.*? characters then it should be pretty good.

@Seldaek
Copy link
Contributor

Seldaek commented Jul 12, 2024

@mvorisek
Copy link
Contributor

As noted above, https://regex101.com/r/pOq5Jj/1 should be parsed as inner class, otherwise we might be limited in the future when we will analyse classes (like numerical string).

@Seldaek
Copy link
Contributor

Seldaek commented Jul 12, 2024

Where did you note this above? And well, right now [:digit:] already causes the parser to crap out, so my patch wouldn't break it any further.. It doesn't fix that though indeed. You could fix it by adding another %token posix_class \[^?:[a-z]+:\] and include that too in ( ::posix_class:: | <class_> | range() | literal() | ::class_literal:: )+ <range>? I think it has to go first before class_ there because that matches a single [ and it'd fail to consume the full posix class then.

Anyway if you want a fully compliant parser it might be better to go with https://github.com/bkiers/pcre-parser/tree/master/src/main/antlr4/nl/bigo/pcreparser + https://packagist.org/packages/antlr/antlr4-php-runtime or something else than retrofitting everything that's missing onto hoa/regex? But in the meantime I'd still be happy if this additional fix can be applied if it indeed works, as it should solve the last chunk of problem I've encountered so far.

@staabm
Copy link
Contributor Author

staabm commented Jul 14, 2024

I have no idea what to make from your comments.

If there is a grammar change I should apply please advise and give me a test-sample to verify the change

Thank you both

@Seldaek
Copy link
Contributor

Seldaek commented Jul 14, 2024

@staabm what I'd do is add this below https://github.com/hoaproject/Regex/blob/master/Source/Grammar.pp#L45-L49 :

%token class_literal [^\\\[\]-]
%token posix_class \[^?:[a-z]+:\]

then change https://github.com/hoaproject/Regex/blob/master/Source/Grammar.pp#L168 to:

    ( ::posix_class:: | <class_> | range() | literal() | ::class_literal:: )+ <range>?

Or this, I am not sure what the difference between <x> and ::x:: is there:

    ( <posix_class> | <class_> | range() | literal() | <class_literal> )+ <range>?

That should make it so that patterns like /([*|+?{}()]+)([^*|+[:digit:]?{}()]+)/ which fail completely to be parsed right now outputs an array{string, string, string} type.

If the patch fails maybe you can create a draft PR with instructions how to run it then I can try to figure it out here?

@staabm
Copy link
Contributor Author

staabm commented Jul 15, 2024

@Seldaek I have put the suggested change with a unit test into phpstan/phpstan-src#3241

atm it doesn't work though

@phpstan-bot
Copy link
Contributor

@Seldaek After the latest push in 1.12.x, PHPStan now reports different result with your code snippet:

@@ @@
 10: Expected type array{0: string, currency: string, 1: string}, actual: array<string>
-14: Expected type array{}|array{0: string, currency: string, 1: string}, actual: array<string>
+14: Expected type array{}|array{0: string, currency: string, 1: string}, actual: array<string>
+17: Expected type array{0: string, currency: string, 1: string}, actual: array{0: string, currency: non-empty-string, 1: non-empty-string}
+21: Expected type array{}|array{0: string, currency: string, 1: string}, actual: array{}|array{0: string, currency: non-empty-string, 1: non-empty-string}
Full report
Line Error
10 Expected type array{0: string, currency: string, 1: string}, actual: array<string>
14 `Expected type array{}
17 Expected type array{0: string, currency: string, 1: string}, actual: array{0: string, currency: non-empty-string, 1: non-empty-string}
21 `Expected type array{}

@staabm
Copy link
Contributor Author

staabm commented Jul 18, 2024

I think this one is good enough for now

@staabm staabm closed this as completed Jul 18, 2024
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants