feat: Introduce a DialectProviderInterface matching the modern cucumber API by stof · Pull Request #350 · Behat/Gherkin

stof · 2025-05-23T11:51:42Z

~~When configuring the Lexer with a dialect provider, invalid language tags in the parsed files will fail instead of silently uswing English keywords, matching the cucumber behavior. Refs #156~~
~~When configuring the Lexer with a KeywordsInterface, the existing silent usage of English will still be done (as the Keywords implementation does that internally).~~
Failing for invalid languages during parsing will be implemented in a follow-up PR based on the compatiblity mode.

An implementation is provided based on the cucumber gherkin-languages.json file, which is now shipped in the package
unmodified (but is considered an internal implementation detail of the new class as we don't expose the path).
No configurable implementation is provided in the core. The DialectProviderInterface is simple enough to implement it fully based on your needs (I expect that the need to read translations from a file having the same structure than the cucumber gherkin-languages.json is not the use case for custom dialect providers, and even in such case, it is dead simple to implement it).

Closes #203

This PR does not provide a replacement for the KeywordsDumper feature for now (which is more a dumper for an example feature). This will be handled in a separate PR.
This PR also does not mark the KeywordsInterface as deprecated yet, but this is expected to happen once the replacement for the KeywordsDumper is available.

codecov · 2025-05-23T11:52:26Z

Codecov Report

Attention: Patch coverage is 97.58454% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.10%. Comparing base (fad95fa) to head (ba7f752).
Report is 10 commits behind head on master.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/Dialect/GherkinDialect.php	94.73%	2 Missing ⚠️
src/Keywords/DialectKeywords.php	94.59%	2 Missing ⚠️
src/Lexer.php	98.70%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #350      +/-   ##
============================================
- Coverage     96.82%   96.10%   -0.72%     
- Complexity      581      631      +50     
============================================
  Files            37       42       +5     
  Lines          1730     1875     +145     
============================================
+ Hits           1675     1802     +127     
- Misses           55       73      +18

Flag	Coverage Δ
php8.1	`96.10% <97.58%> (-0.72%)`	⬇️
php8.1--with=symfony/yaml:^5.4	`96.10% <97.58%> (-0.72%)`	⬇️
php8.1--with=symfony/yaml:^6.4	`96.10% <97.58%> (-0.72%)`	⬇️
php8.2	`96.10% <97.58%> (-0.72%)`	⬇️
php8.3	`96.10% <97.58%> (-0.72%)`	⬇️
php8.4	`96.10% <97.58%> (-0.72%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stof · 2025-05-23T11:53:46Z

.github/workflows/update.yml

      - run: composer update
        if: steps.cucumber.outputs.cucumber_updated == 'yes'

+      - run: cp vendor/cucumber/gherkin-monorepo/gherkin-languages.json resources/


I intentionally kept this copying separate from the bin/update_i18n script to allow us to drop the bin/update_i18n script when we drop the i18n.php file (when removing the Keywords API in the next major)

stof · 2025-05-23T14:31:33Z

Due to the Lexer still being extensible, I kept 2 unused protected methods (marking them deprecated) in case a child class uses them. As the Lexer never uses them anymore, I've excluded them from code coverage.

stof · 2025-05-23T19:46:57Z

the code coverage reduction is because ArrayKeywords::getStepKeywords becomes uncovered as it was only ever called by the Lexer (which now relies on the Dialect API instead).

src/Dialect/KeywordsDialectProvider.php

stof

ParserTest and ParserExceptionsTest are still using the Keywords API as they were using locally-defined translations, not the i18n.php containing the official gherkin translations. As we are in the process of deleting those tests to replace them with compatibility tests contributed to cucumber, I decided it was not worth validating that the translations were still in sync.

src/Lexer.php

stof · 2025-05-23T20:14:28Z

src/Lexer.php

+
+        $this->allowFeature = false;
+        $this->allowLanguageTag = false;
+        $this->allowMultilineArguments = false;


Unlike the old scanInputForKeywords which was containing a bunch of conditions on the type to decide which allow flag should be toggles, I moved this to each of the specific methods. I personally find this clearer.

Much clearer, thanks

src/Dialect/GherkinDialect.php

src/Keywords/DialectKeywords.php

src/Lexer.php

acoulton · 2025-05-24T09:09:00Z

src/Lexer.php

+        $trimmedLine = $this->getTrimmedLine();
+        $matchedKeyword = null;
+
+        foreach ($this->currentDialect->getStepKeywords() as $keyword) {


Is there a potential performance hit here, compared to the previous regex approach? There are quite a lot of step keyword variants to check for every line of every feature.

this is exactly the way cucumber/gherkin implements it. But I haven't done a benchmark yet.

acoulton · 2025-05-24T09:11:59Z

@stof this is broadly looking great so far, I ran out of time to get to the end of it just now but have left some initial small comments / thoughts.

src/Lexer.php

tests/Cucumber/CompatibilityTest.php

acoulton · 2025-05-24T21:29:31Z

Looked over the rest of this now @stof, this is looking really good thanks.

stof · 2025-05-26T10:54:16Z

tests/TranslationTest.php

+        try {
+            $parsed = $parser->parse($source, __DIR__ . DIRECTORY_SEPARATOR . $language . '_' . ($num + 1) . '.feature');
+        } catch (\Throwable $e) {
+            throw new \RuntimeException($e->getMessage() . ":\n" . $source, 0, $e);


PHPUnit truncates the test parameters, which makes it useless to read the source (which is far longer than the truncation limit).
Anyway, that's what we already do in our KeywordsTestCase (from which I copied a big part of the code).

tests/TranslationTest.php

src/Dialect/KeywordsDialectProvider.php

src/Exception/NoSuchLanguageException.php

acoulton

Thanks @stof, this is looking great. My only outstanding question is re a legacy class calling the now-deprecated getKeywords('Step') - I am possibly misunderstanding, but I think the regex that method returns has changed in a way that would change behaviour of the extension class (which will not necessarily be calling ltrim on the part after the matched keyword)?

stof · 2025-05-26T15:21:53Z

Good catch indeed. I'll fix that.

ArrayKeywords makes its language default to `en` and assumes that there is data available for it without checking. When this is not the case, calling its methods before setting another language will trigger a PHP warning and return `null` (in methods where the phpdoc says it must return a string). The ArrayKeywords test is currently passing only because the current way the parser works avoids this broken state for the particular case being used in the test, but this is not guaranteed to stay true.

An implementation is provided based on the cucumber gherkin-languages.json file, which is now shipped in the package unmodified (but is considered an internal implementation detail of the new class as we don't expose the path).

This test is similar to the CachedArrayKeywords test but relies on the new Dialect API and covers translations defined in the gherkin-languages.json file.

Code coverage does not count code executed in data providers as being covered.

acoulton

Great, thanks @stof

stof requested a review from acoulton May 23, 2025 11:51

stof commented May 23, 2025

View reviewed changes

This was referenced May 23, 2025

Improve type-hinting of the lexer and parser #344

Merged

[RFC] Deprecate extending the lexer and parser #347

Closed

stof force-pushed the dialect_api branch 6 times, most recently from c58140e to 5495d19 Compare May 23, 2025 14:29

stof force-pushed the dialect_api branch 2 times, most recently from 2d98f28 to 265b9e8 Compare May 23, 2025 19:44

stof commented May 23, 2025

View reviewed changes

src/Dialect/KeywordsDialectProvider.php Show resolved Hide resolved

stof force-pushed the dialect_api branch from 265b9e8 to dbbbb83 Compare May 23, 2025 20:30

stof commented May 23, 2025

View reviewed changes