Skip to content

Commit fcf8d2a

Browse files
Erik CorryV8 LUCI CQ
authored andcommitted
[regexp] Improvements to Unicode case independent.
Case-desugar the regexp later for better performance. We used to expand into case variants at a very early stage, in the parser. This has two disadvantages. Firstly, it means we do all the case work even when just checking the syntax of a regexp, which happens multiple places in order to give early errors. Secondly, it disables some of the optimizations that Irregexp can do, because now instead of literal strings we have sequences of character classes, which are not treated in the same way. For character classes we have to desugar early to get the specified semantics, but for literal texts outside of character classes we can mostly go back to the old behaviour that is used without the /u and /v flags. The exception is for surrogate pairs (code points above 0x10000) where we still desugar early. Luckily there are no alpha letters below 0x10000 that have case equivalents above 0x10000. Running https://gist.github.com/erikcorry/dd5b08dd5abdf4f592628dd08db17701 I get the following output: Before: With 7000 terms, took 1674ms, 253ms, 28ms, 26.857142857142858ms per iteration After: With 7000 terms, took 332ms, 145ms, 0ms, 0ms per iteration Bug: 40261789 Change-Id: I16dca4830c3145f436f537c1db1a0de6def9045c Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/5643119 Reviewed-by: Patrick Thier <[email protected]> Commit-Queue: Erik Corry <[email protected]> Reviewed-by: Toon Verwaest <[email protected]> Cr-Commit-Position: refs/heads/main@{#94770}
1 parent 5ad5602 commit fcf8d2a

File tree

7 files changed

+145
-93
lines changed

7 files changed

+145
-93
lines changed

src/regexp/regexp-compiler-tonode.cc

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ using namespace regexp_compiler_constants; // NOLINT(build/namespaces)
2626
constexpr base::uc32 kMaxCodePoint = 0x10ffff;
2727
constexpr int kMaxUtf16CodeUnit = 0xffff;
2828
constexpr uint32_t kMaxUtf16CodeUnitU = 0xffff;
29-
constexpr int32_t kMaxOneByteCharCode = unibrow::Latin1::kMaxChar;
3029

3130
// -------------------------------------------------------------------
3231
// Tree to graph conversion
@@ -424,6 +423,7 @@ RegExpNode* UnanchoredAdvance(RegExpCompiler* compiler,
424423
} // namespace
425424

426425
// static
426+
// Only for /ui and /vi, not for /i regexps.
427427
void CharacterRange::AddUnicodeCaseEquivalents(ZoneList<CharacterRange>* ranges,
428428
Zone* zone) {
429429
#ifdef V8_INTL_SUPPORT
@@ -1450,6 +1450,7 @@ void CharacterRange::AddClassEscape(StandardCharacterSet standard_character_set,
14501450
}
14511451

14521452
// static
1453+
// Only for /i, not for /ui or /vi.
14531454
void CharacterRange::AddCaseEquivalents(Isolate* isolate, Zone* zone,
14541455
ZoneList<CharacterRange>* ranges,
14551456
bool is_one_byte) {
@@ -1465,8 +1466,8 @@ void CharacterRange::AddCaseEquivalents(Isolate* isolate, Zone* zone,
14651466
// Nothing to be done for surrogates.
14661467
if (from >= kLeadSurrogateStart && to <= kTrailSurrogateEnd) continue;
14671468
if (is_one_byte && !RangeContainsLatin1Equivalents(range)) {
1468-
if (from > kMaxOneByteCharCode) continue;
1469-
if (to > kMaxOneByteCharCode) to = kMaxOneByteCharCode;
1469+
if (from > String::kMaxOneByteCharCode) continue;
1470+
if (to > String::kMaxOneByteCharCode) to = String::kMaxOneByteCharCode;
14701471
}
14711472
others.add(from, to);
14721473
}
@@ -1508,8 +1509,8 @@ void CharacterRange::AddCaseEquivalents(Isolate* isolate, Zone* zone,
15081509
// Nothing to be done for surrogates.
15091510
if (bottom >= kLeadSurrogateStart && top <= kTrailSurrogateEnd) continue;
15101511
if (is_one_byte && !RangeContainsLatin1Equivalents(range)) {
1511-
if (bottom > kMaxOneByteCharCode) continue;
1512-
if (top > kMaxOneByteCharCode) top = kMaxOneByteCharCode;
1512+
if (bottom > String::kMaxOneByteCharCode) continue;
1513+
if (top > String::kMaxOneByteCharCode) top = String::kMaxOneByteCharCode;
15131514
}
15141515
unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
15151516
if (top == bottom) {

0 commit comments

Comments
 (0)