Linter plugins: Rustify token methods

Continuation of #14829.

All tokens APIs are now implemented (#15861, and many later PRs - thanks @lilnasy!).

We are currently using TS-ESLint to re-parse source and generate tokens. This works, but is very slow, and bloats the size of `oxlint` package. We should Rustify it!

Am copying content from #14829 to this new issue, to avoid needing to wade through the whole thread there.

---

It's going to be challenging.

## Altering the lexer

Oxc's [lexer](https://github.com/oxc-project/oxc/tree/main/crates/oxc_parser/src/lexer) generates a stream of `Token`s, however, they're not stored anywhere. We'd need to adapt it to also store all the `Token`s in an `oxc_allocator::Vec` in the arena. That `Vec` can then be shared with JS through the same "raw transfer" mechanism as is used for the AST.

One difficulty is that the lexer sometimes has unwind to where syntax is ambiguous reading forwards through the file. e.g. is `(a, b)` a `SequenceExpression`, or the start of an arrow function `(a, b) => ...`? When rewinding occurs, we'd need to also rewind the stream of `Token`s so they don't get included twice in these cases.

## Avoiding perf hit

Oxc's parser is not only used in the linter - it's the first step of the parse-transform-minify-print and parse-format pipelines. Its performance in these pipelines is **critical**, and hurting perf while implementing what we need for linter would be unacceptable.

We'll need some kind of conditional compilation, so linter uses a different version of the parser from other pipelines.

Cargo features are problematic, due to feature unification in tests, so I would propose an approach using generics:

```rs
// ----- Traits -----

trait ParserConfig {
    type Tokens<'a>: TokenStore<'a>;
}

trait TokenStore<'a> {
    type Checkpoint;

    fn new(allocator: &'a Allocator) -> Self;
    fn push(token: Token);
    fn checkpoint(&self) -> Self::Checkpoint;
    fn rewind(&mut self, checkpoint: Self::Checkpoint);
}

// ----- parse-transform-minify-print pipeline -----

struct StandardParserConfig;

impl ParserConfig for StandardParserConfig {
    type Tokens<'a> = NoopTokenStore;
}

struct NoopTokenStore;

impl<'a> TokenStore<'a> for NoopTokenStore {
    type Checkpoint = ();

    fn new(_allocator: &'a Allocator) -> Self { Self }
    fn push(_token: Token) {}
    fn checkpoint(&self) {}
    fn rewind(&mut self, _checkpoint: ()) {}
}

// ----- Oxlint -----

struct OxlintParserConfig;

impl ParserConfig for OxlintParserConfig {
    type Tokens<'a> = RealTokenStore<'a>;
}

struct RealTokenStore<'a> {
    tokens: Vec<'a, Token>,
}

impl<'a> TokenStore<'a> for RealTokenStore<'a> {
    type Checkpoint = u32;

    fn new(allocator: &'a Allocator) -> Self {
        Self { tokens: Vec::new_in(allocator) }
    }
    fn push(token: Token) {
        self.tokens.push(token);
    }
    fn checkpoint(&self) -> u32 {
        self.tokens.len() as u32
    }
    fn rewind(&mut self, checkpoint: u32) {
        self.tokens.truncate(checkpoint as usize);
    }
}

// Parser becomes generic over `ParserConfig`

pub struct Parser<'a, Config: ParserConfig> {
    // ...
    marker: PhantomData<Config>,
}

struct ParserImpl<'a, Config: ParserConfig> {
    // ...
    tokens: Config::Tokens<'a>,
}

impl<'a, Config: ParserConfig> ParserImpl<'a, Config> {
    fn somewhere_in_parser_or_lexer(&mut self, token: Token) {
        // This is a complete no-op when `Config` is `StandardParserConfig`
        self.tokens.push(token);
    }
}
```

### `Token` type

**Important note**: We do not want to increase the size of the Rust `Token` type unless absolutely unavoidable. Keeping `Token` as a single `u128` is critical for performance.

### Why the additional `ParserConfig` abstraction?

The above code example would work without `ParserConfig`. `Parser` and `ParserImpl` could both be generic over `TokenStore`, and we could "cut out the middleman" of `ParserConfig` - it adds no further value.

Reason for `ParserConfig` is that I think it could be useful for other purposes further down the line. e.g. we'll want at some point fairly soon to move construction of the UTF-8 to UTF-16 translation table into lexer. We'd want that to be generic too, so the cost of that is only paid by the linter. We can add that to `ParserConfig` to avoid 2 generic params everywhere `Parser<T: TokenStore, U: UtfTranslationTable>`.

It'd be nice if we can make that change later on without loads of code churn - introducing `ParserConfig` now enables that.

## Translation to ESLint Token format

Similar to how we had to build a translation mechanism from Oxc's Rust AST to ESTree standard, we'll likely need to translate from Oxc's `Token` format to ESLint's format.

Our current `Token` type doesn't include the *content* of the tokens, only their start and end position, the token's `Kind` (type), and some flags. It looks like we can avoid altering `Token` type and get the token `value` (e.g. the content of a string literal token) on JS side by slicing the source text using `start` and `end`.

There is one oddity (escaped identifiers), discussed below.

This source:

```js
123;
/_\n_/u;
"_\n_";
`_\n_${x}_\n_`;
_\u{24}_;
```

Produces tokens:

```js
[
  { type: 'Numeric', value: '123', range: [ 0, 3 ] },
  { type: 'Punctuator', value: ';', range: [ 3, 4 ] },
  { type: 'RegularExpression', value: '/_\\n_/u', range: [ 5, 12 ], regex: { flags: 'u', pattern: '_\\n_' } },
  { type: 'Punctuator', value: ';', range: [ 12, 13 ] },
  { type: 'String', value: '"_\\n_"', range: [ 14, 20 ] },
  { type: 'Punctuator', value: ';', range: [ 20, 21 ] },
  { type: 'Template', value: '`_\\n_${', range: [ 22, 29 ] },
  { type: 'Identifier', value: 'x', range: [ 29, 30 ] },
  { type: 'Template', value: '}_\\n_`', range: [ 30, 36 ] },
  { type: 'Punctuator', value: ';', range: [ 36, 37 ] },
  { type: 'Identifier', value: '_$_', range: [ 38, 46 ] },
  { type: 'Punctuator', value: ';', range: [ 46, 47 ] }
]
```

`value` is always a source slice, except for escaped `Identifier`s, and the only other oddity is the `regex` property on `RegularExpression` tokens.

### Regexps

Slice source text, find last `/`, split into `flags` and `pattern`.

### Escaped identifiers

It seems TS-ESLint parser diverges from ESLint in tokens for escaped identifers:

#### Source:

```js
\u{41}
```

#### Espree (ESLint's parser):

```json
{
  "type": "Identifier",
  "value": "A",
  "range": [0, 6]
}
```

[AST explorer](https://ast-explorer.dev/#eNo9yTEOwjAMheGrVJ67IDH1HIjJS2Q8FNI4sh0kFOXuGJV2e9//OmRY4JneyUjX6jBDjcBWlTlAAcTWr5cRklDHMk0IJk2Jb5/KCEt4k0fLsef9ZtrSndVWKfufk7P5+WehX3dt/C8uLy52RCwDxhc9iTJK)

#### TS-ESLint parser:

```json
{
  "type": "Identifier",
  "value": "\\u{41}",
  "range": [0, 6]
}
```

[AST explorer](https://ast-explorer.dev/#eNolyDEKgDAMBdCryJ91EJx6EKcspWSo1LY0UZDSuxtxfK8jweHwt5fQYlXMqBb6VP5jYUkxfx/sia6+rcNUTJ3yNBE4nH7nJrFkgrNIXlmUQHlgvDQPIC4=)

TS-ESLint [have decided this is a case of "won't fix"](https://github.com/typescript-eslint/typescript-eslint/issues/11036#issuecomment-3016264500).

I think we can align with TS-ESLint, so we don't have to worry about unescaping identifiers. It's a crazy edge case anyway.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Linter plugins: Rustify token methods #16207

Altering the lexer

Avoiding perf hit

`Token` type

Why the additional `ParserConfig` abstraction?

Translation to ESLint Token format

Regexps

Escaped identifiers

Source:

Espree (ESLint's parser):

TS-ESLint parser:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Linter plugins: Rustify token methods #16207

Description

Altering the lexer

Avoiding perf hit

Token type

Why the additional ParserConfig abstraction?

Translation to ESLint Token format

Regexps

Escaped identifiers

Source:

Espree (ESLint's parser):

TS-ESLint parser:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`Token` type

Why the additional `ParserConfig` abstraction?