Skip to content

Conversation

@slotThe
Copy link
Contributor

@slotThe slotThe commented Oct 7, 2024

EDIT: Apologies for the somewhat messy PR; perhaps conceptually some of these should go into different PRs. However, the changes are so small that I figured it was perhaps best to include them here as well.

Commit summary

Fix BQN lexer treating special characters as word chars

\w matches all alphanumeric Unicode characters, including ones (e.g., 𝕊) that BQN treats special. This is especially troublesome for variables; previously, something like

𝕊i

would have returned

(Token.Operator, '𝕊'),
(Token.Error, 'i'),

instead of

(Token.Operator, '𝕊'),
(Token.Name.Variable, 'i').

This extends to special sequences like \b, which care about the difference between \w and \W.


There are still some \b's in the code, but from what I can tell these don't produce any undesirable effects. E.g., for 2-modifiers, the BQN grammar already disallows writing something like F_𝕣_G instead of F _𝕣_ G. No guarantees that I caught everything, of course.

This is, of course, nonsense, there are plenty of situations where this could arise that are accepted by BQN; e.g., _F ← {𝔽_𝕣}, _D ← {𝔽_F𝕩}, etc. See the fixup commit.

BQN lexer: Parse _𝕣 and 𝕣 correctly

BQN Lexer: Replace function * with ⋆

BQN does not actually use * (ASTERISK) anywhere, but there is a primitive function ⋆ (STAR OPERATOR).

BQN Lexer: Allow underscores in numbers

As per the spec.

\w matches all alphanumeric Unicode characters, including ones (e.g., 𝕊)
that BQN treats special. This is especially troublesome for variables;
previously, something like

    𝕊i

would have returned

    (Token.Operator, '𝕊'),
    (Token.Error, 'i'),

instead of

    (Token.Operator, '𝕊'),
    (Token.Name.Variable, 'i').

This extends to special sequences like \b, which care about the
difference between \w and \W.
BQN does not actually use * (ASTERISK) anywhere,
but there is a primitive function ⋆ (STAR OPERATOR).
@slotThe slotThe changed the title Fix BQN lexer treating special characters as word chars Various improvements to the BQN lexer Nov 15, 2024
@Anteru Anteru merged commit 15b0231 into pygments:master Jan 5, 2025
@Anteru Anteru added this to the 2.19.0 milestone Jan 5, 2025
@Anteru Anteru added the A-lexing area: changes to individual lexers label Jan 5, 2025
@Anteru
Copy link
Collaborator

Anteru commented Jan 5, 2025

All in one PR is fine for this kind of grouped changes, merged!

@slotThe slotThe deleted the bqn/inter-word-chars branch January 5, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-lexing area: changes to individual lexers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants