Skip to content

fix(sql): handle quoted column names with dots in JOIN metadata#6590

Merged
bluestreak01 merged 1 commit intomasterfrom
jh_join_bad_aliasing
Jan 2, 2026
Merged

fix(sql): handle quoted column names with dots in JOIN metadata#6590
bluestreak01 merged 1 commit intomasterfrom
jh_join_bad_aliasing

Conversation

@jerrinot
Copy link
Copy Markdown
Contributor

@jerrinot jerrinot commented Jan 2, 2026

fixes #6407

@jerrinot jerrinot added the Bug Incorrect or unexpected behavior label Jan 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 2, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1 / 1 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/join/JoinRecordMetadata.java 1 1 100.00%

@jerrinot
Copy link
Copy Markdown
Contributor Author

jerrinot commented Jan 2, 2026

dear @coderabbitai please review this. one rule for this review: question everything!

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 2, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch PR details
gh pr view 6590 --json title,body,files,additions,deletions

# Fetch issue #6407 details
gh issue view 6407 --json title,body,comments

# Get the full diff of the PR
gh pr diff 6590

# Find test files related to JoinRecordMetadata
fd -e java -e groovy JoinRecordMetadata | head -20

# Search for existing tests that might cover this scenario
rg -n "quoted.*column.*dot|dot.*quoted.*column" --type java -C2

# Look for the implementation file
cat core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java

Length of output: 16606


🏁 Script executed:

#!/bin/bash
# Search for other uses of Chars.indexOf in JoinRecordMetadata
rg -n "Chars\.indexOf" core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java -C3

# Look at the Chars utility to understand indexOfLastUnquoted
rg -n "indexOfLastUnquoted" core/src/main/java/io/questdb/std/Chars.java -A10 | head -50

# Check if there are other metadata classes that might have similar issues
fd -e java RecordMetadata | xargs rg -l "Chars\.indexOf.*\." | head -10

# Let's look at the test file to see if there are other related tests
cat core/src/test/java/io/questdb/test/griffin/engine/join/JoinTest.java | grep -A30 "testJoiningSubqueryWithDotInColumnName"

# Search for tests that actually test json_extract with joins
rg -n "json_extract.*join|join.*json_extract" --type java -i -C2 | head -30

# Let's see what other similar pattern changes were made historically
git log --all --oneline --grep="indexOfLastUnquoted" | head -10

Length of output: 2964


🏁 Script executed:

#!/bin/bash
# Let's try to reproduce the EXACT bug from the issue
# First, check if there's a test that creates a table with dashes in the name
rg -n "\"[a-zA-Z]+-[a-zA-Z]+\"" core/src/test/java/io/questdb/test/griffin/engine/join/JoinTest.java -C2 | head -20

# Look at how the optimizer generates aliases - this might be the root cause
rg -n "generateWildAlias|wild.*alias" --type java -C3

# Check if there are escape sequences or edge cases in column name handling
rg -n "escaped.*quote|quote.*escape" core/src/main/java/io/questdb/std/Chars.java -C3

# Let's see if indexOf is used elsewhere in metadata-related code that might have similar bugs
fd -e java Metadata | xargs rg -n "Chars\.indexOf\(" | grep -v "indexOfLastUnquoted" | head -20

# Check if there's any performance testing or documentation about when to use indexOf vs indexOfLastUnquoted
rg -n "indexOf.*performance|performance.*indexOf" --type java -C2 | head -20

Length of output: 4583


🏁 Script executed:

#!/bin/bash
# Let's look at the actual implementation of the Chars utility methods to understand the difference
cat core/src/main/java/io/questdb/std/Chars.java | grep -A20 "public static int indexOf"

# Let's look more at the test - does it reproduce the original ArrayIndexOutOfBoundsException?
# The original bug had json_extract with '$.' - let's see if the test covers this
rg -n "json_extract" core/src/test/java/io/questdb/test/griffin/engine/join/JoinTest.java -C5

# Check the git history of this file to see when indexOf was first used vs indexOfLastUnquoted
git log --oneline -p --all -- core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java | grep -B5 -A5 "indexOf" | head -50

# Let's understand the optimization path that creates these wild aliases
rg -n "_xQdbA1|generateColumnAlias" --type java -C3 | head -30

# Check if there are edge cases with multiple dots in quoted names
ast-grep --pattern 'indexOfLastUnquoted($_, $_)' | head -30

Length of output: 14200


🏁 Script executed:

#!/bin/bash
# Let's see when getColumnIndexQuiet was changed to use indexOfLastUnquoted
git log --all --oneline -p -- core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java | grep -B10 -A10 "getColumnIndexQuiet" | head -60

# Let's test if the simple test really reproduces the bug - look for the actual error code path
rg -n "ArrayIndexOutOfBoundsException.*Index -1" --type java -C5

# Let's understand why LAST unquoted is used instead of FIRST
# Check if there's documentation or comments about this design decision
rg -n "last.*dot|dot.*last" core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java -C5

# Look at edge cases - what if there are multiple dots like "table.schema.column"?
rg -n "schema\\..*\\." --type java -C2 | head -20

# Check if there's handling for escaped quotes in the indexOfLastUnquoted method
cat core/src/main/java/io/questdb/std/Chars.java | grep -B5 -A15 "indexOfLastUnquoted" | grep -A15 "public static int indexOfLastUnquoted"

# See if there are any other tests with json_extract to understand the pattern better
rg -n "json_extract.*\\\$\." --type java -C3

Length of output: 15707


🏁 Script executed:

#!/bin/bash
# Let's create a more comprehensive test to see if the simple test is enough
# First, let's understand the exact scenario from the bug report

# Check what commit fixed getColumnIndexQuiet - was there discussion about addAlias?
git log --all --oneline --grep="5702" | head -10
git show 6c7cf98c --stat

# Let's see if there are any TODOs or FIXMEs related to this
rg -n "TODO|FIXME" core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java

# Check for double-quote escaping handling - SQL uses "" not \"
rg -n "\"\"" core/src/main/java/io/questdb/std/Chars.java -C3

# Look for performance benchmarks or concerns about indexOf vs indexOfLastUnquoted
rg -n "benchmark|performance" core/src/main/java/io/questdb/std/Chars.java -C2

# Let's verify the fix by looking at how column names are constructed in the optimizer
rg -n "generateColumnAlias" --type java -A10 | head -50

# Check if there are other places that might need similar fixes
rg -n "Chars\.indexOf.*columnName.*\." --type java -C2 | head -30

Length of output: 7506


@jerrinot - Alright, I'll question everything as requested. 🔍

Critical Analysis

1. Inconsistency Fix, But Why Was This Missed?

Looking at the git history, getColumnIndexQuiet was fixed in commit 6c7cf98 (PR #5702) to use indexOfLastUnquoted, but addAlias was left using plain indexOf. Why?

  • Was the original fix incomplete?
  • Is there any shared logic these methods should use to prevent such inconsistencies?
  • Are there OTHER places in the codebase still using indexOf that should use indexOfLastUnquoted?

2. Why LAST Unquoted, Not FIRST Unquoted?

The method uses indexOfLastUnquoted which finds the rightmost dot outside quotes. But why?

  • In alias."column.name", we want the dot BEFORE the quotes
  • What happens with "schema.table"."column.name" - multiple quoted sections with dots?
  • Wouldn't a proper left-to-right parse be more correct than relying on "last" occurrence?
  • This feels like it could break on edge cases - what's the reasoning for using LAST?

3. Test Doesn't Reproduce Original Bug

Your test uses:

SELECT x as "foo.bar" FROM long_sequence(5)

But the original issue #6407 was:

SELECT json_extract(msg, '$.') FROM "a-b-c" LEFT JOIN ...

Questions:

  • Does your simple test actually trigger the same code path as the original bug?
  • Why not add a test that uses json_extract with '$.' to match the actual bug report?
  • Have you manually verified that your fix resolves the exact error from sketchy json_extract plus join leading to 500 #6407?

4. Quote Escaping - Is This a Bug Waiting to Happen?

Looking at the indexOfLastUnquoted implementation:

if (seq.charAt(i) == '\"') {
    inQuotes = !inQuotes;
}

This simply toggles on " characters. But SQL uses "" to escape quotes inside identifiers!

  • What happens with a column named "foo""bar.baz" (contains an escaped quote and a dot)?
  • The current implementation would toggle on the first ", toggle on the second ", thinking quotes are closed, then treat .baz as unquoted
  • Is this a latent bug?

5. Performance Implications

indexOfLastUnquoted scans the entire string every time, maintaining quote state. Questions:

  • Is this called in a hot path during query execution?
  • What's the performance impact compared to simple indexOf?
  • Have you benchmarked this on complex queries with many columns?

6. Root Cause vs Symptom

The issue mentions "the optimizer generates a wild alias like _xQdbA1."json_extract(msg, '$.') "". Is this fix addressing a symptom rather than the root cause?

  • Should the optimizer be generating cleaner aliases?
  • Is creating aliases with dots inside quoted sections a code smell?
  • Would fixing the optimizer avoid needing these workarounds?

7. Edge Cases Not Covered

What about:

  • Empty quotes: ""
  • Multiple consecutive dots: "a...b"
  • Unmatched quotes: "foo (malformed SQL)
  • Extremely long column names with many dots
  • Unicode characters or special chars near dots
  • Null or empty column names?

8. The Assertion That Was Removed

I notice in the git history that an assertion was removed earlier:

- assert dot != -1 || tableAlias != null;

Why was this removed? Was it triggering false positives? Does your fix now make this assertion valid again?

9. Coverage: Is 100% Enough?

The coverage report shows 1/1 lines covered (100%). But:

  • Does line coverage == branch coverage?
  • The method has conditional logic (dot == -1 checks) - are all branches tested?
  • What about the dual-key insertion logic in addToMap?

10. Documentation Gap

Neither addAlias nor getColumnIndexQuiet have comments explaining:

  • Why they use indexOfLastUnquoted instead of indexOf
  • What the "last" semantics mean for the column name structure
  • What assumptions are made about table.column naming

Should this be documented to prevent future regressions?


Recommended Actions

  1. Add a test that reproduces the exact scenario from sketchy json_extract plus join leading to 500 #6407 with json_extract
  2. Verify escaped quote handling - add test with ""
  3. Document why LAST is used vs proper parsing
  4. Audit codebase for other indexOf usages that should use indexOfLastUnquoted
  5. Consider refactoring to share logic between getColumnIndexQuiet and addAlias

What do you think? Should I dig deeper into any of these concerns? 🐰


🏁 Script executed:

#!/bin/bash
# Let's verify if escaped quotes are handled in tests anywhere
rg -n '""' core/src/test/java/io/questdb/test/griffin/engine/join/ --type java -C2

# Check if there are any validation tests for malformed quoted identifiers
rg -n "unmatched.*quote|malformed.*identifier" --type java -C2

# See if there's any documentation about the table.column naming convention
rg -n "table.*column.*name|column.*name.*format" core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java -C3

Length of output: 293275


@jerrinot Thanks — I questioned everything. Here’s the focused review.

What I verified

  • The root cause matches sketchy json_extract plus join leading to 500 #6407: addAlias() used indexOf, which “saw” the dot inside a quoted identifier and split alias/column incorrectly. getColumnIndexQuiet() was already using indexOfLastUnquoted(...) since 6c7cf98, so addAlias() was the odd one out.
  • The change to indexOfLastUnquoted in addAlias() brings it in line with the rest of the pipeline (SqlOptimiser, SqlCodeGenerator, CreateMatViewOperationImpl).
  • The new test covers a quoted alias with a dot ("foo.bar") coming from a subquery and verifies JOIN metadata resolution.

Concerns and questions

  1. Escaped quotes inside identifiers

    • Chars.indexOfLastUnquoted(...) flips state on every '"'. SQL identifiers escape quotes with "" inside quotes. With a name like "a""b.c", the current algorithm will toggle off/on and treat the dot as unquoted. That seems wrong and could reintroduce the same class of bug. Should we harden indexOfLastUnquoted to handle ""?
  2. LAST-unquoted semantics are fine, but let’s document why

    • Using the last unquoted dot is consistent with table[.schema].column and with quoted identifiers that may themselves contain dots. A short comment in addAlias() and in Chars.indexOfLastUnquoted explaining this would prevent future regressions.
  3. Test coverage: good, but does not match sketchy json_extract plus join leading to 500 #6407 exactly

    • The added test uses "foo.bar". The original failure was for an auto-generated alias "_xQdbA1."json_extract(msg, '$.')"" and a join. I suggest adding a test that reproduces the exact pattern to guard against regressions in alias formatting paths.
  4. Minor robustness nit in getColumnIndexQuiet()

    • Else branch uses columnName.length() instead of hi. Given callers currently pass hi=length, this is harmless, but using hi is safer and intention-revealing.

Concrete proposals

A) Harden Chars.indexOfLastUnquoted to support "" inside quoted identifiers
Java (core/src/main/java/io/questdb/std/Chars.java):

public static int indexOfLastUnquoted(@NotNull CharSequence seq, char ch, int seqLo, int seqHi) {
    boolean inQuotes = false;
    int last = -1;
    for (int i = seqLo; i < seqHi; i++) {
        char c = seq.charAt(i);
        if (c == '\"') {
            if (inQuotes && i + 1 < seqHi && seq.charAt(i + 1) == '\"') {
                // Escaped quote inside a quoted identifier: consume both quotes, stay in quotes.
                i++; 
                continue;
            }
            inQuotes = !inQuotes;
            continue;
        }
        if (!inQuotes && c == ch) {
            last = i;
        }
    }
    return last;
}

B) Make getColumnIndexQuiet() respect the hi bound
Java (core/src/main/java/io/questdb/griffin/engine/join/JoinRecordMetadata.java):

@Override
public int getColumnIndexQuiet(CharSequence columnName, int lo, int hi) {
    final MapKey key = map.withKey();
    final int dot = Chars.indexOfLastUnquoted(columnName, '.', lo, hi);
    if (dot == -1) {
        key.putStr(null);
        key.putStrLowerCase(columnName, lo, hi);
    } else {
        key.putStrLowerCase(columnName, 0, dot);
        // use hi instead of columnName.length() to honor caller-provided bounds
        key.putStrLowerCase(columnName, dot + 1, hi);
    }
    MapValue value = key.findValue();
    return value != null ? value.getInt(0) : -1;
}

C) Add tests to pin the original and adjacent scenarios
Java (core/src/test/java/io/questdb/test/griffin/engine/join/JoinTest.java):

@Test
public void testJoiningSubqueryWithJsonExtractDotInAlias() throws Exception {
    assertMemoryLeak(() -> {
        // The alias will be something like: _xQdbA1."json_extract(msg, '$.')"
        execute("create table t(msg string)");
        execute("insert into t values ('{\"k\":1}')");
        assertQueryNoLeakCheck(
                // We only need to check that it compiles and executes without internal errors;
                // expected output: one row with the extracted object and "1" from the RHS.
                "json_extract\t1\n" +
                "{\"k\":1}\t1\n",
                "SELECT * FROM ( SELECT json_extract(msg, '$.') FROM t ) LEFT JOIN (select 1) ON 1 = 1",
                null,
                false
        );
    });
}

@Test
public void testQualifiedAliasWithQuotedDots() throws Exception {
    assertMemoryLeak(() -> {
        // Ensure we can qualify a quoted identifier with internal dots and still resolve correctly.
        assertQueryNoLeakCheck(
                "\"a.b.c\"\t1\n" +
                "42\t1\n",
                "SELECT * FROM (SELECT 42 as \"a.b.c\") l LEFT JOIN (select 1) r ON true",
                null,
                false
        );
    });
}

@Test
public void testQuotedIdentifierWithEscapedQuoteAndDot() throws Exception {
    assertMemoryLeak(() -> {
        // If Chars.indexOfLastUnquoted handles "" correctly, this remains unqualified.
        assertQueryNoLeakCheck(
                "\"a\"\"b.c\"\t1\n" +
                "7\t1\n",
                "SELECT * FROM (SELECT 7 as \"a\"\"b.c\") l LEFT JOIN (select 1) r ON true",
                null,
                false
        );
    });
}

Notes:

  • The exact expected headers/labels may need to match your test utilities’ formatting. If preferred, we can instead assert row counts or use existing helpers that ignore column labels.

D) Tiny docs note (inline comments)

  • In addAlias(...): add a comment explaining the “last unquoted dot” choice and the quoting behavior.
  • In Chars.indexOfLastUnquoted(...): mention SQL-style escaping with "" is handled.

Summary stance

  • The one-line fix is correct and aligns addAlias() with the rest of the planner/generator. I recommend merging it.
  • I also recommend the small follow-ups above (A–D) to close the “escaped-quote” hole, tighten bounds handling, and pin the original regression with an exact test.

Would you like me to open a follow-up issue for A/B/C/D so we can land this PR cleanly and track the hardening separately?

@bluestreak01 bluestreak01 merged commit c252e05 into master Jan 2, 2026
43 checks passed
@bluestreak01 bluestreak01 deleted the jh_join_bad_aliasing branch January 2, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sketchy json_extract plus join leading to 500

3 participants