Skip to content

feat(sql): add length_bytes() function for varchars#6685

Merged
bluestreak01 merged 3 commits intomasterfrom
puzpuzpuz_length_bytes
Jan 23, 2026
Merged

feat(sql): add length_bytes() function for varchars#6685
bluestreak01 merged 3 commits intomasterfrom
puzpuzpuz_length_bytes

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

Includes the following changes:

  • Adds length_bytes() SQL function. It returns the number of bytes present in the varchar argument
  • Reduces disk reads/memory accesses for min(varchar) and max(varchar) group-by functions by avoiding data vector access when prefixes differ:
    • Prefix-first comparison: Compares the 6-byte prefix (stored in aux memory) before accessing the data vector.
    • 8-bytes-at-a-time comparison: When prefixes match, compares remaining bytes using longAt() instead of byte-by-byte.
    • GroupByUtf8Sink: Added zeroPaddedSixPrefix(), longAt(), intAt() methods and ensured proper zero-padding for short strings.

Benchmarks

Queries run on clickbench's hits table, on Ryzen 7900x box:

-- master (length):      283ms
-- patch (length_bytes): 147ms
SELECT * FROM (SELECT CounterID, AVG(length_bytes(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL IS NOT NULL GROUP BY CounterID) WHERE c > 100000 ORDER BY l DESC LIMIT 25;

-- master: 226ms
-- patch:   69ms
SELECT MIN(Title) FROM hits;

@puzpuzpuz puzpuzpuz self-assigned this Jan 20, 2026
@puzpuzpuz puzpuzpuz added Enhancement Enhance existing functionality SQL Issues or changes relating to SQL execution labels Jan 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 20, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces a new length_bytes function for VARCHAR types and optimizes UTF-8 string comparison by leveraging 6-byte inline prefixes and 8-byte chunk comparisons. Changes include adding @Transient annotations to function factory parameters, creating a new LengthBytesVarcharFunctionFactory, enhancing GroupByUtf8Sink with primitive value accessors, and refactoring comparison logic in Utf8s for performance gains.

Changes

Cohort / File(s) Summary
Length Function Factories
core/src/main/java/io/questdb/griffin/engine/functions/str/LengthBinFunctionFactory.java, LengthVarcharFunctionFactory.java
Added @Transient annotation import and annotated args and argPositions parameters in newInstance methods with reformatting.
New Length Bytes Factory
core/src/main/java/io/questdb/griffin/engine/functions/str/LengthBytesVarcharFunctionFactory.java
New factory class implementing length_bytes(Ø) function; returns varchar byte size via getVarcharSize(Record). Includes nested Func class extending IntFunction and implementing UnaryFunction.
UTF-8 Buffer Utilities
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8Sink.java
Added static import for VARCHAR_INLINED_PREFIX_MASK. Introduced minimum capacity constraint, zero-initialization of first 8 bytes on clear/allocation. Added public methods intAt(int), longAt(int), and zeroPaddedSixPrefix() to expose prefix values and primitive reads from internal buffer. Adjusted pointer-based state checks in writeTo and growth paths.
UTF-8 Comparison Optimization
core/src/main/java/io/questdb/std/str/Utf8s.java
Optimized compare() to first check 6-byte inline prefixes; if equal, compares 8-byte chunks using longAt before falling back to byte-by-byte. Updated startsWith() signature to accept precomputed six-prefix parameters, eliminating redundant prefix computation.
ClickBench Tests
core/src/test/java/io/questdb/test/griffin/ClickBenchTest.java
Updated queries Q27 and Q28 to use length_bytes(URL) and length_bytes(Referer) instead of length(), with corresponding expected plan adjustments.
Length Function Tests
core/src/test/java/io/questdb/test/griffin/engine/functions/str/LengthFunctionFactoryTest.java
Reformatted expected string literals to multiline text blocks. Added new test method testVarcharLengthBytesSimple() to verify length_bytes() results for VARCHAR values.
UTF-8 Comparison Tests
core/src/test/java/io/questdb/test/std/str/Utf8sTest.java
Extensive new test coverage for Utf8s.compare() including null/identity cases, empty/short strings, boundary cases at 6-byte threshold, long strings with varying prefixes/suffixes, unicode handling, unsigned-byte behavior, and fuzz tests validating alignment with Java comparator ordering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

New feature, Performance

Suggested reviewers

  • bluestreak01
  • nwoolmer
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(sql): add length_bytes() function for varchars' accurately describes the main feature addition in the changeset.
Description check ✅ Passed The description is directly related to the changeset, detailing the new length_bytes() function, optimization improvements for min/max group-by operations, and relevant performance benchmarks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/test/java/io/questdb/test/std/str/Utf8sTest.java (1)

303-315: Fix codepoint iteration in the fuzz comparator.
The comparator advances by UTF‑16 code units, so surrogate pairs can be misordered relative to Utf8s.compare. Iterate by code point and compare codepoint counts for the length tie‑break.

🐛 Proposed fix
-        Arrays.sort(strings, (l, r) -> {
-            int len = Math.min(l.length(), r.length());
-            for (int i = 0; i < len; i++) {
-                int lCodepoint = l.codePointAt(i);
-                int rCodepoint = r.codePointAt(i);
-                int diff = lCodepoint - rCodepoint;
-                if (diff != 0) {
-                    return diff;
-                }
-            }
-            return l.length() - r.length();
-        });
+        Arrays.sort(strings, (l, r) -> {
+            int i = 0;
+            int j = 0;
+            while (i < l.length() && j < r.length()) {
+                int lCodepoint = l.codePointAt(i);
+                int rCodepoint = r.codePointAt(j);
+                int diff = lCodepoint - rCodepoint;
+                if (diff != 0) {
+                    return diff;
+                }
+                i += Character.charCount(lCodepoint);
+                j += Character.charCount(rCodepoint);
+            }
+            return l.codePointCount(0, l.length()) - r.codePointCount(0, r.length());
+        });
🤖 Fix all issues with AI agents
In `@core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8Sink.java`:
- Around line 155-159: The method zeroPaddedSixPrefix() can call Unsafe.getLong
when ptr == 0 causing invalid memory reads; update zeroPaddedSixPrefix() to
first check if ptr == 0 and return 0L (or an appropriate neutral masked value)
before calling Unsafe.getUnsafe().getLong(ptr + HEADER_SIZE), then apply the
existing VARCHAR_INLINED_PREFIX_MASK as before so the unsafe read is only
executed when ptr is non-zero.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 28 / 28 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/groupby/GroupByUtf8Sink.java 8 8 100.00%
🔵 io/questdb/std/str/Utf8s.java 11 11 100.00%
🔵 io/questdb/griffin/engine/functions/str/LengthBytesVarcharFunctionFactory.java 9 9 100.00%

@bluestreak01 bluestreak01 merged commit 667eefc into master Jan 23, 2026
43 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_length_bytes branch January 23, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Enhance existing functionality SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants