feat(sql): Add array_build() function for dynamic array construction#6794
feat(sql): Add array_build() function for dynamic array construction#6794bluestreak01 merged 5 commits intomasterfrom
Conversation
Introduces array_build(nDims, size, filler1, ...) that creates DOUBLE[] or DOUBLE[][] arrays with controlled shape and fill values. The size parameter accepts a scalar integer or a DOUBLE[] (using its cardinality). Each filler can be a scalar (repeated for all elements) or a DOUBLE[] (copied element-by-element, NaN-padded if shorter, truncated if longer). nDims is validated as a compile-time constant restricted to 1 or 2. NULL scalar size returns a null array. Multi-dimensional array arguments (DOUBLE[][] etc.) are rejected. Bulk copy via appendDataToMem is used when source and target sizes match. Includes 34 unit tests covering all code paths (scalar/array fillers, 1D/2D, NaN padding, truncation, null handling, error conditions, large arrays, multi-dim rejection) and 4 fuzz tests (1D/2D scalar fill, 1D/2D array copy identity) with 100 iterations each.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai, please review critically. Thanks |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 306 🏁 Script executed: Length of output: 36 🏁 Script executed: Length of output: 36 🏁 Script executed: Length of output: 288 🏁 Script executed: Length of output: 36526 🏁 Script executed: Length of output: 10812 🏁 Script executed: Length of output: 8711 🏁 Script executed: Length of output: 900 🏁 Script executed: Length of output: 5139 🏁 Script executed: Length of output: 3103 🏁 Script executed: Length of output: 6677 🏁 Script executed: Length of output: 2314 🏁 Script executed: Length of output: 4011 🏁 Script executed: Length of output: 1812 🏁 Script executed: Length of output: 3549 🏁 Script executed: Length of output: 2130 🏁 Script executed: Length of output: 4620 🏁 Script executed: Length of output: 4469 🏁 Script executed: Length of output: 11411 🏁 Script executed: Length of output: 5705
Critical Review of array_build() Implementation🔴 CRITICAL: Integer Overflow Vulnerability (Lines 215-224)Issue: For 2D arrays, you calculate total elements as // Lines 215-224: No overflow protection
if (nDims == 1) {
array.setDimLen(0, size);
} else {
array.setDimLen(0, nDims); // Always 2
array.setDimLen(1, size); // Could be huge
}
array.applyShape(position); // Total = 2 * size, uncheckedAttack vector: SELECT array_build(2, 200_000_000, 1.0, 2.0) -- 400M elements = 3.2GB, no validationRequired fixes:
if (nDims == 2 && size > ArrayView.DIM_MAX_LEN / nDims) {
throw CairoException.nonCritical().position(sizePos)
.put("2D array size would exceed maximum element count");
}
🟡 PERFORMANCE: Unoptimized Scalar Fill (Lines 249-252)Issue: Scalar fill uses naive loop instead of vectorized bulk operation. I found // Current implementation - line 249-252
double val = fillerFunc.getDouble(rec);
for (int j = 0; j < size; j++) {
mem.putDouble(val); // Element-by-element
}Performance impact: For Required fix: double val = fillerFunc.getDouble(rec);
if (size > 0) {
long addr = mem.getAddress() + mem.getAppendOffset();
Vect.setMemoryDouble(addr, val, size);
mem.skip(size * Double.BYTES);
}Benchmark requirement: Measure performance delta for arrays with 10K, 100K, 1M elements. 🟡 DESIGN FLAW: Unnecessary setType() Overhead (Line 218)Issue: You call // Line 218: Called per-row
array.setType(type); // type never changes!The comment says "setType reinitializes the shape list after a prior ofNull" but this is inefficient. Performance impact: For queries processing millions of rows, this adds unnecessary overhead. Required investigation:
🟠 DESIGN LIMITATION: Arbitrary Restrictions Not JustifiedIssues:
Questions requiring answers:
Recommendation: Either:
🟠 MISSED OPTIMIZATION: Bulk Copy for Truncation (Lines 237-247)Issue: Bulk copy only when // Current - lines 237-247
if (srcLen == size) {
src.appendDataToMem(mem); // Fast path
} else {
int copyLen = Math.min(srcLen, size);
for (int j = 0; j < copyLen; j++) { // Slow path even for truncation
mem.putDouble(src.getDouble(j));
}
// ...
}Fix: int copyLen = Math.min(srcLen, size);
if (copyLen > 0) {
// Bulk copy the available portion
long srcAddr = /* get src address */;
long dstAddr = mem.getAddress() + mem.getAppendOffset();
Unsafe.getUnsafe().copyMemory(srcAddr, dstAddr, copyLen * Double.BYTES);
mem.skip(copyLen * Double.BYTES);
}
// NaN-pad remainder
for (int j = copyLen; j < size; j++) {
mem.putDouble(Double.NaN);
}🟢 TEST COVERAGE: Gaps in Edge Cases87.7% coverage (107/122 lines) = 15 uncovered lines. Based on analysis: Missing test cases:
Required additions: `@Test`
public void testErrorOverflowIn2DArray() throws Exception {
assertException(
"SELECT array_build(2, 134_217_728, 1.0, 2.0)", // 2 * 134M > 268M
7,
"would exceed maximum element count"
);
}
`@Test`
public void testMaxSizeArray1D() throws Exception {
// Should succeed at exact limit
assertSql("cnt\n268435455\n",
"SELECT array_count(array_build(1, 268_435_455, 0.0)) AS cnt");
}🟢 MINOR: Thread Safety DocumentationLine 262 returns Question: Does the query engine guarantee function instance isolation per thread, or can this cause issues in parallel scans? Required: Document the threading model assumption in class javadoc. Summary of Required ActionsBefore merge:
Follow-up issues:
What's your justification for the 1-2D limitation and DOUBLE-only restriction? If these are MVP scope decisions, they should be documented in code comments and tracked as follow-up work. 🧠 Learnings used |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 10812 🏁 Script executed: Length of output: 288 🏁 Script executed: Length of output: 34851 🏁 Script executed: Length of output: 900 🏁 Script executed: Length of output: 303
🏁 Script executed: Length of output: 3 🏁 Script executed: Length of output: 4966 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 2006 🏁 Script executed: Length of output: 19731
Critical Design Issues1. Arbitrary Type Restriction (entire implementation)
2. Artificial Dimensionality Limit (Line 77) if (nDimsLong < 1 || nDimsLong > 2) {QuestDB supports higher-dimensional arrays per the learnings. Why arbitrarily cap at 2D? This seems like an artificial limitation without technical justification.
3. Redundant nDims Parameter (Lines 67-79) -- User error: says 2D but only provides 1 filler
array_build(2, 10, 1.0) -- Should fail but caught late
Performance & Memory Issues4. Unnecessary Argument Copying (Lines 105-106) ObjList<Function> allArgs = new ObjList<>(args);
IntList allArgPositions = new IntList(argPositions);The parameters are already marked
5. Inefficient Type Reset (Line 217) array.setType(type);You call
6. Unoptimized Scalar Fill (Lines 249-252) double val = fillerFunc.getDouble(rec);
for (int j = 0; j < size; j++) {
mem.putDouble(val);
}For large arrays, element-by-element
7. Missing Integer Overflow Protection
8. No OOM Handling Test Coverage Gaps9. 15 Uncovered Lines (87.70% coverage)
API & Thread Safety Concerns10. Thread Safety: False (Line 263) public boolean isThreadSafe() {
return false;
}This limits parallelization in multi-core query execution. The
11. Hardcoded 2D Shape (Lines 219-222) if (nDims == 1) {
array.setDimLen(0, size);
} else {
array.setDimLen(0, nDims); // Always 2 for 2D
array.setDimLen(1, size);
}For 2D arrays, you hardcode the first dimension to
Comparison with Existing CodeLooking at final var memory = array.copyShapeAndStartMemoryA(arr);This is cleaner than your manual shape setup. Why not follow established patterns? Specific Requests
The implementation is solid for its scope, but the design decisions seem overly conservative and the performance characteristics are unverified. I'd like to see justification for the limitations or a roadmap to lift them. 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 0 12. No Performance Benchmarks
13. Coverage Details Unavailable
Edge Cases & Error Handling14. Filler Evaluation Order 15. Size Validation Asymmetry (Lines 197-211) size = sizeArr.getCardinality(); // No bounds check!What if a corrupted array returns
16. NaN Semantics (Lines 232-234, 244-246) for (int j = 0; j < size; j++) {
mem.putDouble(Double.NaN);
}You use NaN for padding shorter arrays and null array fillers. This conflates "missing value" with "padding". What if the source array legitimately contains NaN?
Code Quality Issues17. Inconsistent Naming (Lines 95, 98) int sizeType = sizeFunc.getType();
boolean sizeIsArray = ColumnType.isArray(sizeType);vs. int fillerType = fillerFunc.getType();Use consistent naming: 18. Magic Number 2 (Lines 84, 111) private static final int MAX_DIMENSIONS = 2;
private static final int FIRST_FILLER_ARG_INDEX = 2;Summary: Show Me The EvidenceYou've built a functional implementation with good test coverage for happy paths. But:
Concrete requests before merge:
This is a good foundation but feels rushed. The design constraints need justification, the performance characteristics need measurement, and the edge cases need hardening. 🧠 Learnings used |
- Remove the nArrays <= 2 restriction so array_build() can combine N arrays into a single DOUBLE[][] (e.g., 4 order book columns into one 2D array with 4 rows). nArrays >= 2 always returns DOUBLE[][] regardless of how many rows. - Rename nDims -> nArrays throughout for clarity: the parameter controls how many arrays are combined, not dimensionality. - Add upper-bound validation on nArrays before int cast to prevent overflow/truncation with huge literals. - Add DIM_MAX_LEN bounds check on array-derived size (when size comes from an array filler's cardinality). - Add tests for NaN-padding, truncation, element count overflow, null array fillers from columns, nArrays=4 order book use case, nArrays overflow, and N-dimensional fuzz testing.
|
@coderabbitai, addressed your comments in latest commit CodeRabbit review responseInteger overflow (CRITICAL)Already handled. Additionally, this push adds an upper-bound check on Vect.setMemoryDouble() for scalar fillRequires setType() per rowRequired for correctness. After nDims restricted to 1-2Addressed in this push. The cap is removed - True 3D+ would require a different function signature or dynamic return types, which QuestDB's function framework doesn't support. DOUBLE onlyOnly Bulk copy optimization for truncationWould need to reach into Thread safetyStandard pattern. The engine creates per-worker function instances via Test coverageThis push adds 7 new tests and converts 1 error test to a success test:
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 114 🏁 Script executed: Length of output: 282 🏁 Script executed: Length of output: 11278 🏁 Script executed: Length of output: 18930 🏁 Script executed: Length of output: 20829 🏁 Script executed: Length of output: 4249 🏁 Script executed: Length of output: 6735
Verified ✅1. Integer overflow protection The additional safeguards you added strengthen this further:
2. Vect.setMemoryDouble() optimization 3. setType() per row 4. nArrays restriction lifted
5. DOUBLE-only & bulk copy optimization 6. Thread safety 7. Test coverage enhanced
Outstanding Questions
Final AssessmentMy critical overflow concern was wrong - the protection was already robust. The enhancements you've added (bounds checks, comprehensive tests, nArrays flexibility) significantly strengthen the implementation. The remaining performance optimization opportunities (vectorized scalar fill, bulk truncation copy) are reasonable to defer given API constraints. Is this ready to merge from your perspective? The implementation is solid and well-tested. 🧠 Learnings used |
Add two tests for gaps found during code review: - test2dVariableSize: 2D array_build with column-driven size that changes per row, verifying DirectArray resizes correctly. - test2dZeroSizeArrayFiller: 2D array_build with size=0 and an array filler, verifying filler evaluation is skipped gracefully. Also re-sort test methods alphabetically per project convention. Co-Authored-By: Claude Opus 4.6 <[email protected]>
[PR Coverage check]😍 pass : 109 / 127 (85.83%) file detail
|
Introduces array_build(nDims, size, filler1, ...) that creates DOUBLE[] or DOUBLE[][] arrays with controlled shape and fill values. The size parameter accepts a scalar integer or a DOUBLE[] (using its cardinality). Each filler can be a scalar (repeated for all elements) or a DOUBLE[] (copied element-by-element, NaN-padded if shorter, truncated if longer).
Docs at https://preview-374--questdb-documentation.netlify.app/docs/query/functions/array/#array_build
Includes 34 unit tests covering all code paths (scalar/array fillers, 1D/2D, NaN padding, truncation, null handling, error conditions, large arrays, multi-dim rejection) and 4 fuzz tests (1D/2D scalar fill, 1D/2D array copy identity) with 100 iterations each.