Skip to content

Port remaining parallelizable aggregate functions to off-heap data structures #4120

@puzpuzpuz

Description

@puzpuzpuz

Is your feature request related to a problem?

#4097 ported min(str), max(str), as well as count_distinct() for long, int, and IPv4 types to parallel GROUP BY, but some functions remain unported. Namely:

  • count_distinct(uuid): requires a new long128 hash set, similar to the GroupByLongHashSet one
  • count_distinct(long256): requires a new long256 hash set, similar to the GroupByLongHashSet one
  • approx_percentile(double): this one is tricky as we'll have to port HdrHistogram to become off-heap and flyweight
  • all first/last and first_not_null/last_not_null functions: to port them, we'll have to access and store row ids in the group by map
  • isOrdered(IPv4)/isOrdered(long) functions: again, we need to track row ids
  • ksum/nsum

There is also count_distinct(symbol), but we have early exit logic in that function (see #3974), so we don't want to port it, at least for now.

Describe the solution you'd like.

No response

Describe alternatives you've considered.

No response

Full Name:

Andrei Pechkurov

Affiliation:

QuestDB

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementEnhance existing functionalityHelp wantedAssistance or additional information is wantedPerformancePerformance improvementsSQLIssues or changes relating to SQL executionhacktoberfestA good issue for Hacktoberfest 2025 contributors. No AI-driven commits, please

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions