Skip to content

Performance issue with multiple levels of nesting of if statements #72721

@njcstreet

Description

@njcstreet

Hi,

We are having a performance issue with queries which contain complex columns (with if statements or any long expressions) that reference each other. Columns can be constructed from a number of if statements, and then we have dependent columns which re-use these columns. I believe the issue could be due to the way that alias names are constructed - it seems that the alias names are growing exponentially as we chain together more dependencies. If you take the example table / query below (sample generated by ChatGPT), and run in https://fiddle.clickhouse.com/, (current version 24.11.1.2557) the query takes 9 seconds even though the table contains only one row, and it is necessary to increase max_expanded_ast_elements to allow it to run. This is a heavily simplified example - we are looking at options such as materialising some columns but it is not always possible.

CREATE TABLE sample_table (
    id UInt64,
    value UInt64
) 
ENGINE = MergeTree()
ORDER BY id;

-- Insert sample data
INSERT INTO sample_table VALUES (1, 10);

--explain plan actions = 1
WITH
    -- Deeply nested IF statements for col1
    if(value = 10, 'A',
        if(value = 20, 'B',
            if(value = 30, 'C',
                if(value = 40, 'D',
                    if(value = 50, 'E', 'F')
                )
            )
        )
    ) AS col1,

    -- Another deeply nested IF statements relying on col1
    if(col1 = 'A', 'Alpha',
        if(col1 = 'B', 'Beta',
            if(col1 = 'C', 'Gamma',
                if(col1 = 'D', 'Delta',
                    if(col1 = 'E', 'Epsilon', 'Other')
                )
            )
        )
    ) AS col2,

    -- Deeply nested IF statements relying on col2
    if(col2 = 'Alpha', 1,
        if(col2 = 'Beta', 2,
            if(col2 = 'Gamma', 3,
                if(col2 = 'Delta', 4,
                    if(col2 = 'Epsilon', 5, 0)
                )
            )
        )
    ) AS col3,

    -- Deeply nested IF statements relying on col3
    if(col3 = 1, 'One',
        if(col3 = 2, 'Two',
            if(col3 = 3, 'Three',
                if(col3 = 4, 'Four',
                    if(col3 = 5, 'Five', 'Zero')
                )
            )
        )
    ) AS col4,

    -- Deeply nested IF statements relying on col4
    if(col4 = 'One', 'Uno',
        if(col4 = 'Two', 'Dos',
            if(col4 = 'Three', 'Tres',
                if(col4 = 'Four', 'Cuatro',
                    if(col4 = 'Five', 'Cinco', 'Cero')
                )
            )
        )
    ) AS col5,

    -- Deeply nested IF statements relying on col5
    if(col5 = 'Uno', 'I',
        if(col5 = 'Dos', 'II',
            if(col5 = 'Tres', 'III',
                if(col5 = 'Cuatro', 'IV',
                    if(col5 = 'Cinco', 'V', 'Other')
                )
            )
        )
    ) AS col6,

    -- Deeply nested IF statements relying on col6
    if(col6 = 'I', 'Primero',
        if(col6 = 'II', 'Segundo',
            if(col6 = 'III', 'Tercero',
                if(col6 = 'IV', 'Cuarto',
                    if(col6 = 'V', 'Quinto', 'Otro')
                )
            )
        )
    ) AS col7,

    -- Deeply nested IF statements relying on col7
    if(col7 = 'Primero', 'First',
        if(col7 = 'Segundo', 'Second',
            if(col7 = 'Tercero', 'Third',
                if(col7 = 'Cuarto', 'Fourth',
                    if(col7 = 'Quinto', 'Fifth', 'Other')
                )
            )
        )
    ) AS col8

SELECT 
    id, 
    col1, 
    col2, 
    col3, 
    col4, 
    col5, 
    col6, 
    col7, 
    col8
FROM sample_table
settings max_expanded_ast_elements  = 5000000

If we run the query, increasing the number of columns each time, we can see performance getting much worse with each new column.

5 cols - 194ms
6 cols - 463ms
7 cols - 1.854s
8 cols - 9.181s

If we run the query with explain plan actions = 1, the aliases at the bottom are absolutely huge - my guess is that this is the cause of the poor performance. My question is, is there a setting which can be applied to prevent this behaviour, or a way to re-write the query to avoid this excessive alias naming? ChatGPT has suggested one approach which does actually work, which involves chaining CTEs together. If there is no better solution we will consider this approach, but I am wondering if there is a better way to do it. The query below runs in 143ms with all 8 columns, but it is obviously a bit cumbersome to write it like this. Thanks for any help you can provide!

CREATE TABLE sample_table (
    id UInt64,
    value UInt64
) 
ENGINE = MergeTree()
ORDER BY id;

-- Insert sample data
INSERT INTO sample_table VALUES (1, 10);

WITH
    -- Step 1: Compute col1
    col1_cte AS (
        SELECT 
            id,
            if(value = 10, 'A',
                if(value = 20, 'B',
                    if(value = 30, 'C',
                        if(value = 40, 'D',
                            if(value = 50, 'E', 'F')
                        )
                    )
                )
				
            ) AS col1
        FROM sample_table
    ),

    -- Step 2: Compute col2 based on col1
    col2_cte AS (
        SELECT 
            id,
            col1,
            if(col1 = 'A', 'Alpha',
                if(col1 = 'B', 'Beta',
                    if(col1 = 'C', 'Gamma',
                        if(col1 = 'D', 'Delta',
                            if(col1 = 'E', 'Epsilon', 'Other')
                        )
                    )
                )
            ) AS col2
        FROM col1_cte
    ),

    -- Step 3: Compute col3 based on col2
    col3_cte AS (
        SELECT 
            id,
            col1,
            col2,
            if(col2 = 'Alpha', 1,
                if(col2 = 'Beta', 2,
                    if(col2 = 'Gamma', 3,
                        if(col2 = 'Delta', 4,
                            if(col2 = 'Epsilon', 5, 0)
                        )
                    )
                )
            ) AS col3
        FROM col2_cte
    ),

    -- Step 4: Compute col4 based on col3
    col4_cte AS (
        SELECT 
            id,
            col1,
            col2,
            col3,
            if(col3 = 1, 'One',
                if(col3 = 2, 'Two',
                    if(col3 = 3, 'Three',
                        if(col3 = 4, 'Four',
                            if(col3 = 5, 'Five', 'Zero')
                        )
                    )
                )
            ) AS col4
        FROM col3_cte
    ),

    -- Step 5: Compute col5 based on col4
    col5_cte AS (
        SELECT 
            id,
            col1,
            col2,
            col3,
            col4,
            if(col4 = 'One', 'Uno',
                if(col4 = 'Two', 'Dos',
                    if(col4 = 'Three', 'Tres',
                        if(col4 = 'Four', 'Cuatro',
                            if(col4 = 'Five', 'Cinco', 'Cero')
                        )
                    )
                )
            ) AS col5
        FROM col4_cte
    ),

    -- Step 6: Compute col6 based on col5
    col6_cte AS (
        SELECT 
            id,
            col1,
            col2,
            col3,
            col4,
            col5,
            if(col5 = 'Uno', 'I',
                if(col5 = 'Dos', 'II',
                    if(col5 = 'Tres', 'III',
                        if(col5 = 'Cuatro', 'IV',
                            if(col5 = 'Cinco', 'V', 'Other')
                        )
                    )
                )
            ) AS col6
        FROM col5_cte
    ),

    -- Step 7: Compute col7 based on col6
    col7_cte AS (
        SELECT 
            id,
            col1,
            col2,
            col3,
            col4,
            col5,
            col6,
            if(col6 = 'I', 'Primero',
                if(col6 = 'II', 'Segundo',
                    if(col6 = 'III', 'Tercero',
                        if(col6 = 'IV', 'Cuarto',
                            if(col6 = 'V', 'Quinto', 'Otro')
                        )
                    )
                )
            ) AS col7
        FROM col6_cte
    ),

    -- Step 8: Compute col8 based on col7
    col8_cte AS (
        SELECT 
            id,
            col1,
            col2,
            col3,
            col4,
            col5,
            col6,
            col7,
            if(col7 = 'Primero', 'First',
                if(col7 = 'Segundo', 'Second',
                    if(col7 = 'Tercero', 'Third',
                        if(col7 = 'Cuarto', 'Fourth',
                            if(col7 = 'Quinto', 'Fifth', 'Other')
                        )
                    )
                )
            ) AS col8
        FROM col7_cte
    )

-- Final Query to Get All Columns
SELECT 
    id, 
    col1, 
    col2, 
    col3, 
    col4, 
    col5, 
    col6, 
    col7, 
    col8
FROM col8_cte

Metadata

Metadata

Assignees

Labels

analyzerIssues and pull-requests related to new analyzerperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions