Skip to content

bug: upper and lower not compatible with Spark for international character sets #483

@andygrove

Description

@andygrove

Describe the bug

I discovered this bug using Comet Fuzz.

SELECT lower(c2) FROM test0

[ERROR] Spark and Comet produced different results.

Spark Plan

*(1) Project [lower(c2#2) AS lower(c2)#1546]
+- *(1) ColumnarToRow
   +- FileScan parquet [c2#2] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/andy/git/apple/comet-fuzz/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c2:string>

Comet Plan

*(1) ColumnarToRow
+- CometProject [lower(c2)#1550], [lower(c2#2) AS lower(c2)#1550]
   +- CometScan parquet [c2#2] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/andy/git/apple/comet-fuzz/test0.parquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c2:string>

Results

Spark produced 200 rows and Comet produced 200 rows.
First difference at row 91:
Spark: [ሽ븃靛脉项ໍ绹跅]
Comet: [ძ訆喀ꓚ疟⓰⡃婳]

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions