Skip to content

Incorrect results returned for TPC-H Query 8 #6794

@osawyerr

Description

@osawyerr

Describe the bug

Datafusion gives incorrect results when running TPC-H Query 8 with parquet files.

To Reproduce

  1. Generate TPC-H parquet files for scale factor 10
  2. Open datafusion-cli and create external tables pointing to files
create external table lineitem stored as parquet location '/path/to/lineitem/lineitem_1687987398_default/';
create external table customer stored as parquet location '/path/to/customer/customer_1687987384_default/';
create external table nation stored as parquet location '/path/to/nation/nation_1687988005_default/';
create external table orders stored as parquet location '/path/to/orders/orders_1687988005_default/';
create external table region stored as parquet location '/path/to/region/region_1687988223_default/';
create external table supplier stored as parquet location '/path/to/supplier/supplier_1687988223_default/';
create external table part stored as parquet location '/path/to/part/part_1687988133_default/';
  1. Run TPC-H Query 8
select
  o_year, sum(case when nation = 'BRAZIL' then volume else 
0
 end) / sum(volume) as mkt_share
from
  (
    select
      extract(year from o_orderdate) as o_year,
      l_extendedprice * (
1
 - l_discount) as volume,
      n2.n_name as nation
    from part, supplier, lineitem, orders, customer, nation n1, nation n2, region
    where
      p_partkey = l_partkey
      and s_suppkey = l_suppkey
      and l_orderkey = o_orderkey
      and o_custkey = c_custkey
      and c_nationkey = n1.n_nationkey
      and n1.n_regionkey = r_regionkey
      and r_name = 'AMERICA'
      and s_nationkey = n2.n_nationkey
      and o_orderdate between date '1995-01-01' and date '1996-12-31'
      and p_type = 'ECONOMY ANODIZED STEEL'
  ) as all_nations
group by o_year
order by o_year;
  1. Incorrect results displayed below
+--------+-------------------------------------------+
| o_year | mkt_share                                 |
+--------+-------------------------------------------+
| 1995.0 | -0.00000000000011380044067220119495060732 |
| 1996.0 | 0.00000000000019588288500717383285218261  |
+--------+-------------------------------------------+
2 rows in set. Query took 1.131 seconds.

Expected behavior

The correct results should be:

  1. From Postgres:
 o_year |       mkt_share        
--------+------------------------
   1995 | 0.03882014251433219622
   1996 | 0.03948968749183991638
(2 rows)

Time: 11925.416 ms (00:11.925)
  1. From DuckDb (with same parquet files):
┌────────┬─────────────────────┐
│ o_year │      mkt_share      │
│ int64  │       double        │
├────────┼─────────────────────┤
│   1995 │  0.0388201425143322 │
│   1996 │ 0.03948968749183992 │
└────────┴─────────────────────┘
Run Time (s): real 1.328 user 8.095730 sys 0.358842

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions