Update DataFusion and change order of optimization rules#825
Update DataFusion and change order of optimization rules#825ayushdg merged 26 commits intodask-contrib:mainfrom
Conversation
| @@ -167,18 +167,17 @@ def convert( | |||
| "TimestampNanosecond", | |||
| }: | |||
| unit_mapping = { | |||
There was a problem hiding this comment.
It looks like we were not previously exercising this code path since there were some obvious bugs in here
|
|
Codecov Report
@@ Coverage Diff @@
## main #825 +/- ##
==========================================
+ Coverage 77.04% 77.43% +0.39%
==========================================
Files 71 71
Lines 3594 3599 +5
Branches 632 634 +2
==========================================
+ Hits 2769 2787 +18
+ Misses 696 679 -17
- Partials 129 133 +4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| literal_value, timezone = rex.getTimestampValue() | ||
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| literal_type = SqlTypeName.TIMESTAMP |
There was a problem hiding this comment.
after this if block, you can have:
elif timezone is None:
literal_value = datetime.fromtimestamp(literal_value // 10**9)
(make sure to do a from datetime import datetime)
There was a problem hiding this comment.
btw, I chose 10**9 to convert to seconds, as suggested by https://stackoverflow.com/questions/45423917/pandas-errno-75-value-too-large-for-defined-data-type, but maybe milliseconds (or something else) would be generally better.
There was a problem hiding this comment.
Thanks @sarahyurick that fixes the regression locally for me. I have pushed this change.
|
There is a GPU CI Failure: @sarahyurick Do you know what we need to do to fix this? |
Hmm, maybe you could try adding a |
| select count(*) from ( | ||
| select * from df_simple | ||
| select a, b from df_simple | ||
| intersect | ||
| select * from df_simple | ||
| select a, b from df_simple | ||
| intersect | ||
| select * from df_wide | ||
| select a, b from df_wide | ||
| ) hot_item | ||
| limit 100 |
There was a problem hiding this comment.
This query was invalid. intersect requires that all relations have the same number of columns. For example, postgres would fail with ERROR: each INTERSECT query must have the same number of columns. DataFusion now has a check for this.
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| elif timezone is None: | ||
| literal_value = datetime.fromtimestamp(literal_value // 10**9) |
There was a problem hiding this comment.
If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).
There was a problem hiding this comment.
If gpuCI still fails, you can try adding
literal_value = str(literal_value)after this line (but still in theelifblock).
Thanks Sarah. I tried that but this still fails. Here is more info on the failure:
self = <dask_sql.physical.rex.core.call.ReduceOperation object at 0x7f4717a097c0>
operands = ('2001-03-09', datetime.timedelta(days=90)), kwargs = {}
def reduce(self, *operands, **kwargs):
if len(operands) > 1:
if any(
map(
lambda op: is_frame(op) & pd.api.types.is_datetime64_dtype(op),
operands,
)
):
operands = tuple(map(as_timelike, operands))
> return reduce(partial(self.operation, **kwargs), operands)
E TypeError: can only concatenate str (not "datetime.timedelta") to str
dask_sql/physical/rex/core/call.py:137: TypeError
Perhaps def reduce or def as_timelike need updating to support the interval type that the Rust code is returning?
It looks like @charlesbluca may be familiar with these methods and maybe has some insight here.
There was a problem hiding this comment.
Curious if this change is still needed
There was a problem hiding this comment.
I don't think we need the string casting, but it seems like the datetime.fromtimestamp is necessary to get the proper timezone for the datetime
ayushdg
left a comment
There was a problem hiding this comment.
Minor question but otherwise lgtm!
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| elif timezone is None: | ||
| literal_value = datetime.fromtimestamp(literal_value // 10**9) |
There was a problem hiding this comment.
Curious if this change is still needed
There have been a lot of improvements to the optimization rules in DataFusion recently, particularly related to type coercion. This PR picks up those improvements.
Closes #844