cross join, #1717 #4544

jangorecki · 2020-06-14T00:49:40Z

On hold till #4370 merged, so we can re-use dtmerge here.

cross join short circuit in bmerge by on=character() after base::merge

NEWS.md

tlapak · 2020-06-14T19:22:48Z

Cool, this really has been missing!

I'll leave my overall comments here and a couple more specific ones in appropriate places:

Also with an eye towards #852, I think it would be better to check for a cross join right at the top of the is.data.table(i) branch and explicitly set the ans variables (like f__ etc.) because those are explicitly known at that point. Then you can skip all the rest of that branch, including bmerge. I also don't think you gain anything much by doing this in C since this step isn't all that complex. The only thing you may need to keep or duplicate is the is.na(which) bit, although I'm not sure it makes sense to allow which at a non-default value.

I'm not sold on the interface. Passing character() to on seems to be more likely to be a bug rather than intended. For example, when constructing the on argument like intersect(cols1, cols2) this can unintentionally return character() if there is a typo in one of the column names. (That's what happened to me, anyway.) I do see the logic but I think it's better to be explicit here, e.g.: on=.CROSS, on=TRUE (less preferred imo) or mult="cross" or even crossjoin=TRUE. Compatibility with base::merge is pretty broken anyway.

This should then also be available through merge.data.table.

jangorecki · 2020-06-15T08:47:56Z

@tlapak Thanks for comments

Then you can skip all the rest of that branch, including bmerge.

What about any other code that calls bmerge? It won't support cross join anymore, it will need same code from [ being duplicated there. My motivation to put it inside bmerge was exactly that. So whenever we want to use bmerge we don't have to escape cross join before calling bmerge. One example off the top of my head is mergelist #4370.

when constructing the on argument like intersect(cols1, cols2) this can unintentionally return character()

This is a perfect example, agree it make sense to use on=.CROSS, this will be aligned to a .NATURAL keyword we now use. I checked how dplyr handles that and starting from 0.9.0 it uses by=character(), same as base R merge.

I will wait for some more feedback before making decision on API.

MichaelChirico · 2020-06-15T10:27:05Z

I agree with Václav on=character() seems error prone. Does on=NULL map to anything? Is it less error-prone?

on=TRUE or on=.(1==1) might be more SQL-like, though I'm not an immediate fan of either.

on=.CROSS works but I'm wary of a new keyword if we can agree on something more direct.

on=⊗ for chaotic good...

tlapak

I hadn't realized that I hadn't submitted these comments...

R/data.table.R

tlapak · 2020-06-18T21:28:10Z

Then you can skip all the rest of that branch, including bmerge.

What about any other code that calls bmerge? It won't support cross join anymore, it will need same code from [ being duplicated there. My motivation to put it inside bmerge was exactly that. So whenever we want to use bmerge we don't have to escape cross join before calling bmerge. One example off the top of my head is mergelist #4370.

I looked at mergelist and you have it separate there at the moment, right? I prefer to be more explicit and branch early when the branches have no real code in common. Wrapping the bit that returns the parameters for the cross join in it's own function would be nice. And then a cross join is so different from an inner join and bmerge.c says right at the top "Implements binary search". So I wouldn't expect bmerge to give me a cross join as well.

I think your suggestion from mergelist to use dtmerge and refactor the relevant [.data.table part is a good idea. It's also fairly self contained. Would be a bit broader scope of course.

Just my thoughts on the point.

This is a perfect example, agree it make sense to use on=.CROSS, this will be aligned to a .NATURAL keyword we now use. I checked how dplyr handles that and starting from 0.9.0 it uses by=character(), same as base R merge.

I will wait for some more feedback before making decision on API.

I've been ruminating on the interface a bit and am fully in favor of the chaotic good option...
I'm not a fan of on=.(1==1) (that would be lawful evil, I guess). I'd just like to throw out nomatch="cross" for good measure because nomatch kind of controls join type between right outer and inner. And I would also like to think that in SQL CROSS JOIN is preferred over JOIN [...] ON 1=1

Does on=NULL map to anything? Is it less error-prone?

That's the default value. It's not tested for NULL but I would always expect that explicitly passing the default value can't change the behaviour.

ColeMiller1 · 2020-06-19T02:11:30Z

As Jan mentions, on = .NATURAL is out there. To continue with this API, on = .CROSS is a good compliment.

The SQL 1 == 1 I've used in the first WHERE condition to make it easier to comment out other predicate conditions - I have never seen it to indicate a CROSS JOIN.

I also like the idea of separating the cross join from bmerge(). It seems like we could reuse CJ()for this to generate indices. This FR is the fastest of the methods for small datasets, using CJ() is faster for bigger datasets although allowing for CJ() would require some extra adjustments.

dt1 = data.table(x1 = sample(10))
dt2 = data.table(y1 = sample(20))
bench::mark(
  new = dt1[dt2, on = character()],
  using_cj = {
    inds = CJ(seq_len(nrow(dt2)), seq_len(nrow(dt1)))
    setDT(c(dt1[inds$V2], dt2[inds$V1]))
  },
  using_base = setDT(c(dt1[rep.int(seq_len(nrow(dt1)), nrow(dt2))], dt2[rep(seq_len(nrow(dt2)), each = nrow(dt1))]))
)

jangorecki · 2020-08-25T16:06:10Z

I updated branch little bit and addressed some of feedback provided. I am leaving it now till we will have mergelist branch merged to master, so we can re-use dtmerge rather than duplicating same functionality.

cross join, #1717

18c53d2

jangorecki linked an issue Jun 14, 2020 that may be closed by this pull request

allow cross join in [.data.table #1717

Open

news and manual

a491bff

jangorecki requested a review from tlapak June 14, 2020 00:55

MichaelChirico reviewed Jun 14, 2020

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

tlapak reviewed Jun 18, 2020

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

R/data.table.R Outdated Show resolved Hide resolved

jangorecki added the WIP label Jun 18, 2020

jangorecki added 4 commits August 25, 2020 17:09

Merge branch 'master' into crossjoin

f67d1c8

news feedback

04ee0d5

mult arg handling and tests

d313679

cross join flag, feedback

273a5fa

mattdowle added this to the 1.13.1 milestone Aug 25, 2020

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

Darxor mentioned this pull request Oct 17, 2022

Support data.frame cross-join markfairbanks/tidytable#655

Closed

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

MichaelChirico removed the WIP label Feb 19, 2024

MichaelChirico marked this pull request as draft February 19, 2024 02:44

MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024

MichaelChirico removed this from the 1.17.0 milestone Dec 3, 2024

MichaelChirico added this to the 1.18.0 milestone Dec 3, 2024

jangorecki modified the milestones: 1.18.0, 1.19.0 Nov 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cross join, #1717 #4544

cross join, #1717 #4544

Uh oh!

jangorecki commented Jun 14, 2020 •

edited

Loading

Uh oh!

Uh oh!

tlapak commented Jun 14, 2020

Uh oh!

jangorecki commented Jun 15, 2020

Uh oh!

MichaelChirico commented Jun 15, 2020

Uh oh!

tlapak left a comment

Uh oh!

Uh oh!

Uh oh!

tlapak commented Jun 18, 2020 •

edited

Loading

Uh oh!

ColeMiller1 commented Jun 19, 2020

Uh oh!

jangorecki commented Aug 25, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cross join, #1717 #4544

Are you sure you want to change the base?

cross join, #1717 #4544

Uh oh!

Conversation

jangorecki commented Jun 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tlapak commented Jun 14, 2020

Uh oh!

jangorecki commented Jun 15, 2020

Uh oh!

MichaelChirico commented Jun 15, 2020

Uh oh!

tlapak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tlapak commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ColeMiller1 commented Jun 19, 2020

Uh oh!

jangorecki commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jangorecki commented Jun 14, 2020 •

edited

Loading

tlapak commented Jun 18, 2020 •

edited

Loading

jangorecki commented Aug 25, 2020 •

edited

Loading