Skip to content

Conversation

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 14, 2020

On hold till #4370 merged, so we can re-use dtmerge here.


cross join short circuit in bmerge by on=character() after base::merge

@jangorecki jangorecki linked an issue Jun 14, 2020 that may be closed by this pull request
@jangorecki jangorecki requested a review from tlapak June 14, 2020 00:55
@tlapak
Copy link
Contributor

tlapak commented Jun 14, 2020

Cool, this really has been missing!

I'll leave my overall comments here and a couple more specific ones in appropriate places:

Also with an eye towards #852, I think it would be better to check for a cross join right at the top of the is.data.table(i) branch and explicitly set the ans variables (like f__ etc.) because those are explicitly known at that point. Then you can skip all the rest of that branch, including bmerge. I also don't think you gain anything much by doing this in C since this step isn't all that complex. The only thing you may need to keep or duplicate is the is.na(which) bit, although I'm not sure it makes sense to allow which at a non-default value.

I'm not sold on the interface. Passing character() to on seems to be more likely to be a bug rather than intended. For example, when constructing the on argument like intersect(cols1, cols2) this can unintentionally return character() if there is a typo in one of the column names. (That's what happened to me, anyway.) I do see the logic but I think it's better to be explicit here, e.g.: on=.CROSS, on=TRUE (less preferred imo) or mult="cross" or even crossjoin=TRUE. Compatibility with base::merge is pretty broken anyway.

This should then also be available through merge.data.table.

@jangorecki
Copy link
Member Author

@tlapak Thanks for comments

Then you can skip all the rest of that branch, including bmerge.

What about any other code that calls bmerge? It won't support cross join anymore, it will need same code from [ being duplicated there. My motivation to put it inside bmerge was exactly that. So whenever we want to use bmerge we don't have to escape cross join before calling bmerge. One example off the top of my head is mergelist #4370.

when constructing the on argument like intersect(cols1, cols2) this can unintentionally return character()

This is a perfect example, agree it make sense to use on=.CROSS, this will be aligned to a .NATURAL keyword we now use. I checked how dplyr handles that and starting from 0.9.0 it uses by=character(), same as base R merge.

I will wait for some more feedback before making decision on API.

@MichaelChirico
Copy link
Member

I agree with Václav on=character() seems error prone. Does on=NULL map to anything? Is it less error-prone?

on=TRUE or on=.(1==1) might be more SQL-like, though I'm not an immediate fan of either.

on=.CROSS works but I'm wary of a new keyword if we can agree on something more direct.

on=⊗ for chaotic good...

Copy link
Contributor

@tlapak tlapak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized that I hadn't submitted these comments...

@tlapak
Copy link
Contributor

tlapak commented Jun 18, 2020

Then you can skip all the rest of that branch, including bmerge.

What about any other code that calls bmerge? It won't support cross join anymore, it will need same code from [ being duplicated there. My motivation to put it inside bmerge was exactly that. So whenever we want to use bmerge we don't have to escape cross join before calling bmerge. One example off the top of my head is mergelist #4370.

I looked at mergelist and you have it separate there at the moment, right? I prefer to be more explicit and branch early when the branches have no real code in common. Wrapping the bit that returns the parameters for the cross join in it's own function would be nice. And then a cross join is so different from an inner join and bmerge.c says right at the top "Implements binary search". So I wouldn't expect bmerge to give me a cross join as well.

I think your suggestion from mergelist to use dtmerge and refactor the relevant [.data.table part is a good idea. It's also fairly self contained. Would be a bit broader scope of course.

Just my thoughts on the point.

This is a perfect example, agree it make sense to use on=.CROSS, this will be aligned to a .NATURAL keyword we now use. I checked how dplyr handles that and starting from 0.9.0 it uses by=character(), same as base R merge.

I will wait for some more feedback before making decision on API.

I've been ruminating on the interface a bit and am fully in favor of the chaotic good option...
I'm not a fan of on=.(1==1) (that would be lawful evil, I guess). I'd just like to throw out nomatch="cross" for good measure because nomatch kind of controls join type between right outer and inner. And I would also like to think that in SQL CROSS JOIN is preferred over JOIN [...] ON 1=1

Does on=NULL map to anything? Is it less error-prone?

That's the default value. It's not tested for NULL but I would always expect that explicitly passing the default value can't change the behaviour.

@jangorecki jangorecki added the WIP label Jun 18, 2020
@ColeMiller1
Copy link
Contributor

As Jan mentions, on = .NATURAL is out there. To continue with this API, on = .CROSS is a good compliment.

The SQL 1 == 1 I've used in the first WHERE condition to make it easier to comment out other predicate conditions - I have never seen it to indicate a CROSS JOIN.

I also like the idea of separating the cross join from bmerge(). It seems like we could reuse CJ()for this to generate indices. This FR is the fastest of the methods for small datasets, using CJ() is faster for bigger datasets although allowing for CJ() would require some extra adjustments.

dt1 = data.table(x1 = sample(10))
dt2 = data.table(y1 = sample(20))
bench::mark(
  new = dt1[dt2, on = character()],
  using_cj = {
    inds = CJ(seq_len(nrow(dt2)), seq_len(nrow(dt1)))
    setDT(c(dt1[inds$V2], dt2[inds$V1]))
  },
  using_base = setDT(c(dt1[rep.int(seq_len(nrow(dt1)), nrow(dt2))], dt2[rep(seq_len(nrow(dt2)), each = nrow(dt1))]))
)

@jangorecki
Copy link
Member Author

jangorecki commented Aug 25, 2020

I updated branch little bit and addressed some of feedback provided. I am leaving it now till we will have mergelist branch merged to master, so we can re-use dtmerge rather than duplicating same functionality.

@mattdowle mattdowle added this to the 1.13.1 milestone Aug 25, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 02:44
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico removed this from the 1.17.0 milestone Dec 3, 2024
@MichaelChirico MichaelChirico added this to the 1.18.0 milestone Dec 3, 2024
@jangorecki jangorecki modified the milestones: 1.18.0, 1.19.0 Nov 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

allow cross join in [.data.table

5 participants