Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
3e72c8d
cbindlist
jangorecki Apr 10, 2020
a915832
add cbind by reference, timing
jangorecki Apr 10, 2020
05dd562
R prototype of mergelist
jangorecki Apr 10, 2020
cba5bc1
wording
jangorecki Apr 10, 2020
1edf4d3
use lower overhead funs
jangorecki Apr 10, 2020
36bbd25
stick to int32 for now, correct R_alloc
jangorecki Apr 16, 2020
7d51dd6
bmerge C refactor for codecov and one loop for speed
jangorecki Apr 16, 2020
0437da5
address revealed codecov gaps
jangorecki Apr 16, 2020
e287213
refactor vecseq for codecov
jangorecki Apr 16, 2020
5dc07bd
seqexp helper, some alloccol export on C
jangorecki Apr 17, 2020
a4d124e
bmerge codecov, types handled in R bmerge already
jangorecki Apr 17, 2020
40d3bfe
better comment seqexp
jangorecki Apr 17, 2020
beffe39
bmerge mult=error #655
jangorecki Apr 17, 2020
4e211a1
multiple new C utils
jangorecki Apr 17, 2020
fbddcd6
swap if branches
jangorecki Apr 17, 2020
01b2f9d
explain new C utils
jangorecki Apr 17, 2020
c8e070b
comments mostly
jangorecki Apr 17, 2020
3004748
reduce conflicts to PR #4386
jangorecki Apr 18, 2020
cf73fcf
comment C code
jangorecki Apr 19, 2020
b64c0c3
address multiple matches during update-on-join #3747
jangorecki Apr 19, 2020
348d5b7
Revert "address multiple matches during update-on-join #3747"
jangorecki Apr 19, 2020
df0c11a
merge.dt has temporarily mult arg, for testing
jangorecki Apr 24, 2020
5793508
minor changes to cbindlist c
jangorecki Apr 24, 2020
6017eac
dev mergelist, for single pair now
jangorecki Apr 24, 2020
f88e0de
add quiet option to cc()
jangorecki Apr 25, 2020
2387f09
mergelist tests
jangorecki Apr 25, 2020
5ae7d4d
add check for names to perhaps.dt
jangorecki Apr 25, 2020
d0b2af8
rm mult from merge.dt method
jangorecki Apr 25, 2020
7e51189
rework, clean, polish multer, fix righ and full joins
jangorecki Apr 25, 2020
ea77bce
make full join symmetric
jangorecki Apr 26, 2020
06a1ae8
mergepair inner function to loop on
jangorecki Apr 26, 2020
a942940
extra check for symmetric
jangorecki Apr 26, 2020
dc5f263
mergelist manual
jangorecki Apr 26, 2020
bc17057
ensure no df-dt passed where list expected
jangorecki Apr 26, 2020
db30e44
comments and manual
jangorecki Apr 26, 2020
0dd82c3
handle 0 cols tables
jangorecki Apr 26, 2020
9fe7f55
more tests
jangorecki Apr 26, 2020
113f688
more tests and debugging
jangorecki Apr 26, 2020
9bcb814
move more logic closer to bmerge, simplify mergepair
jangorecki Apr 26, 2020
a7f39c9
more tests
jangorecki Apr 26, 2020
b1f39a6
revert not used changes
jangorecki Apr 26, 2020
29bd438
reduce not needed checks, cleanup
jangorecki Apr 26, 2020
ca0d76a
copy arg behavior, manual, no tests yet
jangorecki Apr 26, 2020
9ac7a89
cbindlist manual, export both
jangorecki Apr 27, 2020
384396b
cleanup processing bmerge to dtmatch
jangorecki Apr 28, 2020
11974f0
test function match order for easier preview
jangorecki Apr 28, 2020
de48d2d
vecseq gets short-circuit
jangorecki Apr 28, 2020
66e7d53
batch test allow browser
jangorecki Apr 28, 2020
25d0633
big cleanup
jangorecki Apr 29, 2020
fee063b
remmove unneeded stuff, reduce diff
jangorecki Apr 29, 2020
84d7146
more cleanup, minor manual fixes
jangorecki Apr 29, 2020
d78d136
add proper test scripts
jangorecki Apr 29, 2020
2b1795b
Merge branch 'master' into cbind-merge-list
jangorecki Apr 29, 2020
dabb55c
comment out not used code for coverage
jangorecki Apr 29, 2020
3ca7d4a
more tests, some nocopy opts
jangorecki Apr 30, 2020
e4b14e6
rename sql test script, should fix codecov
jangorecki Apr 30, 2020
b1fce17
simplify dtmatch inner branch
jangorecki Apr 30, 2020
50f9e89
more precise copy, now copy only T or F
jangorecki Apr 30, 2020
d43be04
unused arg not yet in api, wording
jangorecki Apr 30, 2020
4580dd4
comments and refer issues
jangorecki Apr 30, 2020
5d0e991
codecov
jangorecki Apr 30, 2020
03aa427
hasindex coverage
jangorecki Apr 30, 2020
b15ab93
codecov gap
jangorecki Apr 30, 2020
17d2fa8
tests for join using key, cols argument
jangorecki Apr 30, 2020
492d3b5
fix missing import forderv
jangorecki Apr 30, 2020
a5c4a26
more tests, improve missing on handling
jangorecki May 1, 2020
426e187
more tests for order of inner and full join for long keys
jangorecki May 1, 2020
c8ded9c
new allow.cartesian option, #4383, #914
jangorecki May 3, 2020
674bff8
reduce diff, improve codecov
jangorecki May 3, 2020
0a483c2
reduce diff, comments
jangorecki May 3, 2020
db3249a
need more DT, not lists, mergelist 3+ tbls
jangorecki May 3, 2020
a573286
proper escape heavy check
jangorecki May 3, 2020
78123b0
unit tests
jangorecki May 4, 2020
9273212
more tests, address overalloc failure
jangorecki May 6, 2020
c5df010
mergelist and cbindlist retain index
jangorecki May 8, 2020
64d4f5e
manual, examples
jangorecki May 8, 2020
211da09
fix manual
jangorecki May 8, 2020
4275e8c
minor clarify in manual
jangorecki May 8, 2020
102de68
retain keys, right outer join for snowflake schema joins
jangorecki May 8, 2020
4487360
duplicates in cbindlist
jangorecki May 8, 2020
2923719
recycling in cbindlist
jangorecki May 9, 2020
658410b
escape 0 input in copyCols
jangorecki May 9, 2020
b708507
empty input handling
jangorecki May 9, 2020
1b9f913
closing cbindlist
jangorecki May 9, 2020
d70cd41
vectorized _on_ and _join.many_ arg
jangorecki May 10, 2020
a5179b7
rename dtmatch to dtmerge
jangorecki May 10, 2020
c86c9ad
vectorized args: how, mult
jangorecki May 10, 2020
b89b6f8
full join, reduce overhead for mult=error
jangorecki May 10, 2020
249e09a
mult default value dynamic
jangorecki May 11, 2020
1fa9b40
fix manual
jangorecki May 11, 2020
f889b0e
add "see details" to Rd
MichaelChirico May 11, 2020
f35a555
mention shared on in arg description
MichaelChirico May 11, 2020
52d4f9f
amend feedback from Michael
jangorecki May 11, 2020
2884b29
semi and anti joins will not reorder x columns
jangorecki May 11, 2020
3f3f9de
Merge branch 'master' into cbind-merge-list
jangorecki Dec 9, 2023
6df88bc
spelling, thx to @jan-glx
jangorecki Dec 9, 2023
060610b
check all new funs used and add comments
jangorecki Dec 9, 2023
db58b6c
bugfix, sort=T needed for now
jangorecki Dec 9, 2023
53b9b0d
Merge branch 'master' into cbind-merge-list
MichaelChirico Feb 19, 2024
c6add42
Update NEWS.md
MichaelChirico Feb 19, 2024
ec1973f
Merge branch 'master' into cbind-merge-list
MichaelChirico Feb 19, 2024
d26265f
Merge branch 'master' into cbind-merge-list
MichaelChirico Aug 28, 2024
115b1eb
NEWS placement
MichaelChirico Aug 28, 2024
e2ae4d0
numbering
MichaelChirico Aug 28, 2024
9a1d7db
ascArg->order
MichaelChirico Aug 28, 2024
3ead046
Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-mer…
MichaelChirico Aug 28, 2024
d579af4
attempt to restore from master
MichaelChirico Aug 28, 2024
9a51230
Update to stopf() error style
MichaelChirico Aug 28, 2024
1b363ad
Need isFrame for now
MichaelChirico Aug 28, 2024
e9387d2
More quality checks: any(!x)->!all(x); use vapply_1{b,c,i}
MichaelChirico Aug 28, 2024
b30437b
really restore from master
MichaelChirico Aug 28, 2024
6b9aa6c
try to PROTECT() before duplicate()
MichaelChirico Aug 28, 2024
71bb8b1
update error message in test
MichaelChirico Aug 28, 2024
40191d7
appease the rchk gods
MichaelChirico Aug 29, 2024
3758316
extraneous space
MichaelChirico Aug 29, 2024
e4e5d8c
missing ';'
MichaelChirico Aug 29, 2024
338711a
use catf
MichaelChirico Aug 29, 2024
008abef
simplify perhapsDataTableR
MichaelChirico Aug 29, 2024
854d35e
move sqlite.Rraw.manual into other.Rraw
MichaelChirico Aug 29, 2024
c975c14
simplify for loop
MichaelChirico Aug 29, 2024
5952dd8
Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-mer…
MichaelChirico Aug 29, 2024
9fc109e
Merge branch 'master' into cbind-merge-list
MichaelChirico Jul 1, 2025
a5a1e7e
first pass at publishable NEWS
MichaelChirico Jul 1, 2025
d2b9c52
ws
MichaelChirico Jul 1, 2025
15d8526
failed merge
MichaelChirico Jul 1, 2025
c506fbe
failed merge pt ii
MichaelChirico Jul 1, 2025
5bb3845
shrink diff
MichaelChirico Jul 1, 2025
58a2c8e
pass at style
MichaelChirico Jul 1, 2025
8b5027e
Ditch mergelist(copy=) for setmergelist
MichaelChirico Jul 1, 2025
4b58a7a
Put cols=NULL default into the signature to avoid missing() quirks
MichaelChirico Jul 2, 2025
394f90b
Explain 'NULL' in cols= in Rd
MichaelChirico Jul 2, 2025
24f4549
First pass on grammar for \arguments
MichaelChirico Jul 2, 2025
80f23d8
finish style+grammar pass
MichaelChirico Jul 2, 2025
2e228a4
restore 'join.many' to signature
MichaelChirico Jul 2, 2025
46dbcdd
use 'try' for known error in example
MichaelChirico Jul 2, 2025
7e042a8
tweak examples
MichaelChirico Jul 2, 2025
11b3d7b
Add \references for star/snowflake schema terminology
MichaelChirico Jul 2, 2025
e5709a1
fix test error messages, remove extra '[]' from brackify errors
MichaelChirico Jul 2, 2025
b7a17d5
rm unreachable error
MichaelChirico Jul 2, 2025
facfcc4
coverage
MichaelChirico Jul 2, 2025
ccae5d4
first pass at local() style tests
MichaelChirico Jul 2, 2025
e83138e
linted style
MichaelChirico Jul 2, 2025
2a35e7d
semicolons, spacing
MichaelChirico Jul 2, 2025
271c5c0
rearrange tests using options to be in nested local() calls
MichaelChirico Jul 2, 2025
2707d2c
restore new 'l' for rearranged tests; re-capture test using 'l' in lo…
MichaelChirico Jul 2, 2025
8b74cc6
Jan's clarifying comment
MichaelChirico Jul 2, 2025
35e311b
Another pass at style, annotation; remove some duplicate tests
MichaelChirico Jul 2, 2025
82642d1
more refinement of test structure, comments
MichaelChirico Jul 2, 2025
48aa0e3
finished mergelist.Rraw
MichaelChirico Jul 2, 2025
0f82b32
more whitespace in constructed SQL queries
MichaelChirico Jul 2, 2025
d769cba
style, continued
MichaelChirico Jul 2, 2025
9bafc08
more formal styling with lintr
MichaelChirico Jul 2, 2025
c1e33b9
update reference to other.Rraw tests
MichaelChirico Jul 6, 2025
2866182
return output invisibly for set* functions
MichaelChirico Jul 6, 2025
3a037bd
mention setmergelist in NEWS
MichaelChirico Jul 14, 2025
9f97e10
Merge branch 'master' into cbind-merge-list
MichaelChirico Jul 14, 2025
bcc3dd1
numbering
MichaelChirico Jul 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ export(nafill)
export(setnafill)
export(.Last.updated)
export(fcoalesce)
export(mergelist, setmergelist)
export(cbindlist, setcbindlist)
export(substitute2)
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@

11. New `frev(x)` as a faster analogue to `base::rev()` for atomic vectors/lists, [#5885](https://github.com/Rdatatable/data.table/issues/5885). Twice as fast as `base::rev()` on large inputs, and faster with more threads. Thanks to Benjamin Schwendinger for suggesting and implementing.

12. New `cbindlist()` and `setcbindlist()` for concatenating a `list` of data.tables column-wise, evocative of the analogous `do.call(rbind, l)` <-> `rbindlist(l)`, [#2576](https://github.com/Rdatatable/data.table/issues/2576). `setcbindlist()` does so without making any copies. Thanks @MichaelChirico for the FR, @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.

13. New `mergelist()` and `setmergelist()` similarly work _a la_ `Reduce()` to recursively merge a `list` of data.tables, [#599](https://github.com/Rdatatable/data.table/issues/599). Different join modes (_left_, _inner_, _full_, _right_, _semi_, _anti_, and _cross_) are supported through the `how` argument; duplicate handling goes through the `mult` argument. `setmergelist()` carefully avoids copies where one is not needed, e.g. in a 1:1 left join. Thanks Patrick Nicholson for the FR (in 2013!), @jangorecki for the PR, and @MichaelChirico for extensive reviews and fine-tuning.

### BUG FIXES

1. `fread()` no longer warns on certain systems on R 4.5.0+ where the file owner can't be resolved, [#6918](https://github.com/Rdatatable/data.table/issues/6918). Thanks @ProfFancyPants for the report and PR.
Expand Down
3 changes: 2 additions & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ replace_dot_alias = function(e) {
}
return(x)
}
if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
if (!mult %chin% c("first", "last", "all", "error")) stopf("mult argument can only be 'first', 'last', 'all' or 'error'")
missingroll = missing(roll)
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
if (is.character(roll)) {
Expand Down Expand Up @@ -520,6 +520,7 @@ replace_dot_alias = function(e) {
}
i = .shallow(i, retain.key = TRUE)
ans = bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, ops, verbose=verbose)
if (mult == "error") mult = "all" ## error should have been raised inside bmerge() call above already, if it wasn't continue as mult="all"
xo = ans$xo ## to make it available for further use.
# temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
# 'setorder', as there's another 'setorder' in generating 'irows' below...
Expand Down
106 changes: 102 additions & 4 deletions R/mergelist.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ cbindlist_impl_ = function(l, copy) {
}

cbindlist = function(l) cbindlist_impl_(l, copy=TRUE)
setcbindlist = function(l) cbindlist_impl_(l, copy=FALSE)
setcbindlist = function(l) invisible(cbindlist_impl_(l, copy=FALSE))

# when 'on' is missing then use keys, used only for inner and full join
onkeys = function(x, y) {
Expand Down Expand Up @@ -157,9 +157,9 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
stopf("'on' is missing and necessary key is not present")
}
if (any(bad.on <- !on %chin% names(lhs)))
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "LHS", brackify(names(lhs)))
if (any(bad.on <- !on %chin% names(rhs)))
stopf("'on' argument specifies columns to join [%s] that are not present in %s table [%s]", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
stopf("'on' argument specifies columns to join %s that are not present in %s table %s", brackify(on[bad.on]), "RHS", brackify(names(rhs)))
} else if (is.null(on)) {
on = character() ## cross join only
}
Expand Down Expand Up @@ -203,7 +203,7 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
copy_x = TRUE
## ensure no duplicated column names in merge results
if (any(dup.i <- names(out.i) %chin% names(out.x)))
stopf("merge result has duplicated column names [%s], use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
stopf("merge result has duplicated column names %s, use 'cols' argument or rename columns in 'l' tables", brackify(names(out.i)[dup.i]))
}

## stack i and x
Expand Down Expand Up @@ -257,6 +257,104 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
setDT(out)
}

mergelist_impl_ = function(l, on, cols, how, mult, join.many, copy) {
verbose = getOption("datatable.verbose")
if (verbose)
p = proc.time()[[3L]]

if (!is.list(l) || is.data.frame(l))
stopf("'%s' must be a list", "l")
if (!all(vapply_1b(l, is.data.table)))
stopf("Every element of 'l' list must be data.table objects")
if (!all(idx <- lengths(l) > 0L))
stopf("Tables in 'l' must all have columns, but these entries have 0: %s", brackify(which(!idx)))
if (any(idx <- vapply_1i(l, function(x) anyDuplicated(names(x))) > 0L))
stopf("Column names in individual 'l' entries must be unique, but these have some duplicates: %s", brackify(which(idx)))

n = length(l)
if (n < 2L) {
out = if (n) l[[1L]] else as.data.table(l)
if (copy) out = copy(out)
if (verbose)
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)
return(out)
}

if (!is.list(join.many))
join.many = rep(list(join.many), n - 1L)
if (length(join.many) != n - 1L || !all(vapply_1b(join.many, isTRUEorFALSE)))
stopf("'join.many' must be TRUE or FALSE, or a list of such whose length must be length(l)-1L")

if (missing(mult))
mult = NULL
if (!is.list(mult))
mult = rep(list(mult), n - 1L)
if (length(mult) != n - 1L || !all(vapply_1b(mult, function(x) is.null(x) || (is.character(x) && length(x) == 1L && !anyNA(x) && x %chin% c("error", "all", "first", "last")))))
stopf("'mult' must be one of [error, all, first, last] or NULL, or a list of such whose length must be length(l)-1L")

if (!is.list(how))
how = rep(list(how), n-1L)
if (length(how)!=n-1L || !all(vapply_1b(how, function(x) is.character(x) && length(x)==1L && !anyNA(x) && x %chin% c("left", "inner", "full", "right", "semi", "anti", "cross"))))
stopf("'how' must be one of [left, inner, full, right, semi, anti, cross], or a list of such whose length must be length(l)-1L")

if (is.null(cols)) {
cols = vector("list", n)
} else {
if (!is.list(cols))
stopf("'%s' must be a list", "cols")
if (length(cols) != n)
stopf("'cols' must be same length as 'l' (%d != %d)", length(cols), n)
skip = vapply_1b(cols, is.null)
if (!all(vapply_1b(cols[!skip], function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x))))
stopf("'cols' must be a list of non-zero length, non-NA, non-duplicated, character vectors, or eventually NULLs (all columns)")
if (any(mapply(function(x, icols) !all(icols %chin% names(x)), l[!skip], cols[!skip])))
stopf("'cols' specify columns not present in corresponding table")
}

if (missing(on) || is.null(on)) {
on = vector("list", n - 1L)
} else {
if (!is.list(on))
on = rep(list(on), n - 1L)
if (length(on) != n-1L || !all(vapply_1b(on, function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x)))) ## length checked in dtmerge
stopf("'on' must be non-NA, non-duplicated, character vector, or a list of such which length must be length(l)-1L")
}

l.mem = lapply(l, vapply, address, "")
out = l[[1L]]
out.cols = cols[[1L]]
for (join.i in seq_len(n - 1L)) {
rhs.i = join.i + 1L
out = mergepair(
lhs = out, rhs = l[[rhs.i]],
on = on[[join.i]],
how = how[[join.i]], mult = mult[[join.i]],
lhs.cols = out.cols, rhs.cols = cols[[rhs.i]],
copy = FALSE, ## avoid any copies inside, will copy once below
join.many = join.many[[join.i]],
verbose = verbose
)
out.cols = copy(names(out))
}
out.mem = vapply_1c(out, address)
if (copy)
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
if (verbose)
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]] - p)
out
}

mergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
if (missing(how) || is.null(how))
how = match.arg(how)
mergelist_impl_(l, on, cols, how, mult, join.many, copy=TRUE)
}
setmergelist = function(l, on, cols=NULL, how=c("left", "inner", "full", "right", "semi", "anti", "cross"), mult, join.many=getOption("datatable.join.many")) {
if (missing(how) || is.null(how))
how = match.arg(how)
invisible(mergelist_impl_(l, on, cols, how, mult, join.many, copy=FALSE))
}

# Previously, we had a custom C implementation here, which is ~2x faster,
# but this is fast enough we don't bother maintaining a new routine.
# Hopefully in the future rep() can recognize the ALTREP and use that, too.
Expand Down
Loading
Loading