Skip to content

Commit 98e2257

Browse files
cbindlist
add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list
1 parent ab9b50e commit 98e2257

File tree

5 files changed

+583
-1
lines changed

5 files changed

+583
-1
lines changed

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ export(setnafill)
6060
export(.Last.updated)
6161
export(fcoalesce)
6262
export(cbindlist)
63+
export(mergelist)
6364
export(substitute2)
6465
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472
6566

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ rowwiseDT(
6565

6666
4. `patterns()` in `melt()` combines correctly with user-defined `cols=`, which can be useful to specify a subset of columns to reshape without having to use a regex, for example `patterns("2", cols=c("y1", "y2"))` will only give `y2` even if there are other columns in the input matching `2`, [#6498](https://github.com/Rdatatable/data.table/issues/6498). Thanks to @hongyuanjia for the report, and to @tdhock for the PR.
6767

68+
5. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576).
69+
6870
## BUG FIXES
6971

7072
1. Using `print.data.table()` with character truncation using `datatable.prettyprint.char` no longer errors with `NA` entries, [#6441](https://github.com/Rdatatable/data.table/issues/6441). Thanks to @r2evans for the bug report, and @joshhwuu for the fix.

R/mergelist.R

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,102 @@ mergepair = function(lhs, rhs, on, how, mult, lhs.cols=names(lhs), rhs.cols=name
235235
setDT(out)
236236
}
237237

238+
mergelist = function(l, on, cols, how=c("left","inner","full","right","semi","anti","cross"), mult, copy=TRUE, join.many=getOption("datatable.join.many")) {
239+
verbose = getOption("datatable.verbose")
240+
if (verbose)
241+
p = proc.time()[[3L]]
242+
{
243+
if (!is.list(l) || is.data.frame(l))
244+
stopf("'l' must be a list")
245+
if (!all(vapply_1b(l, is.data.table)))
246+
stopf("Every element of 'l' list must be data.table objects")
247+
if (!all(lengths(l)))
248+
stopf("Tables in 'l' argument must be non-zero columns tables")
249+
if (any(vapply_1i(l, function(x) anyDuplicated(names(x)))))
250+
stopf("Some of the tables in 'l' have duplicated column names")
251+
} ## l
252+
if (!isTRUEorFALSE(copy))
253+
stopf("'%s' must be TRUE or FALSE", "copy")
254+
n = length(l)
255+
if (n<2L) {
256+
out = if (!n) as.data.table(l) else l[[1L]]
257+
if (copy) out = copy(out)
258+
if (verbose)
259+
catf("mergelist: merging %d table(s), took %.3fs\n", n, proc.time()[[3L]]-p)
260+
return(out)
261+
}
262+
{
263+
if (!is.list(join.many))
264+
join.many = rep(list(join.many), n-1L)
265+
if (length(join.many)!=n-1L || !all(vapply_1b(join.many, isTRUEorFALSE)))
266+
stopf("'join.many' must be TRUE or FALSE, or a list of such which length must be length(l)-1L")
267+
} ## join.many
268+
{
269+
if (missing(mult))
270+
mult = NULL
271+
if (!is.list(mult))
272+
mult = rep(list(mult), n-1L)
273+
if (length(mult)!=n-1L || !all(vapply_1b(mult, function(x) is.null(x) || (is.character(x) && length(x)==1L && !anyNA(x) && x %chin% c("error","all","first","last")))))
274+
stopf("'mult' must be one of [error, all, first, last] or NULL, or a list of such which length must be length(l)-1L")
275+
} ## mult
276+
{
277+
if (missing(how) || is.null(how))
278+
how = match.arg(how)
279+
if (!is.list(how))
280+
how = rep(list(how), n-1L)
281+
if (length(how)!=n-1L || !all(vapply_1b(how, function(x) is.character(x) && length(x)==1L && !anyNA(x) && x %chin% c("left","inner","full","right","semi","anti","cross"))))
282+
stopf("'how' must be one of [left, inner, full, right, semi, anti, cross], or a list of such which length must be length(l)-1L")
283+
} ## how
284+
{
285+
if (missing(cols) || is.null(cols)) {
286+
cols = vector("list", n)
287+
} else {
288+
if (!is.list(cols))
289+
stopf("'%s' must be a list", "cols")
290+
if (length(cols) != n)
291+
stopf("'cols' must be same length as 'l'")
292+
skip = vapply_1b(cols, is.null)
293+
if (!all(vapply_1b(cols[!skip], function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x))))
294+
stopf("'cols' must be a list of non-zero length, non-NA, non-duplicated, character vectors, or eventually NULLs (all columns)")
295+
if (any(mapply(function(x, icols) !all(icols %chin% names(x)), l[!skip], cols[!skip])))
296+
stopf("'cols' specify columns not present in corresponding table")
297+
}
298+
} ## cols
299+
{
300+
if (missing(on) || is.null(on)) {
301+
on = vector("list", n-1L)
302+
} else {
303+
if (!is.list(on))
304+
on = rep(list(on), n-1L)
305+
if (length(on)!=n-1L || !all(vapply_1b(on, function(x) is.character(x) && !anyNA(x) && !anyDuplicated(x)))) ## length checked in dtmerge
306+
stopf("'on' must be non-NA, non-duplicated, character vector, or a list of such which length must be length(l)-1L")
307+
}
308+
} ## on
309+
310+
l.mem = lapply(l, vapply, address, "")
311+
out = l[[1L]]
312+
out.cols = cols[[1L]]
313+
for (join.i in seq_len(n-1L)) {
314+
rhs.i = join.i + 1L
315+
out = mergepair(
316+
lhs = out, rhs = l[[rhs.i]],
317+
on = on[[join.i]],
318+
how = how[[join.i]], mult = mult[[join.i]],
319+
lhs.cols = out.cols, rhs.cols = cols[[rhs.i]],
320+
copy = FALSE, ## avoid any copies inside, will copy once below
321+
join.many = join.many[[join.i]],
322+
verbose = verbose
323+
)
324+
out.cols = copy(names(out))
325+
}
326+
out.mem = vapply_1c(out, address)
327+
if (copy)
328+
.Call(CcopyCols, out, colnamesInt(out, names(out.mem)[out.mem %chin% unique(unlist(l.mem, recursive=FALSE))]))
329+
if (verbose)
330+
catf("mergelist: merging %d tables, took %.3fs\n", n, proc.time()[[3L]]-p)
331+
out
332+
}
333+
238334
# Previously, we had a custom C implementation here, which is ~2x faster,
239335
# but this is fast enough we don't bother maintaining a new routine.
240336
# Hopefully in the future rep() can recognize the ALTREP and use that, too.

0 commit comments

Comments
 (0)