Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 32 additions & 26 deletions man/mergelist.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -24,32 +24,38 @@
\details{
Note: these functions should be considered experimental. Users are encouraged to provide feedback in our issue tracker.

Merging is performed sequentially from "left to right", so that for \code{l} of 3 tables, it will do something like \code{merge(merge(l[[1L]], l[[2L]]), l[[3L]])}. \emph{Non-equi joins} are not supported. Column names to merge on must be common in both tables on each merge.

Arguments \code{on}, \code{how}, \code{mult}, \code{join.many} could be lists as well, each of length \code{length(l)-1L}, to provide argument to be used for each single tables pair to merge, see examples.

The terms \emph{join-to} and \emph{join-from} indicate which in a pair of tables is the "baseline" or "authoritative" source -- this governs the ordering of rows and columns.
Whether each refers to the "left" or "right" table of a pair depends on the \code{how} argument:
\enumerate{
\item \code{how \%in\% c("left", "semi", "anti")}: \emph{join-to} is \emph{RHS}, \emph{join-from} is \emph{LHS}.
\item \code{how \%in\% c("inner", "full", "cross")}: \emph{LHS} and \emph{RHS} tables are treated equally, so that the terms are interchangeable.
\item \code{how == "right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}.
}

Using \code{mult="error"} will throw an error when multiple rows in \emph{join-to} table match to the row in \emph{join-from} table. It should not be used just to detect duplicates, which might not have matching row, and thus would silently be missed.

When not specified, \code{mult} takes its default depending on the \code{how} argument:
\enumerate{
\item When \code{how \%in\% c("left", "inner", "full", "right")}, \code{mult="error"}.
\item When \code{how \%in\% c("semi", "anti")}, \code{mult="last"}, although this is equivalent to \code{mult="first"}.
\item When \code{how == "cross"}, \code{mult="all"}.
}

When the \code{on} argument is missing, it will be determined based \code{how} argument:
\enumerate{
\item When \code{how \%in\% c("left", right", "semi", "anti")}, \code{on} becomes the key column(s) of the \emph{join-to} table.
\item When \code{how \%in\% c("inner", full")}, if only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to shorter key.
}
Merging is performed sequentially from "left to right", so that for \code{l} of 3 tables, it will do something like \code{merge(merge(l[[1L]], l[[2L]]), l[[3L]])}. \emph{Non-equi joins} are not supported. Column names to merge on must be common to both tables at each merge.

Arguments \code{on}, \code{how}, \code{mult}, and \code{join.many} can also be lists, each of length \code{length(l)-1L}, providing the argument to be used at each merge; see Examples.

\tabular{lcrccrcrcr}{
\strong{\code{how}} \tab \tab
\strong{\emph{join-from}} \tab \tab \tab
\strong{\emph{join-to}} \tab \tab
\strong{default \code{mult}} \tab \tab
\strong{output key} \cr
\code{"left"} \tab \tab \emph{LHS} \tab \tab \tab \emph{RHS} \tab \tab \code{"error"} \tab \tab \emph{LHS} \cr
\code{"right"} \tab \tab \emph{RHS} \tab \tab \tab \emph{LHS} \tab \tab \code{"error"} \tab \tab \emph{RHS} \cr
\code{"inner"} \tab \tab both \tab \tab \tab both \tab \tab \code{"error"} \tab \tab \emph{LHS} \cr
\code{"full"} \tab \tab both \tab \tab \tab both \tab \tab \code{"error"} \tab \tab \code{NULL} \cr
\code{"semi"}, \code{"anti"} \tab \tab \emph{LHS} \tab \tab \tab \emph{RHS} \tab \tab \code{"last"} \tab \tab \emph{LHS} \cr
\code{"cross"} \tab \tab n/a \tab \tab \tab n/a \tab \tab \code{"all"} \tab \tab \emph{LHS}
}

The roles of \emph{join-from} and \emph{join-to} are as follows:

\itemize{
\item If \code{on} is not provided, it is taken to be \emph{join-to}'s key column(s).
\item When a row of \emph{join-from} finds multiple matches in \emph{join-to}, \code{mult} defines which of them to select.
}

When \code{how \%in\% c("inner", "full")}, \emph{LHS} and \emph{RHS} are treated symmetrically, so that each is both \emph{join-from} and \emph{join-to}:
\itemize{
\item If \code{on} is not provided and only one table has a key, then that key defines the match column(s). If both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, with the order of the key columns aligned with the shorter key.
\item When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side are merged. \code{mult} is satisfied mutually and the merge is one-to-one.
}

Using \code{mult="error"} throws an error when a row of \emph{join-from} finds multiple matches in \emph{join-to}. When \code{how \%in\% c("semi", "anti")}, \code{mult="last"} (the default) and \code{mult="first"} are equivalent; \code{mult="all"} is not allowed. When \code{how == "cross"}, only \code{mult="all"} is allowed.

When joining tables that are not directly linked to a single table, e.g. a snowflake schema (see References), a \emph{right} outer join can be used to optimize the sequence of merges, see Examples.
}
Expand Down
Loading