Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
b972e13
init topn
ben-schwen Sep 11, 2021
a8f4096
make stable
ben-schwen Sep 13, 2021
1767174
add complex
ben-schwen Sep 13, 2021
7c9ce8b
rename index array
ben-schwen Sep 15, 2021
43cc962
ISNAN_COMPLEX
ben-schwen Sep 17, 2021
1687cb6
added tests
ben-schwen Sep 17, 2021
fc7b948
added man
ben-schwen Sep 17, 2021
90a652f
fix man
ben-schwen Sep 17, 2021
c513efb
finish tests
ben-schwen Sep 17, 2021
ab25d3f
add coverage
ben-schwen Sep 17, 2021
6785cf5
typo
ben-schwen Sep 17, 2021
78502c7
add NEWS
ben-schwen Sep 17, 2021
c9b7a07
add sorted argument
ben-schwen Jan 10, 2024
87aa734
update tests
ben-schwen Jan 10, 2024
446ca7f
add CODEOWNERS
ben-schwen Jan 10, 2024
db055f5
update NEWS
ben-schwen Jan 10, 2024
2b16d6a
fix docs
ben-schwen Jan 10, 2024
13caa28
add arg to doc
ben-schwen Jan 10, 2024
0fba260
unname args
ben-schwen Sep 12, 2024
ff541fd
remove parentheses from return
ben-schwen Sep 12, 2024
fe0e6f9
change index a,b with i,j
ben-schwen Sep 12, 2024
e1d037e
remove parentheses from return
ben-schwen Sep 12, 2024
d8cf910
remove unnecessary const
ben-schwen Sep 12, 2024
5e547df
add parentheses for op precedence
ben-schwen Sep 12, 2024
134d0a3
clarify comment
ben-schwen Sep 12, 2024
37efbf3
add error for malloc fail
ben-schwen Sep 12, 2024
538d2f6
Merge branch 'topn_heap' of github.com:Rdatatable/data.table into top…
ben-schwen Sep 12, 2024
c70fbc1
add quickselect support
ben-schwen Sep 16, 2024
f8ff588
add string support for quickselect
ben-schwen Sep 16, 2024
2dfb740
use memcpy instead of assignment
ben-schwen Sep 16, 2024
0604714
update NEWS
ben-schwen Nov 10, 2024
6a162de
Merge branch 'master' into topn_heap
ben-schwen Dec 21, 2025
e9688fb
make linter happy
ben-schwen Dec 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,7 @@
# C code tricks
/src/chmatch.c @aitap
/src/fread.c @aitap

# ordering
/src/topn.c @ben-schwen
/man/topn.Rd @ben-schwen
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@ S3method(format_list_item, data.frame)

export(fdroplevels, setdroplevels)
S3method(droplevels, data.table)

export(topn)
export(frev)
export(.selfref.ok)

Expand Down
21 changes: 20 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,26 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
# user system elapsed
# 0.028 0.000 0.005
```
20. `fread()` now supports the `comment.char` argument to skip trailing comments or comment-only lines, consistent with `read.table()`, [#856](https://github.com/Rdatatable/data.table/issues/856). The default remains `comment.char = ""` (no comment parsing) for backward compatibility and performance, in contrast to `read.table(comment.char = "#")`. Thanks to @arunsrinivasan and many others for the suggestion and @ben-schwen for the implementation.

20. `fread()` now supports the `comment.char` argument to skip trailing comments or comment-only lines, consistent with `read.table()`, [#856](https://github.com/Rdatatable/data.table/issues/856). The default remains `comment.char = ""` (no comment parsing) for backward compatibility and performance, in contrast to `read.table(comment.char = "#")`. Thanks to @arunsrinivasan and many others for the suggestion and @ben-schwen for the implementation.

21. New function `topn(x,n)` [#3804](https://github.com/Rdatatable/data.table/issues/3804). It returns the indices of the `n` smallest/largest values of a vector `x`. Previously, one had to use `order(x)[1:n]` which produced a full sorting of `x`. Usage of `topn` is advised for large vectors where sorting takes long time. Thanks to Michael Chirico for requesting, and Benjamin Schwendinger for the PR.

```R
set.seed(123)
x = rnorm(1e8)
bm = bench::mark(check=FALSE, min_iterations=10,
topn(x, 5L, sorted=TRUE),
topn(x, 5L, sorted=FALSE),
order(x)[1:5]
)
setDT(bm)[, .(expression, min, median, lapply(time, max), mem_alloc)]
# expression min median V4 mem_alloc
# <bench_expr> <bench_time> <bench_time> <list> <bench_bytes>
# 1: topn(x, 5L, sorted = TRUE) 151.65ms 155.21ms 171ms 0B
# 2: topn(x, 5L, sorted = FALSE) 151.59ms 158.77ms 176ms 0B
# 3: order(x)[1:5] 2.69s 2.77s 2.84s 381MB
```

### BUG FIXES

Expand Down
2 changes: 2 additions & 0 deletions R/wrappers.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,7 @@ fitsInInt64 = function(x) .Call(CfitsInInt64R, x)

coerceAs = function(x, as, copy=TRUE) .Call(CcoerceAs, x, as, copy)

topn = function(x, n, na.last=TRUE, decreasing=FALSE, sorted=FALSE) .Call(Ctopn, x, as.integer(n), na.last, decreasing, sorted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this will be most useful when equipped to sort multiple vectors, so my sense is the signature should be topn(n, x, ...) or topn(n, ...).

Here's code that looks topn()-able:

https://github.com/search?q=lang%3AR+%2Forder%5C%28.*%5C%29%5C%5B1%3A%5B0-9%5D%2F&type=code

That said, I could only really find one usage that needs multiple vectors (could be I have a bad regex)!

https://github.com/search?q=lang%3AR+%2F%5B%5Ea-z0-9._%5Dorder%5C%28%5B%5E%28%29%3D%5D*%2C%5B%5E%3D%5Cn%5C%5B%5C%5D%29%5D*%5C%29%5C%5B1%3A%5B0-9%5D%2F&type=code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But n will be most often scalar, no? Then it is not clear why it would be first argument. Then it is also not pipeable for data input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding ... support, however, being pipeable seems more advantageous nowadays.

quickn = function(x, n, na.last=TRUE, decreasing=FALSE) .Call(Cquickn, x, as.integer(n), na.last, decreasing)
frev = function(x) .Call(Cfrev, x, TRUE)
setfrev = function(x) invisible(.Call(Cfrev, x, FALSE))
48 changes: 48 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
.shallow = data.table:::.shallow
stopf = data.table:::stopf
test = data.table:::test
topn = data.table:::topn
uniqlengths = data.table:::uniqlengths
uniqlist = data.table:::uniqlist
warningf = data.table:::warningf
Expand Down Expand Up @@ -17938,6 +17939,7 @@ DT = data.table(A=letters[1:3], B=4:6, key="A")
test(2215.1, DT["b", B], 5L) # has worked forever
test(2215.2, DT[factor("b"), B], 5L) # now works too, joining fact/fact, char/fact and fact/char have plenty of tests


# segfault on merge keyed all-NA_character_ due to is.sorted, #5070
DT1 = data.table(x1 = rep(letters[1:4], each=3), x2=NA_character_, key="x2")
DT2 = data.table(x1 = letters[1:3])
Expand Down Expand Up @@ -21901,3 +21903,49 @@ rm(DT, strings)
DT = unserialize(serialize(as.data.table(mtcars), NULL))
test(2351, DT[,carb:=NULL], as.data.table(mtcars)[,carb:=NULL])
rm(DT)

# topn retrieving which.max/which.min for more than 1 element #3804
set.seed(373L)
x=list(sample(c(-1:1, NA), 10, TRUE),
sample(c(rnorm(3), NA), 10, TRUE),
sample(c(letters[1:3],NA), 10, TRUE),
c(NaN, Inf, -Inf),
c(-Inf, 0, Inf),
c(-Inf, Inf),
c(Inf, -Inf),
c(0, NaN),
c(NaN, 0),
c(3+9i, 10+5i, 8+2i, 10+4i, 3+3i, 1+2i, 5+1i, 8+1i, 8+2i, 10+6i),
c(3, 10+5i, 8, 10+4i, NA, 3, 1+2i, 5+1i, 8+1i, 8, NA))
testnum = 2352.0
b = CJ(c(TRUE,FALSE), c(TRUE,FALSE))
for (i in seq.int(x)) {
l = length(x[[i]])
s = sample(l-1L, 1L)
for (j in nrow(b)) {
test(testnum, topn(x[[i]], l, na.last=b[[j,1]], decreasing=b[[j,2]], sorted=TRUE),
order(x[[i]], na.last=b[[j,1]], decreasing=b[[j,2]]))
testnum = testnum + 0.01
test(testnum, topn(x[[i]], s, na.last=b[[j,1]], decreasing=b[[j,2]], sorted=TRUE),
order(x[[i]], na.last=b[[j,1]], decreasing=b[[j,2]])[1L:s])
testnum = testnum + 0.01
test(testnum, sort(topn(x[[i]], s, na.last=b[[j,1]], decreasing=b[[j,2]], sorted=FALSE)),
sort(order(x[[i]], na.last=b[[j,1]], decreasing=b[[j,2]])[1L:s]))
testnum = testnum + 0.01
}
}
test(2352.79, topn(as.raw(1), 1), error="Type 'raw' not supported by topn.")
test(2352.80, topn(1L, 2L), 1L, warning="n should be smaller or equal than length(x) but provided n=2 and length(x)=1.\n Coercing n to length(x).")
test(2352.81, topn(1L, -1L), error="topn(x,n) only implemented for n > 0.")
if (test_bit64) {
x=as.integer64(c(-5, 5, 0, NA, -5, 5, 1e18, NA, 5, 1e-18))
test(2352.82, topn(x, 10L, na.last=TRUE, decreasing=FALSE, sorted=TRUE), c(1L, 5L, 3L, 10L, 2L, 6L, 9L, 7L, 4L, 8L))
test(2352.83, topn(x, 10L, na.last=FALSE, decreasing=FALSE, sorted=TRUE), c(4L, 8L, 1L, 5L, 3L, 10L, 2L, 6L, 9L, 7L))
test(2352.84, topn(x, 10L, na.last=TRUE, decreasing=TRUE, sorted=TRUE), c(7L, 2L, 6L, 9L, 3L, 10L, 1L, 5L, 4L, 8L))
test(2352.85, topn(x, 10L, na.last=FALSE, decreasing=TRUE, sorted=TRUE), c(4L, 8L, 7L, 2L, 6L, 9L, 3L, 10L, 1L, 5L))
# answers are sort(head(x, 2)) of answer before
test(2352.86, sort(topn(x, 2L, na.last=TRUE, decreasing=FALSE, sorted=FALSE)), c(1L, 5L))
test(2352.87, sort(topn(x, 2L, na.last=FALSE, decreasing=FALSE, sorted=FALSE)), c(4L, 8L))
test(2352.88, sort(topn(x, 2L, na.last=TRUE, decreasing=TRUE, sorted=FALSE)), c(2L, 7L))
test(2352.89, sort(topn(x, 2L, na.last=FALSE, decreasing=TRUE, sorted=FALSE)), c(4L, 8L))
}
45 changes: 45 additions & 0 deletions man/topn.Rd
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
\name{topn}
\alias{topn}
\alias{which.max}
\alias{which.min}
\title{ Top n values index }
\description{
\code{topn} returns the indices of the smallest (resp. largest) \code{n} values of x. This is an extension \code{\link{which.min}} (\code{\link{which.max}}) which only return the index of the minimum (resp. maximum).

Especially, for large vectors this method will be faster and less memory intensive than \code{order} since no full sort is performed.

\code{bit64::integer64} type is also supported.
}

\usage{
topn(x, n, na.last=TRUE, decreasing=FALSE, sorted=FALSE)
}
\arguments{
\item{x}{ A numeric, complex, character or logical vector. }
\item{n}{ A numeric vector length 1. How many indices to select. }
\item{na.last}{ Control treatment of \code{NA}s. If \code{TRUE}, missing values in the data are put last; if \code{FALSE}, they are put first. }
\item{decreasing}{ Logical. Default is \code{FALSE}. Indicating whether the order should be increasing or decreasing. }
\item{sorted}{ Logical. Default is \code{FALSE}. Indicating whether order should be sorted with respect to decreasing. }
}

\value{
An integer vector giving the indicies of the \code{n} smallest (largest) for \code{decreasing=FALSE (TRUE)} elements of \code{x}.
}

\examples{
x = c(1:4, 0:5, 11)
# indices of smallest 3 values
topn(x, 3)
# indices of largest 3 values
topn(x, 3, decreasing = TRUE)

## NA's can be put to front or back
x = c(NA, 1:4)
topn(x, 5)
topn(x, 5, na.last=FALSE)

}
\seealso{
\code{\link{data.table}}, \code{\link{order}}, \code{\link{which.max}}, \code{\link{which.min}}
}
\keyword{ data }
4 changes: 4 additions & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,10 @@ SEXP substitute_call_arg_namesR(SEXP expr, SEXP env);
//negate.c
SEXP notchin(SEXP x, SEXP table);

// topn.c
SEXP topn(SEXP, SEXP, SEXP, SEXP, SEXP);
SEXP quickn(SEXP, SEXP, SEXP, SEXP);

// hash.c
typedef struct {
SEXP prot; // make sure to PROTECT() while the table is in use
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,8 @@ R_CallMethodDef callMethods[] = {
{"Cdt_has_zlib", (DL_FUNC)&dt_has_zlib, -1},
{"Csubstitute_call_arg_namesR", (DL_FUNC) &substitute_call_arg_namesR, -1},
{"CstartsWithAny", (DL_FUNC)&startsWithAny, -1},
{"Ctopn", (DL_FUNC)&topn, -1},
{"Cquickn", (DL_FUNC)&quickn, -1},
{"CconvertDate", (DL_FUNC)&convertDate, -1},
{"Cnotchin", (DL_FUNC)&notchin, -1},
{"Ccbindlist", (DL_FUNC) &cbindlist, -1},
Expand Down
199 changes: 199 additions & 0 deletions src/topn.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
#include "data.table.h"

static inline void swap(int *a, int *b) { int tmp=*a; *a=*b; *b=tmp; }

static inline bool icmp(const int *x, int i, int j, bool min, bool nalast) {
if (x[i]==x[j]) return i > j;
if (x[i]==NA_INTEGER) return nalast;
if (x[j]==NA_INTEGER) return !nalast;
return min ? x[i] < x[j] : x[i] > x[j];
}

static inline bool dcmp(const double *x, int i, int j, bool min, bool nalast) {
if (x[i]==x[j] || (isnan(x[i]) && isnan(x[j]))) return i > j;
if (isnan(x[i])) return nalast;
if (isnan(x[j])) return !nalast;
return min ? x[i] < x[j] : x[i] > x[j];
}

static inline bool i64cmp(const int64_t *x, int i, int j, bool min, bool nalast) {
if (x[i]==x[j]) return i > j;
if (x[i]==NA_INTEGER64) return nalast;
if (x[j]==NA_INTEGER64) return !nalast;
return min ? x[i] < x[j] : x[i] > x[j];
}

static inline bool scmp(const SEXP *restrict x, int i, int j, bool min, bool nalast) {
if (strcmp(CHAR(x[i]), CHAR(x[j])) == 0) return i > j;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CHAR(NA_STRING) returns "NA", this will compare NA_STRING and mkChar("NA") as "equal".

Might it help to call strcmp(x[i], x[j]) only once? The compiler may be already optimising this.

if (x[i]==NA_STRING) return nalast;
if (x[j]==NA_STRING) return !nalast;
return min ? strcmp(CHAR(x[i]),CHAR(x[j]))<0 : strcmp(CHAR(x[i]),CHAR(x[j]))>0;
}

static inline bool ccmp(const Rcomplex *x, int i, int j, bool min, bool nalast) {
if (ISNAN_COMPLEX(x[i]) && ISNAN_COMPLEX(x[j])) return i > j;
if (x[i].r==x[j].r) {
if (x[i].i==x[j].i) return i > j;
return min ? x[i].i < x[j].i : x[i].i > x[j].i;
}
if (ISNAN_COMPLEX(x[i])) return nalast;
if (ISNAN_COMPLEX(x[j])) return !nalast;
return min ? x[i].r < x[j].r : x[i].r > x[j].r;
}

// compare value of node with values of left/right child nodes and sift value down if value of child node is smaller than parent (for minheap)
#undef SIFT
#define SIFT(CMP) { \
int smallest, l, r; \
while(true) { \
smallest = k; \
l = (k << 1) + 1; \
r = l+1; \
if (l < len && CMP(VAL,INDEX[l],INDEX[smallest],min,nalast)) \
smallest = l; \
if (r < len && CMP(VAL,INDEX[r],INDEX[smallest],min,nalast)) \
smallest = r; \
if (smallest != k) { \
swap(&INDEX[k], &INDEX[smallest]); \
k = smallest; \
} else { \
break; \
} \
} \
}

// for finding decreasing topn build minheap and add values if they exceed
// minimum by overwriting minimum and following down sifting
#undef HEAPN
#define HEAPN(CTYPE, RTYPE, CMP, SORTED) { \
const CTYPE *restrict VAL = (const CTYPE *)RTYPE(x); \
for (int i=n/2; i>=0; --i) { k=i; len=n; SIFT(CMP); } \
for (int i=n; i<xlen; ++i) { \
if (CMP(VAL,INDEX[0],i,min,nalast)) { \
INDEX[0] = i; \
k=0; len=n; SIFT(CMP); \
} \
} \
if (SORTED) { \
for (int i=0; i<n; ++i) { \
swap(&INDEX[0], &INDEX[n-1-i]); \
k=0; len=n-1-i; SIFT(CMP); \
ians[n-1-i] = INDEX[n-1-i]+1; \
} \
} else { \
for (int i=0; i<n; ++i) { \
ians[i] = INDEX[i]+1; \
} \
} \
free(INDEX); \
}

SEXP topn(SEXP x, SEXP nArg, SEXP naArg, SEXP ascArg, SEXP sortedArg) {
if (!isInteger(nArg) || LENGTH(nArg)!=1 || INTEGER(nArg)[0]<=0 || INTEGER(nArg)[0]==NA_INTEGER) error(_("topn(x,n) only implemented for n > 0."));
if (!IS_TRUE_OR_FALSE(ascArg)) error(_("%s must be TRUE or FALSE"), "decreasing");
if (!IS_TRUE_OR_FALSE(naArg)) error(_("%s must be TRUE or FALSE"), "na.last");
if (!IS_TRUE_OR_FALSE(sortedArg)) error(_("%s must be TRUE or FALSE"), "sorted");

const int xlen = LENGTH(x);
int n = INTEGER(nArg)[0];
if (n > xlen) {
warning(_("n should be smaller or equal than length(x) but provided n=%d and length(x)=%d.\n Coercing n to length(x)."), n, xlen);
n = xlen;
}

const bool min = LOGICAL(ascArg)[0];
const bool nalast = LOGICAL(naArg)[0];
const bool sorted = LOGICAL(sortedArg)[0];

SEXP ans;
int k, len;
ans = PROTECT(allocVector(INTSXP, n));
int *restrict ians = INTEGER(ans);
int *restrict INDEX = malloc(n*sizeof(int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not (int *)R_alloc(n, sizeof(int))? R will check the allocation and unprotect it when .Call(Ctopn, ...) returns.

if (!INDEX) error(_("Internal error: Couldn't allocate memory for heap indices.")); // # nocov
for (int i=0; i<n; ++i) INDEX[i] = i;
switch(TYPEOF(x)) {
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); }
else { HEAPN(double, REAL, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break;
Comment on lines +116 to +121
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); }
else { HEAPN(double, REAL, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break;
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER_RO, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL_RO, i64cmp, sorted); }
else { HEAPN(double, REAL_RO, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX_RO, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR_RO, scmp, sorted); } break;

default:
free(INDEX); error(_("Type '%s' not supported by topn."), type2char(TYPEOF(x)));
}
UNPROTECT(1);
return(ans);
}

#undef QUICKN
#define QUICKN(CTYPE, RTYPE, CMP, SWAP) \
CTYPE *ix = (CTYPE *)RTYPE(x); \
CTYPE *ians = (CTYPE *)RTYPE(ans); \
unsigned long l = 0, ir = xlen - 1; \
for (;;) { \
if (ir <= l + 1) { \
if (ir == l + 1 && CMP(ix,l,ir,min,nalast)) { \
SWAP(ix+l, ix+ir); \
} \
break; \
} else { \
unsigned long mid = (l + ir) >> 1; \
SWAP(ix+mid, ix+l + 1); \
if (CMP(ix,l,ir,min,nalast)) { \
SWAP(ix+l, ix+ir); \
} \
if (CMP(ix,l+1,ir,min,nalast)) { \
SWAP(ix+l+1, ix+ir); \
} \
if (CMP(ix,l,l+1,min,nalast)) { \
SWAP(ix+l, ix+l+1); \
} \
unsigned long i = l + 1, j = ir; \
for (;;) { \
do i++; while (CMP(ix,l+1,i,min,nalast)); \
do j--; while (CMP(ix,j,l+1,min,nalast)); \
if (j < i) break; \
SWAP(ix+i, ix+j); \
} \
SWAP(ix+l+1, ix+j); \
if (j >= n) ir = j - 1; \
if (j <= n) l = i; \
} \
} \
memcpy(ians, ix, n * sizeof(CTYPE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat scary for character vectors, but I'm not seeing anything that would break right now.

From the GC generations viewpoint, x and ans are likely from the same GC generation; x is possibly older. There shouldn't be any problem with elements of newer, more-frequently-sweeped ans pointing to values from an older, less-frequently-sweeped GC generation. (It's the opposite that causes use-after-frees.)

From the reference counts viewpoint, it'll be one less than what it should be for elements of ans, but CHARSXPs are cached and immutable anyway.


static inline void iswap(int *a, int *b) {int tmp=*a; *a=*b; *b=tmp;}
static inline void dswap(double *a, double *b) {double tmp=*a; *a=*b; *b=tmp;}
static inline void i64swap(int64_t *a, int64_t *b) {int64_t tmp=*a; *a=*b; *b=tmp;}
static inline void cswap(Rcomplex *a, Rcomplex *b) {Rcomplex tmp=*a; *a=*b; *b=tmp;}
static inline void sswap(SEXP *a, SEXP *b) {SEXP tmp=*a; *a=*b; *b=tmp;}

SEXP quickn(SEXP x, SEXP nArg, SEXP naArg, SEXP ascArg) {
if (!isInteger(nArg) || LENGTH(nArg)!=1 || INTEGER(nArg)[0]<=0 || INTEGER(nArg)[0]==NA_INTEGER) error(_("topn(x,n) only implemented for n > 0."));
if (!IS_TRUE_OR_FALSE(ascArg)) error(_("%s must be TRUE or FALSE"), "decreasing");
if (!IS_TRUE_OR_FALSE(naArg)) error(_("%s must be TRUE or FALSE"), "na.last");

const int xlen = LENGTH(x);
int n = INTEGER(nArg)[0];
x = PROTECT(duplicate(x));

const bool min = LOGICAL(ascArg)[0];
const bool nalast = LOGICAL(naArg)[0];

SEXP ans;
ans = PROTECT(allocVector(TYPEOF(x), n));
switch(TYPEOF(x)) {
case LGLSXP: case INTSXP: { QUICKN(int, INTEGER, icmp, iswap); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { QUICKN(int64_t, REAL, i64cmp, i64swap); }
else { QUICKN(double, REAL, dcmp, dswap); } break; }
case CPLXSXP: { QUICKN(Rcomplex, COMPLEX, ccmp, cswap); } break;
case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break;
case STRSXP: { QUICKN(SEXP, STRING_PTR_RO, scmp, sswap); } break;

Could also be DATAPTR_RO. If only used for swapping elements in place, this is not any worse than reorder:

data.table/src/reorder.c

Lines 116 to 117 in 2654599

// Unique and somber line. Not done lightly. Please read all comments in this file.
memcpy((char*)DATAPTR_RO(v) + size*start, TMP, size*nmid);

default:
error(_("Type '%s' not supported by quickn."), type2char(TYPEOF(x)));
}
copyMostAttrib(x, ans);
UNPROTECT(2);
return(ans);
}
Loading