Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Sep 17, 2021

Closes #3804

  • R/C code
  • tests
  • man page
  • news

I see the general use case of topn for arrays where sorting costs much and using as few additional memory as possible with good performance.

Benchmarks

Integer

Worst case

Array is sorted ascending and we want the maximum topn so we need to update the heap at every step after n

library(data.table)
setDTthreads(1L)
x = seq.int(1e8)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   150.43ms 162.42ms     6.17    381.5MB    0    
#> 2 quickn(x, n, decreasing = TRUE) 332.06ms 370.74ms     2.70    381.5MB    2.70 
#> 3 kit::topn(x, n, decreasing = T… 415.41ms 431.38ms     2.32     39.8KB    0    
#> 4 data.table:::forder(x, decreas…    1.83s    1.83s     0.547   381.5MB    0.547

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      2.67s    2.67s     0.375        0B    0    
#> 2 quickn(x, n, decreasing = TRUE) 253.91ms 263.16ms     3.80      381MB    3.80 
#> 3 kit::topn(x, n, decreasing = T… 833.57ms 833.57ms     1.20         0B    0    
#> 4 data.table:::forder(x, decreas…    1.65s    1.65s     0.607     381MB    0.607

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      4.45s    4.45s     0.225      448B    0    
#> 2 quickn(x, n, decreasing = TRUE) 269.91ms 271.52ms     3.68      381MB    3.68 
#> 3 kit::topn(x, n, decreasing = T…     7.3s     7.3s     0.137      448B    0    
#> 4 data.table:::forder(x, decreas…    1.65s    1.65s     0.605     381MB    0.605

Best case

Array is sorted ascending and we want the minimum topn so we never need to update after n
(not benchmarking with kit since it errors from n=1e4 onwards)

library(data.table)
setDTthreads(1L)
x = seq.int(1e8)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       107ms  109ms      9.20     381MB     0   
#> 2 quickn(x, n, decreasing = FALSE)     245ms  259ms      3.86     381MB     3.86
#> 3 data.table:::forder(x, decreasing =… 837ms  837ms      1.19     382MB     1.19

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       125ms  126ms      7.88        0B     0   
#> 2 quickn(x, n, decreasing = FALSE)     249ms  256ms      3.91     381MB     3.91
#> 3 data.table:::forder(x, decreasing =… 971ms  971ms      1.03     381MB     1.03

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       124ms  127ms      7.76      448B     0   
#> 2 quickn(x, n, decreasing = FALSE)     252ms  256ms      3.91     381MB     3.91
#> 3 data.table:::forder(x, decreasing =… 889ms  889ms      1.12     381MB     1.12

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       122ms  123ms      8.11    3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     245ms  257ms      3.89  381.47MB     3.89
#> 3 data.table:::forder(x, decreasing =… 911ms  911ms      1.10  381.48MB     1.10

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       118ms  120ms      8.32    39.1KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     248ms  250ms      4.00   381.5MB     4.00
#> 3 data.table:::forder(x, decreasing =… 876ms  876ms      1.14   381.5MB     1.14

n = 1e5
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       119ms  124ms      8.13     391KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     254ms  280ms      3.57     382MB     3.57
#> 3 data.table:::forder(x, decreasing =… 924ms  924ms      1.08     382MB     1.08

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)  133.82ms 136.46ms     7.28     3.81MB    0    
#> 2 quickn(x, n, decreasing = FALS… 272.88ms 279.49ms     3.58   385.29MB    3.58 
#> 3 data.table:::forder(x, decreas…    1.04s    1.04s     0.964   389.1MB    0.964

n = 1e7
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       203ms  209ms      4.74    38.1MB     0   
#> 2 quickn(x, n, decreasing = FALSE)     265ms  280ms      3.57   419.6MB     3.57
#> 3 data.table:::forder(x, decreasing =… 987ms  987ms      1.01   457.8MB     1.01

Random permutation (mimicking average case)

library(data.table)
setDTthreads(1L)
set.seed(373)
x = sample(seq.int(1e8))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     83.98ms 85.38ms    11.7      3.19KB    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.04s   1.04s     0.964  381.47MB    0.964
#> 3 kit::topn(x, n, decreasing = TRU…  45.9ms 46.12ms    21.6     39.77KB    0    
#> 4 data.table:::forder(x, decreasin…   2.26s   2.26s     0.443  381.55MB    0.443

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      81.7ms 82.12ms    12.1          0B    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.09s   1.09s     0.916     381MB    0.916
#> 3 kit::topn(x, n, decreasing = TRUE) 46.3ms 46.52ms    21.3          0B    0    
#> 4 data.table:::forder(x, decreasing…  2.25s   2.25s     0.444     381MB    0.444

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     82.62ms 84.17ms    11.9        448B    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.07s   1.07s     0.931     381MB    0.931
#> 3 kit::topn(x, n, decreasing = TRU… 46.58ms 47.92ms    20.2        448B    0    
#> 4 data.table:::forder(x, decreasin…   2.42s   2.42s     0.414     381MB    0.414

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     95.41ms 98.98ms     9.98     3.95KB    0    
#> 2 quickn(x, n, decreasing = TRUE)     1.13s   1.13s     0.884  381.47MB    0.884
#> 3 kit::topn(x, n, decreasing = TRU… 58.84ms 60.07ms    16.5      3.95KB    0    
#> 4 data.table:::forder(x, decreasin…   2.51s   2.51s     0.398  381.48MB    0.398

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   118.19ms 119.21ms     8.33     39.1KB    0    
#> 2 quickn(x, n, decreasing = TRUE)    1.03s    1.03s     0.967   381.5MB    0.967
#> 3 kit::topn(x, n, decreasing = T…     2.1s     2.1s     0.477   381.5MB    0.477
#> 4 data.table:::forder(x, decreas…    2.25s    2.25s     0.445   381.5MB    0.445

n = 1e5
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   519.26ms 519.26ms     1.93      391KB    0    
#> 2 quickn(x, n, decreasing = TRUE)    1.07s    1.07s     0.930     382MB    0.930
#> 3 kit::topn(x, n, decreasing = T…    1.91s    1.91s     0.523     382MB    0.523
#> 4 data.table:::forder(x, decreas…    2.35s    2.35s     0.426     382MB    0.426

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        5.44s  5.44s     0.184    3.81MB    0    
#> 2 quickn(x, n, decreasing = TRUE)      1.02s  1.02s     0.983  385.29MB    0.983
#> 3 kit::topn(x, n, decreasing = TRUE)   1.94s  1.94s     0.517  385.29MB    0.517
#> 4 data.table:::forder(x, decreasing =… 2.27s  2.27s     0.441   389.1MB    0.441

Double

Worst case

library(data.table)
setDTthreads(1L)
x = as.double(seq.int(1e7))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      24.2ms  26.3ms     37.2     76.3MB     0   
#> 2 quickn(x, n, decreasing = TRUE)    53.6ms  54.6ms     17.9     76.3MB    17.9 
#> 3 kit::topn(x, n, decreasing = TRU…  51.9ms    55ms     17.7     39.8KB     0   
#> 4 data.table:::forder(x, decreasin… 449.7ms 456.1ms      2.19    38.2MB     1.10

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     337.5ms 342.8ms      2.92        0B     0   
#> 2 quickn(x, n, decreasing = TRUE)    54.3ms  56.4ms     17.4     76.3MB    17.4 
#> 3 kit::topn(x, n, decreasing = TRU… 158.3ms 167.4ms      5.83        0B     0   
#> 4 data.table:::forder(x, decreasin… 386.9ms 424.8ms      2.35    38.1MB     1.18

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)   584.17ms 584.17ms     1.71       448B     0   
#> 2 quickn(x, n, decreasing = TRUE)   52.1ms  53.86ms    18.6      76.3MB    18.6 
#> 3 kit::topn(x, n, decreasing = T…    1.94s    1.94s     0.515      448B     0   
#> 4 data.table:::forder(x, decreas… 367.69ms 383.86ms     2.61     38.1MB     1.30

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     873.2ms 873.2ms    1.15      3.95KB     0   
#> 2 quickn(x, n, decreasing = TRUE)    50.9ms  51.5ms   19.3       76.3MB    19.3 
#> 3 kit::topn(x, n, decreasing = TRU…   18.3s   18.3s    0.0546    3.95KB     0   
#> 4 data.table:::forder(x, decreasin…   358ms 358.3ms    2.79     38.16MB     1.40

Best case

library(data.table)
setDTthreads(1L)
x = as.double(seq.int(1e7))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.8ms  19.2ms     51.3     76.3MB     0   
#> 2 quickn(x, n, decreasing = FALSE)   54.7ms  55.8ms     17.9     76.3MB    17.9 
#> 3 data.table:::forder(x, decreasin… 179.5ms 209.6ms      4.44    38.2MB     1.48

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     20.2ms  21.3ms     46.7         0B     0   
#> 2 quickn(x, n, decreasing = FALSE)   56.5ms  59.9ms     16.2     76.3MB    16.2 
#> 3 data.table:::forder(x, decreasin…   188ms 190.5ms      5.11    38.1MB     1.70

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.5ms  19.8ms     49.8       448B     0   
#> 2 quickn(x, n, decreasing = FALSE)     53ms  54.7ms     18.0     76.3MB    18.0 
#> 3 data.table:::forder(x, decreasin… 178.6ms 187.8ms      5.30    38.1MB     1.77

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     18.5ms  18.9ms     51.6     3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   54.2ms  56.1ms     16.7     76.3MB    16.7 
#> 3 data.table:::forder(x, decreasin… 237.2ms 243.6ms      4.11   38.16MB     1.37

n = 1e4
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     20.6ms  21.7ms     45.8     39.1KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   58.2ms  59.3ms     16.4     76.4MB    16.4 
#> 3 data.table:::forder(x, decreasin… 180.7ms 199.2ms      5.15    38.2MB     1.72

n = 1e5
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     19.5ms  20.1ms     48.3    390.7KB     2.01
#> 2 quickn(x, n, decreasing = FALSE)   61.2ms  61.2ms     16.3     77.1MB   114.  
#> 3 data.table:::forder(x, decreasin… 171.7ms 185.8ms      5.38    38.9MB     2.69

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       27ms    28ms     34.3     3.81MB     0   
#> 2 quickn(x, n, decreasing = FALSE)   52.8ms  56.7ms     15.5    83.92MB    15.5 
#> 3 data.table:::forder(x, decreasin… 179.2ms 180.7ms      5.54   45.78MB     1.85

n = 1e7
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    139.7ms 139.9ms      6.97    38.1MB     2.32
#> 2 quickn(x, n, decreasing = FALSE)   63.1ms  63.1ms     15.9    152.6MB    79.3 
#> 3 data.table:::forder(x, decreasin… 200.9ms 202.3ms      4.94   114.4MB     2.47

Random permutation

library(data.table)
setDTthreads(1L)
set.seed(373)
x = sample(as.double(seq.int(1e7)))
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=TRUE),
    quickn(x,n,decreasing=TRUE),
    kit::topn(x,n,decreasing=TRUE),
    data.table:::forder(x,decreasing=TRUE)[1:n]
  )
}

n = 1e0
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)    15.76ms  16.95ms     58.8     3.19KB     0   
#> 2 quickn(x, n, decreasing = TRUE) 123.92ms 125.14ms      8.01    76.3MB     8.01
#> 3 kit::topn(x, n, decreasing = T…   9.45ms   9.69ms    103.     39.77KB     0   
#> 4 data.table:::forder(x, decreas… 457.36ms 487.79ms      2.05   38.22MB     1.03

n = 1e1
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      18.4ms  19.1ms     51.8         0B     0   
#> 2 quickn(x, n, decreasing = TRUE)   143.7ms 145.9ms      6.80    76.3MB     6.80
#> 3 kit::topn(x, n, decreasing = TRU…  10.6ms  10.7ms     91.7         0B     0   
#> 4 data.table:::forder(x, decreasin… 521.7ms 521.7ms      1.92    38.1MB     0

n = 1e2
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)     16.43ms  17.1ms     57.4       448B     0   
#> 2 quickn(x, n, decreasing = TRUE)  120.77ms 126.8ms      7.97    76.3MB     7.97
#> 3 kit::topn(x, n, decreasing = TR…   9.89ms    11ms     90.9       448B     0   
#> 4 data.table:::forder(x, decreasi… 471.95ms 484.7ms      2.06    38.1MB     1.03

n = 1e3
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      19.6ms  20.2ms     49.0     3.95KB     0   
#> 2 quickn(x, n, decreasing = TRUE)   128.9ms   133ms      7.36    76.3MB     7.36
#> 3 kit::topn(x, n, decreasing = TRU…  31.9ms  33.3ms     29.9     3.95KB     0   
#> 4 data.table:::forder(x, decreasin… 510.5ms 510.5ms      1.96   38.16MB     0

n = 1e4
b()
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)      38.4ms  40.6ms     24.7     39.1KB     0   
#> 2 quickn(x, n, decreasing = TRUE)   140.3ms 140.3ms      7.13    76.4MB    21.4 
#> 3 kit::topn(x, n, decreasing = TRU… 326.9ms 326.9ms      3.06    38.2MB     3.06
#> 4 data.table:::forder(x, decreasin… 530.2ms 530.2ms      1.89    38.2MB     0

n = 1e5
b()
#> # A tibble: 4 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        271ms  275ms      3.63   390.7KB     0   
#> 2 quickn(x, n, decreasing = TRUE)      143ms  143ms      7.01    77.1MB    21.0 
#> 3 kit::topn(x, n, decreasing = TRUE)   443ms  443ms      2.26    38.5MB     2.26
#> 4 data.table:::forder(x, decreasing =… 510ms  510ms      1.96    38.9MB     0

n = 1e6
b()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = TRUE)        3.8s    3.8s     0.263    3.81MB     0   
#> 2 quickn(x, n, decreasing = TRUE)   129.2ms   139ms     6.67    83.92MB     6.67
#> 3 kit::topn(x, n, decreasing = TRU… 344.9ms 378.2ms     2.64    41.96MB     1.32
#> 4 data.table:::forder(x, decreasin… 530.2ms 530.2ms     1.89    45.78MB     0

Strings

Random strings

library(data.table)
setDTthreads(1L)
x = stringi::stri_rand_strings(1e6, 10)
b <- function() {
  bench::mark(check=FALSE,
    topn(x,n,decreasing=FALSE),
    quickn(x,n,decreasing=FALSE),
    data.table:::forder(x,decreasing=FALSE)[1:n]
  )
}

n = 1e0
b()
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    8.94ms   9.22ms    108.      3.19KB     0   
#> 2 quickn(x, n, decreasing = FALS…  41.45ms  42.73ms     23.2     7.63MB    23.2 
#> 3 data.table:::forder(x, decreas… 208.66ms 209.69ms      4.77    3.89MB     2.38

n = 1e1
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)       10ms  10.3ms     96.7         0B      0  
#> 2 quickn(x, n, decreasing = FALSE)   47.4ms  47.7ms     20.7     7.63MB     13.8
#> 3 data.table:::forder(x, decreasin… 210.6ms 216.9ms      4.63    3.81MB      0

n = 1e2
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     10.2ms  10.6ms     94.0       448B     0   
#> 2 quickn(x, n, decreasing = FALSE)   46.7ms  47.7ms     20.9     7.63MB     5.96
#> 3 data.table:::forder(x, decreasin… 215.8ms 217.6ms      4.60    3.82MB     0

n = 1e3
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     12.2ms  12.7ms     78.1     3.95KB     0   
#> 2 quickn(x, n, decreasing = FALSE)     44ms  45.2ms     22.1     7.64MB     5.53
#> 3 data.table:::forder(x, decreasin… 206.4ms 212.8ms      4.63    3.82MB     0

n = 1e4
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     35.7ms  36.7ms     27.1    39.11KB     0   
#> 2 quickn(x, n, decreasing = FALSE)   43.6ms  44.3ms     22.4     7.71MB     5.61
#> 3 data.table:::forder(x, decreasin… 206.1ms 206.5ms      4.77    3.89MB     0

n = 1e5
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)    233.2ms 233.7ms      4.27  390.67KB      0  
#> 2 quickn(x, n, decreasing = FALSE)   44.1ms  45.9ms     21.9     8.39MB     11.0
#> 3 data.table:::forder(x, decreasin… 202.5ms 205.3ms      4.80    4.58MB      0

n = 1e6
b()
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 topn(x, n, decreasing = FALSE)     54.1ms  56.1ms     17.6     3.81MB     5.88
#> 2 quickn(x, n, decreasing = FALSE)   24.9ms  25.8ms     38.8    15.26MB   116.  
#> 3 data.table:::forder(x, decreasin… 206.6ms 209.2ms      4.71   11.44MB     0

@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

❌ Patch coverage is 62.16216% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.87%. Comparing base (b0b8b23) to head (e9688fb).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/topn.c 62.50% 27 Missing ⚠️
R/wrappers.R 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5167      +/-   ##
==========================================
- Coverage   99.04%   98.87%   -0.17%     
==========================================
  Files          87       88       +1     
  Lines       16678    16752      +74     
==========================================
+ Hits        16518    16564      +46     
- Misses        160      188      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mattdowle
Copy link
Member

mattdowle commented Sep 24, 2021

Very nice! This is a common use case and would be great to get in. No problems about the code. Just thinking about API.

  1. I agree with this comment that the word 'top' doesn't convey min or max. How about minn/maxn, or min_n/max_n?

  2. this comment was a good point that maybe it should be topn(n, ...) with passing multiple columns in future in mind, iiuc, which followed from @MichaelChirico's good point.

  3. A concept in SQL land is LIMIT. Whether SQL engines typically knows that LIMIT is set and optimize accordingly, I don't know. Regardless, a function or parameter to limit the number of rows returned could apply and optimize other operations too; e.g. X[Y,limit=10] could return the first 10 rows of the X[Y] result without computing it all. One use-case for that that springs to mind would be testing a join on large data to check it's returning the expected result before removing the limit to get the full result. But it's just an example, really any query could be limited; e.g. X[,j,by,limit=3] could return the first 3 groups say if each group took a long time because j was costly. I was just about to write that X[Y][1:10] could be optimized to X[Y, limit=10] but it's hard to see how to optimize across two [...][...] calls unless we make [...] lazy (which isn't impossible). Anyway, X[order(col), limit=10] could do what X[topn(col, 10)] is proposed and would avoid needing to discuss 1 and 2 above. It wouldn't change this PR much since the meat is in the C code, just the API to call that C code.

@Kamgang-B Kamgang-B requested a review from mattdowle September 25, 2021 10:19
@Kamgang-B

This comment was marked as outdated.

@ben-schwen
Copy link
Member Author

ben-schwen commented Sep 25, 2021

Regarding API:
What about nmin respectively nmax? This would go nicely with nmin(n, ....). Thinking about API and the future, it is easy to just return the root and basically cover nth(n, ...) with the same code.

Regarding Functionality:
Should the indices always be returned in the "right" order as specified by decreasing and na.last or would it make sense to add an sorted argument? This would speedup the runtime by k * log(n) for topn(x,k) with n = length(x).

Regarding implementation:
The current binary heap can be exchanged by an d-ary heap. However, this results in a slightly slower running time for lower k and only seems to overtake the binary heap for k >= 1e4.

I like the idea of a versatile LIMIT in the light of prototyping. However, my most common use case for this feature is only head(X)[,Y] and I'm not sure if I would really switch to X[,Y, limit=6L] for that.

@ben-schwen ben-schwen mentioned this pull request Jan 2, 2022
@jangorecki
Copy link
Member

I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused?
Or maybe benchmark against that implementation?

I used it in naive rolling median algorithm to find partial (half) ordering.

@ben-schwen
Copy link
Member Author

I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused? Or maybe benchmark against that implementation?

I used it in naive rolling median algorithm to find partial (half) ordering.

Possibly. But quickselect returns values of x not of order(x).

@jangorecki
Copy link
Member

ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster

@jangorecki
Copy link
Member

BTW. those benchmark timings tables are terrible to look at when different rows use different units (ms vs s).

@ben-schwen
Copy link
Member Author

ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster

will add a version with quickselect but my guess is that heapselect is faster for smaller k and quickselect will be faster as soon as k starts to grow.

@jangorecki
Copy link
Member

Matt's earlier idea to just wire this into [ as limit= is interesting. My instinct is to continue exporting the function, and revisit a limit= argument later, just making it powered by topn() (or whatever name).

I like topn, but I would look at postgres and duckdb naming here. If they both use limit then it is quite a good reason to consider limit.

@MichaelChirico
Copy link
Member

I like topn, but I would look at postgres and duckdb naming here. If they both use limit then it is quite a good reason to consider limit.

I'm not sure SQL will be the best guide here -- here we have a function that returns a variable number of rows, not really an SQL thing. ... ORDER BY <...> LIMIT N is the closest analogue, which is part of the language, not really a function in the same way -- moreover that returns a table, while the proposal here is to return the indices that would enable the table subset. So this is really more like:

SELECT rn
FROM (SELECT row_number() OVER (ORDER BY ...))
ORDER BY rn LIMIT n

In DuckDB, I looked at "affected by ordering" aggregate functions, max_by() and min_by() are the most relevant. I don't really see anything in Postgres.

Another suggestion: porder() with the same signature as order() except it gets limit= too.

DT[order(...), ..., limit = n] and then having this implementation be totally internal (e.g., just a new argument to forder()) is looking a bit more appealing to me too. It's not clear how often the user actually cares about the ordering indices vs. just getting a subset.

@jangorecki
Copy link
Member

Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses.

@MichaelChirico MichaelChirico mentioned this pull request Sep 13, 2024
@ben-schwen
Copy link
Member Author

ben-schwen commented Sep 16, 2024

I added a quickselect version called quickn. This would make sense if make topn mostly internal e.g. DT[order(...), ..., limit = n, method=c("heapselect", "quickselect")]

Will update the benchmarks to make an informed decision.

@ben-schwen ben-schwen closed this Nov 10, 2024
@ben-schwen ben-schwen reopened this Nov 10, 2024
@MichaelChirico
Copy link
Member

MichaelChirico commented Dec 3, 2024

Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses.

Good call-out: https://learn.microsoft.com/en-us/dax/topn-function-dax

Examples there are not all that helpful, but AFAICT it suffers from the same confusing API where you are writing TOPN(..., DESC/ASC) and "top" is no longer the best phrasing.

I am leaning more towards limit= argument to [. It will be a good eventual complement to other FRs e.g. adding having= (#788), where= (#2911), join= (#3946), to make [ queries highly SQL-compatible.

method=c("heapselect", "quickselect")

I'm not sure a method= argument to [ is warranted, I think options(datatable.query.limit.method) makes more sense.

@github-actions
Copy link

github-actions bot commented Dec 21, 2025

  • HEAD=topn_heap slower P<0.001 for memrecycle regression fixed in #5463
  • HEAD=topn_heap slower P<0.001 for isoweek improved in #7144
    Comparison Plot

Generated via commit e9688fb

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 49 seconds
Installing different package versions 22 seconds
Running and plotting the test cases 5 minutes and 8 seconds

@jangorecki jangorecki closed this Dec 21, 2025
@jangorecki jangorecki reopened this Dec 21, 2025
int k, len;
ans = PROTECT(allocVector(INTSXP, n));
int *restrict ians = INTEGER(ans);
int *restrict INDEX = malloc(n*sizeof(int));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not (int *)R_alloc(n, sizeof(int))? R will check the allocation and unprotect it when .Call(Ctopn, ...) returns.

Comment on lines +116 to +121
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); }
else { HEAPN(double, REAL, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); }
else { HEAPN(double, REAL, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break;
case LGLSXP: case INTSXP: { HEAPN(int, INTEGER_RO, icmp, sorted); } break;
case REALSXP: {
if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL_RO, i64cmp, sorted); }
else { HEAPN(double, REAL_RO, dcmp, sorted); } break; }
case CPLXSXP: { HEAPN(Rcomplex, COMPLEX_RO, ccmp, sorted); } break;
case STRSXP: { HEAPN(SEXP, STRING_PTR_RO, scmp, sorted); } break;

}

static inline bool scmp(const SEXP *restrict x, int i, int j, bool min, bool nalast) {
if (strcmp(CHAR(x[i]), CHAR(x[j])) == 0) return i > j;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CHAR(NA_STRING) returns "NA", this will compare NA_STRING and mkChar("NA") as "equal".

Might it help to call strcmp(x[i], x[j]) only once? The compiler may be already optimising this.

if (INHERITS(x, char_integer64)) { QUICKN(int64_t, REAL, i64cmp, i64swap); }
else { QUICKN(double, REAL, dcmp, dswap); } break; }
case CPLXSXP: { QUICKN(Rcomplex, COMPLEX, ccmp, cswap); } break;
case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break;
case STRSXP: { QUICKN(SEXP, STRING_PTR_RO, scmp, sswap); } break;

Could also be DATAPTR_RO. If only used for swapping elements in place, this is not any worse than reorder:

data.table/src/reorder.c

Lines 116 to 117 in 2654599

// Unique and somber line. Not done lightly. Please read all comments in this file.
memcpy((char*)DATAPTR_RO(v) + size*start, TMP, size*nmid);

if (j <= n) l = i; \
} \
} \
memcpy(ians, ix, n * sizeof(CTYPE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat scary for character vectors, but I'm not seeing anything that would break right now.

From the GC generations viewpoint, x and ans are likely from the same GC generation; x is possibly older. There shouldn't be any problem with elements of newer, more-frequently-sweeped ans pointing to values from an older, less-frequently-sweeped GC generation. (It's the opposite that causes use-after-frees.)

From the reference counts viewpoint, it'll be one less than what it should be for elements of ans, but CHARSXPs are cached and immutable anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

topn for efficiently doing sorted head/tail

6 participants