-
Notifications
You must be signed in to change notification settings - Fork 1k
Topn doing order |> head/tail #5167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5167 +/- ##
==========================================
- Coverage 99.04% 98.87% -0.17%
==========================================
Files 87 88 +1
Lines 16678 16752 +74
==========================================
+ Hits 16518 16564 +46
- Misses 160 188 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Very nice! This is a common use case and would be great to get in. No problems about the code. Just thinking about API.
|
This comment was marked as outdated.
This comment was marked as outdated.
|
Regarding API: Regarding Functionality: Regarding implementation: I like the idea of a versatile LIMIT in the light of prototyping. However, my most common use case for this feature is only |
|
I wonder if possibly https://github.com/Rdatatable/data.table/blob/master/src/quickselect.c could be reused? I used it in naive rolling median algorithm to find partial (half) ordering. |
Possibly. But quickselect returns values of |
|
ah yes, you are correct. Anyway you can compare speed of returning a value vs index, and at least you will know if there is something to improve regarding your current implementation, in case quickselect would be faster |
|
BTW. those benchmark timings tables are terrible to look at when different rows use different units (ms vs s). |
will add a version with quickselect but my guess is that heapselect is faster for smaller |
I like topn, but I would look at postgres and duckdb naming here. If they both use limit then it is quite a good reason to consider limit. |
I'm not sure SQL will be the best guide here -- here we have a function that returns a variable number of rows, not really an SQL thing. SELECT rn
FROM (SELECT row_number() OVER (ORDER BY ...))
ORDER BY rn LIMIT nIn DuckDB, I looked at "affected by ordering" aggregate functions, Another suggestion:
|
|
Then topn can be confusing name because it is commonly used in MSSQL for what is LIMIT in some other dbses. |
|
I added a quickselect version called Will update the benchmarks to make an informed decision. |
Good call-out: https://learn.microsoft.com/en-us/dax/topn-function-dax Examples there are not all that helpful, but AFAICT it suffers from the same confusing API where you are writing I am leaning more towards
I'm not sure a |
Generated via commit e9688fb Download link for the artifact containing the test results: ↓ atime-results.zip
|
| int k, len; | ||
| ans = PROTECT(allocVector(INTSXP, n)); | ||
| int *restrict ians = INTEGER(ans); | ||
| int *restrict INDEX = malloc(n*sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not (int *)R_alloc(n, sizeof(int))? R will check the allocation and unprotect it when .Call(Ctopn, ...) returns.
| case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break; | ||
| case REALSXP: { | ||
| if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); } | ||
| else { HEAPN(double, REAL, dcmp, sorted); } break; } | ||
| case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break; | ||
| case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case LGLSXP: case INTSXP: { HEAPN(int, INTEGER, icmp, sorted); } break; | |
| case REALSXP: { | |
| if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL, i64cmp, sorted); } | |
| else { HEAPN(double, REAL, dcmp, sorted); } break; } | |
| case CPLXSXP: { HEAPN(Rcomplex, COMPLEX, ccmp, sorted); } break; | |
| case STRSXP: { HEAPN(SEXP, STRING_PTR, scmp, sorted); } break; | |
| case LGLSXP: case INTSXP: { HEAPN(int, INTEGER_RO, icmp, sorted); } break; | |
| case REALSXP: { | |
| if (INHERITS(x, char_integer64)) { HEAPN(int64_t, REAL_RO, i64cmp, sorted); } | |
| else { HEAPN(double, REAL_RO, dcmp, sorted); } break; } | |
| case CPLXSXP: { HEAPN(Rcomplex, COMPLEX_RO, ccmp, sorted); } break; | |
| case STRSXP: { HEAPN(SEXP, STRING_PTR_RO, scmp, sorted); } break; |
| } | ||
|
|
||
| static inline bool scmp(const SEXP *restrict x, int i, int j, bool min, bool nalast) { | ||
| if (strcmp(CHAR(x[i]), CHAR(x[j])) == 0) return i > j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CHAR(NA_STRING) returns "NA", this will compare NA_STRING and mkChar("NA") as "equal".
Might it help to call strcmp(x[i], x[j]) only once? The compiler may be already optimising this.
| if (INHERITS(x, char_integer64)) { QUICKN(int64_t, REAL, i64cmp, i64swap); } | ||
| else { QUICKN(double, REAL, dcmp, dswap); } break; } | ||
| case CPLXSXP: { QUICKN(Rcomplex, COMPLEX, ccmp, cswap); } break; | ||
| case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case STRSXP: { QUICKN(SEXP, STRING_PTR, scmp, sswap); } break; | |
| case STRSXP: { QUICKN(SEXP, STRING_PTR_RO, scmp, sswap); } break; |
Could also be DATAPTR_RO. If only used for swapping elements in place, this is not any worse than reorder:
Lines 116 to 117 in 2654599
| // Unique and somber line. Not done lightly. Please read all comments in this file. | |
| memcpy((char*)DATAPTR_RO(v) + size*start, TMP, size*nmid); |
| if (j <= n) l = i; \ | ||
| } \ | ||
| } \ | ||
| memcpy(ians, ix, n * sizeof(CTYPE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat scary for character vectors, but I'm not seeing anything that would break right now.
From the GC generations viewpoint, x and ans are likely from the same GC generation; x is possibly older. There shouldn't be any problem with elements of newer, more-frequently-sweeped ans pointing to values from an older, less-frequently-sweeped GC generation. (It's the opposite that causes use-after-frees.)
From the reference counts viewpoint, it'll be one less than what it should be for elements of ans, but CHARSXPs are cached and immutable anyway.

Closes #3804
I see the general use case of
topnfor arrays where sorting costs much and using as few additional memory as possible with good performance.Benchmarks
Integer
Worst case
Array is sorted ascending and we want the maximum
topnso we need to update the heap at every step after nBest case
Array is sorted ascending and we want the minimum
topnso we never need to update after n(not benchmarking with kit since it errors from n=1e4 onwards)
Random permutation (mimicking average case)
Double
Worst case
Best case
Random permutation
Strings
Random strings