-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
data.table could collect more statistics about data while processing. This allows potential optimizations, not limited to internal data.table code. Users can use them to speed up their code and design more data-driven functions.
List of measures to collect:
- is sorted:
haskey(x) - has index:
!is.null(idx<-attr(attr(x, "index"), idx_name)) - has NA /
anyNA - has NaN
- number of groups (uniqueN):
length(attr(idx, "starts")) - size of biggest group:
attr(idx, "maxgrpn") - is unique (uniqueN == .N):
attr(idx, "maxgrpn")==1L - range (min, max):
x[c(idx[1L], idx[length(idx)])] - all NA:
{{hasna}} && length(attr(idx, "starts"))==1L - is ascii
optionally, as I don't see obvious optimizations coming from those:
- NA count
- sd, var
MichaelChirico, franknarf1, DavidArenburg, danielarantes, HughParsonage and 2 more