Skip to content

collect more statistics about the data #2879

@jangorecki

Description

@jangorecki

data.table could collect more statistics about data while processing. This allows potential optimizations, not limited to internal data.table code. Users can use them to speed up their code and design more data-driven functions.
List of measures to collect:

  • is sorted: haskey(x)
  • has index: !is.null(idx<-attr(attr(x, "index"), idx_name))
  • has NA / anyNA
  • has NaN
  • number of groups (uniqueN): length(attr(idx, "starts"))
  • size of biggest group: attr(idx, "maxgrpn")
  • is unique (uniqueN == .N): attr(idx, "maxgrpn")==1L
  • range (min, max): x[c(idx[1L], idx[length(idx)])]
  • all NA: {{hasna}} && length(attr(idx, "starts"))==1L
  • is ascii

optionally, as I don't see obvious optimizations coming from those:

  • NA count
  • sd, var

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions