Skip to content

Conversation

@sritchie73
Copy link
Contributor

After waiting several hours for as.matrix.data.table to convert a 3000 row 1.7 million column data.table to a matrix, I had a look around to see if there were faster ways of converting a data.frame/data.table to a matrix. This lead me to the data.frame.to_matrix() function in the Rfast package, which was able to perform the above task in under a minute.

I've set about incorporating this into the as.matrix function in the data.table package, but this has proved less straightforward than first anticipated. A few things I need to follow-up on:

  • data.frame.to_matrix returns an error when columns in the data.table are different types, this is intentional, as the code does not do type conversion/pre checks to enhance speed. We would need to do the type conversion ourselves but I'm not sure how best to detect the type that should be converted to across all columns.

  • data.frame.to_matrix` itself is quite janky, it throws a few warnings when it shouldn't due to some bad if-statement checks on their end (hence the somewhat odd choice I've currently made of setting the rownames to NULL post-conversion if there are no rownames)

  • When supplied integer columns data.frame.to_matrix converts these to numerics, causing many tests to fail. This may need to be fixed upstream, and I also need to check how this function works on other non-numeric types (e.g. character vectors, factors, dates, etc) to make sure it returns the equivalent of as.matrix.

For now, I've simply added an Rfast=FALSE argument to as.matrix. If the above points can be solved we can simply remove this, otherwise we could potentially give the user the option of explicitly setting Rfast=TRUE to get a numeric matrix (this would be useful purely for the rownames functionality of as.matrix.data.frame).

@sritchie73 sritchie73 self-assigned this Dec 20, 2019
@MichaelChirico
Copy link
Member

At a glance I would prefer we just wrote our own method in C rather than add a Suggests for a new library with Rcpp... besides the type bumping bit, it should be pretty straightforward & could parallelize on the longer dimension

@mattdowle mattdowle added the WIP label Dec 20, 2019
@mattdowle mattdowle changed the title [WIP] fast cast to matrix using Rfast fast cast to matrix using Rfast Dec 20, 2019
@jangorecki
Copy link
Member

jangorecki commented Dec 21, 2019

Agree with Michael. Adding a heavy Suggested dep is not much justified for a matrix conversion functions. Why not just use that package instead if someone needs it?
There is an open issue about fast conversion to matrix (but not cast or melt) where Matt gave some hints on how that could be optimised in C. I think this is the way to go. No extra deps.

@MichaelChirico
Copy link
Member

Glancing at the code, there's two bits of complication:

  • Dealing with is.ff columns (not sure how to force evaluation at the C level on those)
  • Dealing with columns with dimension (dim(X[[jj]]) > 1L); easier to do at the C level but requires a first pass over columns to know the output dimension.

@MichaelChirico
Copy link
Member

Some profiling revealed the unlist step is the bottleneck:

Screenshot 2019-12-21 at 6 53 54 PM

With that in mind, I tried to replace X = unlist(X, recursive = FALSE, use.names = FALSE) with X = do.call(cbind, X), but unfortunately it is slower.

@sritchie73
Copy link
Contributor Author

100% agree with the above.

What's the best course of action version control wise for dealing with this branch/commit if I wanted to work on creating a C-version? Should I revert the above commit, or just close this pull request and start a new branch?

@MichaelChirico
Copy link
Member

I would close this & start a new branch. Leaves this idea in the web history at least, without potentially cluttering up the C-level PR. And adding a link to this PR for posterity

@ethanbsmith
Copy link
Contributor

some intersection with this functionality and as.xts.data.table, since xts is really just a numeric matrix with an index attribute. would be nice if as.xts could benefit from this when its ready. in fact, would probably be really nice to support any zoo derivatives instead of just xts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants