Skip to content

Conversation

@nkulkarni
Copy link

@nkulkarni nkulkarni commented Aug 4, 2020

  • Adds the accept_large_sparse argument to BaseSGDRegressor.fit() (and similar methods in subclasses).

  • I'm seeing that in [MRG] Large sparse matrix support #11327, support was added for sparse matrices with 64-bit integer indices, but this support was only plumbed in for sklearn/linear_model/_logistic.py. It's possible there's a deeper reason for why 64-bit support isn't possible (which I am not aware of), but if it's just a matter of adding the right plumbing this should do the trick.

  • This should fix the error: ValueError: Only sparse matrices with 32-bit integer indices are accepted. Got int64 indices. encountered when trying to fit an SGDRegressor model with 64-bit sparse matrices.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

@rth
Copy link
Member

rth commented Aug 5, 2020

Does SGDRegressor work with accept_large_sparse=True? If so we should just make it the default, there is no need to add an estimator parameter for this in general.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2020

Please add a test with a 64bit sparse matrix. I don't think it's this simple, but I'd happily be proven wrong

@cmarmo
Copy link
Contributor

cmarmo commented Aug 15, 2020

I think this PR partially addresses #11355

Base automatically changed from master to main January 22, 2021 10:52
@nkulkarni
Copy link
Author

nkulkarni commented Jul 27, 2022

@jnothman @cmarmo @rth sorry for going AWOL on this -- I can pick this up during this week! is this still something we'd want to work on?

@thomasjpfan
Copy link
Member

thomasjpfan commented Jul 27, 2022

I do not think SGDRegressor can support large sparse matrices out of the box. Currently this PR, lets accept_large_sparse through, but once it gets into the Cython code here:

cdef int *x_ind_ptr = NULL

The dtype of x_ind_ptr is fixed to int, which is usually 32 bits. Also SequentialDataset, would also need to be updated to template out the pointer which also uses int* for the x_ind_ptr: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_seq_dataset.pxd.tp

TLDR: I think this is a hard task that involves going through the Cython files and making sure that int64 indices work with templating or fused types.

@nkulkarni
Copy link
Author

@thomasjpfan makes sense -- I guess it would be one thing if we'd just need to make x_ind_ptr a 64bit, but looks like there's a lot more to do.

admittedly I don't have a ton of time to work on this so happy to keep slowly playing with this PR for a bit ... but I don't think it's something I have a lot of time to dedicate to closing out, so feel free to close this

@thomasjpfan
Copy link
Member

makes sense -- I guess it would be one thing if we'd just need to make x_ind_ptr a 64bit, but looks like there's a lot more to do.

Most of the complication comes from needing to support 32bit and 64bit indices.

but I don't think it's something I have a lot of time to dedicate to closing out, so feel free to close this

I'm okay with closing. Thank you for working on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants