Skip to content

Upgrade DictVectorizer to use 64 bit indices instead of 32 bit #18403

@E-Aho

Description

@E-Aho

Describe the workflow you want to enable

Currently, in DictVectorizer, the _transform function is limited to 32 bit indices, meaning that there is a limit of ~2B rows/cols in the resultant matrix.

This issue seeks to enable using DictVectorizer on larger datasets by increasing this to work with 64 bit values.

Describe your proposed solution

Update the dtypes for the indices in _dict_vectorizer.py as follows:

  • from np.int32 to np.int64
  • from np.intc to np.int_
  • update from array("i") to array("l") to get the 64 bit signed long instead of signed int

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions