Skip to content

EmbeddingBag to support mini-batches with offsets #93843

@alexshtf

Description

@alexshtf

🚀 The feature, motivation and pitch

Currently, the forward method of EmbeddingBag, when offsets are passed, supports only 1D inputs. Hence, training / inference on mini-batches of data isn't supported with offsets.

Offsets are very useful when training on tabular datasets with "multi-valued" cells, such as movie genres, since we may want to sum / average the embeddings associated with several genres to a single vector. There can also be weighted multi-valued cells, for example, when the multiple values are generated by an auxiliary model, and the weights represent the confidence of the model in its prediction. For example, consider automatic extraction of movie genres from their title and description.

Alternatives

Two possible alternatives:

  1. Using a regular torch.nn.Embedding class, extract the embedding vectors, multiply by weights manually, and aggregate them. In this case we lose the efficiency of the EmbeddingBag class, which doesn't have to actually create the full embedding tensor. This idea is relevant only if the number of features in each mini-batch item is the same.
  2. Use an EmbeddingBag in our model, decompose the mini-batch to its constituent items, and compute the output of the model for each item using a for-loop.

Additional context

No response

cc @cpuhrsch @jbschlosser @bhosmer @drisspg @mikaylagawarecki

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNot as big of a feature, but technically not a bug. Should be easy to fixmodule: nestedtensorNestedTensor tag see issue #25032triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions