Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

S3E pooling of embeddings#286

Merged
tholor merged 37 commits intomasterfrom
s3e_pooling
Apr 27, 2020
Merged

S3E pooling of embeddings#286
tholor merged 37 commits intomasterfrom
s3e_pooling

Conversation

@tholor
Copy link
Copy Markdown
Member

@tholor tholor commented Mar 19, 2020

Implementing a new pooling method for embeddings as proposed by Wang et al in their paper "Efficient Sentence Embedding via Semantic Subspace Analysis".

It could be a useful alternative to sentence embeddings generated via naive reduce_mean etc.

@tholor tholor added enhancement New feature or request part: inference Inferencer labels Mar 19, 2020
@tholor tholor self-assigned this Mar 19, 2020
@tholor tholor changed the title WIP S3E pooling WIP S3E pooling of embeddings Mar 19, 2020
@tholor tholor changed the title WIP S3E pooling of embeddings S3E pooling of embeddings Apr 24, 2020
@tholor
Copy link
Copy Markdown
Member Author

tholor commented Apr 24, 2020

How to use it?

  • call fit_s3_on_corpus() to fit s3e on your corpus (or a subset)
  • save results or pass them directly into the Inferencer to get your sentence embeddings
   inferencer = Inferencer(model=model, processor=processor, task_type="embeddings", gpu=use_gpu,
                       batch_size=batch_size, extraction_strategy="s3e", extraction_layer=-1,
                       s3e_stats=s3e_stats)

See example script

Known limitations:

  • Speed of fit_s3_on_corpus:
    -- no multiprocessing for tokenization during
    -- the original implementation for removing PCA components is slow and was not optimized yet
  • Minor differences to the original method (handling of UNK tokens)

@tholor tholor requested a review from Timoeller April 24, 2020 14:04
Copy link
Copy Markdown
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. The PR only really touches Farm code in Language_Model.formatted_preds and your additions do not seem to change old behaviour.

I will test saving with larger wordembedding models and come back to you.

@tholor tholor merged commit f8d0744 into master Apr 27, 2020
@tholor tholor deleted the s3e_pooling branch April 28, 2020 07:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request part: inference Inferencer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants