[BC breaking] Add Sentencepiece torchscript Extension by mthrok · Pull Request #755 · pytorch/text

mthrok · 2020-05-06T16:14:35Z

This PR adds torchscript extension _torchtext.so, which contains simple interface to SentencePiece.

SentencePiece v0.1.86 is used.
libsentencepiece.a is built right before _torchtext.so is compiled.
The logic for triggering this build from setuptools can be found under build_tools/setup_helpers.
_torchtext.so provides interface to train a SentencePiece model and load a model from file.
Our custom SentencePiece class has the following methods
- Encode
- EncodeAsIds
- EncodeAsPieces
- ~~is pickle-able TODO Add pickle/un-pickle test~~ This is not for simple pickling, but for torchscrpit serialization.
Our custom SentencePiece class replaces the original SentencePieceProcessor class, which is the official Python binding using Swig.
- Due to the limitation of torchscript, our custom SentencePiece class does not implement special methods __len__ and __getitem__.
- Other methods provided SentencePieceProcessor are not implemented.
generate_sp_model and load_sp_model are replaces with equivalent C++ implementation that handles our custom SentencePiece class.
Original load_sp_model was returning sentencepiece.SentencePieceProcessor. (BC-braeaking.)
We cannot replace sentencepiece_tokenizer and sentencepiece_numericalizer with C++ ext yet as C++ ext does not handle generator.

Breaking change:

Previously torchtext.data.functional.load_sp_model returned sentencepiece.SentencePieceProcessor object, which supported the following methods, in addition to __len__ and __getitem__

$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
    return $self->Load(filename);
    return $self->LoadFromSerializedProto(filename);
    return $self->SetEncodeExtraOptions(extra_option);
    return $self->SetDecodeExtraOptions(extra_option);
    return $self->SetVocabulary(valid_vocab);
    return $self->ResetVocabulary();
    return $self->LoadVocabulary(filename, threshold);
    return $self->EncodeAsPieces(input);
    return $self->EncodeAsIds(input);
    return $self->NBestEncodeAsPieces(input, nbest_size);
    return $self->NBestEncodeAsIds(input, nbest_size);
    return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
    return $self->SampleEncodeAsIds(input, nbest_size, alpha);
    return $self->DecodePieces(input);
    return $self->DecodeIds(input);
    return $self->EncodeAsSerializedProto(input);
    return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
    return $self->NBestEncodeAsSerializedProto(input, nbest_size);
    return $self->DecodePiecesAsSerializedProto(pieces);
    return $self->DecodeIdsAsSerializedProto(ids);
    return $self->GetPieceSize();
    return $self->PieceToId(piece);
    return $self->IdToPiece(id);
    return $self->GetScore(id);
    return $self->IsUnused(id);
    return $self->IsControl(id);
    return $self->IsUnused(id);
    return $self->GetPieceSize();
    return $self->PieceToId(key);

The new C++ Extension provides the following methods

Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)

benchmark/vectors.py

torchtext/csrc/sentencepiece.cpp

torchtext/data/functional.py

zhangguanheng66

Need a JIT test.

build_tools/setup_helpers/extension.py

cpuhrsch · 2020-05-08T16:20:02Z

setup.py

+    # Extension info
+    ext_modules=setup_helpers.get_ext_modules(),
+    cmdclass={
+        'build_ext': setup_helpers.BuildExtension.with_options(no_python_abi_suffix=True),


Why is this abi option necessary?

Also, I'd add a comment that references the issue around dropping symbols we recently referenced (I can dig it up if this doesn't ring a bell).

ah not necessary. It's a reminding custom from what I followed in torchvision. (_C.so notation.)
I will remove it.

Actually, it turned out that we need to provide the full path to so file, so removing abi suffix makes it easier to locate the library file.

cpuhrsch · 2020-05-08T16:23:38Z

torchtext/csrc/sentencepiece.cpp

+
+public:
+  std::string content_;
+  explicit SentencePiece(const std::string &content) : content_(content) {


This could be changed to take a ::sentencepiece:: via move constructor and then effectively represent a shell that forwards methods to the underlying object.

Do you need this string constructor? Does this match previous behavior? Usually I'd make this a separate factory function and then construct a SentencePiece object that is initialized with whatever processor_.LoadFromSerializedProto(content_); returns.

This class has to provide serialization mechanism so that scripted module can be saved.
Ideally, the underlying SentencePieceProcessor class should provide such mechanism, but it does not provide interface to serialize the model.
Note 1. The lifecycle of the official SentencePieceProcessor is 1. train and save to a file then 2. load from the saved file. It does not have a functionality to save the model.
Note 2. If you look at the source code of SentencePieceProcessor, it almost look like provide serialization interface, but ModelProto and all the functionality related to protobuf does not exist in the library file.

To workaround this, we need to keep the content of the saved model file (model data serialized protobuf) when loading a model from a file.
So, it is simpler to pass the model content as string to constructor and to instantiate the underlying SentencePieceProcessor there. Because, you still need to pass the model content separately for the later serialization.

The serialization here is necessary to support pickle? As far as I know that's what we use to save models. We could look into adding custom pickle support as does pybind11 for this. This might also enable us, if they don't already or if there's a need, to apply compression algorithms on the fly and save a smaller model.

cpuhrsch · 2020-05-08T16:24:55Z

torchtext/csrc/sentencepiece.cpp

+    }
+  }
+
+  std::vector<std::string> Encode(const std::string &input) const {


If you want, you could play with iterables here. You pass an object that yields strings and return and object that yields iterators over strings. That could be quite powerful when it comes to avoiding the overhead of writing out a lot of text while processing.

Just a reminder. Iterator is not supported by torchscript but we expect those transforms jitable.

test/data/test_functional.py

torchtext/csrc/sentencepiece.cpp

mthrok · 2020-05-08T18:01:29Z

@zhangguanheng66 @cpuhrsch

Addressed your feedback.
Regarding replacing sentencepiece_tokenizer and sentencepiece_numericalizer, we can try in the next step with @cpuhrsch 's idea to pass a generator to C++.

cpuhrsch

Looks good, but please document the BC-breaking changes and changed semantics so they can be referenced.

Also, maybe it makes sense to wait for len and getitem to land

mthrok · 2020-05-08T19:28:31Z

@cpuhrsch I squashed the commits and added the list of methods available now and then in the commit message.
Checkout the commit message 👉https://github.com/pytorch/text/pull/755/commits

EDIT: also added the same message to this PR description.

mthrok · 2020-05-08T19:31:08Z

Also, maybe it makes sense to wait for len and getitem to land

I prefer to merge this one then work on separately.

cpuhrsch · 2020-05-08T19:40:16Z

Ok sounds good. Please add a [BC breaking] tag and then this should be good to go. Also, please do an import into fbcode after this.

This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`. - SentencePiece `v0.1.86` is used. - `libsentencepiece.a` is built right before `_torchtext.so` is compiled. The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`. - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file. Breaking change: Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__` ``` $ grep '$self->' third_party/sentencepiece/python/sentencepiece.i return $self->Load(filename); return $self->LoadFromSerializedProto(filename); return $self->SetEncodeExtraOptions(extra_option); return $self->SetDecodeExtraOptions(extra_option); return $self->SetVocabulary(valid_vocab); return $self->ResetVocabulary(); return $self->LoadVocabulary(filename, threshold); return $self->EncodeAsPieces(input); return $self->EncodeAsIds(input); return $self->NBestEncodeAsPieces(input, nbest_size); return $self->NBestEncodeAsIds(input, nbest_size); return $self->SampleEncodeAsPieces(input, nbest_size, alpha); return $self->SampleEncodeAsIds(input, nbest_size, alpha); return $self->DecodePieces(input); return $self->DecodeIds(input); return $self->EncodeAsSerializedProto(input); return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha); return $self->NBestEncodeAsSerializedProto(input, nbest_size); return $self->DecodePiecesAsSerializedProto(pieces); return $self->DecodeIdsAsSerializedProto(ids); return $self->GetPieceSize(); return $self->PieceToId(piece); return $self->IdToPiece(id); return $self->GetScore(id); return $self->IsUnused(id); return $self->IsControl(id); return $self->IsUnused(id); return $self->GetPieceSize(); return $self->PieceToId(key); ``` The new C++ Extension provides the following methods ``` Encode(input) EncodeAsIds(input) EncodeAsPieces(input) ```

zhangguanheng66 · 2020-05-11T01:18:34Z

We need to update the README file for some instructions to build torchtext from master branch.

This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`. - SentencePiece `v0.1.86` is used. - `libsentencepiece.a` is built right before `_torchtext.so` is compiled. The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`. - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file. Breaking change: Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__` ``` $ grep '$self->' third_party/sentencepiece/python/sentencepiece.i return $self->Load(filename); return $self->LoadFromSerializedProto(filename); return $self->SetEncodeExtraOptions(extra_option); return $self->SetDecodeExtraOptions(extra_option); return $self->SetVocabulary(valid_vocab); return $self->ResetVocabulary(); return $self->LoadVocabulary(filename, threshold); return $self->EncodeAsPieces(input); return $self->EncodeAsIds(input); return $self->NBestEncodeAsPieces(input, nbest_size); return $self->NBestEncodeAsIds(input, nbest_size); return $self->SampleEncodeAsPieces(input, nbest_size, alpha); return $self->SampleEncodeAsIds(input, nbest_size, alpha); return $self->DecodePieces(input); return $self->DecodeIds(input); return $self->EncodeAsSerializedProto(input); return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha); return $self->NBestEncodeAsSerializedProto(input, nbest_size); return $self->DecodePiecesAsSerializedProto(pieces); return $self->DecodeIdsAsSerializedProto(ids); return $self->GetPieceSize(); return $self->PieceToId(piece); return $self->IdToPiece(id); return $self->GetScore(id); return $self->IsUnused(id); return $self->IsControl(id); return $self->IsUnused(id); return $self->GetPieceSize(); return $self->PieceToId(key); ``` The new C++ Extension provides the following methods ``` Encode(input) EncodeAsIds(input) EncodeAsPieces(input) ```

zhangguanheng66 reviewed May 6, 2020

View reviewed changes

benchmark/vectors.py Outdated Show resolved Hide resolved

mthrok marked this pull request as draft May 6, 2020 17:13

cpuhrsch reviewed May 7, 2020

View reviewed changes

torchtext/csrc/sentencepiece.cpp Outdated Show resolved Hide resolved

cpuhrsch reviewed May 7, 2020

View reviewed changes

torchtext/csrc/sentencepiece.cpp Outdated Show resolved Hide resolved

mthrok changed the title ~~[WIP] Add Sentencepiece CPP Extension~~ [WIP] Add Sentencepiece torchscript Extension May 7, 2020

mthrok mentioned this pull request May 7, 2020

Special methods on torchscript custom class pytorch/pytorch#37987

Open

mthrok force-pushed the sentencepiece branch 3 times, most recently from 4dbf3f1 to 1a1b6f0 Compare May 7, 2020 20:25

mthrok changed the title ~~[WIP] Add Sentencepiece torchscript Extension~~ Add Sentencepiece torchscript Extension May 8, 2020

mthrok marked this pull request as ready for review May 8, 2020 01:08

mthrok requested review from cpuhrsch and zhangguanheng66 May 8, 2020 01:08

zhangguanheng66 reviewed May 8, 2020

View reviewed changes

torchtext/data/functional.py Show resolved Hide resolved

zhangguanheng66 reviewed May 8, 2020

View reviewed changes

cpuhrsch reviewed May 8, 2020

View reviewed changes

build_tools/setup_helpers/extension.py Show resolved Hide resolved

cpuhrsch reviewed May 8, 2020

View reviewed changes

test/data/test_functional.py Show resolved Hide resolved

cpuhrsch reviewed May 8, 2020

View reviewed changes

torchtext/csrc/sentencepiece.cpp Outdated Show resolved Hide resolved

mthrok requested review from cpuhrsch and zhangguanheng66 May 8, 2020 18:02

cpuhrsch suggested changes May 8, 2020

View reviewed changes

mthrok force-pushed the sentencepiece branch from 9e27a03 to 764d060 Compare May 8, 2020 19:25

mthrok changed the title ~~Add Sentencepiece torchscript Extension~~ [BC breaking] Add Sentencepiece torchscript Extension May 8, 2020

mthrok force-pushed the sentencepiece branch from 764d060 to e6f1c3c Compare May 8, 2020 19:49

mthrok merged commit 5de163a into pytorch:master May 8, 2020

mthrok deleted the sentencepiece branch May 8, 2020 20:50

mthrok mentioned this pull request May 8, 2020

Self-contain codecs library pytorch/audio#625

Merged

Conversation

mthrok commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhangguanheng66 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpuhrsch May 8, 2020

Choose a reason for hiding this comment

Uh oh!

mthrok May 8, 2020

Choose a reason for hiding this comment

Uh oh!

mthrok May 8, 2020

Choose a reason for hiding this comment

Uh oh!

cpuhrsch May 8, 2020

Choose a reason for hiding this comment

Uh oh!

mthrok May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpuhrsch May 8, 2020

Choose a reason for hiding this comment

Uh oh!

cpuhrsch May 8, 2020

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 May 8, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mthrok commented May 8, 2020

Uh oh!

cpuhrsch left a comment

Choose a reason for hiding this comment

Uh oh!

mthrok commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mthrok commented May 8, 2020

Uh oh!

cpuhrsch commented May 8, 2020

Uh oh!

zhangguanheng66 commented May 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mthrok commented May 6, 2020 •

edited

Loading

zhangguanheng66 left a comment •

edited

Loading

mthrok May 8, 2020 •

edited

Loading

mthrok commented May 8, 2020 •

edited

Loading