[BC breaking] Add Sentencepiece torchscript Extension#755
[BC breaking] Add Sentencepiece torchscript Extension#755mthrok merged 1 commit intopytorch:masterfrom
Conversation
4dbf3f1 to
1a1b6f0
Compare
| # Extension info | ||
| ext_modules=setup_helpers.get_ext_modules(), | ||
| cmdclass={ | ||
| 'build_ext': setup_helpers.BuildExtension.with_options(no_python_abi_suffix=True), |
There was a problem hiding this comment.
Why is this abi option necessary?
Also, I'd add a comment that references the issue around dropping symbols we recently referenced (I can dig it up if this doesn't ring a bell).
There was a problem hiding this comment.
ah not necessary. It's a reminding custom from what I followed in torchvision. (_C.so notation.)
I will remove it.
There was a problem hiding this comment.
Actually, it turned out that we need to provide the full path to so file, so removing abi suffix makes it easier to locate the library file.
|
|
||
| public: | ||
| std::string content_; | ||
| explicit SentencePiece(const std::string &content) : content_(content) { |
There was a problem hiding this comment.
This could be changed to take a ::sentencepiece:: via move constructor and then effectively represent a shell that forwards methods to the underlying object.
Do you need this string constructor? Does this match previous behavior? Usually I'd make this a separate factory function and then construct a SentencePiece object that is initialized with whatever processor_.LoadFromSerializedProto(content_); returns.
There was a problem hiding this comment.
This class has to provide serialization mechanism so that scripted module can be saved.
Ideally, the underlying SentencePieceProcessor class should provide such mechanism, but it does not provide interface to serialize the model.
Note 1. The lifecycle of the official SentencePieceProcessor is 1. train and save to a file then 2. load from the saved file. It does not have a functionality to save the model.
Note 2. If you look at the source code of SentencePieceProcessor, it almost look like provide serialization interface, but ModelProto and all the functionality related to protobuf does not exist in the library file.
To workaround this, we need to keep the content of the saved model file (model data serialized protobuf) when loading a model from a file.
So, it is simpler to pass the model content as string to constructor and to instantiate the underlying SentencePieceProcessor there. Because, you still need to pass the model content separately for the later serialization.
There was a problem hiding this comment.
The serialization here is necessary to support pickle? As far as I know that's what we use to save models. We could look into adding custom pickle support as does pybind11 for this. This might also enable us, if they don't already or if there's a need, to apply compression algorithms on the fly and save a smaller model.
| } | ||
| } | ||
|
|
||
| std::vector<std::string> Encode(const std::string &input) const { |
There was a problem hiding this comment.
If you want, you could play with iterables here. You pass an object that yields strings and return and object that yields iterators over strings. That could be quite powerful when it comes to avoiding the overhead of writing out a lot of text while processing.
There was a problem hiding this comment.
Just a reminder. Iterator is not supported by torchscript but we expect those transforms jitable.
|
Addressed your feedback. |
cpuhrsch
left a comment
There was a problem hiding this comment.
Looks good, but please document the BC-breaking changes and changed semantics so they can be referenced.
Also, maybe it makes sense to wait for len and getitem to land
|
@cpuhrsch I squashed the commits and added the list of methods available now and then in the commit message. EDIT: also added the same message to this PR description. |
I prefer to merge this one then work on separately. |
|
Ok sounds good. Please add a [BC breaking] tag and then this should be good to go. Also, please do an import into fbcode after this. |
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`.
- SentencePiece `v0.1.86` is used.
- `libsentencepiece.a` is built right before `_torchtext.so` is compiled.
The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`.
- `_torchtext.so` provides interface to train a SentencePiece model and load a model from file.
Breaking change:
Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__`
```
$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
return $self->Load(filename);
return $self->LoadFromSerializedProto(filename);
return $self->SetEncodeExtraOptions(extra_option);
return $self->SetDecodeExtraOptions(extra_option);
return $self->SetVocabulary(valid_vocab);
return $self->ResetVocabulary();
return $self->LoadVocabulary(filename, threshold);
return $self->EncodeAsPieces(input);
return $self->EncodeAsIds(input);
return $self->NBestEncodeAsPieces(input, nbest_size);
return $self->NBestEncodeAsIds(input, nbest_size);
return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
return $self->SampleEncodeAsIds(input, nbest_size, alpha);
return $self->DecodePieces(input);
return $self->DecodeIds(input);
return $self->EncodeAsSerializedProto(input);
return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
return $self->NBestEncodeAsSerializedProto(input, nbest_size);
return $self->DecodePiecesAsSerializedProto(pieces);
return $self->DecodeIdsAsSerializedProto(ids);
return $self->GetPieceSize();
return $self->PieceToId(piece);
return $self->IdToPiece(id);
return $self->GetScore(id);
return $self->IsUnused(id);
return $self->IsControl(id);
return $self->IsUnused(id);
return $self->GetPieceSize();
return $self->PieceToId(key);
```
The new C++ Extension provides the following methods
```
Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)
```
|
We need to update the README file for some instructions to build torchtext from master branch. |
This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`.
- SentencePiece `v0.1.86` is used.
- `libsentencepiece.a` is built right before `_torchtext.so` is compiled.
The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`.
- `_torchtext.so` provides interface to train a SentencePiece model and load a model from file.
Breaking change:
Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__`
```
$ grep '$self->' third_party/sentencepiece/python/sentencepiece.i
return $self->Load(filename);
return $self->LoadFromSerializedProto(filename);
return $self->SetEncodeExtraOptions(extra_option);
return $self->SetDecodeExtraOptions(extra_option);
return $self->SetVocabulary(valid_vocab);
return $self->ResetVocabulary();
return $self->LoadVocabulary(filename, threshold);
return $self->EncodeAsPieces(input);
return $self->EncodeAsIds(input);
return $self->NBestEncodeAsPieces(input, nbest_size);
return $self->NBestEncodeAsIds(input, nbest_size);
return $self->SampleEncodeAsPieces(input, nbest_size, alpha);
return $self->SampleEncodeAsIds(input, nbest_size, alpha);
return $self->DecodePieces(input);
return $self->DecodeIds(input);
return $self->EncodeAsSerializedProto(input);
return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha);
return $self->NBestEncodeAsSerializedProto(input, nbest_size);
return $self->DecodePiecesAsSerializedProto(pieces);
return $self->DecodeIdsAsSerializedProto(ids);
return $self->GetPieceSize();
return $self->PieceToId(piece);
return $self->IdToPiece(id);
return $self->GetScore(id);
return $self->IsUnused(id);
return $self->IsControl(id);
return $self->IsUnused(id);
return $self->GetPieceSize();
return $self->PieceToId(key);
```
The new C++ Extension provides the following methods
```
Encode(input)
EncodeAsIds(input)
EncodeAsPieces(input)
```
This PR adds
torchscriptextension_torchtext.so, which contains simple interface toSentencePiece.v0.1.86is used.libsentencepiece.ais built right before_torchtext.sois compiled.The logic for triggering this build from
setuptoolscan be found underbuild_tools/setup_helpers._torchtext.soprovides interface to train a SentencePiece model and load a model from file.SentencePiececlass has the following methodsEncodeEncodeAsIdsEncodeAsPiecesis pickle-able TODO Add pickle/un-pickle testThis is not for simple pickling, but fortorchscrpitserialization.SentencePiececlass replaces the originalSentencePieceProcessorclass, which is the official Python binding using Swig.torchscript, our customSentencePiececlass does not implement special methods__len__and__getitem__.SentencePieceProcessorare not implemented.generate_sp_modelandload_sp_modelare replaces with equivalent C++ implementation that handles our customSentencePiececlass.load_sp_modelwas returningsentencepiece.SentencePieceProcessor. (BC-braeaking.)sentencepiece_tokenizerandsentencepiece_numericalizerwith C++ ext yet as C++ ext does not handle generator.Breaking change:
Previously
torchtext.data.functional.load_sp_modelreturnedsentencepiece.SentencePieceProcessorobject, which supported the following methods, in addition to__len__and__getitem__The new C++ Extension provides the following methods