Integrate Bert-like model on Flax runtime. #3722

mfuntowicz · 2020-04-09T15:24:56Z

This Pull Request attempts to bring support for Flax framework as part of transformers.

Main focus as been put on providing BERT-like models, principally by making it possible to load PyTorch checkpoints and doing the necessary conversions (few) directly on the fly. Supports also providing a msgpack formatted file from Flax.

save_pretrained will save the model through msgpack format to avoid dependency on torch inside Jax code.

Targeted models:

Bert
RoBERTa
DistilBERT
DistilRoBERTa

If not too hard

CamemBERT

LysandreJik

As I understand Flax is a neural-network library built on top of Jax, which brings automatic differentiation for python/numpy operations.

Do you mind walking me through why we use both here, and, more precisely, why we have a JaxPreTrainedModel as a parent of FlaxXXX models? Is the JaxPreTrainedModel's purpose to be able to accommodate models built on top of it from different jax-based libraries, or is it that it only depends on Jax operations so there is no need for it to be Flax-based?

Other than that, it looks like a great first approach! I guess we would need to add a few features as we go on - but it's impressive that you got it working so fast, with the same output between PT/TF/Flax!

LysandreJik · 2020-04-09T20:41:30Z

src/transformers/modeling_flax_bert.py

+
+# Models are loaded from Pytorch checkpoints
+BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {
+    "bert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",


Really nice to be able to load models from their PyTorch checkpoints

LysandreJik · 2020-04-09T20:43:04Z

src/transformers/modeling_flax_bert.py

+
+    def __init__(self, config: BertConfig, state: dict, **kwargs):
+        self.config = config
+        self.key = PRNGKey(0)


What is this used for?

It's mainly usage is related to all the stochastic operations, such as Dropout that I didn't put into the model for now, but might be pushed soon.

mfuntowicz · 2020-04-09T21:04:11Z

As you said, Jax is a library that interact with numpy to provide additional features: autodiff, auto-vectorization (vmap) and auto-parallelization (pmap).

Jax is essentially stateless, which is reflected here through the function to differentiate (the model) doesn't holds the parameters. They have to be referenced somewhere else and feed somehow.

JaxPreTrainedModel is introduced here mainly to handle the serialization of such model and provide conversion. Also, one specificity of Jax is many different Neural Network library are currently being implemented on top of it:

In that aspect, @madisonmay is currently working on a Haiku Bert integration in transformers. My hope it to be able to share as many things as possible between the two implementations (but can't be sure for now)

LysandreJik · 2020-04-10T15:05:03Z

Alright, that makes sense. Thanks for the explanation.

LysandreJik

Cool, it's starting to look really good! I just have a few questions regarding conversion/model architecture.

src/transformers/modeling_flax_bert.py

thomwolf

Ok looks really great (Flax is quite pleasant to read)

Added a couple of remarks and questions. Happy to discuss them.

thomwolf · 2020-04-22T10:02:51Z

src/transformers/file_utils.py

    USE_TF = os.environ.get("USE_TF", "AUTO").upper()
    USE_TORCH = os.environ.get("USE_TORCH", "AUTO").upper()
-    if USE_TORCH in ("1", "ON", "YES", "AUTO") and USE_TF not in ("1", "ON", "YES"):
+    if USE_TORCH in ENV_VARS_TRUE_VALUES and USE_TF not in ("1", "ON", "YES"):


Why not the same for the end of the line? (USE_TF)

Actually, why is _torch_available dependent of USE_TF?

thomwolf · 2020-04-22T10:03:46Z

src/transformers/file_utils.py

    USE_TORCH = os.environ.get("USE_TORCH", "AUTO").upper()

-    if USE_TF in ("1", "ON", "YES", "AUTO") and USE_TORCH not in ("1", "ON", "YES"):
+    if USE_TF in ENV_VARS_TRUE_VALUES and USE_TORCH not in ("1", "ON", "YES"):


Here as well, why not the same for the end of the line?

Same here, why do we test USE_TORCH for _tf_available?

thomwolf · 2020-04-22T10:06:50Z

src/transformers/modeling_flax_auto.py

+            config (:class:`~transformers.PretrainedConfig`):
+                The model class to instantiate is selected based on the configuration class:
+
+                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModel` (RoBERTa model)


FlaxRobertaModel

thomwolf · 2020-04-22T10:07:01Z

src/transformers/modeling_flax_auto.py

+                The model class to instantiate is selected based on the configuration class:
+
+                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModel` (RoBERTa model)
+                - isInstance of `bert` configuration class: :class:`~transformers.BertModel` (Bert model)


FlaxBertModel

thomwolf · 2020-04-22T10:11:32Z

src/transformers/modeling_jax_utils.py

+        self._module = module
+
+        # Those are public as their type is generic to every derived classes.
+        self.key = PRNGKey(0)


I think we should we have the PRNGKey seed exposed as a model args so that users can have (and control) several weight initialization seeds.

+1. I think it can be really useful to be able to configure the random seed.

thomwolf · 2020-04-22T11:07:41Z

src/transformers/modeling_flax_bert.py

+        super().__init__(config, model_def, state)
+
+    @property
+    def module(self) -> BertModel:


BertModel or FlaxBertModel?

thomwolf · 2020-04-22T11:09:38Z

src/transformers/modeling_flax_bert.py

+
+    @property
+    def config(self) -> BertConfig:
+        return self._config


Shouldn't these module and config properties be in the base class?

Yes, that was the initial impl, but in that case we cannot have the correct return type for the config and model which are model dependant.

(nit) We could return a PreTrainedConfig for the configuration, which would be complete enough imo. This comment does not apply to the module though.

thomwolf · 2020-04-22T11:13:42Z

src/transformers/modeling_flax_roberta.py

+    @property
+    def config(self) -> RobertaConfig:
+        return self._config


Same here, shouldn't this be in the base class?

thomwolf · 2020-04-22T11:15:22Z

src/transformers/modeling_flax_roberta.py

+            if token_type_ids is None:
+                token_type_ids = np.ones_like(input_ids)
+
+            if position_ids is None:
+                position_ids = np.arange(
+                    self.config.pad_token_id + 1,
+                    np.atleast_2d(input_ids).shape[-1] + self.config.pad_token_id + 1
+                )


in the FlaxBertModel these parameters are created ouside of the jited predict().
Any reason it's different here?
Should we standardize on one practice to make it easier to read for the user?

thomwolf · 2020-04-22T11:15:44Z

src/transformers/modeling_jax_utils.py

+    config_class = None
+    pretrained_model_archive_map = {}
+    base_model_prefix = ""
+    model_class = None


What is model_class used for?

model_class is the underlying flax.nn.Module class which is required by Flax/msgpack to allocated all the buffers.

bastings · 2020-04-27T15:38:05Z

src/transformers/modeling_flax_bert.py

+class BertIntermediate(nn.Module):
+
+    def apply(self, hidden_state, output_size: int):
+        # TODO: Had ACT2FN reference to change activation function


bastings · 2020-04-29T06:51:33Z

src/transformers/modeling_flax_bert.py

+              vocab_size: int, hidden_size: int, type_vocab_size: int, max_length: int):
+
+        # Embed
+        w_emb = BertEmbedding(jnp.atleast_2d(input_ids.astype('i4')), vocab_size, hidden_size, name="word_embeddings")


nit: maybe split these in 2 lines to make it less wide?

bastings · 2020-04-29T06:53:32Z

src/transformers/modeling_jax_utils.py

+        self._module = module
+
+        # Those are public as their type is generic to every derived classes.
+        self.key = PRNGKey(0)


+1. I think it can be really useful to be able to configure the random seed.

bastings · 2020-04-29T06:55:45Z

src/transformers/modeling_flax_roberta.py

+    def config(self) -> RobertaConfig:
+        return self._config
+
+    def __call__(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):


How often is this called? RIght now it looks like you @jit the predict graph each time the function is called. Is that the intention? Can you define&compile the predict function elsewhere?

src/transformers/modeling_flax_auto.py

bastings · 2020-05-29T08:27:08Z

src/transformers/modeling_flax_bert.py

+        return self._config
+
+    def __call__(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        @jax.jit


It looks like you should move this outside of the call.
So everything that you want to have @jax.jit'ed , so that when it is called multiple times it is cached.
Right now this gets compiled on each call to call, which should cause a lot of compilation overhead.

yeah you definitely don't want to use jit inside a function like this, unless perhaps it's only called once anyway.

stale · 2020-07-29T03:55:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LysandreJik · 2020-08-04T07:00:02Z

Unstale

codecov · 2020-09-07T20:20:56Z

Codecov Report

Merging #3722 into master will increase coverage by 2.55%.
The diff coverage is 90.11%.

@@            Coverage Diff             @@
##           master    #3722      +/-   ##
==========================================
+ Coverage   78.32%   80.88%   +2.55%     
==========================================
  Files         187      165      -22     
  Lines       37162    30383    -6779     
==========================================
- Hits        29107    24575    -4532     
+ Misses       8055     5808    -2247

Impacted Files	Coverage Δ
src/transformers/__init__.py	`99.30% <ø> (-0.11%)`	⬇️
src/transformers/modeling_flax_auto.py	`60.86% <60.86%> (ø)`
src/transformers/modeling_flax_utils.py	`83.60% <83.60%> (ø)`
src/transformers/file_utils.py	`82.92% <92.85%> (-0.05%)`	⬇️
src/transformers/modeling_flax_roberta.py	`94.11% <94.11%> (ø)`
src/transformers/modeling_flax_bert.py	`96.50% <96.50%> (ø)`
src/transformers/testing_utils.py	`67.66% <100.00%> (+0.38%)`	⬆️
src/transformers/modeling_tf_mobilebert.py	`24.55% <0.00%> (-72.40%)`	⬇️
src/transformers/modeling_tf_flaubert.py	`24.53% <0.00%> (-65.14%)`	⬇️
src/transformers/trainer.py	`55.11% <0.00%> (-9.71%)`	⬇️
... and 156 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60de910...c0d1c81. Read the comment docs.

src/transformers/modeling_flax_bert.py

marcvanzee · 2020-09-25T07:36:06Z

src/transformers/modeling_flax_bert.py

+        return jax.lax.tanh(out)
+
+
+class BertModel(nn.Module):


Comparing your implementation to our lm1b example (https://github.com/google/flax/blob/master/examples/lm1b/models.py), it seems your code contains significantly more nn.Module abstractions. Why did you decide to create them? Is this to ensure the translation from a Pytorch model was easier?

I personally prefer fewer abstraction, since I think it would make the code more concise and easier to digest, but i'd be interested in hearing what your thoughts are!

As you guessed, I choose to follow what is already widely adopted in the library regarding modules fragmentation.

This is something our users are welcoming because they like how easy it is to tweak one module. Also it makes almost a 1-1 matchs with both PyTorch & TensorFlow implementations, so it might be easier for users who would like to give it a try to do the move.

Does it make sense?

marcvanzee · 2020-09-25T07:58:25Z

src/transformers/modeling_flax_bert.py

+    BERT implementation using JAX/Flax as backend
+    """
+
+    model_class = BertModel


I personally think the names FlaxBertModel and BertModel may be a bit confusing, since in fact it seems BertModel is a nn.Module and in that sense more like a Flax model than the other one. Perhaps you could rename BertModel to BertModule, to clarify this is in fact the Flax module, and the other one is a wrapper around it?

tests/test_flax_auto.py

mfuntowicz · 2020-09-25T08:11:03Z

cc @levskaya

levskaya · 2020-09-29T23:11:56Z

src/transformers/file_utils.py

+        from jax.config import config
+        # TODO(marcvanzee): Flax Linen requires JAX omnistaging. Remove this 
+        # once JAX enables it by default.
+        config.enable_omnistaging()


If we cut a new release of flax and pin to newer jax/flax versions this is no longer necessary as it's now (recently) the default.

Agreed, I'll add a suggestion.

levskaya · 2020-09-30T02:01:29Z

src/transformers/modeling_flax_bert.py

+        return self._config
+
+    def __call__(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
+        @jax.jit


yeah you definitely don't want to use jit inside a function like this, unless perhaps it's only called once anyway.

marcvanzee

Looks great!!

setup.py

marcvanzee · 2020-10-06T08:23:02Z

src/transformers/file_utils.py

+        from jax.config import config
+        # TODO(marcvanzee): Flax Linen requires JAX omnistaging. Remove this 
+        # once JAX enables it by default.
+        config.enable_omnistaging()


Agreed, I'll add a suggestion.

src/transformers/file_utils.py

Signed-off-by: Morgan Funtowicz <[email protected]>

…torch equivalence. Signed-off-by: Morgan Funtowicz <[email protected]>

Signed-off-by: Morgan Funtowicz <[email protected]>

LysandreJik

Great, thanks for iterating @mfuntowicz!

stas00 · 2020-10-19T18:26:16Z

It looks like a file is missing:

$ make fixup
[...]
Checking all models are properly tested.
Traceback (most recent call last):
  File "utils/check_repo.py", line 327, in <module>
    check_repo_quality()
  File "utils/check_repo.py", line 321, in check_repo_quality
    check_all_models_are_tested()
  File "utils/check_repo.py", line 212, in check_all_models_are_tested
    new_failures = check_models_are_tested(module, test_file)
  File "utils/check_repo.py", line 182, in check_models_are_tested
    tested_models = find_tested_models(test_file)
  File "utils/check_repo.py", line 163, in find_tested_models
    with open(os.path.join(PATH_TO_TESTS, test_file)) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'tests/test_modeling_flax_utils.py'
Makefile:25: recipe for target 'extra_quality_checks' failed
make: *** [extra_quality_checks] Error 1

Shouldn't the CI have caught this?

sgugger · 2020-10-19T19:04:07Z

Looks like a problem in make fixup, make quality runs fine (and that's what the CI runs).

stas00 · 2020-10-19T19:13:07Z

Nope, both run the same sub-target: extra_quality_checks

$ make quality
[...]
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_repo.py
2020-10-19 12:11:26.345843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Checking all models are properly tested.
Traceback (most recent call last):
  File "utils/check_repo.py", line 327, in <module>
    check_repo_quality()
  File "utils/check_repo.py", line 321, in check_repo_quality
    check_all_models_are_tested()
  File "utils/check_repo.py", line 212, in check_all_models_are_tested
    new_failures = check_models_are_tested(module, test_file)
  File "utils/check_repo.py", line 182, in check_models_are_tested
    tested_models = find_tested_models(test_file)
  File "utils/check_repo.py", line 163, in find_tested_models
    with open(os.path.join(PATH_TO_TESTS, test_file)) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'tests/test_modeling_flax_utils.py'
Makefile:25: recipe for target 'extra_quality_checks' failed
make: *** [extra_quality_checks] Error 1

This is with the latest master.

stas00 · 2020-10-19T19:22:36Z

PR with fix #7914

The question is - why CI didn't fail? It reports no problem here:
https://app.circleci.com/pipelines/github/huggingface/transformers/14040/workflows/6cd2b931-ce7e-4e99-b313-4a34326fcece/jobs/101513

Once I got this fixed, 2 more issues came up:

python utils/check_repo.py
2020-10-19 12:22:10.636984: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Checking all models are properly tested.
Traceback (most recent call last):
  File "utils/check_repo.py", line 328, in <module>
    check_repo_quality()
  File "utils/check_repo.py", line 322, in check_repo_quality
    check_all_models_are_tested()
  File "utils/check_repo.py", line 217, in check_all_models_are_tested
    raise Exception(f"There were {len(failures)} failures:\n" + "\n".join(failures))
Exception: There were 2 failures:
test_modeling_flax_bert.py should define `all_model_classes` to apply common tests to the models it tests. If this intentional, add the test filename to `TEST_FILES_WITH_NO_COMMON_TESTS` in the file `utils/check_repo.py`.
test_modeling_flax_roberta.py should define `all_model_classes` to apply common tests to the models it tests. If this intentional, add the test filename to `TEST_FILES_WITH_NO_COMMON_TESTS` in the file `utils/check_repo.py`.
Makefile:25: recipe for target 'extra_quality_checks' failed

Fixed in the same PR.

Unless, this is actually a problem, this adds `modeling_flax_utils` to ignore list. otherwise currently it expects to have a 'tests/test_modeling_flax_utils.py' for it. for context please see: huggingface#3722 (comment)

* [flax] fix repo_check Unless, this is actually a problem, this adds `modeling_flax_utils` to ignore list. otherwise currently it expects to have a 'tests/test_modeling_flax_utils.py' for it. for context please see: #3722 (comment) * fix 2 more issues * merge #7919

mfuntowicz requested review from LysandreJik and thomwolf April 9, 2020 15:24

LysandreJik reviewed Apr 9, 2020

View reviewed changes

mfuntowicz marked this pull request as ready for review April 17, 2020 13:25

LysandreJik reviewed Apr 17, 2020

View reviewed changes

src/transformers/modeling_flax_bert.py Show resolved Hide resolved

thomwolf approved these changes Apr 22, 2020

View reviewed changes

bastings reviewed Apr 29, 2020

View reviewed changes

stefan-it reviewed Apr 29, 2020

View reviewed changes

src/transformers/modeling_flax_auto.py Outdated Show resolved Hide resolved

stefan-it reviewed Apr 29, 2020

View reviewed changes

src/transformers/modeling_flax_auto.py Show resolved Hide resolved

mfuntowicz force-pushed the jax-bert branch 2 times, most recently from b259000 to 27e9bc5 Compare May 1, 2020 19:28

mfuntowicz force-pushed the jax-bert branch from d9102d5 to 0112bbd Compare May 8, 2020 13:59

mfuntowicz force-pushed the jax-bert branch from 0112bbd to c43ee15 Compare May 28, 2020 19:38

bastings reviewed May 29, 2020

View reviewed changes

stale bot added the wontfix label Jul 29, 2020

stale bot removed the wontfix label Aug 4, 2020

mfuntowicz force-pushed the jax-bert branch from 49303a2 to 87bf0db Compare September 7, 2020 13:14

marcvanzee reviewed Sep 13, 2020

View reviewed changes

src/transformers/modeling_flax_bert.py Outdated Show resolved Hide resolved

marcvanzee reviewed Sep 13, 2020

View reviewed changes

src/transformers/modeling_flax_bert.py Outdated Show resolved Hide resolved

mfuntowicz force-pushed the jax-bert branch from a1bedcd to 23703a5 Compare September 18, 2020 13:57

marcvanzee suggested changes Sep 25, 2020

View reviewed changes

levskaya approved these changes Sep 30, 2020

View reviewed changes

mfuntowicz force-pushed the jax-bert branch from 23703a5 to c0d1c81 Compare October 5, 2020 12:23

marcvanzee approved these changes Oct 6, 2020

View reviewed changes

mfuntowicz added 14 commits October 19, 2020 14:41

use "cls_token" instead of "first_token" variable name.

8297f8e

Signed-off-by: Morgan Funtowicz <[email protected]>

use "hidden_state" instead of "h" variable name.

d50d2df

Signed-off-by: Morgan Funtowicz <[email protected]>

Correct class reference in docstring to link to Flax related modules.

9ee019a

Signed-off-by: Morgan Funtowicz <[email protected]>

Added HF + Google Flax team copyright.

b09ba30

Signed-off-by: Morgan Funtowicz <[email protected]>

Make Roberta independent from Bert

1540011

Signed-off-by: Morgan Funtowicz <[email protected]>

Move activation functions to flax_utils.

c554596

Signed-off-by: Morgan Funtowicz <[email protected]>

Move activation functions to flax_utils for bert.

2c614ca

Signed-off-by: Morgan Funtowicz <[email protected]>

Added docstring for BERT

5b2ec59

Signed-off-by: Morgan Funtowicz <[email protected]>

Update import for Bert and Roberta tokenizers

a056c16

Signed-off-by: Morgan Funtowicz <[email protected]>

Make style.

a8319c2

Signed-off-by: Morgan Funtowicz <[email protected]>

fix-copies

5e06f46

Signed-off-by: Morgan Funtowicz <[email protected]>

Correct FlaxRobertaLayer to match PyTorch.

bccc840

Signed-off-by: Morgan Funtowicz <[email protected]>

Use the same store_artifact for flax unittest

810332e

Signed-off-by: Morgan Funtowicz <[email protected]>

Style.

ab54008

Signed-off-by: Morgan Funtowicz <[email protected]>

mfuntowicz force-pushed the jax-bert branch from 6db747b to ab54008 Compare October 19, 2020 12:42

mfuntowicz added 2 commits October 19, 2020 14:57

Make sure gradient are disabled only locally for flax unittest using …

ab6068c

…torch equivalence. Signed-off-by: Morgan Funtowicz <[email protected]>

Use relative imports

8452c14

Signed-off-by: Morgan Funtowicz <[email protected]>

LysandreJik approved these changes Oct 19, 2020

View reviewed changes

LysandreJik merged commit 8f8f8d9 into master Oct 19, 2020

LysandreJik deleted the jax-bert branch October 19, 2020 13:55

stas00 mentioned this pull request Oct 19, 2020

[flax] fix repo_check #7914

Merged

Integrate Bert-like model on Flax runtime. #3722

Integrate Bert-like model on Flax runtime. #3722

Uh oh!

Conversation

mfuntowicz commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfuntowicz commented Apr 9, 2020

Uh oh!

LysandreJik commented Apr 10, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Jul 29, 2020

Uh oh!

LysandreJik commented Aug 4, 2020

Uh oh!

codecov bot commented Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

mfuntowicz commented Apr 9, 2020 •

edited

Loading

codecov bot commented Sep 7, 2020 •

edited

Loading

stas00 commented Oct 19, 2020 •

edited

Loading