Skip to content

Integrate UDify into AllenNLP #5

@Hyperparticle

Description

@Hyperparticle

It would be useful to integrate the UDify model directly into AllenNLP as a PR, as the code merely extends the library to handle a few extra features. Since the release of the UDify code, AllenNLP also has added a multilingual UD dataset reader and a multilingual dependency parser with a corresponding model, which should make things easier.

Here is a list of things that need to be done:

  • Add scripts to download and concatenate the UD data for training/evaluation. Also, add the CoNLL 2018 evaluation script.
  • Create a UDify conllu -> conllu predictor that can handle unseen tokens and multiword ids.
  • Add the sqrt learning rate decay LR scheduler.
  • Add optional dropout to ScalarMix.
  • Modify the multilingual UD dataset reader to handle multiword ids.
  • Add lemmatizer edit script code.
  • Modify the BERT token embedder to be able to return multiple scalar mixes, one per task (or alternatively all the embeddings). Add optional args for internal BERT dropout.
  • Add generic dynamic masking functions.
  • Add the custom sequence tagger and biaffine dependency parser that handles a multi-task setup.
  • Add the UDify main model, wrapping the BERT, dynamic masking, scalar mix, sequence tagger, and dependency parser code. Provide custom metrics for TensorBoard.
  • Add utility code to optionally cache the vocab and grab UD treebank names from files.
  • Add helper script to evaluate conllu predictions and output them to json.
  • Add tests to verify the new UDify model and modules.
  • Add UDify config jsonnet file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions