-
Notifications
You must be signed in to change notification settings - Fork 40
Training
Training TEES is done with the program train.py which produces a model file that can be used with classify.py to predict events/relations. Before training, it is important to setup TEES with configure.py. Training TEES requires a corpus, divided into a training (train), parameter optimization (devel) and performance testing (test) set. The simplest way to train TEES is to use one of the included corpora. For example, if we wanted to train a model for the GE corpus, we could do it with the following command:
python train.py -t GE11 -o [OUTDIR] -c [REMOTE]
In the command, the -t (task) switch defines the GE11 corpus. Using the task switch makes train.py to use predefined settings suitable for that known corpus. All of these values can also be overridden with the command line parameters, but using a defined task is an easy way to get started with building a new model.
Alternatively, to train TEES for a new corpus, the following command could be used:
python train.py --trainFile MY_TRAINING_CORPUS.xml --develFile MY_DEVELOPMENT_CORPUS.xml --testFile MY_TEST_CORPUS.xml -o OUTDIR -c REMOTE
Here the --trainFile, --develFile and --testFile define the learning set, parameter optimization set and performance evaluation sets, respectively. When encountering a new corpus, TEES will attempt to learn suitable training settings from the annotations in the learning and parameter optimization sets. As when using a predefined task setting, all of these values can also be overridden with the command line parameters.
You always need to define an outdir parameter (-o). This is the directory that train.py will use for its intermediate files, and here you can also find the "log.txt" which records the whole training process. If the model file switches (--develModel and --testModel) are undefined, they will also appear in the output directory. If used, these switches should define the absolute path to the models to be created. If the model name ends with a ".zip" suffix, the model will be a compressed archive, otherwise it will be a directory.
The optional remote connection switch (-c) allows you to send the time-consuming SVM parameter optimization search to your cluster, enabling fast parallel processing. Remote connections are introduced in the following section.
Other options for train.py largely depend on whether you are training a single-stage or multi-stage detector. A single stage detector does one round of machine learning, such as EdgeDetector, EntityDetector or ModifierDetector. A multi-stage detector consists of several single-stage detector forming a pipeline, e.g. the EventDetector first detects trigger words with an EntityDetector, then event argumenst with and EdgeDetector and finally builds valid event structures with an UnmergingDetector. The EventDetector can also use a ModifierDetector to predict negation and speculation if these modifiers have been annotated in the training data. The Detector-class to use is determined automatically if the task argument is used, or it can be defined manually, as the command to import a named Detector-class. If a single-stage detector is used, the "example-*" arguments are used for its settings, and if an EventDetector is used, the arguments named with its subcomponents are used.
When training a model with the -t (task) switch example generation and classifier parameters are defined automatically. It is however possibly to override all these settings, or to defined all such settings manually without using the -t (task) setting.
The example builders construct the class + feature vector examples for use by the classifier. The "*Style" arguments of train.py can be used to toggle specific settings for example generation. In the case of a SingleStageDetector, "--exampleStyle" is used for the only ExampleBuilder the detector uses. In the case of an EventDetector, the "--triggerStyle", "--edgeStyle", "--unmergingStyle" and "--modifierStyle" parameters can be used. Available example generation styles are defined in the init-methods of the corresponding example builder class.
An example of using custom example generation styles can be seen with the DDI11 task. This tasks proposes two variants, one where only drug-drug interaction edges are detected, and one where also drug name entities must be detected. By default, the drug name entities in this corpus are marked with the attribute "given", so when analysing the corpus TEES sees that entities don't have to be detected, and will use a SingleStageDetector. However, if we wish to train for the task variant with entity detection, we can give the following command:
python train.py -t DDI11 --detector Detectors.EventDetector --triggerStyle names:build_for_nameless:ddi13_features:drugbank_features -o [OUTDIR]
Here we use the task setting DDI11, but override the automatically chosen SingleStageDetector with the "--detector"-argument. Since entities in the DDI11 corpus are marked as "given", the EntityExampleBuilder will by default ignore them. Thus, "--triggerStyle" is used to customize example generation. The style argument "names" ensures that the EntityExampleBuilder will generate examples for the "given" entities and the style argument "build_for_nameless" causes EntityExampleBuilder to process all sentences. Finally, the "ddi13_features" and "drugbank_features" are used to enable task-specific feature generators.
The above example is also available as a predefined -t (task) setting "DDI11-FULL".
As with example generation, also classification can be customized. In the case of a SingleStageDetector, "--exampleParams" is used for the only classification. In the case of an EventDetector, the "--triggerParams", "--edgeParams", "--unmergingParams" and "--modifierParams" settings can be used.
The classifier parameters by default use the SVM-MultiClass classifier. A typical example of "--exampleParams" would be "c=5000,10000,20000,50000,100000", a grid search for the best regularization parameter. However, classifier parameters can be used to apply a different classifier. The classifiers available in TEES are:
| Classifier | Purpose |
|---|---|
| SVMMultiClassClassifier | The default support vector machine classifier |
| ScikitClassifier | An interface to the scikit-learn library |
| AllTrueClassifier | A dummy classifier that always predicts the positive class. For testing of binary classifications |
| AllCorrectClassifier | A dummy classifier that always predicts the correct, gold answer. For testing of classification pipelines. |
Of the alternative classifiers, the ScikitClassifier is probably of the most interest. It enables the use of any scikit-learn classifier with TEES. In order to use scikit-learn it must be installed on the system. To use e.g. the scikit-learn SVC classifier with a single example builder, the following parameters can be used:
--exampleParams TEES.classifier=ScikitClassifier:scikit=svm.SVC:C=10,20,50,80,100,120,150,200,500,1000:probability
Here the "TEES.classifier" parameter imports the ScikitClassifier from the Classifiers subdirectory. The "scikit" parameter in turn imports the "SVC" classifier from the scikit-learn module "svm". Finally, the "C" and "probability" parameters are passed on to the scikit-learn classifier object. The "C" parameter has multiple values, so a parameter grid search is performed. The "probability" parameter is given no explicit value, so it defaults to True. By using the "probability" setting confidence estimates are generated for the predicted elements, just like with the SVMMultiClassClassifier.
All scikit-learn classifiers that support sparse feature vectors should be usable with TEES. As another example, to use the RandomForestClassifier, the following parameters can be used:
--exampleParams TEES.classifier=ScikitClassifier:scikit=ensemble.RandomForestClassifier:n_estimators=5,10,50,100
Here the scikit-learn "RandomForestClassifier" is imported from the "ensemble" module and a parameter grid search is performed with four different values of the "n_estimators" parameter.
If the remote connection is not set when training, all the SVM models are trained on the local machine, which can be a very time consuming process. If you have access to a cluster computer, using a remote connection can save a lot of time.
If you are running train.py within the cluster environment, all you need to do to use parallel processing, is to define your job scheduler type with the -c switch. Supported values are "SLURM", "PBS" and "LSF", referring to those job scheduling systems. For example, in a SLURM environment, use the parameter
-c SLURM
Occasionally you might want to run the actual training process on your local computer, but still use the cluster for fast parallel training of the SVMs. This can be useful for example if your cluster environment doesn't support Python, meaning you can't run most of TEES there. To make such a remote connection, you have to define a remote connection. For example, you might give the train.py the following -c (connection) parameter:
-c type=SLURM:[email protected]:workdir=/workdir/me:settings=/home/me/TEESLocalSettings.py
Here, as previously, the connection type "SLURM" is used. In addition, the address of the remote machine is defined with the "account" parameter. Passwordless SSH login (with keys) must be enabled for TEES to be able to use a remote connection. The "workdir" parameter defines the remote working directory, where TEES will mirror your local training output directory for the relevant files.
Finally, the TEES settings file, which defines the SVM-executable locations etc, must be defined in the remote connection, as the "settings" parameter. You have probably defined your local settings file with TEES_SETTINGS on your local computer, but this is a different file, the one for your remote cluster machine.
A remote connection like this is quite a lot to type every time you run train.py, so you can save it into your local TEES settings file. Just define a new variable, e.g.:
MY_CLUSTER = "type=SLURM:[email protected]:workdir=/workdir/me:settings=/home/me/TEESLocalSettings.py"
Now, whenever you give the remote connection switch -c the value "MY_CLUSTER", TEES will read the remote connection parameters from the local settings file.
Training TEES is a long process, for example training the GE model takes around 3 hours, even if a cluster is used to speed up SVM training. Occasionally something can go wrong, especially when developing a new extension. If the training process crashes, all is not lost, as train.py allows you to continue training from a close by position in the process. This is made possible with the --step switch, which defines a major and minor step. For train.py, the major steps are "TRAIN", "DEVEL", "EMPTY" and "TEST" where "TRAIN" is the actual training process, "DEVEL" and "EMPTY" are classification of the development corpus with the newly trained devel-model, and "TEST" is the classification of the test corpus with the newly trained test-model. The minor step defines the processing step within the Detector-object. For example, if you want to restart the same training process mentioned earlier from the parameter grid search, add the --step switch to the call with a major=minor step pair:
python train.py -t GE -u -m -o OUTDIR -c REMOTE --step TRAIN=GRID
TEES writes in the log file the step it is currently entering or exiting, such as:
=== ENTER STEP EventDetector:TRAIN:EXAMPLES ===
or
=== EXIT STEP EXAMPLES time: 0:04:53.251005 ===
Both of the above log messages refer to the "EXAMPLES" minor step. To restart training from this step, use the --step parameter with the value "TRAIN=EXAMPLES". After a crash, looking at the log file tells what was the step where the error happened, and that's where processing can be continued after the error is fixed.
In addition to continuing a previously stopped training process, sometimes it is useful to create a new copy of the training directory. For example, you might have a succesfull GE model trained, and you want to preserve the output directory as a template, so that you can later make a modification to code affecting a later stage of the process, and then jump immediately to that point when re-training. This can be done with the --copyFrom switch. For example, if we wanted to rerun the earlier GE11 training process, using data from the first process, but starting only at the grid search, we could use the command:
python train.py -t GE11 -u -m -o NEW_OUTDIR -c REMOTE --step TRAIN=GRID --copyFrom OLD_OUTDIR