Bitextor is a tool to automatically harvest bitexts from multilingual websites.
5.6K

Bitextor is a tool to automatically harvest bitexts from multilingual websites. To run it, it is necessary to provide:
Bitextor can be installed via Docker, Conda or built from source. See instructions here.
usage: bitextor [-C FILE [FILE ...]] [-c KEY=VALUE [KEY=VALUE ...]]
[-j JOBS] [-k] [--notemp] [--dry-run]
[--forceall] [--forcerun [TARGET [TARGET ...]]]
[-q] [-h]
launch Bitextor
Bitextor config::
-C FILE [FILE ...], --configfile FILE [FILE ...]
Bitextor YAML configuration file
-c KEY=VALUE [KEY=VALUE ...], --config KEY=VALUE [KEY=VALUE ...]
Set or overwrite values for Bitextor config
Optional arguments::
-j JOBS, --jobs JOBS Number of provided cores
-k, --keep-going Go on with independent jobs if a job fails
--notemp Disable deletion of intermediate files marked as temporary
--dry-run Do not execute anything and display what would be done
--forceall Force rerun every job
--forcerun TARGET [TARGET ...]
List of files and rules that shall be re-created/re-executed
-q, --quiet Do not print job information
-h, --help Show this help message and exit
Bitextor uses Snakemake to define Bitextor's workflow and manage its execution. Snakemake provides a lot of flexibility in terms of configuring the execution of the pipeline. For advanced users that want to make the most out of this tool, bitextor-full command is provided that calls Snakemake CLI with Bitextor's workflow and exposes all of Snakemake's parameters.
To run Bitextor on a cluster with a software that allows to manage job queues, it is recommended to use bitextor-full command and use Snakemake's cluster configuration.
Bitextor uses a configuration file to define the variables required by the pipeline. Depending on the options defined in this configuration file the pipeline can behave differently, running alternative tools and functionalities. For more information consult this exhaustive overview of all the options that can be set in the configuration file and how they affect the pipeline.
Suggestion: A configuration wizard called bitextor-config gets installed with Bitextor to help with this task. Furthermore, a minimalist configuration file sample is provided in this repository. You can take it as a starting point by changing all the paths to match your environment.
Bitextor generates the final parallel corpora in multiple formats. These files will be placed in permanentDir folder and will have the following format: {lang1}-{lang2}.{prefix}.gz, where {prefix} corresponds to a descriptor of the corresponding format. The list of files that may be produced is the following:
{lang1}-{lang2}.raw.gz - default (always generated){lang1}-{lang2}.sent.gz - default{lang1}-{lang2}.not-deduped.tmx.gz - generated if tmx: true{lang1}-{lang2}.deduped.tmx.gz - generated if deduped: true{lang1}-{lang2}.deduped.txt.gz - generated if deduped: true{lang1}-{lang2}.not-deduped.roamed.tmx.gz - generated if biroamer: true and tmx: true{lang1}-{lang2}.deduped.roamed.tmx.gz - generated if biroamer: true and deduped: trueSee detailed description of the output files.
Bitextor is a pipeline that runs a collection of scripts to produce a parallel corpus from a collection of multilingual websites. The pipeline is divided in five stages:
The following diagram shows the structure of the pipeline and the different scripts that are used in each stage:


All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.
Content type
Image
Digest
sha256:7cb77a17a…
Size
8.8 GB
Last updated
over 2 years ago
Requires Docker Desktop 4.37.1 or later.