TEIWorLD (TEI Workflow for Language Data)

Tool to convert several written and spoken language data formats into TEI

Description

TEIWorLD transforms a variety of different formats for spoken and written language into the standardised formats TEISpoken and I5 with the intermediate format TEI P5. For archiving written data, the pipeline converts TEI P5 to the format used at IDS, the I5 format, which was developed by IDS based on TEI P5.

Schematic representation of the components of TEIWorLD

Data formats

Input (spoken formats)

eaf (Elan)
textgrid (Praat)
cha (chat/childes)
trs (transcriber)
maxqda (qdpx/mx24)

Input (written formats)

txt
docx/doc

Output

ISO/TEI Transcriptions of Spoken Language (TEISpoken)
IDS TEI P5 (I5)

Usage

If you want to build the app, download teiworld and inside the directory teiworld/ run mvn clean package

Without the build process:

Download directory teiworld/target/teiworld-1.0-SNAPSHOT, make sure it includes the following:

teiworld-1.0-SNAPSHOT
├── teiworld.jar
└── lib
    ├── teicorpo.jar
    └── commons-io-2.19.0.jar

Inside the directory run the programm with java -jar teiworld.jar [mode] [input directory] [output directory]
Call help menu: java -jar teiworld.jar -h
For examples of the different modes see below

Command spoken:

Converts all files of spoken data format (see data formats below) to TEIspoken and keeps files separate if there is more than one in the input directory

Linux: java -jar teiworld.jar spoken ../spoken ../spoken-output
Windows: java -jar teiworld.jar spoken ..\spoken ..\spoken-output

path\to\input\dir\
├── 01.10.07-1_2_transcription.eaf       // After conversion: 01.10.07-1_2_transcription.tei_corpo.xml
├── 011017.cha                           // After conversion: 011017.tei_corpo.xml
├── 3wCno_4b_In.qdpx                     // File name will be different after conversion as only part of the qdpx archive (the transcription) is subject to conversion
├── Max_Mustermann.TextGrid              // After conversion: Max_Mustermann.tei_corpo.xml
├── B060115prs15.trs                     // After conversion: B060115prs15.tei_corpo.xml
├── Notes.docx                           // FILE WILL BE IGNORED as DOCX is no valid input format for mode spoken
├── paper_publication.pdf                // FILE WILL BE IGNORED as PDF is no valid input format for mode spoken
├── RDMO.PNG                             // FILE WILL BE IGNORED as PNG is no valid input format for mode spoken
├── emptyDirectory                       // DIRECTORY WILL BE IGNORED, only files are processed in mode spoken
├── output_9382718e-ad60ce40f155_2.json  // FILE WILL BE IGNORED as JSON is no valid input format for mode spoken
└── token-alignment.txt                  // FILE WILL BE IGNORED as TXT is no valid input format for mode spoken

Command writtenP5:

Converts all files of written data format (see data formats below) to TEI P5 and keeps the resulting files separate if there is more than one in the input directory

Linux: java -jar teiworld.jar writtenP5 ../writtenP5 ../writtenP5-output
Windows: java -jar teiworld.jar writtenP5 ..\writtenP5 ..\writtenP5-output

path\to\input\dir\
├── fileA.txt                            // After conversion: fileA.tei_garage.xml
├── fileB.txt                            // After conversion: fileB.tei_garage.xml
├── fileC.cha                            // FILE WILL BE IGNORED as CHA is no valid input format for mode writtenP5
└── fileD.docx                           // After conversion: fileD.tei_garage.xml

Command written:

Converts all files of written data format (see data formats below) to TEI I5 and combines files to a single I5 corpus.
The file metadata.json needs to be in the same directory.
The corpusSigle is taken from metadata.json.
All files will be put under the one single dokumentSigle whose label is also extracted from metadata.json.

Linux: java -jar teiworld.jar written ../written ../written-output
Windows: java -jar teiworld.jar written ..\written ..\written-output

path\to\input\dir\
├── pic.PNG                              // FILE WILL BE IGNORED as PNG is no valid input format for mode written
├── file01.docx                          // 
├── file02.txt                           // 
└── metadata.json                        // MANDATORY file with the corpus metadata

<idsCorpus version="1.0">
    <idsHeader type="corpus" pattern="allesaußerZtg/Zschr" version="1.0"> <!-- contains metadata from meatadata.json -->
    <idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/DOK -->
      <idsHeader type="document" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/DOK.00001 | empty t.title -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/DOK.00002 | empty t.title -->
	</idsDoc>
</idsCorpus>

Command writtenHierarchical:

Converts all files of written data format (see data formats below) to TEI I5 and constructs the hierarchical document and text structure of a written corpus.
The directory needs to contain the file metadata.json and one or more subdirectories (= idsDoc) that contain the individual texts (= idsText).
In this mode only the corpusSigle is taken from metadata.json.

Linux: java -jar teiworld.jar writtenHierarchical ../writtenHierarchical ../writtenHierarchical-output
Windows: java -jar teiworld.jar writtenHierarchical ..\writtenHierarchical ..\writtenHierarchical-output

The folder structure of the input directory will be reflected in the resulting I5 XML tree:

path\to\input\dir\
├── Directory01                          // After conversion: directory name = dokumentSigle 
│   ├── Kriterien für Datenaufnahme.txt  // After conversion: file name = t.title | textSigle: NOZ/Directory01.00001
│   └── Protokoll Projekttreffen.docx    // After conversion: file name = t.title | textSigle: NOZ/Directory01.00002
├── Directory02                          // After conversion: directory name = dokumentSigle
│   └── Planung Publikation.docx         // After conversion: file name = t.title | textSigle: NOZ/Directory02.00001
├── Directory03                          // After conversion: directory name = dokumentSigle
│   ├── Briefsammlung.txt                // After conversion: file name = t.title | textSigle: NOZ/Directory03.00001
│   ├── Essay.txt                        // After conversion: file name = t.title | textSigle: NOZ/Directory03.00002
│   ├── Essay_Kommentare.pdf             // FILE WILL BE IGNORED as PDF is no valid input format
│   └── Workshop Korpusaufbau.docx       // After conversion: file name = t.title | textSigle: NOZ/Directory03.00003
├── Directory04                          // DIRECTORY WILL BE IGNORED as there is no file as a direct child
│   └── folderXY                         // DIRECTORY WILL BE IGNORED, only files would be processed
│       ├── dummyFile02.txt              // FILE WILL BE IGNORED
│       └── emptyFolder                  // FILE WILL BE IGNORED
├── dummyFile01.txt                      // FILE WILL BE IGNORED as it is not inside a directory
└── metadata.json                        // MANDATORY file with the corpus metadata

<idsCorpus version="1.0">
    <idsHeader type="corpus" pattern="allesaußerZtg/Zschr" version="1.0"> <!-- contains metadata from meatadata.json -->
    <idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory01 -->
      <idsHeader type="document" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/Directory01.00001 | t.title: Kriterien für Datenaufnahme -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory01.00002 | t.title: Protokoll Projekttreffen -->
	</idsDoc>
	<idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory02 -->
      <idsHeader type="text" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/Directory02.00001 | t.title: Planung Publikation -->
	</idsDoc>
	<idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory03 -->
      <idsHeader type="text" pattern="text" version="1.0">
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00001 | t.title: Briefsammlung -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00002 | t.title: Essay -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00003 | t.title: Workshop Korpusaufbau -->
	</idsDoc>
</idsCorpus>

Components

TEIGarage
TEICORPO
P5ToI5

Publications

Contact

E-Mail

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
images		images
teiworld		teiworld
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEIWorLD (TEI Workflow for Language Data)

Description

Data formats

Input (spoken formats)

Input (written formats)

Output

Usage

Command spoken:

Command writtenP5:

Command written:

Command writtenHierarchical:

Components

Publications

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TEIWorLD (TEI Workflow for Language Data)

Description

Data formats

Input (spoken formats)

Input (written formats)

Output

Usage

Command spoken:

Command writtenP5:

Command written:

Command writtenHierarchical:

Components

Publications

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages