Skip to content

TEIWrLD/TEIWorLD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

TEIWorLD (TEI Workflow for Language Data)

Tool to convert several written and spoken language data formats into TEI

Description

TEIWorLD transforms a variety of different formats for spoken and written language into the standardised formats TEISpoken and I5 with the intermediate format TEI P5. For archiving written data, the pipeline converts TEI P5 to the format used at IDS, the I5 format, which was developed by IDS based on TEI P5.

Schematic representation of the components of TEIWorLD

Data formats

Input (spoken formats)

  • eaf (Elan)
  • textgrid (Praat)
  • cha (chat/childes)
  • trs (transcriber)
  • maxqda (qdpx/mx24)

Input (written formats)

  • txt
  • docx/doc

Output

  • ISO/TEI Transcriptions of Spoken Language (TEISpoken)
  • IDS TEI P5 (I5)

Usage

If you want to build the app, download teiworld and inside the directory teiworld/ run mvn clean package

Without the build process:

  1. Download directory teiworld/target/teiworld-1.0-SNAPSHOT, make sure it includes the following:
teiworld-1.0-SNAPSHOT
├── teiworld.jar
└── lib
    ├── teicorpo.jar
    └── commons-io-2.19.0.jar
  1. Inside the directory run the programm with java -jar teiworld.jar [mode] [input directory] [output directory]
  2. Call help menu: java -jar teiworld.jar -h
  3. For examples of the different modes see below

Command spoken:

Converts all files of spoken data format (see data formats below) to TEIspoken and keeps files separate if there is more than one in the input directory

  • Linux: java -jar teiworld.jar spoken ../spoken ../spoken-output
  • Windows: java -jar teiworld.jar spoken ..\spoken ..\spoken-output
path\to\input\dir\
├── 01.10.07-1_2_transcription.eaf       // After conversion: 01.10.07-1_2_transcription.tei_corpo.xml
├── 011017.cha                           // After conversion: 011017.tei_corpo.xml
├── 3wCno_4b_In.qdpx                     // File name will be different after conversion as only part of the qdpx archive (the transcription) is subject to conversion
├── Max_Mustermann.TextGrid              // After conversion: Max_Mustermann.tei_corpo.xml
├── B060115prs15.trs                     // After conversion: B060115prs15.tei_corpo.xml
├── Notes.docx                           // FILE WILL BE IGNORED as DOCX is no valid input format for mode spoken
├── paper_publication.pdf                // FILE WILL BE IGNORED as PDF is no valid input format for mode spoken
├── RDMO.PNG                             // FILE WILL BE IGNORED as PNG is no valid input format for mode spoken
├── emptyDirectory                       // DIRECTORY WILL BE IGNORED, only files are processed in mode spoken
├── output_9382718e-ad60ce40f155_2.json  // FILE WILL BE IGNORED as JSON is no valid input format for mode spoken
└── token-alignment.txt                  // FILE WILL BE IGNORED as TXT is no valid input format for mode spoken

Command writtenP5:

Converts all files of written data format (see data formats below) to TEI P5 and keeps the resulting files separate if there is more than one in the input directory

  • Linux: java -jar teiworld.jar writtenP5 ../writtenP5 ../writtenP5-output
  • Windows: java -jar teiworld.jar writtenP5 ..\writtenP5 ..\writtenP5-output
path\to\input\dir\
├── fileA.txt                            // After conversion: fileA.tei_garage.xml
├── fileB.txt                            // After conversion: fileB.tei_garage.xml
├── fileC.cha                            // FILE WILL BE IGNORED as CHA is no valid input format for mode writtenP5
└── fileD.docx                           // After conversion: fileD.tei_garage.xml

Command written:

Converts all files of written data format (see data formats below) to TEI I5 and combines files to a single I5 corpus.
The file metadata.json needs to be in the same directory.
The corpusSigle is taken from metadata.json.
All files will be put under the one single dokumentSigle whose label is also extracted from metadata.json.

  • Linux: java -jar teiworld.jar written ../written ../written-output
  • Windows: java -jar teiworld.jar written ..\written ..\written-output
path\to\input\dir\
├── pic.PNG                              // FILE WILL BE IGNORED as PNG is no valid input format for mode written
├── file01.docx                          // 
├── file02.txt                           // 
└── metadata.json                        // MANDATORY file with the corpus metadata
<idsCorpus version="1.0">
    <idsHeader type="corpus" pattern="allesaußerZtg/Zschr" version="1.0"> <!-- contains metadata from meatadata.json -->
    <idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/DOK -->
      <idsHeader type="document" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/DOK.00001 | empty t.title -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/DOK.00002 | empty t.title -->
	</idsDoc>
</idsCorpus>

Command writtenHierarchical:

Converts all files of written data format (see data formats below) to TEI I5 and constructs the hierarchical document and text structure of a written corpus.
The directory needs to contain the file metadata.json and one or more subdirectories (= idsDoc) that contain the individual texts (= idsText).
In this mode only the corpusSigle is taken from metadata.json.

  • Linux: java -jar teiworld.jar writtenHierarchical ../writtenHierarchical ../writtenHierarchical-output
  • Windows: java -jar teiworld.jar writtenHierarchical ..\writtenHierarchical ..\writtenHierarchical-output

The folder structure of the input directory will be reflected in the resulting I5 XML tree:

path\to\input\dir\
├── Directory01                          // After conversion: directory name = dokumentSigle 
│   ├── Kriterien für Datenaufnahme.txt  // After conversion: file name = t.title | textSigle: NOZ/Directory01.00001
│   └── Protokoll Projekttreffen.docx    // After conversion: file name = t.title | textSigle: NOZ/Directory01.00002
├── Directory02                          // After conversion: directory name = dokumentSigle
│   └── Planung Publikation.docx         // After conversion: file name = t.title | textSigle: NOZ/Directory02.00001
├── Directory03                          // After conversion: directory name = dokumentSigle
│   ├── Briefsammlung.txt                // After conversion: file name = t.title | textSigle: NOZ/Directory03.00001
│   ├── Essay.txt                        // After conversion: file name = t.title | textSigle: NOZ/Directory03.00002
│   ├── Essay_Kommentare.pdf             // FILE WILL BE IGNORED as PDF is no valid input format
│   └── Workshop Korpusaufbau.docx       // After conversion: file name = t.title | textSigle: NOZ/Directory03.00003
├── Directory04                          // DIRECTORY WILL BE IGNORED as there is no file as a direct child
│   └── folderXY                         // DIRECTORY WILL BE IGNORED, only files would be processed
│       ├── dummyFile02.txt              // FILE WILL BE IGNORED
│       └── emptyFolder                  // FILE WILL BE IGNORED
├── dummyFile01.txt                      // FILE WILL BE IGNORED as it is not inside a directory
└── metadata.json                        // MANDATORY file with the corpus metadata
<idsCorpus version="1.0">
    <idsHeader type="corpus" pattern="allesaußerZtg/Zschr" version="1.0"> <!-- contains metadata from meatadata.json -->
    <idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory01 -->
      <idsHeader type="document" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/Directory01.00001 | t.title: Kriterien für Datenaufnahme -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory01.00002 | t.title: Protokoll Projekttreffen -->
	</idsDoc>
	<idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory02 -->
      <idsHeader type="text" pattern="text" version="1.0">
      <idsText version="1.0">          <!-- textSigle: NOZ/Directory02.00001 | t.title: Planung Publikation -->
	</idsDoc>
	<idsDoc type="text" version="1.0"> <!-- dokumentSigle: NOZ/Directory03 -->
      <idsHeader type="text" pattern="text" version="1.0">
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00001 | t.title: Briefsammlung -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00002 | t.title: Essay -->
	  <idsText version="1.0">          <!-- textSigle: NOZ/Directory03.00003 | t.title: Workshop Korpusaufbau -->
	</idsDoc>
</idsCorpus>

Components

TEIGarage
TEICORPO
P5ToI5

Publications

Contact


Logo 1 Logo 2

© 2025 TEIWorLD

About

Tool to convert several written and spoken language data formats into TEI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors