A specification for datasets of files used to test data processing algorithms implementations
  • Kaitai Struct 100%
Find a file
2023-10-17 20:27:16 +03:00
C@ebb0cef251 Initial commit 2023-10-17 20:27:16 +03:00
ksys Initial commit 2023-10-17 20:27:16 +03:00
python@e1a03ca19d Initial commit 2023-10-17 20:27:16 +03:00
.editorconfig Initial commit 2023-10-17 20:27:16 +03:00
.gitmodules Initial commit 2023-10-17 20:27:16 +03:00
Code_Of_Conduct.md Initial commit 2023-10-17 20:27:16 +03:00
logo.svg Initial commit 2023-10-17 20:27:16 +03:00
meta.json Initial commit 2023-10-17 20:27:16 +03:00
ReadMe.md Initial commit 2023-10-17 20:27:16 +03:00
UNLICENSE Initial commit 2023-10-17 20:27:16 +03:00

File Test Suite specification

Rationale

There are various projects related to data processing, such as compression, encryption, hashing, source code transpilation, etc.

Developers of of such projects have to test such projects.

Usually testing is done in the form of pairs <data before processing> - <data after processing>.

Often such data is organized as files on the disk, where the test identifier is the name of a file, and whether the data is processed or not is given by file extension.

I.e. a.txt is an original file, and a.txt.gz or a.gz is the gzipped file.

Here we:

  • try to unify such attempts
  • provide a specification encoding different ways of organizing test files sets in the wild
  • provide a libraries implementing such a specification.

This allows us to share test datasets and code between projects easily.

We can harvest test sets from various projects, merge them into a single repo, then replace in the original projects their test set to a git submodule, almost without changes in original projects.

Schema

A test set is a standalone repository in a version control system.

It has a root dir, which has ReadMe.md file, which describes the whole dataset. It has subdirs for each standalone dataset harvested from various projects. Each subdir contains:

  • a metadata file, meta.ftsmeta, describing the way challenge-response pairs filenames are mapped to each other and parameters used for the algorithms;
  • a ReadMe.md file, describing the origin of the dataset;
  • a license file, if needed;
  • challenge-response pairs, with filenames following the schema, described in the metadata file.

Metadata file

For the format of meta.ftsmeta, its semantics and the design decisions behind it see the file_test_suite_metadata.ksy Kaitai Struct spec.