Skip to content

lemire/unicode_lipsum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

unicode_lipsum

Tests files encoded with UTF-8, UTF-16LE and UTF-32LE.

By convention, all UTF-8 files end with .utf8.txt while all UTF-16LE files end with .utf16.txt and all UTF-32LE end with .utf32.txt.

A small number of files are encoded using Latin 1 (ISO-8859-1): esperanto.latin1.txt, french.latin1.txt, german.latin1.txt, portuguese.latin1.txt in the wikipedia_mars directory. They are not exactly equivalent to the Unicode files: e.g., it is not possible to reproduce the equivalent Unicode files from the Latin 1 files. However, we have have modified Unicode files with the suffixes .utflatin8.txt (UTF-8 recovered from Latin 1), .utflatin16.txt (UTF-16LE recovered from Latin 1), .utflatin32.txt (UTF-32LE recovered from Latin 1).

The wikipedia_mars files are derived from the Mars wikipedia article in different languages. Wikipedia is licensed under a Creative Commons license. The html2text Python program is used to convert them to text, by stripping HTML codes.

The lipsum file come from the package https://github.com/rusticstuff/simdutf8 by Hans Kratz (licensed under both MIT and Apache).

These files are provided for research purposes.

BibTeX

@misc{lemire_unicode_lipsum,
	author       = {Daniel Lemire},
	title        = {The unicode lipsum dataset},
	year         = {2026},
	howpublished = {\url{https://github.com/lemire/unicode_lipsum}},
	note         = {GitHub repository}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages