You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21
it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

The new script:
-
Correctly preserves structure, matching the table of contents very closely;
-
Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),
-
Correctly preserves code structure, so that source code and similar things are visually coherent,
-
Converts numbered lists from “1\.” to “1.”
-
Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;
-
Expands Unicode ellipses to “...” (three separate ascii characters).
The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.
You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21
it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)
The new script:
Correctly preserves structure, matching the table of contents very closely;
Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),
Correctly preserves code structure, so that source code and similar things are visually coherent,
Converts numbered lists from “1\.” to “1.”
Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;
Expands Unicode ellipses to “...” (three separate ascii characters).
The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.