Skip to content

Here’s a download link for all of bookcorpus as of Sept 2020 #27

@shawwn

Description

@shawwn

You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21

it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt script, which you can find at https://github.com/shawwn/scrap named “epub2txt-all”. (not epub2txt.)

The new script:

  1. Correctly preserves structure, matching the table of contents very closely;

  2. Correctly renders tables of data (by default html2txt produces mostly garbage-looking results for tables),

  3. Correctly preserves code structure, so that source code and similar things are visually coherent,

  4. Converts numbered lists from “1\.” to “1.”

  5. Runs the full text through ftfy.fix_text() (which is what OpenAI does for GPT), replacing Unicode apostrophes with ascii apostrophes;

  6. Expands Unicode ellipses to “...” (three separate ascii characters).

The tarball download link (see tweet above) also includes the original ePub URLs, updated for September 2020, which ended up being about 2k more than the URLs in this repo. But they’re hard to crawl. I do have the epub files, but I’m reluctant to distribute them for obvious reasons.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions