Theodore Beers – Closing the Gap in Non-Latin-Script Data

Introduction

At the core of “Closing the Gap” is a database of JSON files, each containing information relating to a research project in the Multilingual/Non-Latin-Script Digital Humanities. The files look like the following (a snippet; see the full file if you’re interested):

{
  "schema_version": "0.2.3",
  "record_metadata": {
    "uuid": "d1e6d69b-5e9a-4b4a-85ad-09aac56ed2d9",
    "record_created_on": "2021-11-08",
    "record_created_by": "Kudela, Xenia Monika",
    "last_edited_on": "2022-02-18"
  },
  "project": {
    "title": "Kalila and Dimna – AnonymClassic",
    "abbr": "",
    "type": "project",
    "ref": [],
    "date": [
      {
        "from": "2018-01-01",
        "to": "2022-12-31"
      }
    ],
    "maintained": null,
    "websites": [
      "https://www.geschkult.fu-berlin.de/en/e/kalila-wa-dimna/index.html",
      "https://kalila-and-dimna.fu-berlin.de/"
    ],
...

There are many places in our database schema where URLs are recorded. Projects have websites. We also include information about host institutions, the researchers who lead or are affiliated with each project, etc.—and their website URLs are generally listed.

With our project database already containing over 165 entries, the number of URLs can be expected to be several times larger. This is in fact the case: as will be described below, we have currently over 1,300 unique URLs. A challenge is that these links are scattered across different parts of a bunch of JSON files.

If we want to check the URLs in our database automatically for breakage—which we do—then we need some way of extracting them, along with, obviously, sending HTTP requests to all of them and tracking the responses. That is what this blog post is meant to explain: our usage of a tool called lychee to find and check links.

Sidebar: Why do this?

It is worth asking why we go to the trouble of checking hundreds of URLs and replacing them as they break. After all, one of the harsh realities of the Internet is that no URL will remain valid forever. With 1,300+ links to check (and counting), we find at least some breakage on a weekly basis. It often seems to be the case that, within a few years of the end of the funding period of a given DH project, the host university will allow the project website to go offline. With this in mind, why bother struggling against entropy?

Our answer is that part of the mission of “Closing the Gap” is to contribute to the improvement of standards of practice in the Digital Humanities. Research projects, and the institutions that host them, should strive to assign stable URLs and to maintain their validity for as long as is practical. How long is long enough? This is a matter of some subjectivity, but we think it is safe to say that any DH project that was active within the last decade should have an accessible website. (Relatedly, we advocate the use of static websites and web apps, which are easier to keep online over longer periods—and easier for the Internet Archive to snapshot.)

There have been cases in which we noticed that URLs at a given institution were breaking at high rates, and we were able to notify them and to see an actual improvement in the situation. So there is some method to the madness. We go through all the URLs in our database on a regular basis; find broken links; fix/replace them to the best of our ability; and notify site owners when we see larger problems. This allows us to keep our data relatively clean and to perform a kind of community service among DH researchers.

Extracting URLs

lychee is a modern, performant command-line utility for checking links. It is implemented in Rust, a relatively new and popular programming language that is designed to make it easier for developers to write correct, memory-safe software without sacrificing performance. There are so many excellent CLI tools written in Rust that even non-programmers might benefit from interpreting its use as a suggestion of high quality.

At any rate, the easiest use case for link-checking is an HTML document, since one can at least parse the HTML and look for URLs in <a> elements. lychee does this nicely. It can, in fact, be pointed at a webpage online, wherein it will find all links and check them. We can use as an example the homepage of the Kalīla and Dimna Project at the Freie Universität Berlin (and you will notice a few broken links if you run this command):

lychee https://www.geschkult.fu-berlin.de/en/e/kalila-wa-dimna/

The lychee documentation explains more about the options that can be set, the file formats supported, etc. But, again, a simple approach will not work for the “Closing the Gap” database. We have links in hundreds of JSON files, sometimes deeply nested. There are also many duplicate URLs. What we need is to iterate over the files, extract links from each, and generate a unified list, which can then be checked.

There would no doubt be many different ways of accomplishing this. The approach that we chose is to couple lychee to another CLI tool (also implemented in Rust), fd-find. The following command, when run in the root of our repository, recursively identifies all JSON files:

fd -e json

And the other preparatory command that we need, with lychee, takes a file (JSON or otherwise) and “dumps” a list of all URLs found therein. The list can then be written to an output file, which we will use later to check the links:

lychee --dump [some_file.json] > links_list.md

We can connect the “find” command to the “dump all links” command by using the -x flag in fd-find. That is, we ask for all JSON files, and for the links to be extracted from each, collecting them in a single list:

fd -e json -x lychee --dump > links_list.md

Now, as you can imagine, it will be easy to point lychee’s link-checking function at this list. In the case of the “Closing the Gap” database, the “dump” process yields a list of nearly 2,200 URLs. By the time that duplicates are weeded out and various invalid links are skipped over (see below), we end up with more like 1,325 URLs to check.

Checking Links

This is, in a way, the easy part. Assuming that we’re in the root of the “Closing the Gap” repository and have generated the links_list.md file, we can run the following command (with a few options set, to be explained below):

lychee --max-concurrency 16 -m 3 links_list.md

The option --max-concurrency 16 tells lychee not to attempt to check more than 16 URLs at a time. We set this limit after finding that, by default, lychee would try to work too quickly, generating spurious errors. You can feel free to remove this option or adjust the value to something that works on your machine. Just be on the lookout for link-checking errors caused by the submission of too many HTTP requests at once.

As for the option -m 3, it sets a limit on the number of times that lychee will allow for a link to be redirected. This is somewhat subjective. It is normal for one URL to redirect to another, and even for a chain of several redirects to occur before the web client is given a final, substantive response. At the same time, we have found that a large number of redirects is sometimes indicative of an actual problem. e.g., the website of a DH project may have been taken offline, but the host university set things up so that links to that site are redirected to the university homepage. These are in fact broken links, but they will be considered valid by lychee because they eventually lead to a successful response (albeit for a different resource). We can test for problems like this, to an extent, by limiting redirects.

How much redirection is too much? Again, this is subjective, but for the links in our database, we have found that we can fairly easily set a limit of 3. If we lower this to 2, lychee errors on more innocuous redirects—but we can manage this by updating the URLs in question. In fact, we have been using a limit of 2 internally, since we prefer to err on the side of strictness and to accept the occasional tedium that it produces. A limit of 3 is what we can more comfortably recommend to others.

Dealing with Errors

If you manage a website or a database that contains a substantial number of links, then you will soon encounter “errors that aren’t really errors,” or “errors that aren’t our fault.” Pages sometimes go down temporarily. An HTTP request can fail for any of a huge variety of reasons. And, it seems to us, a growing number of servers are configured to blanket-deny requests from command-line tools, presumably for security reasons and/or to make scraping more difficult. (The same libraries that allow lychee to check hundreds or thousands of URLs in a matter of seconds, could be used by bad actors to launch DDOS attacks.) So there are links that you will find impractical to check programmatically.

There are ways to mitigate this problem—basically, to avoid sending requests to URLs that are guaranteed to fail, and to spare yourself the hassle of being bombarded with error messages that you have no way of fixing. With lychee, you can specify in a configuration file which links it should skip over. Since this post is already more than long enough, we will not go into the details here; but you can read about this in the lychee documentation, and look at the lychee.toml file in the “Closing the Gap” repository. You will see, for example, that we do not bother checking any links to the website of the British Library, since it denies access to command-line tools.

The inevitability of encountering a large number and variety of errors with link-checking also makes it challenging to automate the process. For the time being, we are checking links in the “Closing the Gap” database manually—that is, manually triggering a command, which then runs automatically and reports a list of errors. As long as we make sure to do this once in a while, we can keep up with actual cases of link breakage, while ignoring errors that are not actionable. We do hope to add, at some point, a further degree of automation to this process: lychee can be run in GitHub Actions. Even then, however, we would have the link-checking workflow run at a certain interval (weekly?), simply to generate a list of errors. It would still be up to us to determine which errors we can fix, and to do so.

Conclusions

As has been explained above, our dual modus operandi in checking URLs is to keep our database clean and up to date (to the extent feasible), and to contribute to best practices in the community of Multilingual DH researchers. The Internet will always be a chaotic world, and that’s ok. We should just invest a bit of effort so that the projects on which we work remain findable and accessible over a long enough period that the public can benefit from them.

We encourage you to follow the work of “Closing the Gap in Non-Latin-Script Data” via our GitHub repository, our website, and our page under the website of the Seminar for Semitic and Arabic Studies at the Freie Universität Berlin. This post was written by Dr. Theodore Beers, who sometimes discusses topics relevant to Digital Humanities research on X/Twitter.

Author: Theodore Beers

How to Check Thousands of URLs for Broken Links