{"@attributes":{"version":"2.0"},"channel":{"title":"A research (b)log","link":"https:\/\/alix-tz.github.io\/phd\/","description":"A research (b)log created during my phD in Digital Humanities.","language":"en","copyright":"Contents \u00a9 2026 <a href=\"https:\/\/alix-tz.github.io\/phd\/\">Alix Chagu\u00e9<\/a> CC-BY","lastBuildDate":"Fri, 24 Apr 2026 15:25:41 GMT","generator":"Nikola (getnikola.com)","docs":"http:\/\/blogs.law.harvard.edu\/tech\/rss","item":[{"title":"026 -- The Haunting of Reviewer 3 (FR)","link":"https:\/\/alix-tz.github.io\/phd\/posts\/026\/","description":"<p>Il y a quelques ann\u00e9es, j'ai d\u00e9couvert le mythe du <a href=\"https:\/\/www.reddit.com\/r\/labrats\/comments\/el60yo\/reviewer_2\/#lightbox\">Reviewer 2<\/a> \u00e0 l'occasion d'une journ\u00e9e costum\u00e9e organis\u00e9e par les membres de mon \u00e9quipe \u00e0 Inria pour Halloween. Reviewer 2, portraitur\u00e9 dans nos couloirs sous les traits d'un fant\u00f4me impassible, c'est l'\u00e9valuateur\u00b7rice dont le rapport et la note, s\u00e9v\u00e8re, r\u00e9duisent \u00e0 n\u00e9ant les espoirs d'acceptation des (jeunes) chercheur\u00b7ses soumettant \u00e0 des journaux ou des conf\u00e9rences, souvent en d\u00e9pit d'une premi\u00e8re \u00e9valuation positive. <\/p>\n<p>Je repense \u00e0 ce mythe alors que je participe \u00e0 la campagne d'\u00e9valuation des soumissions \u00e0 la <a href=\"https:\/\/dh2026.adho.org\/cfp\/\">conf\u00e9rence annuelle d'ADHO<\/a> qui est organis\u00e9e \u00e0 Daejeon cet \u00e9t\u00e9. Cette ann\u00e9e, l'\u00e9valuation par les paires est organis\u00e9e en double-aveugle, c'est-\u00e0-dire qu'en tant qu'\u00e9valuateur\u00b7rices, nous ne sommes pas cens\u00e9\u00b7es conna\u00eetre l'identit\u00e9 des auteur\u00b7rices de la proposition, et iels ignorent la n\u00f4tre. Pour la premi\u00e8re fois, je remarque qu'apr\u00e8s avoir soumis les \u00e9valuations, il m'est possible de voir le d\u00e9tail des autres \u00e9valuations qui concernent une soumission que j'ai trait\u00e9e. Je trouve que ce format est int\u00e9ressant \u00e0 plusieurs titres.<\/p>\n<p>Premi\u00e8rement, il a un int\u00e9r\u00eat didactique : il contribue potentiellement \u00e0 l'apprentissage de l'art de l'\u00e9valuation. A ma connaissance, savoir faire des \u00e9valuations pour les paires est une comp\u00e9tence compl\u00e8tement absente des cursus de formation des chercheur\u00b7ses. En g\u00e9n\u00e9ral, c'est une expertise qui s'acquiert avec le temps et par l'exposition \u00e0 l'\u00e9valuation, soit lorsque l'on re\u00e7oit des \u00e9valuations, soit quand on est mis en situation d'en faire. Apprendre \u00e0 faire de bonnes \u00e9valuations prend du temps. A mon avis, l'absence de v\u00e9ritable m\u00e9thodologie\/\u00e9thique de l'\u00e9valuation dans la formation des chercheur\u00b7ses est un probl\u00e8me \u00e9norme \u00e9tant donn\u00e9 que l'ensemble de la structure de production du savoir moderne repose sur la validation des travaux des paires par l'\u00e9valuation. Mais c'est une bien trop grande question pour un billet de blog, et je reviens donc \u00e0 l'int\u00e9r\u00eat didactique de pouvoir acc\u00e9der aux \u00e9valuations propos\u00e9es par les autres : si on a la curiosit\u00e9 de les lire (on devrait), \u00e7a permet de confronter notre appr\u00e9ciation avec celle d'autres sur un travail pour lequel on n'a pas d'attachement personnel. On peut identifier les points faibles de ses propres \u00e9valuations, ou \u00e0 l'inverse confirmer que signaler une faiblesse de la soumission comme on l'a fait ne relevait pas du <em>pet peeve<\/em> et que notre observation \u00e9tait l\u00e9gitime.<\/p>\n<p>Deuxi\u00e8mement, ce format permet de contribuer \u00e0 l'\u00e9valuation... des autres \u00e9valuations. L'objectif de l'\u00e9valuation en double aveugle est de r\u00e9duire le risque de biais (typiquement : \u00e9viter un a priori n\u00e9gatif ou positif caus\u00e9 par l'affiliation, la nationalit\u00e9 ou l'identit\u00e9 des auteurs) et \u00e9viter la peur des repr\u00e9sailles (par exemple, ne pas oser faire une \u00e9valuation n\u00e9gative pour le papier d'un laboratoire o\u00f9 l'on souhaite \u00eatre recrut\u00e9\u00b7e par la suite). Mais la configuration en double-aveugle a aussi l'inconv\u00e9nient de diminuer la responsabilit\u00e9 (<em>accountability<\/em>) des \u00e9valuateur\u00b7rices, avec le risque de laisser se d\u00e9velopper des comportements inapropri\u00e9s dans les comptes-rendus. A mon avis, le fait de pouvoir lire les rapports faits par les autres \u00e9valuateur\u00b7rices permet de rep\u00e9rer ces comportements inapropri\u00e9s et de les faire conna\u00eetre au comit\u00e9 scientifique, qui est capable de lever l'anonymat d'un\u00b7e \u00e9valuateur\u00b7rice si n\u00e9cessaire et d'engager un dialogue avec elui. C'est la m\u00eame logique avec l'\u00e9valuation par les paires telle qu'elle mise en place sur des plateformes comme <a href=\"https:\/\/pubpeer.com\/\">PubPeer<\/a> : la publicit\u00e9 de l'\u00e9valuation contribue \u00e0 garantir sa bonne qualit\u00e9 et son \u00e9quit\u00e9 (m\u00eame si dans ce cas l'\u00e9valuation est compl\u00e8tement ouverte).<\/p>\n<p>Troisi\u00e8mement, l'acc\u00e8s aux autres rapports d'\u00e9valuation permet de prendre le pouls de la qualit\u00e9 scientifique de la conf\u00e9rence et de la communaut\u00e9 scientifique concern\u00e9e en g\u00e9n\u00e9ral. Si, apr\u00e8s avoir \u00e9valu\u00e9 s\u00e9rieusement une soumission, j'acc\u00e8de aux autres \u00e9valuations et trouve que nos notes et remarques convergent, qu'elles d\u00e9montrent que (presque) tou\u00b7tes les \u00e9valuateur\u00b7rices ont lu la soumission aussi attentivement que moi, je suis rassur\u00e9e. A la fin, cela voudra dire qu'il y a de fortes chances que la conf\u00e9rence propose un programme de bonne qualit\u00e9, rigoureux scientifiquement. Ca contribue donc \u00e0 construire et alimenter la l\u00e9gitimit\u00e9 de la conf\u00e9rence, au-del\u00e0 de la seule qualit\u00e9 des propositions soumises et des pr\u00e9sentations donn\u00e9es. C'est une d\u00e9ception sur ce dernier point qui me conduit \u00e0 r\u00e9diger ce billet.<sup id=\"fnref:quatrieme-point\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/026\/#fn:quatrieme-point\">1<\/a><\/sup><\/p>\n<p>Malheureusement, apr\u00e8s avoir lu les autres \u00e9valuations des papiers que j'ai \u00e9valu\u00e9s, je me mets \u00e0 redouter un nouveau type de Reviewer. Je l'appelerai Reviewer 3. Reviewer 3, c'est l'\u00e9valuateur\u00b7rice qui donne une note maximale \u00e0 une soumission qu'iel n'a pas lue et qui laisse un mod\u00e8le d'IA g\u00e9n\u00e9rative faire l'\u00e9valuation et r\u00e9diger les commentaires \u00e0 sa place. S\u00e9rieusement ?! On sait pourtant que ces mod\u00e8les ont une forte tendance \u00e0 conforter les id\u00e9es des humains qui inter\u00e9gissent avec eux, ce qui en fait de tr\u00e8s tr\u00e8s mauvais outils pour l'\u00e9valuation critique et scientifique.  <\/p>\n<p>Je prends l'exemple d'une des propositions que j'ai \u00e9valu\u00e9es cette ann\u00e9e. Voici les notes re\u00e7ues par la proposition :  <\/p>\n<ul>\n<li>Reviewer 1 (moi) : 65\/100;<\/li>\n<li>Reviewer 2 (fid\u00e8le \u00e0 son titre) : 35\/100<\/li>\n<li>Reviewer 3 : 86\/100<\/li>\n<li>Reviewer 4 : 80\/100<\/li>\n<\/ul>\n<p>L\u00e0, il y a clairement un d\u00e9saccord entre les paires sur la qualit\u00e9 de la proposition, donc c'est int\u00e9ressant de regarder le d\u00e9tail des commentaires. Ma note refl\u00e8te globalement le contenu de mon \u00e9valuation : il y a un potentiel dans la proposition et je juge qu'elle devrait \u00eatre accept\u00e9e, mais dans un autre format et avec de s\u00e9rieux points \u00e0 am\u00e9liorer. La note de Reviewer 2 refl\u00e8te \u00e9galement son appr\u00e9ciation : le caract\u00e8re scientifique du papier n'est pas av\u00e9r\u00e9, les fondements ne sont pas solides. J'ai \u00e9t\u00e9 surprise des notes des Reviewers 3 et 4 car je les trouve \u00e9lev\u00e9es par rapport \u00e0 la proposition : c'est bien d'\u00eatre bienveillant, mais dans un processus d'\u00e9valuation, c'est bien de pouvoir situer les propositions de mani\u00e8re pertinente. Les commentaires de Reviewer 4 d\u00e9montrent qu'iel a lu la soumission, son standard est simplement diff\u00e9rement du mien, ce qui est un autre type de probl\u00e8me. Par contre, l'\u00e9valuation de Reviewer 3 est probl\u00e9matique.<\/p>\n<p>Les commentaires de Reviewer 3 ont les caract\u00e9ristiques typiques des contenus g\u00e9n\u00e9r\u00e9s par un produit comme ChatGPT : homog\u00e9n\u00e9it\u00e9 de la longueur des paragraphes et des formulations, paraphrase excessive du contenu de la soumission et relev\u00e9 de d\u00e9tails anecdotiques comme s'ils \u00e9taient centraux, absence de perspective critique, redondance des formulations. On y lit que la proposition est puissante, correctement fond\u00e9e, qu'elle sort du lot, voire qu'elle est r\u00e9volutionnaire -- je sais pourtant qu'elle ne cite que 2 r\u00e9f\u00e9rences scientifiques, d\u00e9finit mal son cadre disciplinaire et ne pr\u00e9sente pas de conclusions exceptionelles. On y lit aussi que le texte est clair et inhabituellement solide pour une soumission en Humanit\u00e9s Num\u00e9riques -- ce que je trouve en r\u00e9alit\u00e9 cynique.<\/p>\n<p>A la lecture de cette \u00e9valuation, je me demande surtout \"\u00e0 quoi bon?\" A quoi bon participer au processus d'\u00e9valuation si c'est pour proposer une \u00e9valuation comme \u00e7a ? A quoi bon mentir aux auteur\u00b7rices de la soumission et les flatter de cette mani\u00e8re ? J'aurais pr\u00e9f\u00e9r\u00e9 que le Reviewer 3 ne fasse pas son \u00e9valuation. Qu'il ou elle s'excuse d'avoir manqu\u00e9 de disponibilit\u00e9s aupr\u00e8s du comit\u00e9 d'organisation et s'abstienne de pr\u00e9tendre avoir fait son travail pour la communaut\u00e9 scientifique.<\/p>\n<p>En fait, je vois deux probl\u00e8mes majeurs dans cette utilisation b\u00eate de l'IA, deux probl\u00e8mes de fond : c'est mal comprendre l'int\u00e9r\u00eat p\u00e9dagogique de l'\u00e9valuation, et c'est mal comprendre le r\u00f4le des acteur\u00b7rices de la recherche.<\/p>\n<p>Sur l'int\u00e9r\u00eat p\u00e9dagogique : il y a presque toujours quelque chose \u00e0 apprendre \u00e0 la lecture d'un rapport d'\u00e9valuation. Des travaux scientifiques pertinents qui ont \u00e9chap\u00e9 \u00e0 notre veille, des pistes de reflexions nouvelles qui peuvent nourrir de futurs travaux, des faiblesses mal identifi\u00e9es, des bonnes pratiques \u00e0 am\u00e9liorer, etc. Cela veut dire que l'\u00e9valuation par les paires fait partie int\u00e9grante du <strong>dialogue scientifique<\/strong> qui permet le d\u00e9veloppement de la Science. En plus, parmi les papiers soumis \u00e0 la conf\u00e9rence d'ADHO, il n'y a pas que des chercheur\u00b7ses en poste permanent, il y a aussi des jeunes chercheur\u00b7ses, des doctorants dont l'apprentissage est loin d'\u00eatre termin\u00e9<sup id=\"fnref:not-blind\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/026\/#fn:not-blind\">2<\/a><\/sup> et pour qui r\u00e9pondre au <em>Call for Propositions<\/em> d'ADHO est un moyen d'apprendre \u00e0 faire mieux. Signaler les faiblesses (avec bienveillance bien entendu) d'une proposition au lieu de la d\u00e9crire comme exceptionnelle, c'est contribuer positivement \u00e0 la formation de ces jeunes chercheur\u00b7ses. C'est dommageable qu'\u00e0 l'inverse des \u00e9valuateur\u00b7rices semblent ne pas comprendre ce r\u00f4le p\u00e9dagogique.<\/p>\n<p>Comme je l'indiquais dans mon <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\">dernier billet<\/a> qui traitait du manque de regard critique sur l'IA, je m'inqui\u00e8te de la trajectoire sur laquelle nous place une utilisation irraisonn\u00e9e de l'IA pour des t\u00e2ches intellectuelles et de pens\u00e9e critique qui sont cens\u00e9es \u00eatre au coeur de la d\u00e9marche scientifique. Allons-nous gaiement dans la direction d'une pantomime g\u00e9n\u00e9ralis\u00e9e de la recherche ? Une recherche fond\u00e9e sur la pr\u00e9tention, la pr\u00e9tention de conna\u00eetre son cadre th\u00e9orique, la pr\u00e9tention d'avoir con\u00e7u et conduit une exp\u00e9rience, la pr\u00e9tention d'avoir obtenu des r\u00e9sultats, la pr\u00e9tention d'avoir r\u00e9dig\u00e9 un article, la pr\u00e9tention d'avoir \u00e9valu\u00e9 un article, la pr\u00e9tention d'avoir lu un article ? O\u00f9 s'arr\u00eate-t-on ? Si toutes les \u00e9tapes de la production scientifique sont scl\u00e9ros\u00e9es par des utilisations b\u00eates comme celles-ci des IA g\u00e9n\u00e9ratives, \u00e0 quoi bon faire de la recherche ? Si on a des questions, il suffirait plut\u00f4t de demander les r\u00e9ponses \u00e0 Claude ou ChatGPT, non?  <\/p>\n<p><em>REMERCIEMENTS: Mille mercis \u00e0 Margot Mellet et Mathilde Verstraete, qui m'ont lue avec attention et m'ont signal\u00e9 quelques unes des trop nombreuses fautes de frappe laiss\u00e9es dans la premi\u00e8re version de ce billet.<\/em><\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:quatrieme-point\">\n<p>J'ajouterais volontiers un quatri\u00e8me point : le fait de ne pouvoir voir les autres \u00e9valuations qu'apr\u00e8s avoir soumis ma propre \u00e9valuation m'a motiv\u00e9e \u00e0 faire toutes les \u00e9valuations qui m'avaient \u00e9t\u00e9 assign\u00e9es.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/026\/#fnref:quatrieme-point\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:not-blind\">\n<p>Dans l'exemple que je prends, il se trouve que l'auteur\u00b7rice de la soumission avait mal cach\u00e9 son identit\u00e9 et qu'il m'a \u00e9t\u00e9 possible de confirmer qu'il s'agissait d'une personne inscrite dans un programme de doctorat et ayant soumis la proposition en tant qu'unique auteur\u00b7rice. On est donc bien dans un cas o\u00f9 l'\u00e9valuation a un potentiel p\u00e9dagogique fort.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/026\/#fnref:not-blind\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["French blog posts","Generative AI","peer reviewing"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/026\/","pubDate":"Wed, 28 Jan 2026 15:46:54 GMT"},{"title":"025 - A Perfect Job is the New Very Good Job","link":"https:\/\/alix-tz.github.io\/phd\/posts\/025\/","description":"<blockquote>\n<p>A little disclaimer: this post is not a personnal rent against Dan Cohen. I do not know him and nor his work.<\/p>\n<p>A second disclaimer: I moved the original French version of this post here: <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\">posts\/025-fr.md<\/a>.<\/p>\n<\/blockquote>\n<p>Earlier this week, my colleague Louis-Olivier Brassard asked me for my opinion on the <a href=\"https:\/\/newsletter.dancohen.org\/archive\/the-writing-is-on-the-wall-for-handwriting-recognition\/\">latest post<\/a> by Dan Cohen, which he titled \"<em>The Writing Is on the Wall for Handwriting Recognition<\/em>\", adding a subtitle that sets the tone: \"<em>One of the hardest problems in digital humanities has finally been solved<\/em>\". I wanted to make my critical reading a bit more public, so I'm turning it into a blog post<!--, in French for once-->.<\/p>\n<p>I carefully read this article because the subject is of interest to me (obviously), but I must admit that I usually start this kind of reading with a negative <em>a priori<\/em>. This is the treatment I save for all those posts, whether on blogs or social media, that announce left and right that generative AI has revolutionized this or that -- this and that generally being problems that have occupied researchers and engineers for years, and which gave rise to sometimes heated or even unsolvable debates. All these posts contribute to fueling the hype around generative AI and undermining our already quite worn collective ability to develop critical thinking about it.<\/p>\n<p>Dan Cohen's post follows the release of version 3 of Gemini, Google's generative AI model, publicized as Google's \"most intelligent model yet\". Like every time a new model of this type is released, several users share the results of their \"experiments\" with these models. Dan Cohen is not the only one; for example, Mark Humphries also posted <a href=\"https:\/\/generativehistory.substack.com\/p\/gemini-3-solves-handwriting-recognition\">a post on the subject<\/a> on the same day, soberly titled \"<em>Gemini 3 Solves Handwriting Recognition and it\u2019s a Bitter Lesson<\/em>\". I saw these two posts widely shared on Bluesky, praised by researchers whom I consider to hold positions of authority in the field of automatic transcription. After reading Dan Cohen's post, I found myself quite annoyed by these shares: I'm not convinced that the text was well read by those who shared it on Bluesky.<\/p>\n<p>In my opinion, the problem with Dan Cohen's post is twofold: 1) he develops a universal discourse on a tool that he has only tested on a minimal selection of examples that say almost nothing about the problems encountered by users of automatic transcription on old documents, 2) his demonstration relies on fallacious arguments.<\/p>\n<h3>A matter of scientific rigor<\/h3>\n<p>About the first point: Dan Cohen uses three examples that are not at all representative of the challenges of automatic transcription. Right from the start, this would justify a footnote to his subtitle: he says \"<em>one of the hardest problems in digital humanities has finally been solved<\/em>\", I add \"<em>as far as it concerns epistolary documents written in English during the first half of the 19th century by personalities whose biographies have been written, or whose correspondence has already been edited<\/em>\"<sup id=\"fnref:precision_inedit\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:precision_inedit\">1<\/a><\/sup> because that's what he tested. That already reduces the scope of his results quite a bit, doesn't it? Moreover, given that the model fails to transcribe the third example, we could even add that this only concerns documents with a simple layout.<sup id=\"fnref:standard_layout\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:standard_layout\">2<\/a><\/sup><\/p>\n<p>This first point is really problematic because this post is a text published by a person who has scientific authority and should therefore demonstrate scientific rigor, even if we are only talking about a newsletter and not an edited article or book. Following this scientific rigor, I would expect us to limit ourselves to drawing conclusions about what has been successfully demonstrated instead of Doom propheting with flashy (sub)titles. One can be convinced that Gemini is capable of successfully handling many other cases than those presented by Dan Cohen, but that is a matter of belief, not scientific demonstration. I think this is a topic that needs to be discussed more broadly, in a context where AI is messianically served to us in all forms of dishes, but Marcello Vitali-Rosati talks about it well in <a href=\"https:\/\/blog.sens-public.org\/marcellovitalirosati\/2025-11-htr.html\">his latest post<\/a> or, from another angle and outside the uses by the academic world, there is the recent work of <a href=\"https:\/\/www.polytechnique-insights.com\/tribunes\/digital\/comment-se-proteger-du-syndrome-de-stockholm-technologique-face-a-lia\/\">Hamilton Mann<\/a>.<\/p>\n<p>It happens that the day Louis-Olivier asked me to read Dan Cohen's text, I had also read that of <a href=\"https:\/\/digitalorientalist.com\/2025\/11\/25\/teaching-bengali-digital-texts-to-anglophone-undergraduates-what-voyant-reveals-about-the-infrastructural-bias-of-dh-tools\/\">Sunayani Bhattacharya<\/a> who trained her students at Saint Mary's College of California in text analysis with <a href=\"https:\/\/voyant-tools.org\/\">Voyant Tools<\/a> and who also evoked automatic transcription in passing in her post. She explains that, with the objective of offering an opening to the Global South to her students, she had them work on texts in Bengali (even though none of them can speak or read Bengali). I find the exercise interesting and promising as she presents it. After developing in her students a familiarity with what Bengali in properly edited press texts look like in Voyant Tools, she showed them what you get when you try to run Voyant Tools on texts directly taken from OCR software. These texts contain a lot of noise and sometimes do not even use the correct character sets. This allows her to give her students a very concrete example of the limitations of software infrastructures when it comes to processing texts in Indic languages. She concludes by reiterating the usefulness of giving students a better idea of what on-the-ground anglophone biases look like in technology. In a text like the one I discuss in this post, this anglophone bias (and I would even add modernist) is blatant.<\/p>\n<h3>A shaky demonstration<\/h3>\n<p>Now, regarding the second point, it requires taking a closer look at what Dan Cohen tells us and the examples he gives. There are inaccuracies that need to be pointed out, but also excerpts that do not correspond to the statements made in the post.<\/p>\n<p>Let's start with an inaccuracy that actually regards the question of model accuracy. I have already discussed this in <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/012\">a previous post<\/a> because it seems to me that this is one of the topics where researchers are most lazy: what accuracy are we talking about, and what are the limits of these accuracy measures? Dan Cohen states that \"<em>the best HTR software struggles to reach 80% accuracy<\/em>\". As he clarifies that this means 2 wrong words out of 10 words, we already see that he is talking about word error rate and not character error rate. Such an error rate, on its own, says nothing about the readability of the text since a single error is enough for it to be counted as wrong. In a sentence like \"<em>the hardest problem in digtial humaities has finolly beeen sol ved<\/em>\", one word out of two contains a mistake, yet it seems to me that the sentence is perfectly readable.<sup id=\"fnref:lisible\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:lisible\">3<\/a><\/sup> To put things into perspective, the character accuracy rate in this same sentence is 90.77% (according to software like <a href=\"https:\/\/huggingface.co\/spaces\/lterriel\/kami-app\">KaMI<\/a>). In addition to this initial inaccuracy, Dan Cohen's statement about the difficulties of traditional software seems false to me. I do not see on what source he bases himself. For documents like those he tests, we are well above 80% accuracy, even at the word level, and this with several models and several software using RCNNs or Transformers.<\/p>\n<p>Since this initial statement surprised me, I wanted to look closer at Transkribus' output to see if it really did this many errors. Of course, there are errors in Transkribus' transcriptions. Yet, when we look at the source document, we see that some of these errors are understandable in a zero-shot context. When Boole draws two \"l\"s in a row, his second \"l\" looks like an \"e\" with a very very small loop. This explains why Transkribus' prediction contains errors on \"<em>tell<\/em>\" (read as \"<em>tele<\/em>\") on the left page, and \"<em>All<\/em>\" (read as \"<em>Ale<\/em>\") on the right page. To find out the real extent of Transkribus' errors, I made my own transcription of the double page tested by Dan Cohen, line by line (following the line order taken from the segmentation in Transkribus, and helping myself a bit with the reading proposed by Gemini<sup id=\"fnref:ordre_lignes\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:ordre_lignes\">4<\/a><\/sup>). When I calculate the accuracy rate on this excerpt, I get a character accuracy of about 95% and a word accuracy of 88%.<sup id=\"fnref:precision_error_tkb\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:precision_error_tkb\">5<\/a><\/sup> So there is plenty of room for improvement, but we are not in a catastrophic situation as the preamble suggests.<\/p>\n<p>If we now turn to the transcription generated by Gemini, we can see that there are actually some errors as well, whereas Dan Cohen is telling us that \"<em>Gemini transcribed the letter perfectly<\/em>\". For example, Gemini transcribes, on the right page, \"<em>occasionally by<\/em>\",<sup id=\"fnref:occasion_by\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fn:occasion_by\">6<\/a><\/sup> generating as additional precision in a notes section that \"<em>On the right page (line 8), the handwriting becomes very scribbled. It appears to say 'take a long walk occasionally try &amp; once or twice...' or possibly 'occasionally by &amp; once or twice...'.<\/em>\" Gemini fails here to propose reading a hyphenation that makes sense and prefers to add a word in its transcription. The problem is not that Gemini did not make a perfect transcription of course, but rather that Dan Cohen states it without noting this error.<\/p>\n<p>We have the same issue in the second example, where Gemini formats the word \"transmitted\" to indicate that it is crossed out in the source when it is not. The text generated by Gemini leaves no doubt about the look of the text in the source, and invents an intention on the part of the author: \"<em>In the second line of the body, the word 'transmitted' is crossed out in the original text, but the sentence is grammatically incomplete without it (or a similar verb). It is likely the author meant to replace it to avoid repetition with the word 'transmitting' appearing a few lines later but forgot to insert the new word.<\/em>\" Whereas this error was easier to spot, Dan Cohen once again tells us: \"<em>Another perfect job.<\/em>\"<\/p>\n<p>Then comes the third example. Gemini does not offer a complete transcription of this one, and after a few lines, generates a message indicating that the text is illegible beyond a certain point. This allows Dan Cohen to conclude: \"<em>Gemini does the right thing here: rather than venture a guess like a sycophantic chatbot, it is candid when it can\u2019t interpret a section of the letter.<\/em>\" I personally choke reading that, given the errors already noted in the two previous examples. Contrary to what Dan Cohen claims, there is no candor here, but rather a perverse effect of what I imagine is a calibration of the model based on its perplexity rate. In the first two examples, we can imagine that the model's perplexity regarding certain difficult passages leads to the generation of a note and\/or an insert in brackets, but does not prevent the generation of a false transcription. It goes unnoticed all the more because the explanations generated in notes sound good, even if they are false. We are not dealing with a candid robot, but rather with a scammer chatbot, a presti-generator, who finds an escape route when the situation is too big for a subtle feint. And in my opinion, it would really be time for users of this software to integrate this reality, taking an even closer look when they control what these tools generate.<\/p>\n<p>I haven't yet read <a href=\"https:\/\/generativehistory.substack.com\/p\/gemini-3-solves-handwriting-recognition\">Mark Humphries' post<\/a> that I mentioned at the beginning, but I might come back to the subject in the future. To be honest, what I find really really unfortunate about these publications, coming from the academic world, which help to fuel the hysteria around generative AI, is that it gives me the impression that decisively it will not be from the scientific community that Salvation will come. As a citizen and a young researcher, this worries me a lot.<\/p>\n<p><em>EDIT: 2025-12-01: Minor corrections and addition of another footnote.<\/em><\/p>\n<p><em>EDIT: 2025-12-04: Translated the post to English (with the help of Copilot) and moved the French version to another path: <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\">posts\/025-fr.md<\/a>.<\/em><\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:precision_inedit\">\n<p>I give this precision about the edition of biographies and correspondences because it is important: Dan Cohen did not take documents that we are sure are unpublished. Given that generative AI models are trained from everything that can be found on the Web, this means that these letters may have, in one way or another, been part of the batches used for training. For example, <a href=\"https:\/\/foinse.ucc.ie\/en\/records\/IE\/BL\/PP\/BP\/1\/A\/1\/1\/51?utm_source=dancohen&amp;utm_medium=email&amp;utm_campaign=the-writing-is-on-the-wall-for-handwriting-recognition\">on the website<\/a> of the Archives of University College Cork, from which the digitization of Boole's letter is taken, we find the following text in the description field: \"<em>Boole in Cork to Maryann. He is in a very depressed mood, life has become monotonous with only his work adding interest to the day. He enjoys playing the piano but 'it would be better with someone else to listen and to be listened to'. He is also very annoyed by [Cropers] dedicating his book to him without first asking for permission - 'I cannot help feeling that he has taken a great liberty' - and speaks in strong terms of [Cropers] 'pretensions to high morality'. He invites and urges Maryann to visit him as soon as their mother's health would allow. He feels the climate would do her good.<\/em>\" These are contextual elements that can help a model when transcribing.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:precision_inedit\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:standard_layout\">\n<p>I purposefully use the term \"simple layout\" rather than \"standard layout\" because the phenomenon illustrated by the third example, the rewriting on the same sheet after having turned it 90\u00b0, corresponds to a practice that can be found at least until the mid-20th century.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:standard_layout\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:lisible\">\n<p>By readable, I mean that one does not need to know what the original sentence was to understand what we should have read in place of the errors. I admit however that depending on familiarity with the text or the language or the nature of the errors, this readability may vary. If you still find this sentence unreadable, it should be read as follows: \"the hardest problem in digital humanities has finally been solved\". There was 1 letter inversion in \"<em>digital<\/em>\", one missing letter in \"<em>humanities<\/em>\", one letter substituted by another in \"<em>finally<\/em>\", one extra letter in \"<em>been<\/em>\" and an inappropriate separation in \"<em>solved<\/em>\".\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:lisible\" title=\"Jump back to footnote 3 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:ordre_lignes\">\n<p>I rapidly develop in the question of the layout. In Gemini's transcription, there are additional pieces of information that suggest that the model correctly identified which part of the text corresponds to which page. In Transkribus' transcription, this is not the case, but I think it's because Dan Cohen only used Transkribus' basic web page from testing models. If he had used the full version of Transkribus, I'm sure the software would have also perfectly identified the double-page layout. As for the line-by-line transcription, we no longer have this information in Gemini's transcription, which generates the text continuously.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:ordre_lignes\" title=\"Jump back to footnote 4 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:precision_error_tkb\">\n<p>Among the errors made by Transkribus, we can also note the use of a \"<a href=\"https:\/\/www.compart.com\/en\/unicode\/U+0432\">\u0432<\/a>\" (the Cyrillic v) to transcribe the \"B\" in the margin of the document, and a \"<a href=\"https:\/\/www.compart.com\/en\/unicode\/U+0440\">\u0440<\/a>\" (the Cyrillic r) to transcribe the \"P\" that follows. These are errors that escape us when we do a quick visual check, which do not hinder reading by humans, but which lower the accuracy calculated automatically since a \u0432 is not a B and a \u0440 is not a P, nor indeed a p (see what I did here?).\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:precision_error_tkb\" title=\"Jump back to footnote 5 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:occasion_by\">\n<p>Transkribus transcribed it as \"<em>occasion by<\/em>\".\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025\/#fnref:occasion_by\" title=\"Jump back to footnote 6 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["evaluation","French blog posts","Generative AI","HTR","Large Language Models","literature review"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/025\/","pubDate":"Fri, 28 Nov 2025 21:50:54 GMT"},{"title":"025 - A Perfect Job is the New Very Good Job (FR)","link":"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/","description":"<blockquote>\n<p>A little disclaimer: this post is not a personnal rent against Dan Cohen. I do not know him and nor his work.<\/p>\n<\/blockquote>\n<p>Plus t\u00f4t cette semaine, mon coll\u00e8gue Louis-Olivier Brassard m'a demand\u00e9 mon avis sur le <a href=\"https:\/\/newsletter.dancohen.org\/archive\/the-writing-is-on-the-wall-for-handwriting-recognition\/\">dernier billet<\/a> post\u00e9 par Dan Cohen, qu'il a intitul\u00e9 \"<em>The Writing Is on the Wall for Handwriting Recognition<\/em>\", ajoutant un sous-titre annon\u00e7ant la couleur: \"<em>One of the hardest problems in digital humanities has finally been solved<\/em>\". J'avais envie de rendre un peu plus public ma lecture critique, donc j'en tire un billet de blog, en fran\u00e7ais pour une fois.<\/p>\n<p>J'ai lu avec attention cet article car le sujet m'int\u00e9resse (forc\u00e9ment), mais je ne cache pas que je d\u00e9bute en g\u00e9n\u00e9ral ce genre de lecture avec un a priori n\u00e9gatif. C'est le traitement que je r\u00e9serve \u00e0 tous ces postes, de blog ou sur les r\u00e9seaux sociaux, qui annoncent \u00e0 tour de bras que l'IA g\u00e9n\u00e9rative a r\u00e9volutionn\u00e9 ceci ou cela -- ceci et cela \u00e9tant g\u00e9n\u00e9ralement des probl\u00e8mes qui ont occup\u00e9 des chercheur-ses et ing\u00e9nieur-es depuis des ann\u00e9es, et qui donnent lieu \u00e0 des d\u00e9bats parfois houleux, voire insolvables. Tous ces billets contribuent \u00e0 alimenter l'esbroufe de l'IA g\u00e9n\u00e9rative et \u00e0 saper notre capacit\u00e9 collective d\u00e9j\u00e0 pas mal us\u00e9e \u00e0 d\u00e9velopper une pens\u00e9e critique \u00e0 son endroit. <!--J'essaie quand m\u00eame d'\u00eatre honn\u00eate et de faire attention \u00e0 mes propres biais dans ce que j'en tire ci-dessous.--><\/p>\n<p>Le billet de Dan Cohen fait suite \u00e0 la sortie de la version 3 de Gemini, le mod\u00e8le d'IA g\u00e9n\u00e9rative de Google, publicis\u00e9 comme le mod\u00e8le de Google \"le plus intelligent \u00e0 date\" (\"<em>our most intelligent model yet<\/em>\" dit Google). Comme \u00e0 chaque fois qu'un nouveau mod\u00e8le de ce type sort, plusieurs utilisateurs partagent les r\u00e9sultats de leurs \"exp\u00e9rimentations\" avec ces mod\u00e8les. Dan Cohen n'est pas le seul, par exemple Mark Humphries a aussi post\u00e9 le m\u00eame jour <a href=\"https:\/\/generativehistory.substack.com\/p\/gemini-3-solves-handwriting-recognition\">un billet sur le sujet<\/a> intitul\u00e9 sobrement \"<em>Gemini 3 Solves Handwriting Recognition and it\u2019s a Bitter Lesson<\/em>\". J'ai beaucoup vu ces deux billets relay\u00e9s sur Bluesky, salu\u00e9s par des chercheurs que j'estime occuper des places d'autorit\u00e9 dans le domaine de la transcription automatique. Apr\u00e8s avoir lu le billet de Dan Cohen, je me suis retrouv\u00e9e assez agac\u00e9e de ces relais: je ne suis pas convaincue que le texte ait \u00e9t\u00e9 bien lu par ceux qui l'ont relay\u00e9 sur Bluesky.<\/p>\n<p>\u00c0 mon avis, le probl\u00e8me du billet que Dan Cohen est double: 1) il d\u00e9veloppe un discours universel sur un outil qu'il n'a test\u00e9 que sur s\u00e9lection minime d'exemples qui ne disent presque rien des probl\u00e8mes que rencontrent les utilisateurs de la transcription automatique sur les documents anciens, 2) sa d\u00e9monstration tient sur des arguments fallacieux.  <\/p>\n<h3>Un probl\u00e8me de rigueur scientifique<\/h3>\n<p>Sur le premier point tout d'abord. Dan Cohen utilise trois exemples qui ne sont pas du tout repr\u00e9sentatifs des d\u00e9fis de la transcription automatique. D'embl\u00e9e, cela justifierait une note de bas de page \u00e0 son sous-titre: il dit \"l'un des probl\u00e8mes les plus difficiles des humanit\u00e9s num\u00e9riques a enfin \u00e9t\u00e9 r\u00e9solu\", j'ajoute \"en ce qui concerne les documents \u00e9pistolaires r\u00e9dig\u00e9s en anglais durant la premi\u00e8re moiti\u00e9 du XIXe si\u00e8cle par des personnalit\u00e9s dont des biographies ont \u00e9t\u00e9 \u00e9crites, voire dont la correspondance \u00e0 d\u00e9j\u00e0 \u00e9t\u00e9 \u00e9dit\u00e9e\"<sup id=\"fnref:precision_inedit\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:precision_inedit\">1<\/a><\/sup> car c'est ce qu'il a test\u00e9. Ca r\u00e9duit d\u00e9j\u00e0 pas mal la port\u00e9e de ses r\u00e9sultats, non? D'ailleurs, \u00e9tant donn\u00e9 que le mod\u00e8le ne parvient pas \u00e0 transcrire le troisi\u00e8me exemple, on pourrait m\u00eame ajouter que cela ne concerne en plus que les documents dont la mise en page est simple.<sup id=\"fnref:standard_layout\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:standard_layout\">2<\/a><\/sup><\/p>\n<p>Ce premier point est vraiment probl\u00e9matique parce qu'il s'agit d'un texte publi\u00e9 par une personne qui a une autorit\u00e9 scientifique et qui devrait donc faire preuve de rigueur scientifique, m\u00eame si ce texte n'est qu'une newsletter et pas un article ou un ouvrage \u00e9dit\u00e9. J'attendrais de cette rigueur scientifique qu'on se limite \u00e0 tirer des conclusions sur ce que l'on a r\u00e9ussi \u00e0 d\u00e9montrer au lieu de jouer les Cassandre avec des (sous-)titres tape-\u00e0-l'oeil. On peut avoir la conviction que Gemini est capable de traiter avec succ\u00e8s bien d'autres cas que ceux pr\u00e9sent\u00e9s par Dan Cohen, mais cela rel\u00e8ve de la croyance, pas de la d\u00e9monstration scientifique. Je pense que c'est un sujet qui doit \u00eatre discut\u00e9 plus largement, dans un contexte o\u00f9 l'IA nous est messianiquement servie \u00e0 toutes les sauces, mais Marcello Vitali-Rosati en parle bien dans <a href=\"https:\/\/blog.sens-public.org\/marcellovitalirosati\/2025-11-htr.html\">son dernier billet<\/a> ou encore, sous un autre angle et qui sort des usages par le monde acad\u00e9mique, il y a le r\u00e9cent travail d'<a href=\"https:\/\/www.polytechnique-insights.com\/tribunes\/digital\/comment-se-proteger-du-syndrome-de-stockholm-technologique-face-a-lia\/\">Hamilton Mann<\/a>.<\/p>\n<p>Il se trouve que le jour o\u00f9 Louis-Olivier m'a demand\u00e9 de lire le texte de Dan Cohen, j'avais aussi lu celui de <a href=\"https:\/\/digitalorientalist.com\/2025\/11\/25\/teaching-bengali-digital-texts-to-anglophone-undergraduates-what-voyant-reveals-about-the-infrastructural-bias-of-dh-tools\/\">Sunayani Bhattacharya<\/a> qui a form\u00e9 ses \u00e9l\u00e8ves du Saint Mary\u2019s College en Californie \u00e0 l'analyse de texte avec <a href=\"https:\/\/voyant-tools.org\/\">Voyant Tools<\/a> et qui traite aussi de transcription automatique au d\u00e9tour de son billet. Elle explique que, dans l'optique de proposer une ouverture vers le Sud Global \u00e0 ses \u00e9tudiant-es, elle les a fait travailler sur des textes en bengali (m\u00eame si aucun ne sait parler ou lire le bengali). Je trouve l'exercice int\u00e9ressant et prometteur tel qu'elle le pr\u00e9sente. Apr\u00e8s avoir d\u00e9velopp\u00e9 chez ses \u00e9l\u00e8ves une familiarit\u00e9 avec ce \u00e0 quoi ressemble les textes de presse correctement \u00e9dit\u00e9s dans Voyant Tools, elle leur a montr\u00e9 ce qu'on obtient quand on tente de faire tourner Voyant Tools sur des textes directement sortis d'un logiciel d'OCR. Ces textes contiennent \u00e9norm\u00e9ment de bruit et parfois n'utilisent m\u00eame pas les bons jeux de caract\u00e8res. Cela lui permet de donner un exemple tr\u00e8s concret \u00e0 ses \u00e9tudiant-es des limites des infrastructures logicielles d\u00e8s qu'il s'agit de traiter de textes en langues indiennes. Elle conclut en redisant l'utilit\u00e9 de donner aux \u00e9tudiant-es une meilleure id\u00e9e de ce \u00e0 quoi ressemblent les biais anglophones dans la technologie quand on est sur le terrain. Dans un texte comme celui dont je discute dans ce billet, ce biais anglophone (et j'ajouterai m\u00eame moderniste) saute aux yeux.<\/p>\n<h3>Une d\u00e9monstration bancale<\/h3>\n<p>Maintenant, concernant le deuxi\u00e8me point, il suppose de regarder d'un peu plus pr\u00e8s ce que Dan Cohen nous dit et les exemples qu'il donne. Il y a des impr\u00e9cisions qui doivent \u00eatre relev\u00e9es, mais aussi des extraits qui ne correspondent pas aux affirmations qui sont faites dans le billet.  <\/p>\n<p>Une impr\u00e9cision qui commence justement par la question de la pr\u00e9cision des mod\u00e8les. J'en ai d\u00e9j\u00e0 parl\u00e9 dans <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/012\">un pr\u00e9c\u00e9dent billet<\/a> car il me semble que c'est l'un des sujets o\u00f9 les chercheurs font le plus preuve de paresse: de quelle pr\u00e9cision on parle, et quelles sont les limites de ces mesures de pr\u00e9cision ? Dan Cohen affirme que \"<em>les meilleurs logiciels d'HTR ont du mal \u00e0 atteindre 80% de pr\u00e9cision<\/em>\". Comme il clarifie que cela signifie 2 mots erron\u00e9s tous les 10 mots, d\u00e9j\u00e0 on s'aper\u00e7oit qu'il nous parle de taux d'erreur au mot et non au caract\u00e8re. Un tel taux d'erreur ne dit rien de la lisibilit\u00e9 du texte puisqu'une seule erreur suffit pour que le mot soit compt\u00e9 comme faux. Dans une phrase comme \"<em>the hardest problem in digtial humaities has finolly beeen sol ved<\/em>\", un mot sur deux contient une faute, pourtant il me semble que la phrase est parfaitement lisible.<sup id=\"fnref:lisible\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:lisible\">3<\/a><\/sup> Pour mettre les choses en perspective, le taux de pr\u00e9cision au caract\u00e8re dans cette phrase, lui, est de 90.77% (d'apr\u00e8s un logiciel comme <a href=\"https:\/\/huggingface.co\/spaces\/lterriel\/kami-app\">KaMI<\/a>). En plus de cette impr\u00e9cision de d\u00e9part, l'affirmation de Dan Cohen sur les difficult\u00e9s des logiciels traditionnels me semble fausse. Je ne vois pas sur quelle source il se base. Pour des documents comme ceux qu'il teste, on est bien au-del\u00e0 des 80% de pr\u00e9cision, y compris au mot, et ce, avec plusieurs mod\u00e8les et plusieurs logiciels.<\/p>\n<p>Comme cette affirmation m'a surprise, j'ai voulu regarder si vraiment le mod\u00e8le de Transkribus avait fait autant de fautes que \u00e7a. Bien s\u00fbr, il a fait des erreurs. Quand on regarde le document source, on voit que certaines sont compr\u00e9hensibles dans un contexte zero-shot: lorsque Boole trace deux \"l\" \u00e0 la suite, son deuxi\u00e8me \"l\" ressemble \u00e0 un \"e\" avec une boucle tr\u00e8s tr\u00e8s petite. C'est ce qui explique que la pr\u00e9diction de Transkribus contient des erreurs sur \"<em>tell<\/em>\" (lu \"<em>tele<\/em>\") sur la page de gauche, et \"<em>All<\/em>\" (lu \"<em>Ale<\/em>\") sur la page de droite. Pour savoir quelle \u00e9tait vraiment l'ampleur des erreurs de Transkribus, j'ai fait ma propre transcription de la double-page, ligne par ligne (en suivant l'ordre des lignes tir\u00e9 de la segmentation dans Transkribus, et en m'aidant un peu de la lecture propos\u00e9 par Gemini<sup id=\"fnref:ordre_lignes\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:ordre_lignes\">4<\/a><\/sup>). Quand je calcule le taux de pr\u00e9cision sur cet extrait, j'obtiens une pr\u00e9cision au caract\u00e8re d'environ 95% et une pr\u00e9cision au mot de 88%.<sup id=\"fnref:precision_error_tkb\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:precision_error_tkb\">5<\/a><\/sup> Largement perfectible donc, mais on n'est pas dans une situation catastrophique comme le laisse supposer le pr\u00e9ambule.<\/p>\n<p>Maintenant, si on regarde la transcription de Gemini, on s'aper\u00e7oit qu'il y a en fait aussi des erreurs, alors que Dan Cohen nous dit \"<em>Gemini transcribed the letter perfectly<\/em>\". Par exemple, Gemini transcrit, sur la page de droite, \"<em>occasionally by<\/em>\",<sup id=\"fnref:occasion_by\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fn:occasion_by\">6<\/a><\/sup> en g\u00e9n\u00e9rant comme pr\u00e9cision compl\u00e9mentaire dans une section de notes que \"<em>On the right page (line 8), the handwriting becomes very scribbled. It appears to say 'take a long walk occasionally try &amp; once or twice...' or possibly 'occasionally by &amp; once or twice...'.<\/em>\" Donc Gemini, \u00e9choue ici \u00e0 proposer de lire une c\u00e9sure qui a pourtant du sens et pr\u00e9f\u00e8re ajouter un mot dans sa transcription. Le probl\u00e8me ce n'est pas que Gemini n'ai pas fait une transcription parfaite bien s\u00fbr, mais plut\u00f4t que Dan Cohen l'affirme sans relever cette erreur.<\/p>\n<p>On a le m\u00eame probl\u00e8me dans le deuxi\u00e8me exemple, o\u00f9 Gemini met en forme le mot \"transmitted\" pour signaler qu'il est barr\u00e9 dans la source alors que ce n'est pas le cas. Le texte g\u00e9n\u00e9r\u00e9 par Gemini ne laisse pas de doute vis-\u00e0-vis de l'aspect du texte dans la source, et invente une intention de la part de l'auteur: \"<em>In the second line of the body, the word 'transmitted' is crossed out in the original text, but the sentence is grammatically incomplete without it (or a similar verb). It is likely the author meant to replace it to avoid repetition with the word 'transmitting' appearing a few lines later but forgot to insert the new word.<\/em>\" Alors que cette erreur \u00e9tait plus facile \u00e0 rep\u00e9rer, Dan Cohen nous dit pourtant encore une fois: \"<em>Another perfect job.<\/em>\"<\/p>\n<p>Le coup de gr\u00e2ce \u00e0 mon avis vient avec le troisi\u00e8me exemple. Gemini n'en propose pas de transcription compl\u00e8te, et g\u00e9n\u00e8re, apr\u00e8s quelques lignes, un message indiquant que le texte est illisible au-del\u00e0 d'un certain point. Cela permet \u00e0 Dan Cohen d'en conclure: \"<em>Gemini does the right thing here: rather than venture a guess like a sycophantic chatbot, it is candid when it can\u2019t interpret a section of the letter.<\/em>\" Personnellement, je m'\u00e9touffe en lisant \u00e7a, vu les erreurs d\u00e9j\u00e0 not\u00e9es dans les deux exemples pr\u00e9c\u00e9dents. Au contraire de ce qu'affirme Dan Cohen, il n'y a pas de candeur ici, mais plut\u00f4t un effet pervers de ce que j'imagine \u00eatre un calibrage du mod\u00e8le en fonction de son taux de perplexit\u00e9. Dans les deux premiers exemples, on peut imaginer que la perplexit\u00e9 du mod\u00e8le face \u00e0 certains passages difficiles conduit \u00e0 la g\u00e9n\u00e9ration d'une note et\/ou d'un insert entre crochets, mais n'emp\u00eache pas la g\u00e9n\u00e9ration d'une transcription fausse. Elle passe d'autant plus inaper\u00e7ue que les explications g\u00e9n\u00e9r\u00e9es en notes sonnent bien, m\u00eame si elles sont incorrectes. On n'a donc pas affaire \u00e0 un robot candide, mais \u00e0 un chatbot arnaqueur, un presti-g\u00e9n\u00e9rateur, qui trouve une porte de sortie lorsque la situation est trop grosse pour une feinte subtile. Et \u00e0 mon avis il serait vraiment temps que les utilisateurs de ces logiciels int\u00e8grent cette r\u00e9alit\u00e9, en ayant la main d'autant moins l\u00e9g\u00e8re quand ils contr\u00f4lent ce que g\u00e9n\u00e8rent ces outils.<\/p>\n<p>Je n'ai pas encore lu le <a href=\"https:\/\/generativehistory.substack.com\/p\/gemini-3-solves-handwriting-recognition\">billet<\/a> de Mark Humphries que je mentionnais tout au d\u00e9but, mais j'aurais peut-\u00eatre l'occasion de revenir encore sur le sujet. \u00c0 vrai dire, ce que je trouve vraiment vraiment dommage avec ces publications, issues du monde acad\u00e9mique, qui contribuent \u00e0 alimenter l'hyst\u00e9rie autour de l'IA g\u00e9n\u00e9rative, c'est qu'elle me donne l'impression que d\u00e9cid\u00e9ment ce n'est m\u00eame pas de la part de la communaut\u00e9 scientifique que viendra le Salut. En tant que citoyenne et jeune chercheuse, cela m'inqui\u00e8te beaucoup.  <\/p>\n<p><em>EDIT: 2025-12-01: Petites corrections et ajout de notes d'une note suppl\u00e9mentaire en bas de page.<\/em><\/p>\n<p><em>EDIT: 2025-12-04: Traduction du post en anglais, et d\u00e9placement de la version fran\u00e7aise vers un autre chemin: <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\">posts\/025-fr.md<\/a>.<\/em><\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:precision_inedit\">\n<p>Je donne cette pr\u00e9cision sur l'\u00e9dition des biographies et des correspondances car elle me semble importantes: Dan Cohen n'a pas pris des documents dont on est s\u00fbr qu'ils soient in\u00e9dits. \u00c9tant donn\u00e9 que les mod\u00e8les d'IA g\u00e9n\u00e9rative sont entra\u00een\u00e9s \u00e0 partir de tout ce qui peut \u00eatre trouv\u00e9 sur le Web, cela veut dire que ces lettres ont peut-\u00eatre d'une mani\u00e8re ou d'une autre, fait partie des lots utilis\u00e9s pour l'entra\u00eenement. Par exemple, <a href=\"https:\/\/foinse.ucc.ie\/en\/records\/IE\/BL\/PP\/BP\/1\/A\/1\/1\/51?utm_source=dancohen&amp;utm_medium=email&amp;utm_campaign=the-writing-is-on-the-wall-for-handwriting-recognition\">sur le site<\/a> des Archives du University College of Cork, d'o\u00f9 est tir\u00e9e la num\u00e9risation de la lettre de Boole, on trouve le texte suivant dans le champ description: \"<em>Boole in Cork to Maryann. He is in a very depressed mood, life has become monotonous with only his work adding interest to the day. He enjoys playing the piano but 'it would be better with someone else to listen and to be listened to'. He is also very annoyed by [Cropers] dedicating his book to him without first asking for permission - 'I cannot help feeling that he has taken a great liberty' - and speaks in strong terms of [Cropers] 'pretensions to high morality'. He invites and urges Maryann to visit him as soon as their mother's health would allow. He feels the climate would do her good.<\/em>\" Ce sont des \u00e9l\u00e9ments de contexte qui peuvent aider, y compris un mod\u00e8le, au moment de transcrire.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:precision_inedit\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:standard_layout\">\n<p>Je ne dis pas \"mise en page standard\", parce que le ph\u00e9nom\u00e8ne qui est illustr\u00e9 par le troisi\u00e8me exemple, le fait de r\u00e9\u00e9crire sur la m\u00eame feuille apr\u00e8s l'avoir tourn\u00e9e \u00e0 90\u00b0, correspond \u00e0 une pratique qu'on retrouve au moins jusqu'au milieu du XXe si\u00e8cle.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:standard_layout\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:lisible\">\n<p>Par lisible, je veux dire qu'on n'a pas besoin de savoir quelle \u00e9tait la phrase de d\u00e9part pour comprendre ce qu'on aurait d\u00fb lire dans les erreurs. J'admets par contre qu'en fonction de la familiarit\u00e9 avec le texte ou de la langue ou de la nature des erreurs, cette lisibilit\u00e9 peut varier. Si jamais vous trouvez quand m\u00eame cette phrase illisible, il faut la lire comme ceci: \"the hardest problem in digital humanities has finally been solved\". Il y avait 1 inversion de lettres dans \"<em>digital<\/em>\", une lettre manquante dans \"<em>humanities<\/em>\", une lettre substitu\u00e9e par une autre dans \"<em>finally<\/em>\", une lettre en trop dans \"<em>been<\/em>\" et une s\u00e9paration inappropri\u00e9e dans \"<em>solved<\/em>\".\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:lisible\" title=\"Jump back to footnote 3 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:ordre_lignes\">\n<p>Je d\u00e9veloppe tr\u00e8s rapidement sur le point de la mise en page. Dans la transcription de Gemini, il y a des compl\u00e9ments d'informations qui sugg\u00e8rent que le mod\u00e8le a bien identifi\u00e9 \u00e0 quelle page correspond telle ou telle partie du texte. Dans la transcription de Transkribus, ce n'est pas le cas, mais je pense que c'est parce que Dan Cohen a seulement utilis\u00e9 la page de test de mod\u00e8les de transcription de Transkribus. S'il avait utilis\u00e9 la version compl\u00e8te de Transkribus, je suis s\u00fbre que le mod\u00e8le aurait aussi parfaitement identifi\u00e9 la mise en page en double-page. Pour ce qui concerne la transcription ligne par ligne, on n'a plus cette information dans la transcription de Gemini, qui g\u00e9n\u00e8re le texte en continu.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:ordre_lignes\" title=\"Jump back to footnote 4 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:precision_error_tkb\">\n<p>Parmi les erreurs de Transkribus, on peut aussi noter l'utilisation d'un \"<a href=\"https:\/\/www.compart.com\/en\/unicode\/U+0432\">\u0432<\/a>\" (le v cyrillique) pour transcrire le \"B\" de la c\u00f4te du document, et d'un \"<a href=\"https:\/\/www.compart.com\/en\/unicode\/U+0440\">\u0440<\/a>\" (le r cyrillique) pour transcrire le \"P\" qui suit. Ce sont des erreurs qui nous \u00e9chappent quand on fait un contr\u00f4le visuel rapide, qui ne g\u00eane pas la lecture par les humains, mais qui font baisser la pr\u00e9cision calcul\u00e9e automatiquement puisque qu'un \u0432 n'est pas un B et un \u0440 n'est pas un P, ni d'ailleurs un p (see what I did here?).\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:precision_error_tkb\" title=\"Jump back to footnote 5 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:occasion_by\">\n<p>Transkribus l'avait transcrit \"<em>occasion by<\/em>\".\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/#fnref:occasion_by\" title=\"Jump back to footnote 6 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["evaluation","French blog posts","Generative AI","HTR","Large Language Models","literature review"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/025-fr\/","pubDate":"Fri, 28 Nov 2025 21:50:54 GMT"},{"title":"024 - The messy backstage of a literature review","link":"https:\/\/alix-tz.github.io\/phd\/posts\/024\/","description":"<p>A few weeks ago, I began a thorough review of articles published in four digital humanities venues to track mentions of automatic text recognition and understand how, where, and why scholars use it. Although I wish I had started sooner in my doctoral journey, I stay positive holding on to the idea that \"it's never too late.\" I learn a lot about Digital Humanities as a field of research and gain a better understanding of ATR's presence in the field.<\/p>\n<p>While catching up on our dissertation progress, I was telling Roch Delanney about the survey I'm conducting, my goals for it, and how I selected and sorted the articles. Roch suggested that I share my method more widely. It seems a little clumsy at times, but I am also able to use many different skills I have learned and sharpened over the years so I think it is indeed interesting to share a bit of my <em>cuisine<\/em>.<\/p>\n<h3>Perimeter of the literature review<\/h3>\n<p>My literature review focuses on four publication venues. I think they are, collectively, representative of research in the Digital Humanities: <\/p>\n<ul>\n<li>\n<p><a href=\"https:\/\/academic.oup.com\/dsh\"><em>Digital Scholarship in the Humanities<\/em><\/a> (DSH), which is presented by the Alliance of Digital Humanities Organizations (ADHO) as an international, peer-reviewed journal published by Oxford University Press on behalf of ADHO and the European Association for Digital Humanities (EADH). It was published under the title <em>Literary and Linguistic Computing: The Journal of Digital Scholarship in the Humanities<\/em> until 2014. I counted a total of 174 volumes for a total of 1741 articles (excluding retracted articles, book reviews, editorials and committee reports) published since 1985 until the first half of 2025.<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/dhq.digitalhumanities.org\/\"><em>Digital Humanities Quarterly<\/em><\/a> (DHQ) is an open-access peer-reviewed journal, probably more representative of research in North America. It is published by the Association for Computers and the Humanities (ACH). I counted a total of 790 articles published since its first issue in 2007. Most articles are in English.<\/p>\n<\/li>\n<li>\n<p>The <a href=\"https:\/\/jdmdh.episciences.org\/\"><em>Journal of Data Mining and Digital Humanities<\/em><\/a> (JDMDH), is published by Episciences since 2017. Contrary to DHQ, its focus is more European-centric, and it has a special volume dedicated specifically to automatic text recognition (directed by Ariane Pinche and Peter Stokes). I found a total of 162 articles published in JDMDH, including the special volume on ATR.<\/p>\n<\/li>\n<li>\n<p>Lastly, the proceedings from the more recent <em>Computational Humanities Research<\/em> (CHR) conferences (see the <a href=\"https:\/\/2024.computational-humanities-research.org\/\">2024 conference proceedings<\/a> for example) offer a perspective on research focused on more intensively computational methods in the Humanities. The conference is held annually since 2021. I found a total of 214 articles in the proceedings.<\/p>\n<\/li>\n<\/ul>\n<p>Aside from DSH, that I can access thanks to the library of the University of Montr\u00e9al, all the other journals are in open access. <\/p>\n<h3>Collecting the articles and their metadata<\/h3>\n<p>For JDMDH, articles are not centralized on the journal website but rather published on platforms like <a href=\"https:\/\/hal.archives-ouvertes.fr\/\">HAL<\/a> or <a href=\"https:\/\/arxiv.org\/\">arXiv<\/a> and sometimes <a href=\"https:\/\/zenodo.org\/\">Zenodo<\/a>. Getting an overview of the articles published in JDMDH is not straightforward, but it is possible to browse the articles per <a href=\"https:\/\/jdmdh.episciences.org\/browse\/volumes\">volumes<\/a>. I opened and downloaded each article in each volume, as well as collected the article entries in Zotero using the Zotero connector. The process was cumbersome and required many clicks, but the variety of publishing platforms deterred me from writing a script to automate the downloading process.  <\/p>\n<p>CHR, on the other hand, was very easy to scrape, partly because there are only four volumes of proceedings so far. For each series of proceeding, the index of all articles is compatible with the batch import scenario of the Zotero connector. To collect the PDFs, I used a section of the HTML page and regular expressions to identify the links to the PDF files, creating a list of URLs. Finally, I used a Python script to download the PDFs to my computer.  <\/p>\n<p>For example, in <a href=\"https:\/\/ceur-ws.org\/Vol-2989\/\">https:\/\/ceur-ws.org\/Vol-2989\/<\/a>, the <code>ul<\/code> contains simple HTML elements pointing to the PDF files, such as: <\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"p\">&lt;<\/span><span class=\"nt\">h3<\/span><span class=\"p\">&gt;&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURSESSION\"<\/span><span class=\"p\">&gt;<\/span>Presented papers<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;&lt;\/<\/span><span class=\"nt\">h3<\/span><span class=\"p\">&gt;<\/span>\n\n<span class=\"p\">&lt;<\/span><span class=\"nt\">ul<\/span><span class=\"p\">&gt;<\/span>\n  <span class=\"p\">&lt;<\/span><span class=\"nt\">li<\/span> <span class=\"na\">id<\/span><span class=\"o\">=<\/span><span class=\"s\">\"long_paper5\"<\/span><span class=\"p\">&gt;&lt;<\/span><span class=\"nt\">a<\/span> <span class=\"na\">href<\/span><span class=\"o\">=<\/span><span class=\"s\">\"long_paper5.pdf\"<\/span><span class=\"p\">&gt;<\/span>\n      <span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURTITLE\"<\/span><span class=\"p\">&gt;<\/span>Entity Matching in Digital Humanities Knowledge\n      Graphs<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;&lt;\/<\/span><span class=\"nt\">a<\/span><span class=\"p\">&gt;<\/span>\n    <span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURPAGES\"<\/span><span class=\"p\">&gt;<\/span>1-15<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;<\/span> <span class=\"p\">&lt;<\/span><span class=\"nt\">br<\/span><span class=\"p\">&gt;<\/span>\n    <span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURAUTHOR\"<\/span><span class=\"p\">&gt;<\/span>Juriaan Baas<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;<\/span>,\n    <span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURAUTHOR\"<\/span><span class=\"p\">&gt;<\/span>Mehdi M. Dastani<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;<\/span>,\n    <span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"CEURAUTHOR\"<\/span><span class=\"p\">&gt;<\/span>Ad J. Feelders<span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;<\/span>\n  <span class=\"p\">&lt;\/<\/span><span class=\"nt\">li<\/span><span class=\"p\">&gt;<\/span>\n...\n<\/pre><\/div>\n\n<p>All I had to do was copy and paste this entire list into a text editor (I like to use <a href=\"https:\/\/www.sublimetext.com\/\">Sublime Text<\/a> in such a situation). Then, I used a simple regular expression like <code>href=\".+?\"<\/code> to select the value in the <code>a<\/code> element, which contains the links to the PDF files. I kept only the selected text and then rebuilt the complete URL with a couple of replacements such as <code>href=\"<\/code> -&gt; <code>\"https:\/\/ceur-ws.org\/Vol-2989\/<\/code> and <code>\"\\n<\/code> -&gt; <code>\",\\n<\/code>. At this point I just added square brackets around the selection, et voil\u00e0! I had a Python list ready to be passed to a script like the one below to download the files:<\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"n\">list_of_urls<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"s2\">\"https:\/\/ceur-ws.org\/Vol-2723\/short8.pdf\"<\/span><span class=\"p\">,<\/span>\n                <span class=\"s2\">\"https:\/\/ceur-ws.org\/Vol-2723\/long35.pdf\"<\/span><span class=\"p\">,<\/span>\n                <span class=\"s2\">\"https:\/\/ceur-ws.org\/Vol-2723\/long44.pdf\"<\/span><span class=\"p\">,<\/span>\n                <span class=\"c1\">#...<\/span>\n                <span class=\"p\">]<\/span>\n\n<span class=\"kn\">import<\/span><span class=\"w\"> <\/span><span class=\"nn\">requests<\/span>\n<span class=\"kn\">import<\/span><span class=\"w\"> <\/span><span class=\"nn\">os<\/span>\n<span class=\"kn\">from<\/span><span class=\"w\"> <\/span><span class=\"nn\">tqdm<\/span><span class=\"w\"> <\/span><span class=\"kn\">import<\/span> <span class=\"n\">tqdm<\/span> <span class=\"c1\"># it makes  progress bar so I know how long I can take to make a tea while the script runs<\/span>\n\n<span class=\"k\">for<\/span> <span class=\"n\">url<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">tqdm<\/span><span class=\"p\">(<\/span><span class=\"n\">list_of_urls<\/span><span class=\"p\">):<\/span>\n    <span class=\"n\">r<\/span> <span class=\"o\">=<\/span> <span class=\"n\">requests<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"n\">url<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">r<\/span><span class=\"o\">.<\/span><span class=\"n\">status_code<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">200<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">filename<\/span> <span class=\"o\">=<\/span> <span class=\"sa\">f<\/span><span class=\"s2\">\"<\/span><span class=\"si\">{<\/span><span class=\"n\">url<\/span><span class=\"o\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"\/\"<\/span><span class=\"p\">)[<\/span><span class=\"o\">-<\/span><span class=\"mi\">2<\/span><span class=\"p\">]<\/span><span class=\"si\">}<\/span><span class=\"s2\">-<\/span><span class=\"si\">{<\/span><span class=\"n\">url<\/span><span class=\"o\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"\/\"<\/span><span class=\"p\">)[<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">]<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n        <span class=\"c1\">#print(filename)<\/span>\n        <span class=\"k\">with<\/span> <span class=\"nb\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"wb\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">as<\/span> <span class=\"n\">f<\/span><span class=\"p\">:<\/span>\n            <span class=\"n\">f<\/span><span class=\"o\">.<\/span><span class=\"n\">write<\/span><span class=\"p\">(<\/span><span class=\"n\">r<\/span><span class=\"o\">.<\/span><span class=\"n\">content<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n        <span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s2\">\"Failed to download: <\/span><span class=\"si\">{<\/span><span class=\"n\">url<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">time<\/span><span class=\"o\">.<\/span><span class=\"n\">sleep<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>  <span class=\"c1\"># This cool down is to be polite to the server<\/span>\n<\/pre><\/div>\n\n<p>I used a similar approach for downloading the articles from DHQ because the <a href=\"https:\/\/dhq.digitalhumanities.org\/index\/title.html\">Index of Titles<\/a> lists all of the published articles on a single page. I first downloaded the HTML pages of the articles (DHQ publishes articles in HTML format as well as PDF). I also used regular expressions to extract the list of links and used a Python script to download the files.  <\/p>\n<p>Unfortunately, the Zotero connector only works on each article page individually, but not for batch-import on the index page. I investigated a bit to understand why it was so, and found that in the source code of each article page, there is a <code>span<\/code> element identified with the class <code>Z3988<\/code> that the Zotero connector uses to extract the metadata and create an entry in Zotero. In DHQ, these spans look like this:<\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"p\">&lt;<\/span><span class=\"nt\">span<\/span> <span class=\"na\">class<\/span><span class=\"o\">=<\/span><span class=\"s\">\"Z3988\"<\/span> <span class=\"na\">title<\/span><span class=\"o\">=<\/span><span class=\"s\">\"url_ver=Z39.88-2004&amp;amp;ctx_ver=Z39.88-2004&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;amp;rft.genre=article&amp;amp;rft.atitle=Academics%20Retire%20and%20Servers%20Die%3A%20Adventures%20in%20the%20Hosting%20and%20Storage%20of%20Digital%20Humanities%20Projects&amp;amp;rft.jtitle=Digital%20Humanities%20Quarterly&amp;amp;rft.stitle=DHQ&amp;amp;rft.issn=1938-4122&amp;amp;rft.date=2023-05-26&amp;amp;rft.volume=017&amp;amp;rft.issue=1&amp;amp;rft.aulast=Cummings&amp;amp;rft.aufirst=James&amp;amp;rft.au=James%20Cummings\"<\/span><span class=\"p\">&gt;<\/span> <span class=\"p\">&lt;\/<\/span><span class=\"nt\">span<\/span><span class=\"p\">&gt;<\/span>\n<\/pre><\/div>\n\n<p>I understood recently, while discussing with Margot Mellet, that Z3988 is a reference to the <a href=\"https:\/\/groups.niso.org\/higherlogic\/ws\/public\/download\/14833\/z39_88_2004_r2010.pdf\">OpenURL Framework Standard (ISO Z 39.88-2004)<\/a>, which is used by the Zotero connector. Also, I should note that such spans are not systematically used in online journals. JDMDH for example doesn't use them, and serves the metadata in a different way.  <\/p>\n<p>Since I had already downloaded all the articles from DHQ as HTML files, I wrote a simple Python script that found all of such spans for each downloaded article and aggregated them in a single, very simple HTML file. Then, I simply opened this page in my browser after emulating a local server<sup id=\"fnref:python_server\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fn:python_server\">1<\/a><\/sup> (with a command like <code>python -m http.server<\/code>), and I was able to use the Zotero connector to import all the articles in a single click. It was very satisfying! The only downside is that I couldn't collect the articles' abstracts because there weren't included in the spans.  <\/p>\n<p>DSH was different from the rest of the journals. Because of the longevity of the journal and the amount of articles it published, it was quite overwhelming. Unfortunately, it is a paywalled journal and I couldn't figure out how to make the proxy of the University of Montreal library work with my Python scripts and the command line. As a result, I had to manually download the articles,<sup id=\"fnref:proxy_dsh\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fn:proxy_dsh\">2<\/a><\/sup> but only when they were relevant! Since DSH has a fairly good search engine that allows to do multi-keyword searches, I only downloaded articles matching my search criteria (143 in total).<\/p>\n<p>Additionally, I went through each of the 174 issues of DSH to batch-import the article references in Zotero. It was tedious but I figured I might be able to use these metadata for other projects in the future.  <\/p>\n<h3>Filtering the articles<\/h3>\n<p>For DHQ, JDMDH and CHR, I ran a keyword surch using the command <a href=\"https:\/\/www.man7.org\/linux\/man-pages\/man1\/grep.1.html\"><code>grep<\/code><\/a> on the content of the articles. I didn't want to limit my search to the titles, abstract or keywords because I really wanted to include anecdotal mentions of automatic text recognition in my results.  <\/p>\n<p>To use grep, I created a file (pattern.txt) with the keywords I was looking for:  <\/p>\n<div class=\"code\"><pre class=\"code literal-block\">HTR\nOCR\ntext recognition\nATR\nTranskribus\neScriptorium\nautomatic transcription\n<\/pre><\/div>\n\n<p>Then I converted the PDFs into text files using the command <a href=\"https:\/\/man.archlinux.org\/man\/pdftotext.1.en\">pdftotext<\/a>. This was necessary because grep cannot search inside a PDF directly. I didn't need to do this conversion for DHQ, since I had download HTML files from that journal. <\/p>\n<p>The commands to search inside the PDFs of one of the journals would look like this:<\/p>\n<div class=\"code\"><pre class=\"code literal-block\">ls<span class=\"w\"> <\/span>*.pdf<span class=\"w\"> <\/span><span class=\"p\">|<\/span><span class=\"w\"> <\/span>xargs<span class=\"w\"> <\/span>-n1<span class=\"w\"> <\/span>pdftotext<span class=\"w\"> <\/span><span class=\"c1\"># to convert PDFs to text files<\/span>\ngrep<span class=\"w\"> <\/span>-i<span class=\"w\"> <\/span>-w<span class=\"w\"> <\/span>-m5<span class=\"w\"> <\/span>-H<span class=\"w\"> <\/span>-f<span class=\"w\"> <\/span>..\/pattern.txt<span class=\"w\"> <\/span>*.txt<span class=\"w\"> <\/span><span class=\"c1\"># to search for the keywords in the text files and display the first 5 matches<\/span>\n<\/pre><\/div>\n\n<p>After controlling how grep matched the keywords, I used <code>grep -l -f ..\/pattern.txt *.txt<\/code> to list the files that matched the keywords. This list was used to sort the documents into two folders, according to whether or not they matched my research.<\/p>\n<p>In the case of DSH, I directly used the search engine to combine the keywords, using the \"OR\" operator. I set the full text of the articles as the scope of my research: <a href=\"https:\/\/academic.oup.com\/dsh\/search-results?allJournals=1&amp;f_ContentType=Journal+Article&amp;fl_SiteID=5447&amp;cqb=[{%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22automatic%20transcription%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22transkribus%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22text%20recognition%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22escriptorium%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22OCR%22}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22HTR%22}]}]&amp;qb={%22_text_1-exact%22:%22automatic%20transcription%22,%22qOp2%22:%22OR%22,%22_text_2-exact%22:%22transkribus%22,%22qOp3%22:%22OR%22,%22_text_3-exact%22:%22text%20recognition%22,%22qOp4%22:%22OR%22,%22_text_4-exact%22:%22escriptorium%22,%22qOp5%22:%22OR%22,%22_text_5%22:%22OCR%22,%22qOp6%22:%22OR%22,%22_text_6%22:%22HTR%22}&amp;page=1\">https:\/\/academic.oup.com\/dsh\/search-results?allJournals=1&amp;f_ContentType=Journal+Article&amp;fl_SiteID=5447&amp;cqb=[{%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22automatic%20transcription%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22transkribus%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22text%20recognition%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22escriptorium%22,%22exactMatch%22:true}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22OCR%22}]},{%22condition%22:%22OR%22,%22terms%22:[{%22filter%22:%22_text_%22,%22input%22:%22HTR%22}]}]&amp;qb={%22_text_1-exact%22:%22automatic%20transcription%22,%22qOp2%22:%22OR%22,%22_text_2-exact%22:%22transkribus%22,%22qOp3%22:%22OR%22,%22_text_3-exact%22:%22text%20recognition%22,%22qOp4%22:%22OR%22,%22_text_4-exact%22:%22escriptorium%22,%22qOp5%22:%22OR%22,%22_text_5%22:%22OCR%22,%22qOp6%22:%22OR%22,%22_text_6%22:%22HTR%22}&amp;page=1<\/a><\/p>\n<p>In both cases, the search was not case sensitive, in order to catch a maximum of occurrences of keywords like \"automatic text recognition\" or \"Text Recognition\" or \"text recognition\", etc. However, it meant that sometimes I found false positives: \"democracy\" often matches with \"ocr\", so does \"theatre\" with \"atr\". Since DSH's search engine returns the match in context, I was able to ignore these false positives. For the other journals, I had to manually check where the matches were. Usually, I combined this control with the next step of my investigation.  <\/p>\n<h4>Hits per journal<\/h4>\n<ul>\n<li>JDMDH: 47 hits (out of 162 articles)<\/li>\n<li>DHQ: 93 hits (out of 790 articles)<\/li>\n<li>DSH: 143 relevant hits (out of 1741 articles)<\/li>\n<li>CHR: 65 hits (out of 214 articles)<\/li>\n<\/ul>\n<h3><em>D\u00e9pouillement<\/em> and analysis<\/h3>\n<p>To this date, I am still in the process of reading the articles and taking notes on the occurrences of my keywords. <\/p>\n<p>I use Zotero to keep track of the articles I read and to confirm whether they are false positives. Sometimes, I leave out articles that are irrelevant, even if they mention a keyword I was looking for. For example, <a href=\"https:\/\/doi.org\/10.1093\/llc\/fqac089\">Liu &amp; Zhu (2023)<\/a><sup id=\"fnref:liu_zhu_2023\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fn:liu_zhu_2023\">3<\/a><\/sup> contains the string \"OCR\" but it only appears in a title in their bibliography, for work they refer to in a context where OCR is not relevant to their argument. With tags in Zotero, I clearly identify such articles as \"to be left out\" from my analysis, but I don't remove them from the collection.  <\/p>\n<p>I use different tags to identify the various occurrences of the technology in the articles. For example, I distinguish between firsthand applications of ATR and the reuse of data produced by ATR before the experimentation presented by the authors. Typically, there are many mentions of documents that were OCRed by libraries and used by scholars to conduct their research. Overall, with this analysis, I am trying to add more depth to the observations made by <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2304.13530\">Tarride et al (2023)<\/a><sup id=\"fnref:tarride_et_al_2023\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fn:tarride_et_al_2023\">4<\/a><\/sup> in which they pragmatically considered three situations leading to the use of ATR: 1) for the production of digital editions; 2) for the production of large searchable text corpora; and 3) for the production of non-comprehensive transcriptions to feed knowledge bases. However, it is difficult to elaborate definitive categories before I am done processing all the collected articles.  <\/p>\n<p>Due to the large number of articles to be analyzed, I have continued to use the grep command to quickly review the content of articles and speed up my sorting process. For example, I am more interested in firsthand usages of ATR, want to be able to quickly identify non relevant mentions of my keywords as was the case in Liu &amp; Zhu (2023). The command <code>grep -i -w -C 5 -H -f ..\/pattern.txt *.txt &gt; grep_out<\/code> allows me to generate a file, grep_out, in which, for each time a keyword is matched in a document, five lines of context are displayed before and after the match, as well as the name of the file. I still have to read the abstracts and parts of the articles to clearly understand in which contexts the automatic text recognition technologies are used. However, this is an effective method for quickly sorting through the articles.<\/p>\n<p>I'm looking forward to sharing the results of this analysis in my dissertation! <\/p>\n<!-- FOOTNOTES --->\n\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:python_server\">\n<p>This emulation is necessary to allow the Zotero connector to work properly.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fnref:python_server\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:proxy_dsh\">\n<p>I want to specify here that it was not by lack of reading documentations on proxies and requests. Unable to find a straightforward solution, unsure if it was even something that the UdeM proxy allowed, and because I would have still needed to write additional scripts afterwards, I decided that it would take just as long to do it manually (about 2-3 hours).\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fnref:proxy_dsh\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:liu_zhu_2023\">\n<p>Liu, Lei, and Min Zhu. \"Bertalign: Improved Word Embedding-Based Sentence Alignment for Chinese\u2013English Parallel Corpora of Literary Texts.\" <em>Digital Scholarship in the Humanities<\/em> 38, no. 2 (June 1, 2023): 621\u201334. <a href=\"https:\/\/doi.org\/10.1093\/llc\/fqac089\">https:\/\/doi.org\/10.1093\/llc\/fqac089<\/a>.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fnref:liu_zhu_2023\" title=\"Jump back to footnote 3 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:tarride_et_al_2023\">\n<p>Tarride, Sol\u00e8ne, M\u00e9lodie Boillet, and Christopher Kermorvant. \"Key-Value Information Extraction from Full Handwritten Pages.\" arXiv, April 26, 2023. <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2304.13530\">https:\/\/doi.org\/10.48550\/arXiv.2304.13530<\/a>.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/024\/#fnref:tarride_et_al_2023\" title=\"Jump back to footnote 4 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["HTR","literature review","OCR","survey"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/024\/","pubDate":"Sat, 21 Jun 2025 19:15:27 GMT"},{"title":"023 - Writing a PhD manuscript with Markdown and Quarto","link":"https:\/\/alix-tz.github.io\/phd\/posts\/023\/","description":"<p>The deadline for finishing the dissertation is approaching. And there is still so much to do! This is one of the main reasons why this research blog has been quiet for the last few months, even though there are many topics I would like to write about. <\/p>\n<p>But I guess I can take a short break from time to time and go with the flow of writing a blog post in one sitting. Who knows, maybe I'll do a few more before it's time to turn in my dissertation. I want to talk about my writing setup because it is something I have thought about a lot, trying to find the best compromise. <\/p>\n<p>Writing my dissertation in Microsoft Word has never been an option, although I do use Google Docs from time to time to get quick feedback from my supervisors. <\/p>\n<p><a href=\"https:\/\/www.latex-project.org\/\">LaTeX<\/a> may seem like an obvious choice to some of my fellow PhD writers, but I usually limit my use of LaTeX to <a href=\"https:\/\/www.overleaf.com\">Overleaf<\/a>, an online LaTeX editor. On the one hand, I didn't necessarily want to install LaTeX locally for the time being, and on the other hand, I couldn't imagine writing a whole dissertation using Overleaf, because working in my browser can be distracting, and because it would require that I always have access to the Internet to work. To be honest, I mostly didn't want to use LaTeX in the first place because I find the syntax too distracting when I'm writing. It's super useful for getting good control over the layout of the document for the final version of the manuscript, but it's not convenient to work with while I'm formulating ideas and arguments.<\/p>\n<p>I will probably use LaTeX to prepare the final version of the manuscript, but I wanted to use something lighter to structure my document, but easily convertible to LaTeX down the road. <\/p>\n<p>And I am a big fan of Markdown.<\/p>\n<p>Markdown has a syntax that is light enough not to be too distracting - I use it all the time when taking notes anyway, so it is fully part of my writing reflexes. Also, in the context of writing my dissertation, I think of Markdown as text that I can easily copy and paste into a Google document when I need feedback, without losing formatting and without compromising readability in Google Docs. I've seen some LaTeX copy-pasted into Google Docs for supervisor feedback, and I don't think it would work for me. <\/p>\n<p>In addition to Markdown, I wanted to be able to use a modular approach to building my manuscript. A modular approach means having several smaller text files that are eventually merged into a single master document. LaTeX also relies on modularity with commands like <code>\\include{}<\/code>. Modularity is important because in a very long text document it is easy to get lost between inline comments, draft passages, and finished paragraphs. There's also the risk of accidentally deleting passages. With a modular structure, it will also be easier to move paragraphs around as I progress. Also, my manuscript is versioned with Git and synchronized with a private GitHub repository, and modularity makes versioning much easier.<\/p>\n<p>Instead of programming my own manuscript builder - yes, that was my first impulse - I took a closer look at the documentation for <a href=\"https:\/\/quarto.org\/\">Quarto<\/a>, which I've been using for a little over a year to create slides and websites for the courses I teach. Quarto offered me a solution on a silver platter, because it supports building <a href=\"https:\/\/quarto.org\/docs\/reference\/projects\/books.html\">books<\/a> with Markdown, which is close enough to a phD thesis. <\/p>\n<p>Quarto implements a single-source publishing paradigm and acts as a shell around <a href=\"https:\/\/pandoc.org\/\">pandoc<\/a>, which allows for swift conversion from one format to another, including from Markdown to LaTeX. I can split the document into multiple smaller Markdown files, and use my book's config file to specify the order in which the Markdown files are aggregated. Quarto's Markdown implementation includes some cool stuff from pandocs, including citation and cross-reference management. It's really worth taking a look at the documentation.<\/p>\n<p>So with Quarto, I can write my dissertation as a series of smaller Markdown files, and end up with a master .md file, a .tex file ready to import into Overleaf, or even an already parsed PDF file generated with <a href=\"https:\/\/quarto.org\/docs\/output-formats\/pdf-engine.html\">tinytex<\/a>. <\/p>\n<p>Quarto is not a text editor, it is simply a processor that starts with a set of markdown files and a config file, and then builds one or more outputs. To write, I use Visual Studio Code and have a <code>quarto preview<\/code> command running in the background. For now, it just produces an HTML preview that I see in my browser. When I'm closer to a stable version of the manuscript, I'll start working with PDF output.<\/p>\n<p>The syntax for some of the more specific Markdown features in Quarto is more complex than I am used to, so I still have to look at the documentation from time to time. But I am getting the hang of it, and I use a cheat sheet for the features I use more often. <\/p>\n<p>Pandoc's Markdown support lets you apply classes to entire paragraphs or inline portions of text. This is useful because it has allowed me to create some CSS transformations with classes like \"draft\" or \"missing-information\" to keep track of passages I need to rewrite, or blocks where I need to get away from my text editor and go back to my notes (usually in <a href=\"https:\/\/www.zotero.org\/\">Zotero<\/a>). I find it super useful to avoid (at least as much as possible) falling into loopholes that distract me from actually writing. It's more efficient for my time management to divide my time between actual writing sessions and other sessions where I work on improving the drafty passages or doing the research I'm missing to illustrate an argument. <\/p>\n<p>Another use of inline classes is to keep track of concepts or specific terms that I could include in a glossary or at least a list of acronyms. By keeping track of them directly in the text, I can automate the generation of these sections. Some might say that this is the kind of thing I could do with <a href=\"https:\/\/tei-c.org\/release\/doc\/tei-p5-doc\/en\/html\/index.html\">TEI XML<\/a>- I agree, since this is semantic annotation. But as I said, I wanted a lightweight syntax, and I really like Markdown.  <\/p>\n<p><em><strong>EDIT from June 20, 2025:<\/strong> I feel the need to add a precision a few months after this original post: while I did like my set up with Markdown and Quarto to get started on writing my dissertation, I eventually switched to the good old LaTeX. Quarto\/Markdown simply lacked too many features for what I wanted to do.<\/em> <\/p>\n<p><em>Part of the problem came from the fact that custom annotations that turn into spans with custom classes during a Markdown-to-HTML transformation scenario were not converted into anything in LaTeX and were therefore lost. For example, I would have needed to manage the glossary and acronym handler afterwards, only once I was done with Markdown and fully switched to LaTeX. Rather than writing my own preprocessing script to find a solution to this problem (as far as I could see, pandoc does not offer any option to map markdown spans to custom LaTeX commands), I figured swithing to writing in LaTeX directly made more sense: there was no point in pushing too far the complications.<\/em><\/p>\n<p><em>Also, I really wanted to be able to use the <code>todo<\/code> package from LaTeX to keep track of feedback, side notes and questions I had for myself while writing. With this package, they are visible in the PDF output, which is useful also when I share my text with other people.<\/em><\/p>\n<p><em>Lastly, Roch Delanney greatly facilitated this switch by sharing his LaTeX template with me. It was easy to start from the setup he created with Robert Alessi and to add my own configuration and customization. Their template was much more pure than templates that can be found on Overleaf, on top of being very well documented. It was great to keep things simple: I don't import any package that I don't actually need.<\/em><\/p>","category":["markdown","quarto"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/023\/","pubDate":"Tue, 18 Feb 2025 05:00:00 GMT"},{"title":"022 - McCATMuS #5 - Training models","link":"https:\/\/alix-tz.github.io\/phd\/posts\/022\/","description":"<p>Last week, I visited Rimouski in the Bas-Saint-Laurent region of Qu\u00e9bec, along the South-eastern bank of the St Laurent river. I was invited to contribute to discussions around the <a href=\"https:\/\/nouvellefrancenumerique.info\/\">Nouvelle-France Num\u00e9rique project<\/a>, and I took this opportunity to <a href=\"https:\/\/inria.hal.science\/hal-04706828\">present<\/a> HTR-United, CATMuS as well as preliminary results on training a McCATMuS model. In preparation for this presentation, I conducted a series of tests on the two first models I trained. Today, this blog post gives me a space to discuss these tests and their results in more details.<\/p>\n<p>The Kraken McCATMuS models were not directly trained on the HuggingFace dataset I introduced in my <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/022\/021\/\">previous post<\/a>, but rather on ARROW files created with the same ALTO XML files used to create the HuggingFace dataset. At the beginning of September, I wrote a Python script which reproduces the split of ALTO XML files into the train, validation and test sets, and which applies the same type of filtering of lines and modifications as I previously presented. Instead of generating the PARQUET files for HuggingFace, it simply creates alternative <code>.catmus_arrow.xml<\/code> files and three listings of these files, ready to be served to a <a href=\"https:\/\/kraken.re\/4.3.0\/ketos.html#binary-datasets\"><code>ketos compile<\/code><\/a> command<sup id=\"fnref:compile\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/022\/#fn:compile\">1<\/a><\/sup>.<\/p>\n<p>I used Kraken 4.3.13 to train the models on Inria's computation server because I've had dependency issues with Kraken 5 and haven't fixed them yet. The first model I trained strictly followed the train\/validation split thanks to the <a href=\"https:\/\/github.com\/mittagessen\/kraken\/blob\/cdfb923eba8d7dba10b6f32fb73bdf1e355aaf74\/kraken\/ketos\/recognition.py#L129C16-L129C30\"><code>--fixed-splits<\/code> option<\/a>. After 60 epochs, the model plateaued at 79.9% of character accuracy. When applied to the test set, this accuracy remained at 78.06%, a mere two points drop.<\/p>\n<p>I trained a second model using the same parameters<sup id=\"fnref:params\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/022\/#fn:params\">2<\/a><\/sup> but without the <code>--fixed-splits<\/code> option, allowing Kraken to shuffle the train set and the validation set into a 90\/10 split (the test set was left untouched however). This time, the training lasted 157 epochs before stopping, with the best model scoring with an accuracy of 92.8% on the validation set. When applied to the test set however, the model lost 7 points of accuracy (85.24%).<\/p>\n<figure>\n    <img src=\"https:\/\/alix-tz.github.io\/phd\/images\/mccatmus_v1_entra%C3%AEnement_fixedsplits.png\" alt=\"Learning curve for the model trained on the fixed split.\">\n    <figcaption>Learning curve (Character and Word Accuracies) for the model trained on the fixed \"feature\"-based split between train and validation.<\/figcaption>\n<\/figure>\n\n<figure>\n    <img src=\"https:\/\/alix-tz.github.io\/phd\/images\/mccatmus_v1_entra%C3%AEnement.png\" alt=\"Learning curve for the model trained on the non-fixed split.\">\n    <figcaption>Learning curve (Character and Word Accuracies) for the model trained on the random split between train and validation.<\/figcaption>\n<\/figure>\n\n<p>Although disappointing, this was consistent with the observations made when training the CATMuS Medieval model:<\/p>\n<blockquote>\n<p><em>As anticipated, the \"General\" split exhibits lower CER, given the absence of out-of-domain documents, whereas the \"Feature\"-based split surpasses 10%. This higher score presents an intriguing challenge for developing more domain-specific models that consider factors such as script type and language.<\/em> (from <a href=\"https:\/\/univ-paris8.hal.science\/hal-04453952v1\">Thibault Cl\u00e9rice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagu\u00e9, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece. \u27e8hal-04453952\u27e9<\/a> p. 15)<\/p>\n<\/blockquote>\n<p>So, the drop in accuracy observed on the test set is, as suggested in <em>Cl\u00e9rice et al, 2024<\/em>, likely due to the fact that with a fixed-split, the model is both validated and tested against out-of-domain hands and documents (although the documents differ in the two sets). On the other hand, the model trained with a random split is validated against known hands and documents, but tested on out-of-domain examples.<\/p>\n<p>The test set contains transcriptions of printed, typewritten and handwritten texts, covering all centuries. Limiting ourselves to only one accuracy score obtained on the whole test set would tell us very little about the model's capacity and its limitations. This is why I divided the test set into several smaller test sets based on the century of the documents and\/or on the main type of writing present in the documents. For documents spanning over several centuries, I used the most represented century.<\/p>\n<p>I only used the McCATMuS trained on the random split for these tests, because the accuracy of the other one was too low for the results to be meaningful. Instead of only testing McCATMuS, I also ran the Manu McFrench V3 and McFondue on the McCATMuS test set. They are two generic models trained on similar data (although with no or different normalization approaches).<\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Test set..............<\/th>\n<th style=\"text-align: left;\">...McCATMuS...<\/th>\n<th style=\"text-align: center;\">...Manu McFrench V3...<\/th>\n<th style=\"text-align: right;\">...McFondue<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\">All...................<\/td>\n<td style=\"text-align: left;\">...85.24...<\/td>\n<td style=\"text-align: center;\">...<strong>91.17<\/strong>...<\/td>\n<td style=\"text-align: right;\">...76.12<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">Handwritten...........<\/td>\n<td style=\"text-align: left;\">...78.72...<\/td>\n<td style=\"text-align: center;\">...<strong>89.40<\/strong>...<\/td>\n<td style=\"text-align: right;\">...75.17<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">Print.................<\/td>\n<td style=\"text-align: left;\">...<strong>96.37<\/strong>...<\/td>\n<td style=\"text-align: center;\">...94.15...<\/td>\n<td style=\"text-align: right;\">...78.30<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">Typewritten...........<\/td>\n<td style=\"text-align: left;\">...90.93...<\/td>\n<td style=\"text-align: center;\">...<strong>92.69<\/strong>...<\/td>\n<td style=\"text-align: right;\">...58.13<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">17th cent.............<\/td>\n<td style=\"text-align: left;\">...<strong>87.27<\/strong>...<\/td>\n<td style=\"text-align: center;\">...86.39...<\/td>\n<td style=\"text-align: right;\">...72.81<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">18th cent.............<\/td>\n<td style=\"text-align: left;\">...88.65...<\/td>\n<td style=\"text-align: center;\">...<strong>94.21<\/strong>...<\/td>\n<td style=\"text-align: right;\">...81.64<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">19th cent.............<\/td>\n<td style=\"text-align: left;\">...79.81...<\/td>\n<td style=\"text-align: center;\">...<strong>93.70<\/strong>...<\/td>\n<td style=\"text-align: right;\">...75.46<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">20th cent.............<\/td>\n<td style=\"text-align: left;\">...74.92...<\/td>\n<td style=\"text-align: center;\">...<strong>86.52<\/strong>...<\/td>\n<td style=\"text-align: right;\">...56.74<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">21st cent.............<\/td>\n<td style=\"text-align: left;\">...73.86...<\/td>\n<td style=\"text-align: center;\">...<strong>90.20<\/strong>...<\/td>\n<td style=\"text-align: right;\">...68.04<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">(HW) 17th cent........<\/td>\n<td style=\"text-align: left;\">...58.69...<\/td>\n<td style=\"text-align: center;\">...<strong>64.83<\/strong>...<\/td>\n<td style=\"text-align: right;\">...<em>64.26<\/em><\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">(HW) 18th cent........<\/td>\n<td style=\"text-align: left;\">...85.38...<\/td>\n<td style=\"text-align: center;\">...<strong>93.35<\/strong>...<\/td>\n<td style=\"text-align: right;\">...80.47<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">(HW) 19th cent........<\/td>\n<td style=\"text-align: left;\">...79.81...<\/td>\n<td style=\"text-align: center;\">...<strong>93.70<\/strong>...<\/td>\n<td style=\"text-align: right;\">...75.46<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">(HW) 20th cent........<\/td>\n<td style=\"text-align: left;\">...63.02...<\/td>\n<td style=\"text-align: center;\">...<strong>82.23<\/strong>...<\/td>\n<td style=\"text-align: right;\">...55.89<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\">(HW) 21st cent........<\/td>\n<td style=\"text-align: left;\">...73.86...<\/td>\n<td style=\"text-align: center;\">...<strong>90.20<\/strong>...<\/td>\n<td style=\"text-align: right;\">...68.04<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- add plot? -->\n\n<p>I was initially surprised by the consistent margin Manu McFrench had over McCATMuS, considering it was trained on less data (73.9K + 8.8K lines, against the 106K + 5.8K lines) which had not been harmonized to follow the same transcription rules. However, these scores are actually biased in favor of Manu McFrench because several of the documents included in the McCATMuS test set were also used in Manu McFrench's train set. Even though this is not true for all documents, it concerns almost half of the test set. It might also be the case for McFonddue, but this model scores higher than McCATMuS in only one instance (handwritten documents from the 17th century). Creating a new test set, with documents that are not present in any of the train sets but follow the CATMuS guidelines, would be a good way to confirm this bias.<\/p>\n<p>Additionally, I detected an issue in one of the datasets used in the test set: <code>FoNDUE_Wolfflin_Fotosammlung<\/code> contains some lines of faulty transcriptions, resulting from automatic text recognition, which most certainly cause an inaccurate evaluation of all three models.<\/p>\n<blockquote>\n<p><em>A couple of examples of the faulty transcriptions, along with their CER they generate when compared to what would be a correct transcription (the CER is generated with <a href=\"https:\/\/github.com\/WHaverals\/CERberus\">CERberus<\/a>):<\/em><\/p>\n<table>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Line image<\/th>\n<th style=\"text-align: right;\">Faulty transcription<\/th>\n<th style=\"text-align: right;\">Correct transcription<\/th>\n<th style=\"text-align: center;\">Faulty CER would be<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td style=\"text-align: left;\"><img alt='text line images reading, in print, \"COLLECTION HANFSTAENGL LONDON\"' src=\"https:\/\/alix-tz.github.io\/phd\/images\/fotosammlung_error_example1.jpg\"><\/td>\n<td style=\"text-align: right;\">\"CSTITHER, KIESERMAEAER AogS.\"<\/td>\n<td style=\"text-align: right;\">\"COLLECTION HANFSTAENGL LONDON\"<\/td>\n<td style=\"text-align: center;\">89.29<\/td>\n<\/tr>\n<tr>\n<td style=\"text-align: left;\"><img alt='text line image reading, in print, \"NATIONAL GALLERY\"' src=\"https:\/\/alix-tz.github.io\/phd\/images\/fotosammlung_error_example2.jpg\"><\/td>\n<td style=\"text-align: right;\">\"PEcLioL.\"<\/td>\n<td style=\"text-align: right;\">\"NATIONAL GALLERY\"<\/td>\n<td style=\"text-align: center;\">175.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/blockquote>\n<p>I have planned to manually control this dataset and update the McCATMuS dataset accordingly. I don't know yet how many lines are affected.<\/p>\n<p>The better accuracy of the Manu McFrench model is not just a product of the biases in the test set. I had the occasion to apply it to two documents, one from the 17th century and one from the 20th century. In both cases, Manu McFrench's transcription seemed more likely to be correct than McCATMuS's. This has led me to compare the training parameters used for both models and to start a third training experiment using Manu McFrench's parameters. In this case, the batch size is reduced to 16 (as opposed to 32) and the Unicode normalization follows <a href=\"https:\/\/unicode.org\/reports\/tr15\/#Compatibility_Composite_Figure\">NFKD instead of NFD<\/a>.<\/p>\n<p>If the results of this third training are consistent with the previous experiments, it will be interesting to see if adding more data to the training set will improve the results. Also, I have yet to test the model in a situation of finetuning.<\/p>\n<p>As said at the beginning of this post, these results are preliminary, so I hope to have more to share in the coming weeks.<\/p>\n<!-- footnotes -->\n\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:compile\">\n<p>The command looks like this: cat \".\/list_of_paths.txt\" | xargs -d \"\\n\" ketos compile -o \".\/binary_dataset.arrow\" --random-split .0 .0 1.0 -f alto.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/022\/#fnref:compile\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:params\">\n<p>The configuration of Kraken for training these two model relies on the default network architecture, on a NFD Unicode normalization, a learning rate of 0.0001 (1e<sup>-4<\/sup>), batch size of 32, padding of 16 (default value), and applies augmentation (<code>--augment<\/code>). The <code>--fixed-splits<\/code> option is used for the first model. Following Kraken's default behavior, the training stops when the validation loss does not decrease for 10 epochs (early stops); this prevents the model from overfitting, which is confirmed when looking at the accuracy score of the intermediary models on the test set (orange line on the graphs). The training is done on a GPU.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/022\/#fnref:params\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["CATMuS","datasets","HTR"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/022\/","pubDate":"Mon, 23 Sep 2024 04:00:00 GMT"},{"title":"021 - McCATMuS #4 - Cleaning data, collection metadata","link":"https:\/\/alix-tz.github.io\/phd\/posts\/021\/","description":"<p>Preparing the data for CATMuS would certainly have taken much more time had I not been able to benefit from Thibault Cl\u00e9rice's experience with CATMuS Medieval. Not only was I able to build on the workflow he set up when he built it, but I also relied heavily on his scripts to parse and build the final dataset into <a href=\"https:\/\/parquet.apache.org\/\">PARQUET<\/a> files that were pushed to HuggingFace. Most of these steps are described in <a href=\"https:\/\/univ-paris8.hal.science\/hal-04453952v1\">Thibault Cl\u00e9rice, Ariane Pinche, Malamatenia Vlachou-Efstathiou, Alix Chagu\u00e9, Jean-Baptiste Camps, et al.. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. 2024 International Conference on Document Analysis and Recognition (ICDAR), 2024, Athens, Greece<\/a>, presented at the <a href=\"https:\/\/icdar2024.net\/\">ICDAR<\/a> conference in Athens in a few days.<\/p>\n<p>For McCATMuS, I started by downloading all the datasets (keeping track of the official releases) then I manually reorganized all the datasets so that the transcription and images were always under <code>{dataset_repo}\/data\/{sub_folder}<\/code>, which made later manipulation easier. Based on the notes I took while filtering the datasets, and after generating a character table for each dataset with <a href=\"https:\/\/github.com\/PonteIneptique\/choco-mufin\">Chocomufin<\/a>, I created several conversion tables to harmonize the transcription. The conversions are a mix of single character or multiple character replacements (<code>[<\/code> and  <code>[[?]]<\/code>) and more or less sophisticated replacements based on regular expressions (<code>#r#\u00ab<\/code>).<sup id=\"fnref:chocomufin\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/021\/#fn:chocomufin\">1<\/a><\/sup><\/p>\n<p>Here is a sample of the Chocomufin conversion table used for the LECTAUREP datasets. If the character is replaced by itself, it remains unchanged in the dataset, while replacing it allows either to remove a character from the dataset (the <code>\u00a5<\/code>) or to harmonize its transcription with the CATMuS guidelines (see <code>\u0153<\/code> and <code>\u00b0<\/code> for example).<\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"nc\">char<\/span><span class=\"p\">,<\/span><span class=\"n\">name<\/span><span class=\"p\">,<\/span><span class=\"n\">replacement<\/span><span class=\"p\">,<\/span><span class=\"n\">codepoint<\/span><span class=\"p\">,<\/span><span class=\"n\">mufidecode<\/span><span class=\"p\">,<\/span><span class=\"k\">order<\/span>\n<span class=\"n\">#r<\/span><span class=\"err\">#\u00ab<\/span><span class=\"w\"> <\/span><span class=\"p\">,<\/span><span class=\"n\">Repl<\/span><span class=\"w\"> <\/span><span class=\"n\">extra<\/span><span class=\"w\"> <\/span><span class=\"nf\">space<\/span><span class=\"w\"> <\/span><span class=\"k\">before<\/span><span class=\"w\"> <\/span><span class=\"nf\">LEFT<\/span><span class=\"o\">-<\/span><span class=\"n\">POINTING<\/span><span class=\"w\"> <\/span><span class=\"k\">DOUBLE<\/span><span class=\"w\"> <\/span><span class=\"n\">ANGLE<\/span><span class=\"w\"> <\/span><span class=\"n\">QUOTATION<\/span><span class=\"w\"> <\/span><span class=\"n\">MARK<\/span><span class=\"p\">,<\/span><span class=\"ss\">\"\"\"\"<\/span><span class=\"p\">,<\/span><span class=\"mi\">00<\/span><span class=\"n\">AB<\/span><span class=\"p\">,,<\/span><span class=\"mi\">0<\/span>\n<span class=\"n\">#r<\/span><span class=\"err\">#<\/span><span class=\"w\"> <\/span><span class=\"err\">\u00bb<\/span><span class=\"p\">,<\/span><span class=\"n\">Repl<\/span><span class=\"w\"> <\/span><span class=\"n\">extra<\/span><span class=\"w\"> <\/span><span class=\"nf\">space<\/span><span class=\"w\"> <\/span><span class=\"k\">before<\/span><span class=\"w\"> <\/span><span class=\"nf\">RIGHT<\/span><span class=\"o\">-<\/span><span class=\"n\">POINTING<\/span><span class=\"w\"> <\/span><span class=\"k\">DOUBLE<\/span><span class=\"w\"> <\/span><span class=\"n\">ANGLE<\/span><span class=\"w\"> <\/span><span class=\"n\">QUOTATION<\/span><span class=\"w\"> <\/span><span class=\"n\">MARK<\/span><span class=\"p\">,<\/span><span class=\"ss\">\"\"\"\"<\/span><span class=\"p\">,<\/span><span class=\"mi\">00<\/span><span class=\"n\">BB<\/span><span class=\"p\">,,<\/span><span class=\"mi\">0<\/span>\n<span class=\"o\">[<\/span><span class=\"n\">[?<\/span><span class=\"o\">]<\/span><span class=\"err\">]<\/span><span class=\"p\">,<\/span><span class=\"nf\">replace<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\">[?<\/span><span class=\"o\">]<\/span><span class=\"err\">]<\/span><span class=\"w\"> <\/span><span class=\"k\">with<\/span><span class=\"w\"> <\/span><span class=\"err\">\u27e6\u27e7<\/span><span class=\"p\">,<\/span><span class=\"err\">\u27e6\u27e7<\/span><span class=\"p\">,,,<\/span><span class=\"mi\">0<\/span>\n<span class=\"o\">[<\/span><span class=\"n\">?<\/span><span class=\"o\">]<\/span><span class=\"p\">,<\/span><span class=\"nf\">replace<\/span><span class=\"w\"> <\/span><span class=\"o\">[<\/span><span class=\"n\">?<\/span><span class=\"o\">]<\/span><span class=\"w\"> <\/span><span class=\"k\">with<\/span><span class=\"w\"> <\/span><span class=\"err\">\u27e6\u27e7<\/span><span class=\"p\">,<\/span><span class=\"err\">\u27e6\u27e7<\/span><span class=\"p\">,,,<\/span><span class=\"mi\">0<\/span>\n<span class=\"p\">),<\/span><span class=\"nf\">RIGHT<\/span><span class=\"w\"> <\/span><span class=\"n\">PARENTHESIS<\/span><span class=\"p\">,),<\/span><span class=\"mi\">0029<\/span><span class=\"p\">,),<\/span>\n<span class=\"n\">m<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">M<\/span><span class=\"p\">,<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span><span class=\"mi\">006<\/span><span class=\"n\">D<\/span><span class=\"p\">,<\/span><span class=\"n\">m<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">\u00c9<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">CAPITAL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">E<\/span><span class=\"w\"> <\/span><span class=\"k\">WITH<\/span><span class=\"w\"> <\/span><span class=\"n\">ACUTE<\/span><span class=\"p\">,<\/span><span class=\"n\">\u00c9<\/span><span class=\"p\">,<\/span><span class=\"mi\">00<\/span><span class=\"n\">C9<\/span><span class=\"p\">,<\/span><span class=\"n\">E<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">a<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">A<\/span><span class=\"p\">,<\/span><span class=\"n\">a<\/span><span class=\"p\">,<\/span><span class=\"mi\">0061<\/span><span class=\"p\">,<\/span><span class=\"n\">a<\/span><span class=\"p\">,<\/span>\n<span class=\"ss\">\",\"<\/span><span class=\"p\">,<\/span><span class=\"n\">COMMA<\/span><span class=\"p\">,<\/span><span class=\"ss\">\",\"<\/span><span class=\"p\">,<\/span><span class=\"mi\">002<\/span><span class=\"n\">C<\/span><span class=\"p\">,<\/span><span class=\"ss\">\",\"<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">e<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">E<\/span><span class=\"p\">,<\/span><span class=\"n\">e<\/span><span class=\"p\">,<\/span><span class=\"mi\">0065<\/span><span class=\"p\">,<\/span><span class=\"n\">e<\/span><span class=\"p\">,<\/span>\n<span class=\"o\">^<\/span><span class=\"p\">,<\/span><span class=\"n\">CIRCUMFLEX<\/span><span class=\"w\"> <\/span><span class=\"n\">ACCENT<\/span><span class=\"p\">,<\/span><span class=\"o\">^<\/span><span class=\"p\">,<\/span><span class=\"mi\">005<\/span><span class=\"n\">E<\/span><span class=\"p\">,<\/span><span class=\"o\">^<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">\u0153<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LIGATURE<\/span><span class=\"w\"> <\/span><span class=\"n\">OE<\/span><span class=\"p\">,<\/span><span class=\"n\">oe<\/span><span class=\"p\">,<\/span><span class=\"mi\">0153<\/span><span class=\"p\">,<\/span><span class=\"n\">oe<\/span><span class=\"p\">,<\/span>\n<span class=\"err\">\u0302<\/span><span class=\"p\">,<\/span><span class=\"n\">COMBINING<\/span><span class=\"w\"> <\/span><span class=\"n\">CIRCUMFLEX<\/span><span class=\"w\"> <\/span><span class=\"n\">ACCENT<\/span><span class=\"p\">,<\/span><span class=\"err\">\u0302<\/span><span class=\"p\">,<\/span><span class=\"mi\">0302<\/span><span class=\"p\">,,<\/span>\n<span class=\"n\">W<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">CAPITAL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">W<\/span><span class=\"p\">,<\/span><span class=\"n\">W<\/span><span class=\"p\">,<\/span><span class=\"mi\">0057<\/span><span class=\"p\">,<\/span><span class=\"n\">W<\/span><span class=\"p\">,<\/span>\n<span class=\"err\">\u00b0<\/span><span class=\"p\">,<\/span><span class=\"n\">DEGREE<\/span><span class=\"w\"> <\/span><span class=\"nf\">SIGN<\/span><span class=\"p\">,<\/span><span class=\"o\">^<\/span><span class=\"n\">o<\/span><span class=\"p\">,<\/span><span class=\"mi\">00<\/span><span class=\"n\">B0<\/span><span class=\"p\">,<\/span><span class=\"o\">*<\/span><span class=\"p\">,<\/span>\n<span class=\"err\">\u00a5<\/span><span class=\"p\">,<\/span><span class=\"n\">YEN<\/span><span class=\"w\"> <\/span><span class=\"nf\">SIGN<\/span><span class=\"p\">,,<\/span><span class=\"mi\">00<\/span><span class=\"n\">A5<\/span><span class=\"p\">,,<\/span>\n<span class=\"n\">\u00bd<\/span><span class=\"p\">,<\/span><span class=\"n\">VULGAR<\/span><span class=\"w\"> <\/span><span class=\"n\">FRACTION<\/span><span class=\"w\"> <\/span><span class=\"n\">ONE<\/span><span class=\"w\"> <\/span><span class=\"n\">HALF<\/span><span class=\"p\">,<\/span><span class=\"mi\">1<\/span><span class=\"o\">\/<\/span><span class=\"mi\">2<\/span><span class=\"p\">,<\/span><span class=\"mi\">00<\/span><span class=\"n\">BD<\/span><span class=\"p\">,<\/span><span class=\"mf\">0.5<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">h<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">H<\/span><span class=\"p\">,<\/span><span class=\"n\">h<\/span><span class=\"p\">,<\/span><span class=\"mi\">0068<\/span><span class=\"p\">,<\/span><span class=\"n\">h<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">r<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">R<\/span><span class=\"p\">,<\/span><span class=\"n\">r<\/span><span class=\"p\">,<\/span><span class=\"mi\">0072<\/span><span class=\"p\">,<\/span><span class=\"n\">r<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">\u00e6<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">AE<\/span><span class=\"p\">,<\/span><span class=\"n\">ae<\/span><span class=\"p\">,<\/span><span class=\"mf\">00E6<\/span><span class=\"p\">,<\/span><span class=\"n\">ae<\/span><span class=\"p\">,<\/span>\n<span class=\"n\">\u023c<\/span><span class=\"p\">,<\/span><span class=\"n\">LATIN<\/span><span class=\"w\"> <\/span><span class=\"n\">SMALL<\/span><span class=\"w\"> <\/span><span class=\"n\">LETTER<\/span><span class=\"w\"> <\/span><span class=\"n\">C<\/span><span class=\"w\"> <\/span><span class=\"k\">WITH<\/span><span class=\"w\"> <\/span><span class=\"n\">STROKE<\/span><span class=\"p\">,<\/span><span class=\"n\">c<\/span><span class=\"p\">,<\/span><span class=\"mi\">023<\/span><span class=\"n\">C<\/span><span class=\"p\">,<\/span><span class=\"n\">c<\/span><span class=\"p\">,<\/span>\n<span class=\"err\">\u221f<\/span><span class=\"p\">,<\/span><span class=\"nf\">RIGHT<\/span><span class=\"w\"> <\/span><span class=\"n\">ANGLE<\/span><span class=\"p\">,,<\/span><span class=\"mi\">221<\/span><span class=\"n\">F<\/span><span class=\"p\">,<\/span><span class=\"o\">[<\/span><span class=\"n\">UNKNOWN<\/span><span class=\"o\">]<\/span><span class=\"p\">,<\/span>\n<\/pre><\/div>\n\n<p>It wasn't possible to use a single conversion table for all the datasets because some had different transcription approaches. While replacing  <code>\u00ac<\/code> with <code>-<\/code> could, in principle, be used for each dataset, normalizing the way corrections and uncertainties were transcribed was another story. For example, in some of the CREMMA datasets, <code>&gt;&lt;<\/code> is used to signal a crossed word, while in other datasets <code>&lt;&gt;<\/code> is used. So replacing <code>&gt;<\/code> with <code>\u27e6<\/code> and <code>&lt;<\/code> with <code>\u27e7<\/code> in <code>&gt;hello&lt;<\/code> meant that in some cases we would successfully get <code>\u27e6hello\u27e7<\/code>, while in other cases we would end up with <code>\u27e7hello\u27e6<\/code>. There are a few documents where I had to manually intervene in the XML file to fix the transcription. In such cases, I fork the dataset repository to keep track of the corrected version of the ground truth or I push the correction back into the original dataset to create a new, more consistent version.<\/p>\n<p>In general, the converted dataset is saved as <code>.catmus.xml<\/code> files, which allows us to keep track of the original ground truth and easily adjust the conversion table later if necessary afterwards.<\/p>\n<p>In the <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/19\/\">second post<\/a> of this series, I mentioned that \"<em>the CATMuS guidelines can (should?) be used as a reference point<\/em>\" and that \"<em>if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented<\/em>.\" Providing a Chocomufin conversion table along with a dataset that uses project-specific guidelines would be an excellent practice to ensure that the dataset is indeed compatible with CATMuS.<\/p>\n<p>Once all the <code>.catmus.xml<\/code> files were ready, I created a new metadata table for McCATMuS listing all the subdirectories under each dataset's \"data\" folder. This table was used as a basis to start collecting additional metadata at the document level rather than at dataset level, like the language used in the source or the type of writing (printed, handwritten or typewritten). Working at the document level is important because some dataset contain different types of writing and\/or are multilingual. In some cases, when a document would mix different languages and\/or different types of writing in the source, if the distinction could be made at the image level, I manually sorted them and created two different subfolders. This is what I did in the \"Memorials for Jane Lathrop Stanford\" dataset, for example: the subfolder \"PageX-LettreX\" mixed typewritten and handwritten letters, so I sorted them into \"PageX-LettreX-handwritten\" and \"PageX-LettreX-typewritten\" in order to have the most accurate metadata possible.<\/p>\n<p>Other metadata included the assignment of a call number (or shelf mark) for each source represented in the datasets. In some cases a call number may apply to multiple subfolders, but in most cases, each subfolder is de facto a different document. Retrieving the call number is useful for several reasons: it allows for an accurate assessment of the diversity of documents in McCATMuS, it allows for a document to be associated with additional metadata found in its institution's catalog, or the list of call numbers can be used during benchmarking or production to check whether a document is known to the models trained on that dataset, thus explaining potentially higher accuracy scores.<\/p>\n<p>In the few cases where the source used to build the ground truth did not have a corresponding call number, I simply made one up, keeping \"nobs_\" as a signal that it was a made-up call number. Thus, if \"cph_paris_tissage_1858\/\" in \"timeuscorpus\" is now associated with its corresponding call number at the Paris archive center (Paris, AD75, D1U10 386), CREMMAWiki's \"batch-04\", which is composed of documents we created for the project, is associated with a made-up call number: \"nobs_cremma-wikipedia_b04\".<\/p>\n<p>In the end, when the PARQUET files are created, the metadata from the table I just presented is collected, along with information extracted from parsing the contents of the XML file. Each of the metadata is then represented at the text line level. If you compare <a href=\"https:\/\/huggingface.co\/datasets\/CATMuS\/modern\">McCATMuS<\/a> with <a href=\"https:\/\/huggingface.co\/datasets\/CATMuS\/medieval\">CATMuS Medieval<\/a> using HuggingFace's dataset viewer, you can see that they don't use exactly the same metadata.<\/p>\n<p>\"Language\", \"region type\" and \"line type\" (which are based on the segmOnto classification), \"project\" and \"gen_split\" are common to both datasets, along with \"shelfmark\" I just described above. They both have a \"genre\" column with similar values (treatise, epistolary, document of practice, etc.). In the case of CATMuS Medieval, \"genre\" is complemented by \"verse\" (prose, verse).<\/p>\n<p>Following Thibault's advice, I defined the creation date of a text line using two numbers (\"not_before\" and \"not_after\") instead of a single \"century\" value. This allows for a precise dating when it is possible or on the contrary, to spread the dating over several centuries when it cannot be avoided, which is more accurate in both cases.<\/p>\n<p>McCATMuS mixes printed, handwritten and typewritten documents, so it was important to have a \"writing type\" column to help filter the dataset based on this information, in cases where one does not want to mix them. This metadata also makes it possible to use McCATMuS to train a classifier capable of distinguishing between the different types of writing. CATMuS Medieval on the other hand presents only handwritten sources, so such a metadata would be useless and is able to rely on paleographic classifications to characterize each text line based on a \"script type\" metadata, that includes values such as \"caroline\", \"textualis\", \"hybrida\", etc.<\/p>\n<p>McCATMuS also has a \"color\" column that helps sort text lines based on whether the source image is colored (true) or in grayscale (false).<\/p>\n<p>Although I reused the scripts developed by Thibault to build this dataset, I had to make several modifications to include this new metadata in the PARQUET files and to add additional filtering to the text lines. This included updating the mapping to the segmOnto vocabulary to match what existed in my datasets, or filtering some types of lines such as those identified as signatures.<sup id=\"fnref:signatures\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/021\/#fn:signatures\">2<\/a><\/sup> I also included an update of \"writing_type\" at the line level whenever the value in \"line_type\" allowed it to be controlled. <\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"k\">if<\/span> <span class=\"s2\">\":handwritten\"<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">line_type<\/span><span class=\"p\">:<\/span>\n    <span class=\"n\">writing_type<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"handwritten\"<\/span>\n    <span class=\"n\">line_type<\/span> <span class=\"o\">=<\/span> <span class=\"n\">line_type<\/span><span class=\"o\">.<\/span><span class=\"n\">replace<\/span><span class=\"p\">(<\/span><span class=\"s2\">\":handwritten\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"\"<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">elif<\/span> <span class=\"s2\">\":print\"<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">line_type<\/span><span class=\"p\">:<\/span>\n    <span class=\"n\">writing_type<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"printed\"<\/span>\n    <span class=\"n\">line_type<\/span> <span class=\"o\">=<\/span> <span class=\"n\">line_type<\/span><span class=\"o\">.<\/span><span class=\"n\">replace<\/span><span class=\"p\">(<\/span><span class=\"s2\">\":print\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"\"<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">elif<\/span> <span class=\"s2\">\":typewritten\"<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">line_type<\/span><span class=\"p\">:<\/span>\n    <span class=\"n\">writing_type<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"typewritten\"<\/span>\n    <span class=\"n\">line_type<\/span> <span class=\"o\">=<\/span> <span class=\"n\">line_type<\/span><span class=\"o\">.<\/span><span class=\"n\">replace<\/span><span class=\"p\">(<\/span><span class=\"s2\">\":typewritten\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"\"<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n    <span class=\"n\">writing_type<\/span> <span class=\"o\">=<\/span> <span class=\"n\">metadata<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"writing_type\"<\/span><span class=\"p\">]<\/span>\n<\/pre><\/div>\n\n<p>In the end, having built such a dataset (the first version of McCATMuS contains 117 text lines!) with such a variety of metadata is very satisfying although there is room for improvement. I have already mentioned that it would be interesting to have a greater variety of languages in McCATMuS. I also know that some of the values in \"writing_type\" are not completely accurate so adding a control based on a classifier might be interesting. Finally, I've noticed that some transcriptions in the \"FoNDUE_Wolfflin_Fotosammlung\" dataset are not correct at all, probably due to an automatic transcription that wasn't corrected.<\/p>\n<p>However, before we dive into improving McCATMuS, it's important to first examine the accuracy of the models that can be built on top of it! This will be the topic of the next and last post in this series!<\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:chocomufin\">\n<p>To learn more about how <a href=\"https:\/\/github.com\/PonteIneptique\/choco-mufin?tab=readme-ov-file#commands\"><code>chocomufin convert<\/code><\/a> works, just read the software's short documentation.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/021\/#fnref:chocomufin\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:signatures\">\n<p>I don't think it makes sense to include signatures in a dataset to train a generic model, since the transcription of such lines can be very context specific.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/021\/#fnref:signatures\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["CATMuS","datasets","HTR"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/021\/","pubDate":"Fri, 30 Aug 2024 04:00:00 GMT"},{"title":"020 - McCATMuS #3 - Datasets selection","link":"https:\/\/alix-tz.github.io\/phd\/posts\/020\/","description":"<p>HTR-United made identifying candidate datasets for McCATMuS a piece of cake. Once the rest of the CATMuS community agreed with the period to be covered by a \"modern and contemporary\" dataset, I created a simple script to parse the content of the HTR-United catalog and make a list of existing datasets covering documents written in Latin alphabet and matching our time criteria. <\/p>\n<p>Actually, here is the script!<\/p>\n<div class=\"code\"><pre class=\"code literal-block\"><span class=\"n\">url_latest_htrunited<\/span><span class=\"o\">=<\/span><span class=\"s2\">\"https:\/\/raw.githubusercontent.com\/HTR-United\/htr-united\/master\/htr-united.yml\"<\/span>\n\n<span class=\"kn\">import<\/span><span class=\"w\"> <\/span><span class=\"nn\">requests<\/span>\n<span class=\"kn\">import<\/span><span class=\"w\"> <\/span><span class=\"nn\">yaml<\/span>\n\n<span class=\"kn\">import<\/span><span class=\"w\"> <\/span><span class=\"nn\">pandas<\/span><span class=\"w\"> <\/span><span class=\"k\">as<\/span><span class=\"w\"> <\/span><span class=\"nn\">pd<\/span>\n\n<span class=\"c1\"># get latest htr-united.yml from main repository<\/span>\n<span class=\"n\">response<\/span> <span class=\"o\">=<\/span> <span class=\"n\">requests<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"n\">url_latest_htrunited<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">catalog<\/span> <span class=\"o\">=<\/span> <span class=\"n\">yaml<\/span><span class=\"o\">.<\/span><span class=\"n\">safe_load<\/span><span class=\"p\">(<\/span><span class=\"n\">response<\/span><span class=\"o\">.<\/span><span class=\"n\">content<\/span><span class=\"p\">)<\/span>\n\n<span class=\"k\">def<\/span><span class=\"w\"> <\/span><span class=\"nf\">in_time_scope<\/span><span class=\"p\">(<\/span><span class=\"n\">dates<\/span><span class=\"p\">):<\/span>\n    <span class=\"n\">century_scope_min<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">1600<\/span>\n    <span class=\"n\">century_scope_max<\/span> <span class=\"o\">=<\/span> <span class=\"mi\">2100<\/span>\n    <span class=\"c1\"># this means that we allow datasets that intersect with the period<\/span>\n    <span class=\"k\">if<\/span> <span class=\"nb\">int<\/span><span class=\"p\">(<\/span><span class=\"n\">dates<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notBefore\"<\/span><span class=\"p\">))<\/span> <span class=\"o\">&lt;<\/span> <span class=\"n\">century_scope_min<\/span> <span class=\"ow\">and<\/span> <span class=\"nb\">int<\/span><span class=\"p\">(<\/span><span class=\"n\">dates<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notAfter\"<\/span><span class=\"p\">))<\/span> <span class=\"o\">&lt;<\/span> <span class=\"n\">century_scope_min<\/span><span class=\"p\">:<\/span>\n        <span class=\"k\">return<\/span> <span class=\"kc\">False<\/span>\n    <span class=\"k\">elif<\/span> <span class=\"nb\">int<\/span><span class=\"p\">(<\/span><span class=\"n\">dates<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notBefore\"<\/span><span class=\"p\">))<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">century_scope_max<\/span> <span class=\"ow\">and<\/span> <span class=\"nb\">int<\/span><span class=\"p\">(<\/span><span class=\"n\">dates<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notAfter\"<\/span><span class=\"p\">))<\/span> <span class=\"o\">&gt;<\/span> <span class=\"n\">century_scope_max<\/span><span class=\"p\">:<\/span>\n        <span class=\"k\">return<\/span> <span class=\"kc\">False<\/span>\n    <span class=\"k\">return<\/span> <span class=\"kc\">True<\/span>\n\n<span class=\"n\">filtered_by_date<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\n<span class=\"k\">for<\/span> <span class=\"n\">entry<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">catalog<\/span><span class=\"p\">:<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">in_time_scope<\/span><span class=\"p\">(<\/span><span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"time\"<\/span><span class=\"p\">,<\/span> <span class=\"p\">{})):<\/span>\n        <span class=\"n\">filtered_by_date<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">(<\/span><span class=\"n\">entry<\/span><span class=\"p\">)<\/span>\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s2\">\"Found <\/span><span class=\"si\">{<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">filtered_by_date<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s2\"> entries matching the time scope.\"<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">targeted_script<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"Latn\"<\/span>\n<span class=\"n\">filtered_by_script<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[]<\/span>\n<span class=\"k\">for<\/span> <span class=\"n\">entry<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">filtered_by_date<\/span><span class=\"p\">:<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">targeted_script<\/span> <span class=\"ow\">in<\/span> <span class=\"p\">[<\/span><span class=\"n\">s<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"iso\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">for<\/span> <span class=\"n\">s<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"script\"<\/span><span class=\"p\">)]:<\/span>\n        <span class=\"n\">filtered_by_script<\/span><span class=\"o\">.<\/span><span class=\"n\">append<\/span><span class=\"p\">(<\/span><span class=\"n\">entry<\/span><span class=\"p\">)<\/span>\n<span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s2\">\"Found <\/span><span class=\"si\">{<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">filtered_by_script<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s2\"> entries matching the script criteria.\"<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">cols<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"s2\">\"Script Type\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"Time Span\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"Languages\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"Repository\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"Project Name\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"Dataset Name\"<\/span><span class=\"p\">]<\/span>\n\n<span class=\"n\">metadata_df<\/span> <span class=\"o\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\">.<\/span><span class=\"n\">DataFrame<\/span><span class=\"p\">(<\/span><span class=\"n\">columns<\/span><span class=\"o\">=<\/span><span class=\"n\">cols<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">selected_entries<\/span> <span class=\"o\">=<\/span> <span class=\"n\">filtered_by_script<\/span>\n<span class=\"k\">for<\/span> <span class=\"n\">entry<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">selected_entries<\/span><span class=\"p\">:<\/span>\n    <span class=\"n\">row<\/span> <span class=\"o\">=<\/span> <span class=\"p\">{<\/span><span class=\"n\">k<\/span><span class=\"p\">:<\/span><span class=\"s2\">\"\"<\/span> <span class=\"k\">for<\/span> <span class=\"n\">k<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">cols<\/span><span class=\"p\">}<\/span>\n    <span class=\"n\">languages<\/span> <span class=\"o\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">l<\/span> <span class=\"k\">for<\/span> <span class=\"n\">l<\/span> <span class=\"ow\">in<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"language\"<\/span><span class=\"p\">)]<\/span>\n    <span class=\"k\">if<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">languages<\/span><span class=\"p\">)<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">1<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Languages\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">languages<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">]<\/span>\n    <span class=\"k\">elif<\/span> <span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">languages<\/span><span class=\"p\">)<\/span> <span class=\"o\">&gt;<\/span> <span class=\"mi\">1<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Languages\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\", \"<\/span><span class=\"o\">.<\/span><span class=\"n\">join<\/span><span class=\"p\">(<\/span><span class=\"n\">languages<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n        <span class=\"nb\">print<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"Couldn't find a field for language in this repository\"<\/span><span class=\"p\">)<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Languages\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"no language\"<\/span>\n    <span class=\"c1\"># get centuries\/y<\/span>\n    <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Time Span\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"sa\">f<\/span><span class=\"s1\">'<\/span><span class=\"si\">{<\/span><span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"time\"<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notBefore\"<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s1\">-<\/span><span class=\"si\">{<\/span><span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"time\"<\/span><span class=\"p\">)<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"notAfter\"<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s1\">'<\/span>\n    <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Project Name\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"project-name\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"no project name\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">repository<\/span> <span class=\"o\">=<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"url\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"no url found\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">repository<\/span><span class=\"o\">.<\/span><span class=\"n\">startswith<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"https:\/\/github.com\/\"<\/span><span class=\"p\">):<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Repository\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">repository<\/span><span class=\"o\">.<\/span><span class=\"n\">split<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"https:\/\/github.com\/\"<\/span><span class=\"p\">)[<\/span><span class=\"o\">-<\/span><span class=\"mi\">1<\/span><span class=\"p\">]<\/span>\n    <span class=\"k\">elif<\/span> <span class=\"n\">repository<\/span><span class=\"o\">.<\/span><span class=\"n\">startswith<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"https:\/\/zenodo.org\/\"<\/span><span class=\"p\">):<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Repository\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">repository<\/span><span class=\"o\">.<\/span><span class=\"n\">replace<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"https:\/\/zenodo.org\/\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"zenodo:\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Repository\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">repository<\/span>\n    <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Dataset Name\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"title\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"no title found\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">script_type<\/span> <span class=\"o\">=<\/span> <span class=\"n\">entry<\/span><span class=\"o\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"script-type\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">script_type<\/span> <span class=\"o\">==<\/span> <span class=\"s2\">\"only-typed\"<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Script Type\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"Print\"<\/span>\n    <span class=\"k\">elif<\/span> <span class=\"n\">script_type<\/span> <span class=\"o\">==<\/span> <span class=\"s2\">\"only-manuscript\"<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Script Type\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"Handwritten\"<\/span>\n    <span class=\"k\">else<\/span><span class=\"p\">:<\/span>\n        <span class=\"n\">row<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"Script Type\"<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"Mixed\"<\/span>\n    <span class=\"n\">metadata_df<\/span><span class=\"o\">.<\/span><span class=\"n\">loc<\/span><span class=\"p\">[<\/span><span class=\"nb\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">metadata_df<\/span><span class=\"p\">)]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">row<\/span>\n\n<span class=\"n\">metadata_df<\/span>\n<\/pre><\/div>\n\n<p>I saved the output as a CSV and proceeded to go through each of the selected datasets and its metadata. I checked several things:<\/p>\n<ul>\n<li>I made sure the datasets were available and easy to download. For example, I excluded those requiring manual image retrieval.<\/li>\n<li>I checked the format of the data because I decided to initially focus only on datasets available in ALTO XML and PAGE XML.<\/li>\n<li>I controlled the overall compatibility between the transcription guidelines used for the dataset and those designed by CATMuS.<\/li>\n<li>I also checked the conformity of the dataset when trying to import it into eScriptorium. This import allowed me to detect when there was a discrepancies between the names of the image files and the value for the source image in the XML file which prevented the import from successfully running.<sup id=\"fnref:images\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/020\/#fn:images\">1<\/a><\/sup><\/li>\n<li>Loading a sample of the dataset in eScriptorium also allowed me to visually control other incompatibilities with CATMuS that may not have been documented by the producers of the data.<sup id=\"fnref:segmentation\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/020\/#fn:segmentation\">2<\/a><\/sup> <\/li>\n<li>Finally, I considered the structure of the repository and, when necessary, the facility to reorganize it into a single <code>data\/<\/code> folder containing the images and the XML files, often distributed among sub-folders.<\/li>\n<\/ul>\n<p>I assigned each dataset a priority number from 1 to 6. The lowest number was for dataset compatible with CATMuS without any modification (no dataset was giving a priority rank of 1...) and 6 for massive datasets that would require a nerve-racking script to be built correctly. My grading system is shown below.<\/p>\n<ul>\n<li>1=ready as is<\/li>\n<li>2=need to be <a href=\"https:\/\/github.com\/PonteIneptique\/choco-mufin\">chocomufin<\/a>-ed<\/li>\n<li>3=require manual corrections but the dataset is very small, or the dataset is chocomufin\/catmus compatible but requires a script to build it<\/li>\n<li>4=require manual corrections but the dataset is relatively big, or require a script to be fixed<\/li>\n<li>5=require manual corrections but the dataset is really big<\/li>\n<li>6=require manual corrections but the dataset is really big and require a personalized script to be built<\/li>\n<\/ul>\n<p>For example, <a href=\"https:\/\/htr-united.github.io\/share.html?uri=507bb514d\">\"Notaires de Paris - Bronod\"<\/a> had to be modified to comply with CATMuS requirements. This included replacing <code>[[<\/code> and <code>]]<\/code> <a href=\"https:\/\/catmus-guidelines.github.io\/html\/guidelines\/en\/corrections_and_others.html\">with <code>\u27e6<\/code> and <code>\u27e7<\/code><\/a>, or also to ignore lines containing <code>\u00a5<\/code>, a symbol used in LECTAUREP's datasets to transcribe signatures and paraphs. These were straightforward modifications, thanks to Chocomufin. On the complete opposite, <a href=\"https:\/\/htr-united.github.io\/share.html?uri=7a99090c5\">\"University of Denver Collections as Data - HTR Train and Validation Set JCRS_2020_5_27\"<\/a> is a massive dataset (2660 XML files), but there are segmentation errors in this dataset, creating erroneous transcriptions given the way the line is drawn, and the annotation of the superscripted text is not compatible with CATMuS. To make it compatible with CATMuS, it would be necessary to control and correct each page one by one.<\/p>\n<p>I chose to focus on datasets with priority 2 for the <em>first<\/em> version of McCATMuS. Indeed, it'll be possible to add more datasets into CATMuS in later versions, so there was no need to spend too much time on manually cleaning datasets. I had 23 with priority 2 to go through.<\/p>\n<p>Identifying eligible datasets was not as time consuming as cleaning them and collecting additional metadata turned out to be. However, it gave me a good idea of the challenges I would face when trying to aggregate the datasets. I would have liked to be able to find a greater diversity of languages, but this is wasn't possible at this stage, mainly because many non-French datasets require more elaborate corrections than applying Chocomufin and were thus given a priority score higher than 2. <\/p>\n<p>The next post will be covering the tedious phase of data cleaning and aggregation, along with metadata collection!<\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:images\">\n<p>It was the case in \"<a href=\"https:\/\/htr-united.github.io\/share.html?uri=c326a6fee\">Donn\u00e9es v\u00e9rit\u00e9 de terrain HTR+ Annuaire des propri\u00e9taires et des propri\u00e9t\u00e9s de Paris et du d\u00e9partement de la Seine (1898-1923)<\/a> where the ALTO XML files are not explicitly linked to their corresponding source images. I believe it can be fixed, but it would require creating a script just for this purpose and the dataset presented other incompatibilities with CATMuS' guidelines.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/020\/#fnref:images\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<li id=\"fn:segmentation\">\n<p>For example, \"<a href=\"https:\/\/htr-united.github.io\/share.html?uri=43d1c93c7\">Argus des Brevets<\/a>\" contains some segmentation errors that will need to be corrected manually.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/020\/#fnref:segmentation\" title=\"Jump back to footnote 2 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["CATMuS","datasets","HTR"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/020\/","pubDate":"Thu, 29 Aug 2024 04:00:00 GMT"},{"title":"019 - McCATMuS #2 - Defining guidelines","link":"https:\/\/alix-tz.github.io\/phd\/posts\/019\/","description":"<p><a href=\"https:\/\/x.com\/JMFradeRue\/status\/1730191566508060883\">Previous experiments<\/a> have shown that conflicting transcription guidelines in training datasets make it less likely that a model will learn to transcribe correctly. This is particularly relevant when it comes to abbreviations and it's something to keep in mind when merging existing datasets. We didn't really address this when we trained the <a href=\"https:\/\/inria.hal.science\/hal-04094241\">Manu McFrench model<\/a> because it's difficult to retroactively align datasets to follow the same transcription rules. Unless you can afford to manually check every line, of course. In the case of Manu McFrench however, we only merged datasets that didn't solve abbreviations, so we ensured a minimum of cohesion.<\/p>\n<p>CATMuS was built on the foundation laid by CREMMALab and the <a href=\"https:\/\/hal.science\/hal-03716526\">annotation guidelines<\/a> developed by Ariane Pinche at the end of a seminar organized in 2021. These guidelines are intended to be generic, meaning they should be compatible with most transcription situations and are not project-specific. Following these guidelines will help data producers create ground truth that is compatible with data from other projects. It will also help those projects save time by not having to create transcription rules from scratch. From my experience, it is indeed easy for the members of a project discovering HTR to get caught up in the specifics of one project and forget what is and is not relevant (or even complicating) in the transcription phase.<\/p>\n<blockquote>\n<p><em>It's worth mentioning that a project can choose to follow some of the CATMuS guidelines, while maintaining more specific rules for certain cases. If that's the case, the CATMuS guidelines can (should?) be used as a reference point. Ideally, the specific rules defined by a project should be retro-compatible with CATMuS. For example, if a project decides to use a special character to mark the end of each paragraph, then in order to create a CATMuS-compatible version of the dataset, I should only have to replace or remove that character. In such cases, the special character that was chosen should be unambiguous and the rule should be explicitly presented.<\/em><\/p>\n<\/blockquote>\n<p>As CREMMALab focused on the transcription of medieval manuscripts, so did the first CATMuS dataset and guidelines. As I said in my <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/018\/\">previous post<\/a>, I focused on data covering the modern and contemporary periods, for which there was no equivalent to the CREMMALab guidelines. So, when extending CATMuS to these periods, I started with collecting existing guidelines and comparing them. I used the <a href=\"https:\/\/hal.science\/hal-03697382\">CREMMA Medieval guidelines<\/a>, the <a href=\"https:\/\/gist.github.com\/alix-tz\/6f89444521bf1cab0522da520f7e4ff4\">CREMMA guidelines for modern and contemporary documents<\/a>, <a href=\"https:\/\/hal.science\/hal-04281804\">SETAF's guidelines<\/a> and <a href=\"https:\/\/hal.science\/hal-04557457\">CATMuS Print's guidelines<\/a> as a basis to elaborate the transcription rules for McCATMuS.<\/p>\n<p>For each rubric, I <a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1bFE-rRk6ZwgIHqXAOgwPo1s1zwQ-UPTLPnzjaRmTMsk\/edit?usp=sharing\">compared<\/a> what each set of rules suggested, when they covered it. It was rare for all guidelines to align, but some cases were easy to solve. For example, all the guidelines recommended not to differentiate between regular s (<code>\u27e8s\u27e9<\/code>) and long s (<code>\u27e8\u017f\u27e9<\/code>), except for the rules I had set for the modern and contemporary sources transcribed by CREMMA in 2021, before the CREMMALab seminar. It was thus decided that for McCATMuS there would be no distinction between all types of s's.<\/p>\n<p>Some rubrics needed to be discussed to figure out why the rule had been chosen in the first place by some of the projects, to decide which one to keep for McCATMuS. In February, I met with Ariane Pinche and Simon Gabay to go over the rubrics that still needed to be set. One example of a rule we discussed is how hyphenations are handled. CATMuS Medieval and the two CREMMA guidelines say to always use the same symbol (<code>\u27e8-\u27e9<\/code>), whereas for the SETAF and CATMuS Print datasets, inline hyphenations (<code>\u27e8-\u27e9<\/code>) are differentiated from hyphenations at the end of a line (<code>\u27e8\u00ac\u27e9<\/code>). Other symbols, like <code>\u27e8\u2e17\u27e9<\/code>, were unanimously rejected.<\/p>\n<p>Two factors were considered when making those decisions: the feasibility of a retro-conversion for the existing datasets and the compatibility of the rule with a maximum of projects. In the case of hyphenations, I eventually decided to follow the same rule as CATMuS Medieval and CREMMA. On top of simplifying the compatibility of McCATMuS with CATMuS Medieval, I found that replacing all <code>\u27e8\u00ac\u27e9<\/code> with <code>\u27e8-\u27e9<\/code>, rather than retroactively place <code>\u27e8\u00ac\u27e9<\/code> where there was indeed an hyphenation at the end of a line<sup id=\"fnref:hyphen\"><a class=\"footnote-ref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/019\/#fn:hyphen\">1<\/a><\/sup> was much more straightforward.<\/p>\n<p>Once the set of rules was fixed, I used it to sort between the different datasets I had identified (I'll discuss this in the next post) and to decide which one would be retained for McCATMuS v1. I also defined the transformation scenarios necessary to turn each of these datasets into a CATMuS-compatible version. Then, once McCATMuS v1 was ready, I integrated the modern and contemporary guidelines into the <a href=\"https:\/\/catmus-guidelines.github.io\/\">CATMuS website<\/a>, where the transcription guidelines for CATMuS medieval were already published.<\/p>\n<p>Now that I am done integrating the rules set for McCATMuS into the website, I am confident that we have successfully designed rules that are overall compatible across the medieval, modern and contemporary periods, despite some unavoidable exceptions. Two good examples of the impossibility to cover a whole millennium of document production with the same rule are the <a href=\"https:\/\/catmus-guidelines.github.io\/html\/guidelines\/en\/abbreviations.html\">abbreviations<\/a> and the <a href=\"https:\/\/catmus-guidelines.github.io\/html\/guidelines\/en\/punctuation.html\">punctuation signs<\/a>.<\/p>\n<p>I've now explained how the transcription guidelines were established for McCATMuS. Next, I'll cover how they were integrated into existing datasets to create the first version of the McCATMuS dataset.<\/p>\n<div class=\"footnote\">\n<hr>\n<ol>\n<li id=\"fn:hyphen\">\n<p>You can't assume that every instance of <code>\u27e8-\u27e9<\/code> at the end of a line must be replaced with a <code>\u27e8\u00ac\u27e9<\/code>. In many cases, this can be a simple typographic decoration marking the end of a paragraph or the end of a title.\u00a0<a class=\"footnote-backref\" href=\"https:\/\/alix-tz.github.io\/phd\/posts\/019\/#fnref:hyphen\" title=\"Jump back to footnote 1 in the text\">\u21a9<\/a><\/p>\n<\/li>\n<\/ol>\n<\/div>","category":["CATMuS","guidelines","HTR"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/019\/","pubDate":"Tue, 20 Aug 2024 04:00:00 GMT"},{"title":"018 - McCATMuS #1 - Overview","link":"https:\/\/alix-tz.github.io\/phd\/posts\/018\/","description":"<p>Last week, I attended <a href=\"https:\/\/dh2024.adho.org\/\">ADHO's annual conference<\/a> in Washington DC. I presented a short paper, co-authored with Floriane Chiffoleau and Hugo Scheithauer, about the documentation we wrote for eScriptorium (I wrote <a href=\"https:\/\/alix-tz.github.io\/phd\/posts\/018\/010\">a post<\/a> about it last year and you can also find our presentation <a href=\"https:\/\/inria.hal.science\/hal-04594142\">here<\/a>). I was also a co-author on a long paper presented by Ariane Pinche on the <a href=\"https:\/\/inria.hal.science\/hal-04346939\">CATMuS Medieval dataset<\/a>.<\/p>\n<p>CATMuS, which stands for \"Consistent Approach to Transcribing ManuScripts\", is a collective initiative and a framework to aggregate ground truth datasets using compatible <a href=\"https:\/\/catmus-guidelines.github.io\/\">transcription guidelines<\/a> for documents from different period written in romance languages. It started with <a href=\"https:\/\/huggingface.co\/datasets\/CATMuS\/medieval\">CATMuS Medieval<\/a>, but since January this year, I have been working on a version of CATMuS for the modern and contemporary period. <\/p>\n<p>While I should (and will) try to publish a data paper on CATMuS Modern &amp; Contemporary (I'll call it McCatmus from now on), I figured I could start with a series of blog posts here. I want to describe the various steps I followed in order to eventually release <a href=\"https:\/\/huggingface.co\/datasets\/CATMuS\/modern\">a dataset on HuggingFace<\/a> and hopefully soon the corresponding transcription model.<\/p>\n<p>I started working on McCatmus in January, but because of a major personal event (I moved to Canada!), it took seven month of stop-and-go before the release of the V1. This was particularly challenging due to the scale of the project and its technicality (it was hard to get back into McCatmus after several weeks of interruption, which I had to do several times).<\/p>\n<p>To add to this complexity, McCatmus was also a multi-front operation. Indeed, to create McCatmus, it was necessary to:<\/p>\n<ul>\n<li>define transcription guidelines in collaboration with other data producers,<\/li>\n<li>identify datasets compatible with the guidelines and set priorities,<\/li>\n<li>actually make all the dataset compatible with each other and clean some of the data,<\/li>\n<li>model and collect metadata that made sense for this dataset,<\/li>\n<li>release the dataset and fix the issues that came up.<\/li>\n<\/ul>\n<p>To this date, two tasks remain on my to-do list for McCatmus: train a transcription model corresponding to this dataset and compare it with other existing ones, and make sure to have a publication describing this dataset and its usefulness.<\/p>\n<p>My plan is to dedicate one post to the creation of the guidelines for the dataset, then a post about the identification and collection of the datasets used in McCatmus v1, and then I'll wrap up with a post about the process to create the dataset, the metadata and the release. Stay tuned!<\/p>","category":["CATMuS","HTR"],"guid":"https:\/\/alix-tz.github.io\/phd\/posts\/018\/","pubDate":"Wed, 14 Aug 2024 04:00:00 GMT"}]}}