Re: [dm-l] Plain text access to medieval manuscripts: 147 volumes indexed in the HIMANIS project

10 Jul 2017

      Dear Dominique,
Thanks for getting back to me!
I look forward to reading the papers relating to the Handwritten 
Text Recognition you used. Can you please point me to them (or let 
me know where they'll be published when they're ready, if they 
aren't yet available)? OCR is a field I've been involved in for a 
while, though I'm only just starting to learn about handwriting 
recognition. Does the A2iA approach use neural networks, and 
leverage some open source software within it, I wonder? I look 
forward to finding out from the journal articles!
Excellent news that you'll be releasing the plain text in due 
course, under an appropriately non-restrictive license. Out of 
curiousity, how is the output from the text recognition engine being 
stored and used at the moment? Is it in some XML format listing 
locations and alternatives for words, or is it in a database, or 
something else? For bonus credits you could always release that too; 
though probably it'd be a bit niche for anyone else to reuse, who 
knows, and it could provide a good exemplar for good practise for 
others to follow or investigate when designing their own project.
Great project, I've enjoyed reading about it, and keep up the good 
work.
Nick
On Thu, Jun 29, 2017 at 12:37:25PM +0200, Dominique Stutzmann wrote:
...
Dear Nick,
thank you for your encouraging message.
As for the first question, I understand your concern about large publically
funded projects. Alas, there is no plan to release the code under a free
software license: This is part of the funding mechanism integrating a private
company, in the present case A2iA, which is the leader worldwide in Handwritten
Text Recognition. The methods, however, are published in conference and
journals and can be replicated.
I am glad to announce, as for the second question, that we will release the
plain text under CC0 license. Please note the search interface gives access to
more than the plain text, since it gives access to indexes in which each "word
on the page" is transcribed by many "word hypotheses". A "plain text" would
only be the sequence of "best word hypothese" for the "word on the page", while
it often happens that the correct reading is in the second or third hypothese.
So, plain text will be much poorer, even if, of course, it has a great
potential for further analysis.
Before releasing the "plain (automated) text", we plan to structure the corpus
in single "documents" (here, typically, each royal charter recorded in the
registers) and to attribute each part of the plain text to those documents, for
which we have (more or less) metadata. The issue of granularity is of core
importance to HIMANIS project, because network analysis (esp. on persons and
places) will have to take into account the "document" granularity and not the
"page" or the "volume". So, this is our priority by now.
Best regards,
Dominique
––
M. Dominique Stutzmann
Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes (CNRS,
UPR 841)
2017-06-29 11:54 GMT+02:00 Nick White nick.white@ell.ox.ac.uk:
Dear Dominique,

This does indeed look very interesting. I have a couple of
questions.

First, is the software (or some part of it) used to extract the text
from the manuscripts going to be released under a free software
license? While I'm sure there's plenty of domain-specific stuff
there, I'm also sure it could be very useful to other projects which
are doing similar things.

I find it so frustrating when large publically funded projects
funnel their money into proprietary software that can't be further
developed and built upon by others. c.f. Transkribus. Here's hoping
you're planning to release the code soon!

Second question, is there a way to access the plain text directly,
or is there only a search interface at the moment? Having direct
plain text access can be useful for others to do various further
analysis on the corpus.

Anyway, great work, looks exciting, congratulations!

Nick White

On Tue, Jun 20, 2017 at 03:12:12PM +0200, Dominique Stutzmann wrote:
> Dear all,
>
> Within the HIMANIS project, funded by the Joint Programming Initiative on
> Cultural Heritage and Global Change” (JPI-CH) of the European Union, the
> partners are developing cost-effective solutions for querying large sets
of
> handwritten document images. With IRHT and A2iA (France), the
Universities of
> Valencia (Spain) and Groningen (Netherlands) as well as the French
National
> Archive, it gathers Computer Science, Humanities and Cultural Heritage
> institutions in order to produce technology to generate new,
research-based
> knowledge from historical manuscripts. As a challenging and particularly
> interesting case study, we have indexed the large collection of the
Trésor des
> Chartes’ registers produced by the French royal chancery (Paris, Archives
> Nationales, JJ7 – JJ209).
>
> Now we are proud to announce that you can search the plain text in the
Trésor
> des Chartes’ registers and provide feedback: It is ready to be used and
tested
> by all interested users worldwide!
> http://prhlt-kws.prhlt.upv.es/himanis/
>
>
> This is a prototype and beta version, which will be amended and will
change
> over the next months, with new functionalities (navigate through hits,
display
> of abstracts and editions) and with additional volumes to be indexed from
the
> French National Library and the National Archive..
>
> The project website is: http://www.himanis.org/
> The search interface into the corpus: http://prhlt-kws.prhlt.upv.es/
himanis/
> Additional explanations about the interface: https://himanis.hypotheses.
org/
> 105
>
> You can search with boolean operators and word sequences (for the syntax,
check
> on https://himanis.hypotheses. org/105)
>
> You can help us measuring the precision of our results:
> - please click on highlighted hits to confirm whether the word is
correctly
> spotted or not;
> - please double click on a missed hit if you see it on the page (it will
be
> added to the index for all users to search from the next day)
>
> Two simple examples as a beginning:
> - "scriptor" within the whole corpus: http://prhlt-kws.prhlt.upv.es/
himanis/
> index.php?q=scriptor& t=10&feedback=1
> - "pelerinage" on one page : http://prhlt-kws.prhlt.upv.es/ himanis/
index.php/
> ui/show/ chancery/147/853?q=pelerinage& t=50&feedback=1
>
> The complete indexing results from an automated, image analysis process.
You
> may find unexpected or false hits: for example, abbreviations are
expanded
> automatically and it is needless to say that they are error-prone;
likewise
> place and person names are slightly less well spotted. You can enhance
the hit
> list by setting the "confidence" rate (between 0 and 100).
>
> We hope that you will be as thrilled as we are to present these results
and we
> invite you to test, give feedback and send further comments, critics and
> suggestions to himanis@irht.cnrs.fr!
>
> Best regards
>
> Dominique Stutzmann
> ––
> M. Dominique Stutzmann
> Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes
(CNRS,
> UPR 841)

> Digital Medievalist --  http://www.digitalmedievalist.org/
> Journal: http://www.digitalmedievalist.org/journal/
> Journal Editors: editors _AT_ digitalmedievalist.org
> News: https://digitalmedievalist.wordpress.com/news/
> Twitter: http://twitter.com/digitalmedieval
> Facebook: http://www.facebook.com/group.php?gid=49320313760
> Discussion list: dm-l@uleth.ca
> Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [dm-l] Plain text access to medieval manuscripts: 147 volumes indexed in the HIMANIS project