Dear Nick,
thank you for your encouraging message.
As for the first question, I understand your concern about large publically funded projects. Alas, there is no plan to release the code under a free software license: This is part of the funding mechanism integrating a private company, in the present case A2iA, which is the leader worldwide in Handwritten Text Recognition. The methods, however, are published in conference and journals and can be replicated.
I am glad to announce, as for the second question, that we will release the plain text under CC0 license. Please note the search interface gives access to more than the plain text, since it gives access to indexes in which each "word on the page" is transcribed by many "word hypotheses". A "plain text" would only be the sequence of "best word hypothese" for the "word on the page", while it often happens that the correct reading is in the second or third hypothese. So, plain text will be much poorer, even if, of course, it has a great potential for further analysis.
Before releasing the "plain (automated) text", we plan to structure the corpus in single "documents" (here, typically, each royal charter recorded in the registers) and to attribute each part of the plain text to those documents, for which we have (more or less) metadata. The issue of granularity is of core importance to HIMANIS project, because network analysis (esp. on persons and places) will have to take into account the "document" granularity and not the "page" or the "volume". So, this is our priority by now.
Best regards, Dominique
–– M. Dominique Stutzmann Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes (CNRS, UPR 841)
2017-06-29 11:54 GMT+02:00 Nick White nick.white@ell.ox.ac.uk:
Dear Dominique,
This does indeed look very interesting. I have a couple of questions.
First, is the software (or some part of it) used to extract the text from the manuscripts going to be released under a free software license? While I'm sure there's plenty of domain-specific stuff there, I'm also sure it could be very useful to other projects which are doing similar things.
I find it so frustrating when large publically funded projects funnel their money into proprietary software that can't be further developed and built upon by others. c.f. Transkribus. Here's hoping you're planning to release the code soon!
Second question, is there a way to access the plain text directly, or is there only a search interface at the moment? Having direct plain text access can be useful for others to do various further analysis on the corpus.
Anyway, great work, looks exciting, congratulations!
Nick White
On Tue, Jun 20, 2017 at 03:12:12PM +0200, Dominique Stutzmann wrote:
Dear all,
Within the HIMANIS project, funded by the Joint Programming Initiative on Cultural Heritage and Global Change” (JPI-CH) of the European Union, the partners are developing cost-effective solutions for querying large sets
of
handwritten document images. With IRHT and A2iA (France), the
Universities of
Valencia (Spain) and Groningen (Netherlands) as well as the French
National
Archive, it gathers Computer Science, Humanities and Cultural Heritage institutions in order to produce technology to generate new,
research-based
knowledge from historical manuscripts. As a challenging and particularly interesting case study, we have indexed the large collection of the
Trésor des
Chartes’ registers produced by the French royal chancery (Paris, Archives Nationales, JJ7 – JJ209).
Now we are proud to announce that you can search the plain text in the
Trésor
des Chartes’ registers and provide feedback: It is ready to be used and
tested
by all interested users worldwide! http://prhlt-kws.prhlt.upv.es/himanis/
This is a prototype and beta version, which will be amended and will
change
over the next months, with new functionalities (navigate through hits,
display
of abstracts and editions) and with additional volumes to be indexed
from the
French National Library and the National Archive..
The project website is: http://www.himanis.org/ The search interface into the corpus: http://prhlt-kws.prhlt.upv.es/
himanis/
Additional explanations about the interface: https://himanis.hypotheses.
org/
105
You can search with boolean operators and word sequences (for the
syntax, check
on https://himanis.hypotheses. org/105)
You can help us measuring the precision of our results:
- please click on highlighted hits to confirm whether the word is
correctly
spotted or not;
- please double click on a missed hit if you see it on the page (it will
be
added to the index for all users to search from the next day)
Two simple examples as a beginning:
- "scriptor" within the whole corpus: http://prhlt-kws.prhlt.upv.es/
himanis/
index.php?q=scriptor& t=10&feedback=1
- "pelerinage" on one page : http://prhlt-kws.prhlt.upv.es/
himanis/index.php/
ui/show/ chancery/147/853?q=pelerinage& t=50&feedback=1
The complete indexing results from an automated, image analysis process.
You
may find unexpected or false hits: for example, abbreviations are
expanded
automatically and it is needless to say that they are error-prone;
likewise
place and person names are slightly less well spotted. You can enhance
the hit
list by setting the "confidence" rate (between 0 and 100).
We hope that you will be as thrilled as we are to present these results
and we
invite you to test, give feedback and send further comments, critics and suggestions to himanis@irht.cnrs.fr!
Best regards
Dominique Stutzmann –– M. Dominique Stutzmann Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes
(CNRS,
UPR 841)
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: https://digitalmedievalist.wordpress.com/news/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gid=49320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l