Re: [dm-l] Plain text access to medieval manuscripts: 147 volumes indexed in the HIMANIS project

29 Jun 2017

      Dear Nick,
thank you for your encouraging message.
As for the first question, I understand your concern about large publically
funded projects. Alas, there is no plan to release the code under a free
software license: This is part of the funding mechanism integrating a
private company, in the present case A2iA, which is the leader worldwide in
Handwritten Text Recognition. The methods, however, are published in
conference and journals and can be replicated.
I am glad to announce, as for the second question, that we will release the
plain text under CC0 license. Please note the search interface gives access
to more than the plain text, since it gives access to indexes in which each
"word on the page" is transcribed by many "word hypotheses". A "plain text"
would only be the sequence of "best word hypothese" for the "word on the
page", while it often happens that the correct reading is in the second or
third hypothese. So, plain text will be much poorer, even if, of course, it
has a great potential for further analysis.
Before releasing the "plain (automated) text", we plan to structure the
corpus in single "documents" (here, typically, each royal charter recorded
in the registers) and to attribute each part of the plain text to those
documents, for which we have (more or less) metadata. The issue of
granularity is of core importance to HIMANIS project, because network
analysis (esp. on persons and places) will have to take into account the
"document" granularity and not the "page" or the "volume". So, this is our
priority by now.
Best regards,
Dominique
––
M. Dominique Stutzmann
Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes
(CNRS, UPR 841)
2017-06-29 11:54 GMT+02:00 Nick White nick.white@ell.ox.ac.uk:
...
Dear Dominique,
This does indeed look very interesting. I have a couple of
questions.
First, is the software (or some part of it) used to extract the text
from the manuscripts going to be released under a free software
license? While I'm sure there's plenty of domain-specific stuff
there, I'm also sure it could be very useful to other projects which
are doing similar things.
I find it so frustrating when large publically funded projects
funnel their money into proprietary software that can't be further
developed and built upon by others. c.f. Transkribus. Here's hoping
you're planning to release the code soon!
Second question, is there a way to access the plain text directly,
or is there only a search interface at the moment? Having direct
plain text access can be useful for others to do various further
analysis on the corpus.
Anyway, great work, looks exciting, congratulations!
Nick White
On Tue, Jun 20, 2017 at 03:12:12PM +0200, Dominique Stutzmann wrote:
...
Dear all,
Within the HIMANIS project, funded by the Joint Programming Initiative on
Cultural Heritage and Global Change” (JPI-CH) of the European Union, the
partners are developing cost-effective solutions for querying large sets
of
...
handwritten document images. With IRHT and A2iA (France), the
Universities of
...
Valencia (Spain) and Groningen (Netherlands) as well as the French
National
...
Archive, it gathers Computer Science, Humanities and Cultural Heritage
institutions in order to produce technology to generate new,
research-based
...
knowledge from historical manuscripts. As a challenging and particularly
interesting case study, we have indexed the large collection of the
Trésor des
...
Chartes’ registers produced by the French royal chancery (Paris, Archives
Nationales, JJ7 – JJ209).
Now we are proud to announce that you can search the plain text in the
Trésor
...
des Chartes’ registers and provide feedback: It is ready to be used and
tested
...
by all interested users worldwide!
http://prhlt-kws.prhlt.upv.es/himanis/
This is a prototype and beta version, which will be amended and will
change
...
over the next months, with new functionalities (navigate through hits,
display
...
of abstracts and editions) and with additional volumes to be indexed
from the
...
French National Library and the National Archive..
The project website is: http://www.himanis.org/
The search interface into the corpus: http://prhlt-kws.prhlt.upv.es/
himanis/
...
Additional explanations about the interface: https://himanis.hypotheses.
org/
...
105
You can search with boolean operators and word sequences (for the
syntax, check
...
on https://himanis.hypotheses. org/105)
You can help us measuring the precision of our results:

please click on highlighted hits to confirm whether the word is

correctly
...
spotted or not;

please double click on a missed hit if you see it on the page (it will

be
...
added to the index for all users to search from the next day)
Two simple examples as a beginning:

"scriptor" within the whole corpus: http://prhlt-kws.prhlt.upv.es/

himanis/
...
index.php?q=scriptor& t=10&feedback=1

"pelerinage" on one page : http://prhlt-kws.prhlt.upv.es/

himanis/index.php/
...
ui/show/ chancery/147/853?q=pelerinage& t=50&feedback=1
The complete indexing results from an automated, image analysis process.
You
...
may find unexpected or false hits: for example, abbreviations are
expanded
...
automatically and it is needless to say that they are error-prone;
likewise
...
place and person names are slightly less well spotted. You can enhance
the hit
...
list by setting the "confidence" rate (between 0 and 100).
We hope that you will be as thrilled as we are to present these results
and we
...
invite you to test, give feedback and send further comments, critics and
suggestions to himanis@irht.cnrs.fr!
Best regards
Dominique Stutzmann
––
M. Dominique Stutzmann
Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes
(CNRS,
...
UPR 841)
...
Digital Medievalist --  http://www.digitalmedievalist.org/
Journal: http://www.digitalmedievalist.org/journal/
Journal Editors: editors _AT_ digitalmedievalist.org
News: https://digitalmedievalist.wordpress.com/news/
Twitter: http://twitter.com/digitalmedieval
Facebook: http://www.facebook.com/group.php?gid=49320313760
Discussion list: dm-l@uleth.ca
Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [dm-l] Plain text access to medieval manuscripts: 147 volumes indexed in the HIMANIS project