Dear Dominique,
Thanks for getting back to me!
I look forward to reading the papers relating to the Handwritten Text Recognition you used. Can you please point me to them (or let me know where they'll be published when they're ready, if they aren't yet available)? OCR is a field I've been involved in for a while, though I'm only just starting to learn about handwriting recognition. Does the A2iA approach use neural networks, and leverage some open source software within it, I wonder? I look forward to finding out from the journal articles!
Excellent news that you'll be releasing the plain text in due course, under an appropriately non-restrictive license. Out of curiousity, how is the output from the text recognition engine being stored and used at the moment? Is it in some XML format listing locations and alternatives for words, or is it in a database, or something else? For bonus credits you could always release that too; though probably it'd be a bit niche for anyone else to reuse, who knows, and it could provide a good exemplar for good practise for others to follow or investigate when designing their own project.
Great project, I've enjoyed reading about it, and keep up the good work.
Nick
On Thu, Jun 29, 2017 at 12:37:25PM +0200, Dominique Stutzmann wrote:
Dear Nick,
thank you for your encouraging message.
As for the first question, I understand your concern about large publically funded projects. Alas, there is no plan to release the code under a free software license: This is part of the funding mechanism integrating a private company, in the present case A2iA, which is the leader worldwide in Handwritten Text Recognition. The methods, however, are published in conference and journals and can be replicated.
I am glad to announce, as for the second question, that we will release the plain text under CC0 license. Please note the search interface gives access to more than the plain text, since it gives access to indexes in which each "word on the page" is transcribed by many "word hypotheses". A "plain text" would only be the sequence of "best word hypothese" for the "word on the page", while it often happens that the correct reading is in the second or third hypothese. So, plain text will be much poorer, even if, of course, it has a great potential for further analysis.
Before releasing the "plain (automated) text", we plan to structure the corpus in single "documents" (here, typically, each royal charter recorded in the registers) and to attribute each part of the plain text to those documents, for which we have (more or less) metadata. The issue of granularity is of core importance to HIMANIS project, because network analysis (esp. on persons and places) will have to take into account the "document" granularity and not the "page" or the "volume". So, this is our priority by now.
Best regards, Dominique
–– M. Dominique Stutzmann Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes (CNRS, UPR 841)
2017-06-29 11:54 GMT+02:00 Nick White nick.white@ell.ox.ac.uk:
Dear Dominique, This does indeed look very interesting. I have a couple of questions. First, is the software (or some part of it) used to extract the text from the manuscripts going to be released under a free software license? While I'm sure there's plenty of domain-specific stuff there, I'm also sure it could be very useful to other projects which are doing similar things. I find it so frustrating when large publically funded projects funnel their money into proprietary software that can't be further developed and built upon by others. c.f. Transkribus. Here's hoping you're planning to release the code soon! Second question, is there a way to access the plain text directly, or is there only a search interface at the moment? Having direct plain text access can be useful for others to do various further analysis on the corpus. Anyway, great work, looks exciting, congratulations! Nick White On Tue, Jun 20, 2017 at 03:12:12PM +0200, Dominique Stutzmann wrote: > Dear all, > > Within the HIMANIS project, funded by the Joint Programming Initiative on > Cultural Heritage and Global Change” (JPI-CH) of the European Union, the > partners are developing cost-effective solutions for querying large sets of > handwritten document images. With IRHT and A2iA (France), the Universities of > Valencia (Spain) and Groningen (Netherlands) as well as the French National > Archive, it gathers Computer Science, Humanities and Cultural Heritage > institutions in order to produce technology to generate new, research-based > knowledge from historical manuscripts. As a challenging and particularly > interesting case study, we have indexed the large collection of the Trésor des > Chartes’ registers produced by the French royal chancery (Paris, Archives > Nationales, JJ7 – JJ209). > > Now we are proud to announce that you can search the plain text in the Trésor > des Chartes’ registers and provide feedback: It is ready to be used and tested > by all interested users worldwide! > http://prhlt-kws.prhlt.upv.es/himanis/ > > > This is a prototype and beta version, which will be amended and will change > over the next months, with new functionalities (navigate through hits, display > of abstracts and editions) and with additional volumes to be indexed from the > French National Library and the National Archive.. > > The project website is: http://www.himanis.org/ > The search interface into the corpus: http://prhlt-kws.prhlt.upv.es/ himanis/ > Additional explanations about the interface: https://himanis.hypotheses. org/ > 105 > > You can search with boolean operators and word sequences (for the syntax, check > on https://himanis.hypotheses. org/105) > > You can help us measuring the precision of our results: > - please click on highlighted hits to confirm whether the word is correctly > spotted or not; > - please double click on a missed hit if you see it on the page (it will be > added to the index for all users to search from the next day) > > Two simple examples as a beginning: > - "scriptor" within the whole corpus: http://prhlt-kws.prhlt.upv.es/ himanis/ > index.php?q=scriptor& t=10&feedback=1 > - "pelerinage" on one page : http://prhlt-kws.prhlt.upv.es/ himanis/ index.php/ > ui/show/ chancery/147/853?q=pelerinage& t=50&feedback=1 > > The complete indexing results from an automated, image analysis process. You > may find unexpected or false hits: for example, abbreviations are expanded > automatically and it is needless to say that they are error-prone; likewise > place and person names are slightly less well spotted. You can enhance the hit > list by setting the "confidence" rate (between 0 and 100). > > We hope that you will be as thrilled as we are to present these results and we > invite you to test, give feedback and send further comments, critics and > suggestions to himanis@irht.cnrs.fr! > > Best regards > > Dominique Stutzmann > –– > M. Dominique Stutzmann > Chargé de recherche à l'Institut de Recherche et d'Histoire des Textes (CNRS, > UPR 841) > Digital Medievalist -- http://www.digitalmedievalist.org/ > Journal: http://www.digitalmedievalist.org/journal/ > Journal Editors: editors _AT_ digitalmedievalist.org > News: https://digitalmedievalist.wordpress.com/news/ > Twitter: http://twitter.com/digitalmedieval > Facebook: http://www.facebook.com/group.php?gid=49320313760 > Discussion list: dm-l@uleth.ca > Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l