Dear List,
I've received a query from a medievalist who is interested in applying OCR to manuscripts. I'm not really aware of recent work in this area and I'm wondering what, if anything, is being done at this time or in the recent past. Last time I looked into it good OCR from handwritten texts was a long way off - for nicely written, straight English text, to say nothing of heavily abbreviated medieval Latin or Old English writing. But I'd be delighted to be proven wrong.
Thanks! Dot
-- *************************************** Dot Porter, University of Kentucky ##### Program Coordinator Collaboratory for Research in Computing for Humanities http://www.rch.uky.edu Center for Visualization and Virtual Environments http://www.vis.uky.edu dporter@uky.edu 859-257-1257 x.82115 ***************************************
Melissa Terras is the great expert on this, of course, along with our own Arianna Ciula and Paul Stokes. My understanding from a talk I heard her give last year in London is that it is a holy grail of OCR companies, but nowhere near practical.
I'm not an expert on the current state of this field, but I believe the approach currently is at best to assist humans in the interpretation of characters by looking for patterns that can be highlighted, rather than attempt to replace humans in interpret strings of manuscript text. So in this regard something like how we used to use photocopiers to increase contrast or different types of light to emphasis different types of strokes or ink. But now going farther than this in looking for patterns that can then be approved or interpreted by a human.
So my sense is that if the question is "can I use a computer to avoid keying the content of this manuscript into a computer?" the answer is no. But if the question is "can I use a computer to help me recognise, interpret, or classify aspects or characters in this script?" the answer is yes depending on the specific issues you are looking at analysing.
Good convenient starting places on the current state of the field are the articles in DM by Paul Stokes (newly published in DM 3: http://www.digitalmedievalist.org/journal/3/stokes/) and Arianna Ciula (DM 1: http://www.digitalmedievalist.org/journal/1.1/ciula/)%5B1]. We have a review coming of Terras's book _Image to Interpretation: An intelligent system to aid historians in reading the Vindolanda texts_ (Oxford: 2006).
-dan
[1] If you haven't been to the DM site in a while, you'll see it has been improved and reorganised. Old URLs for journal articles should still work however.
On Fri, 2008-03-21 at 13:58 -0400, Dot Porter wrote:
Dear List,
I've received a query from a medievalist who is interested in applying OCR to manuscripts. I'm not really aware of recent work in this area and I'm wondering what, if anything, is being done at this time or in the recent past. Last time I looked into it good OCR from handwritten texts was a long way off - for nicely written, straight English text, to say nothing of heavily abbreviated medieval Latin or Old English writing. But I'd be delighted to be proven wrong.
Thanks! Dot
--
Dot Porter, University of Kentucky ##### Program Coordinator Collaboratory for Research in Computing for Humanities http://www.rch.uky.edu Center for Visualization and Virtual Environments http://www.vis.uky.edu dporter@uky.edu 859-257-1257 x.82115
Maybe this is interesting: Handwriting Retrieval Demonstrations of the (Center for Intelligent Information Retrieval, University of Massachusetts Amherst): http://ciir.cs.umass.edu/irdemo/hw-demo/ Not medieval however.
See for another approach: http://www.ai.rug.nl/alice/nwo-catch-scratch/index_english.html. From that page: For one thing, the massive amount of text images of handwritten pages will allow for the exploitation of modern statistical techniques. Data mining techniques such as clustering will uncover regularities in handwritten shapes that may act as the bridge to document retrieval and text analysis. In this project we will not aim for a veridical left-to-right transcription of handwritten documents: this would be an unrealistic target. Alternatively, at the end of the project, we will have delivered tools for keyword-based search in handwritten archives, akin to existing flat-text search methods ("Googling").
Peter
If anyone wants to amuse themselves... go to http://demo.iupr.org/layout/layout.php and feed in your favourite manuscript page. This is Google's 'ocropus' software, the latest wonder from Google. This demo shows the system trying to find the text on the page- that is, identify the lines and columns, etc. It does not do too badly at that. We have also played with Gamera, for the same purpose: http:// ldp.library.jhu.edu/projects/gamera/ all the best Peter On 21 Mar 2008, at 22:11, Peter Boot wrote:
Maybe this is interesting: Handwriting Retrieval Demonstrations of the (Center for Intelligent Information Retrieval, University of Massachusetts Amherst): http://ciir.cs.umass.edu/irdemo/hw-demo/ Not medieval however.
See for another approach: http://www.ai.rug.nl/alice/nwo-catch- scratch/index_english.html. From that page: For one thing, the massive amount of text images of handwritten pages will allow for the exploitation of modern statistical techniques. Data mining techniques such as clustering will uncover regularities in handwritten shapes that may act as the bridge to document retrieval and text analysis. In this project we will not aim for a veridical left-to-right transcription of handwritten documents: this would be an unrealistic target. Alternatively, at the end of the project, we will have delivered tools for keyword-based search in handwritten archives, akin to existing flat-text search methods ("Googling").
Peter
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors@digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Peter Robinson Institute for Textual Scholarship and Electronic Editing Elmfield House, Selly Oak Campus University of Birmingham Edgbaston B29 6LG P.M.Robinson@bham.ac.uk p. +44 (0)121 4158441, f. +44 (0) 121 415 8376 www.itsee.bham.ac.uk