I think in the end that the answer to all these questions depends on how important and what type of result you are looking for.
James's answer to the question of OCR is bang on, so I'vw nothing to add there. And as those who've worked with me know, I'm a huge believer in wikis and use them all the time for various, sometimes unorthodox things. Crowd sourcing proof reasing of flat texts a la Gutenberg is a pretty good example of what they might be good at. Though I'd be worried in the case of unnegotiated crows sourcing about incremental version control--i.e. Not that you couldn't see what changes had been made and reverse damage, but that the workflow would be so cumbersome if people started indoing each others corrections that you'd lose any possible efficiencies.
I find the belief in the power of agreed upon conventions quite touching if somewhat other worldly. My experience, as an author, editor, journal editor, and scholar is that consistency of application in the absence of validation is impossible. Even in single author works you tend to forget exactly the format you used earlier for, for example, bibliographic format and I've seen too much minor variation among authors who were trying to follow say the chicago style to think that any conventions not subject to validation will be implemented consistently.
However this is only important of you plan to do something other than print your texts to the screen. And it is also possible to retrofit markup over proof-read flat text.
One last observation is that you can get double keyed text to a guaranteed accuracy of 99.5 or higher for probably about $2-3 a page if you have sufficient volume (99.995% and about $1.50/page for modern print). I'd now consider that route before constructing any particular transcription scheme (there is some hope that the TEI will be offering keying at these prices for smaller jobs later as a membership benefit later this fall, so I've been paying attention to prices lately).
-----------
Daniel O'Donnell
University of Lethbridge
(From my mobile telephone)
--- original message ---
From: "Buckner" <d3uckner(a)btinternet.com>
Subject: Re: [dm-l] Use of wikis for transcription
Date: July 28, 2009
Time: 5:4:51
Thanks for these replies. From work on other wikis, in particular
Wikipedia, I think
1. Crowdsourcing very poor at anything involving summarisation, synthesis
and so on. Hence Wikipedia is good at biographies (which have a set format,
and usually follow the progress of someone's life in the obvious order).
Very poor at high level subjects like 'History', 'Philosophy', 'Roman
Empire' and that sort of thing, where 95% of the work is sourcing the
relevant and important facts and so on.
2. There is no problem with conventions - co-editors generally quick to
absorb relevant policy, house style and so on (over much, in my view).
3. For these reasons, wikis well suited to translation work (which has
absolutely no demands on organisation or synthesis).
4. For similar reasons, transcription would also be well suited for wiki
work.
5. What originally drew me to the idea was finding an important medieval
work (a critical edition from the 1960's) in a London library where the
basement had clearly flooded at some time. The volumes were out of order,
there were missing leaves, one volume was even missing. Many important
works are not critical editions and are simply transcriptions made by
dedicated enthusiasts. These are published in obscure journals like CIMAGL,
in courier font, generally not checked by others (in my view - it is easy to
locate mistakes), and generally not accessible to the outside world.
6. Thus, publication on a wiki would ensure much better access to important
works, and also the opportunity for others to check.
7. Some here have commented on the use of character recognition, which I
find bizarre. I studied optical pattern recognition in the 1980's and it
was accepted then, and it is still true I think, that machines cannot
understand human speech or writing unless they also grasp the semantics. I
can work through a text without concentrating on the meaning and I can get
probably a 90% success rate. Then I go through again, this time translating
as I go along and get a 98% success rate. Finally I go one level higher (it
is philosophy I usually translate) and try to understand not just what the
writer is literally saying in their language, but what they actually mean,
the argument they are making. This gets me to 99% but I am still learning.
It is very difficult to transcribe medieval texts without a deep
understanding of the *kind* of thing the writer is trying to say. That is
because the writer was communicating with his or her (usually his) audience
knowing the assumptions they would make and which would not need to be
clarified.
8. To give an example, some years ago I hired a Cambridge PhD to help me
brush up my Latin. We worked through some medieval texts and we got stuck
at 'Minor patet'. He thought this meant 'it is less clear'. In fact, as I
soon found out, 'Minor' in this context means 'the minor proposition' (of a
syllogism).
9. I did try out my OCR on a manuscript, but it was completely hopeless.
Only humans will ever be able to read these things.
10. Thanks for the tips about XML. I do work with XML and indeed I have
made many experiments with trying to present images of manuscripts together
with the Latin transcript and then an English translation. Another reason
for presenting the material like this is that we should no longer be hostage
to the person making a transcription, who is often interpreting the Latin in
a way that suits their interpretation of grammar and meaning. It was not
until I started reading manuscripts that I realised how much of the printed
material we read is simply a typographer's invention. For example medieval
texts do not generally use the honorific capital. They write 'aristotle'
and even 'god', rather than 'Aristotle' or 'God'. Actually they don't even
write the full word. There are standard abbreviations for all the commonly
used words, such as Aristotle, Priscianus and so on. The only way to
present this material is to give the original, a transcript in the original
language, and a translation into a modern language.
Edward
Digital Medievalist -- http://www.digitalmedievalist.org/
Journal: http://www.digitalmedievalist.org/journal/
Journal Editors: editors _AT_ digitalmedievalist.org
News: http://www.digitalmedievalist.org/news/
Wiki: http://www.digitalmedievalist.org/wiki/
Discussion list: dm-l(a)uleth.ca
Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Hello,
I am thinking about setting up a wiki for the purposes of transcribing medieval manuscripts. One such experiment is here
http://www.mywikibiz.com/User:Ockham/sandbox
Has anyone here heard of a similar project? The advantages of wikis is that many people can work on them, increasing the accuracy of the transcription, and there is an audit trail in case changes need to be reviewed or reversed.
Edward Buckner
Hi all,
I'm putting together a list of background reading for a very compressed
introduction workshop. Does anybody know of a good gentle introduction
to XSL, CSS, and/or stylesheets generally I could refer students to?
Obviously there are hundreds on the web, so what I'm looking for is a
battle-tested recommendation.
-dan
--
Daniel Paul O'Donnell
Associate Professor of English
University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/)
Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America
President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/)
Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377
Fax: +1 403 382-7191 (non-confidental)
Home Page: http://people.uleth.ca/~daniel.odonnell/
In loose relation to Dot's posting the other day...
[Forgive cross posting]
As part of a project for which we are seeking funding
(http://www.visionarycross.org/), we are looking into using ontologies
as the basis for building a generalisable platform for connecting
representations of Anglo-Saxon cultural objects, tropes, texts, and the
like (the specific details of this approach, which we've been developing
over the last year, are still too nascent to be reflected in the
website). The idea would be to see if there might not be a way of
building a common, discipline-wide, set of minimal ontological
distinctions that museums, literary and historical scholars,
archaeologists, etc. could then use to place their particular objects of
study in the larger context of the work of everybody else who has used
the same ontology.
If any other groups are working on the use of ontologies to represent
any aspect of the study of Anglo-Saxon England, I'd very much like to
hear from you. I suspect at the moment people working on this are mostly
likely to be in museums, libraries, or archaeology, but could be wrong.
I'm considering putting together a grant application that would help
fund the development of common standards and systems. Of course, with
ISAS coming up, their might be some opportunities to meet in the next
fortnight as well. Obviously at this stage, the idea is still fairly
exploratory.
-Dan O'Donnell
--
Daniel Paul O'Donnell
Associate Professor of English
University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/)
Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America
President-elect (English), Society for Digital Humanities/Société pour
l'étude des médias interactifs (http://sdh-semi.org/)
Founding Director (2003-2009), Digital Medievalist Project
(http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377
Fax: +1 403 382-7191 (non-confidential)
Home Page: http://people.uleth.ca/~daniel.odonnell/
Hello everyone,
I'm looking for examples of projects in the digital humanities that
use RDF for storing metadata. Does anyone on the list have examples of
projects in the digital medieval studies that are using RDF?
Many thanks,
Dot
--
*~*~*~*~*~*~*~*~*~*~*
Dot Porter (MA, MSLS) Metadata Manager
Digital Humanities Observatory (RIA), Regus House, 28-32 Upper
Pembroke Street, Dublin 2, Ireland
-- A Project of the Royal Irish Academy --
Phone: +353 1 234 2444 Fax: +353 1 234 2400
http://dho.ie Email: dot.porter(a)gmail.com
*~*~*~*~*~*~*~*~*~*~*
For those who don't know about Junicode, it is an effort to provide a
font that 1.) is free; 2.) is reasonably attractive; and 3.) contains
characters of interest to medievalists which typically are not included
in the products of commercial type foundries. There are several other
very fine fonts for medievalists (notably Andron Scriptor and Cardo),
but few offer matching bold and italic faces and numerous advanced
typographical features.
This version of Junicode includes substantial additions, bug fixes and
design improvements. All of the recent medievalist additions to Unicode
have now been added in the regular and italic faces. Almost all Medieval
Unicode Font Initiative characters are now available in the regular
face, and most of them in italic. Obsolete characters (those formerly
encoded in the Private Use Area which have been assigned Unicode
encodings) have been marked with an x to remind users to use the new
encodings; a file called "replacements" has been provided to help users
write scripts to automate the updating of their files. As an
alternative, a new OpenType feature (ss03) has been added to make the
necessary substitutions on the fly.
This release is a big advance over the previous one, and I urge all who
use Junicode to upgrade. Get it as http://junicode.sourceforge.net/.
Best wishes to all,
Peter Baker
Hi all,
I thought I'd pass this on to our list and digital classicist, since
some of us may have had experience with similar techniques (I'm
relatively sure I saw a talk on the question recently). Please make sure
you cc Dr. Wisnicki, since he may not hang out in our circles!
I’ve come across a textual issue that I’m not sure how to resolve, and
I’m hoping that someone on the list might have some suggestions or
even the answer. I’m currently doing some research on the final
African diaries of David Livingstone, the missionary and explorer.
While keeping these diaries, Livingstone was often short of paper and,
as a result, resorted to various expedients to keep the diaries going.
One of these expedients was to take printed pages from books and
newspapers, and, by turning the pages 90 degrees, to write his diary
over the printed text, but perpendicular to it. Although perhaps
legible at the time, these diary entries now are often difficult to
decipher: Livingstone’s ink has faded and the printed text obscures
what remains.
So, in other words, the diary entries have two layers of text: printed
matter which runs horizontally across the page, and Livingstone’s
entries which run vertically. I’ve scanned some of these pages and was
wondering if there’s a way (or, perhaps, a program) by which I might
remove the printed layer so as to make the handwritten layer
freestanding and so more legible. Has anyone on the list dealt with
(and resolved) a similar issue? If so, please email me at
awisnicki(a)yahoo.com -- any suggestions would be very much appreciated.
Dr. Adrian S. Wisnicki
Honorary Research Fellow
School of English and Humanities
Birkbeck College, University of London
--
Daniel Paul O'Donnell
Associate Professor of English
University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/)
Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America
President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/)
Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377
Fax: +1 403 382-7191 (non-confidental)
Home Page: http://people.uleth.ca/~daniel.odonnell/