Hello,
I am thinking about setting up a wiki for the purposes of transcribing medieval manuscripts. One such experiment is here
http://www.mywikibiz.com/User:Ockham/sandbox
Has anyone here heard of a similar project? The advantages of wikis is that many people can work on them, increasing the accuracy of the transcription, and there is an audit trail in case changes need to be reviewed or reversed.
Edward Buckner
I've not heard of others doing this, but I use wikis for lots of scratch collaboration projects. So in principle, this is a reasonable way to work informally among a small group on a first pass.
My main caveat about this is that this is probably only suitable for a first pass at character capture rather than any serious editorial work, and even then it might prove problematic. The main problem is that there is no way of automatically enforcing a policy on what is to be captured or how these details are to be recorded: if you have multiple people working on a project, and you have no control over what they enter, they will invariably diverge in practice: one scholar might expand abbreviations, another prefer to leave them unexpanded; one might record multiple options for reading difficult passages, another just choose the most likely, and yet others use different symbols for indicating uncertainty. Same is true of damage or codicological features: one scholar might want to record information about the damage down in lines 38 or so and ff.; another might just transcribe around the cut and damage.
A solution around this might be to insist that encoders use something like the Leiden system of diplomatic transcription symbols although you won't have anyway of enforcing correct use.
XML and XML tools like the OxygenXML editor are designed precisely to give you this kind of control. There was a time when they could seem quite intimidating. Nowadays, however, they are becoming ever more user friendly. So one, slightly more formal way, of setting a project up so that more than one person could transcribe texts might be to ask everybody to get a copy of something like OxygenXML (although in principle, it wouldn't matter what editor they used), and then store their common transcriptions online in a version control repository like subversion. People would work on transcriptions on their home computers and then log the files back into the common repository when they were finished. Subversion logs changes and lets more than one person work on the same file at the same time. And you can show the world what you are doing by also publishing the repository (this is how Digital Medievalist works, in fact: we have a subversion repository with all the XML files. This repository is copied to a non-public website (so we can check our work) once every minute or two and then to the public site once a day. Individual editors download files to edit from the central repository).
The above assumes that your plan is to have a group of previously identified editors work on the same project (negotiated collaboration). If your goal instead is crowd sourcing (i.e. just putting MS images and or transcriptions up and letting anybody transcribe or edit them, then you'll need some other solution. But planning that depends on what you see the goal of the project as: if it is really just to get the text out of the manuscript photos and into unicode characters, then a wiki might work--but I'd say ask people only to record the letters in front of them, perhaps making a policy of expanding abbreviations, and not to record any paleographic features, uncertainty, etc. Other solutions might depend on what you want to crowd source. For example, if you have an XML encoded document that you are displaying on the web, it could be possible to code the page so that people could click on a word they want to correct, and then get the content of the word (but not the tags) displayed to them in a little form window for correction. And then you could go up to much more heavily programmed solutions.
Bottom line: a wiki is a good informal way of sharing work (with loss of policy control) or for crowdsourcing very simple questions (are these the right letters? what letters does this image contain?). But for really encoding expert knowledge or doing anything complicated with the text at all, you are going to want to use XML. There are very robust ways of doing that in a distributed fashion.
Buckner wrote:
Hello,
I am thinking about setting up a wiki for the purposes of transcribing medieval manuscripts. One such experiment is here
http://www.mywikibiz.com/User:Ockham/sandbox
Has anyone here heard of a similar project? The advantages of wikis is that many people can work on them, increasing the accuracy of the transcription, and there is an audit trail in case changes need to be reviewed or reversed.
Edward Buckner
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
"Daniel Paul O'Donnell" daniel.odonnell@gmail.com
My main caveat about this is that this is probably only suitable for a first pass at character capture rather than any serious editorial work, and even then it might prove problematic.
Surely not. These texts were designed to be read. Most of the stuff is pretty obvious, the biggest problem is the use of shorthand terms, which requires some knowledge of the technical terms. This is where 'knowledge of the crowd' can be useful.
The main problem is that there is no way of automatically enforcing a policy on what is to be captured
or how these details are to be recorded: if you have multiple people working on a project, and you have no control over what they enter, they will invariably diverge in practice: one scholar might expand abbreviations, another prefer to leave them unexpanded; one might record multiple options for reading difficult passages, another just choose the most likely, and yet others use different symbols for indicating uncertainty.
Why? This is an issue about conventions.
Same is true of damage or codicological features: one
scholar might want to record information about the damage down in lines 38 or so and ff.; another might just transcribe around the cut and damage.
Conventions again. Most of the texts I look at don't have any damage.
XML and XML tools like the OxygenXML editor are designed precisely to
give you this kind of control. There was a time when they could seem quite intimidating.
I find software pretty intimidating. How does this work?
Nowadays, however, they are becoming ever more user
friendly. So one, slightly more formal way, of setting a project up so that more than one person could transcribe texts might be to ask everybody to get a copy of something like OxygenXML (although in principle, it wouldn't matter what editor they used), and then store their common transcriptions online in a version control repository like subversion. People would work on transcriptions on their home computers and then log the files back into the common repository when they were finished. Subversion logs changes and lets more than one person work on the same file at the same time. And you can show the world what you are doing by also publishing the repository (this is how Digital Medievalist works, in fact: we have a subversion repository with all the XML files. This repository is copied to a non-public website (so we can check our work) once every minute or two and then to the public site once a day. Individual editors download files to edit from the central repository).
OK
The above assumes that your plan is to have a group of previously
identified editors work on the same project (negotiated collaboration).
Yes.
If your goal instead is crowd sourcing (i.e. just putting MS images and
or transcriptions up and letting anybody transcribe or edit them, then you'll need some other solution.
Also interesting. My experience is that those who can read these manuscripts are limited to less than a crowd.
Bottom line: a wiki is a good informal way of sharing work (with loss of
policy control) or for crowdsourcing very simple questions (are these the right letters? what letters does this image contain?). But for really encoding expert knowledge or doing anything complicated with the text at all, you are going to want to use XML. There are very robust ways of doing that in a distributed fashion.
I suspect that crowdsourcing may work for stuff like translation and particularly for image recognition. As mentioned above, reading the manuscripts is very easy with a limited amount of practice.
Hi Edward (and Dan, and all). Just a few thoughts on what Dan's already written...
On Sun, Jul 26, 2009 at 5:56 PM, Bucknerd3uckner@btinternet.com wrote:
"Daniel Paul O'Donnell" daniel.odonnell@gmail.com The main problem is that there is no way of automatically enforcing a policy on what is to be captured
or how these details are to be recorded: if you have multiple people working on a project, and you have no control over what they enter, they will invariably diverge in practice: one scholar might expand abbreviations, another prefer to leave them unexpanded; one might record multiple options for reading difficult passages, another just choose the most likely, and yet others use different symbols for indicating uncertainty.
Why? This is an issue about conventions.
Setting up a wiki for public transcription would be all about conventions, I think. You would need to make a clear set of conventions available to the transcribers so they know exactly how to note ... anything that you might want them to note (abbreviations, expansions if you want them to expand them, unclear letters, suggestions for editorial corrections if you want them included as well). But if you could supply clear guidelines I think a wiki would be a fine place for sourcing transcriptions, and much simpler in the first instance than using XML.
XML and XML tools like the OxygenXML editor are designed precisely to
give you this kind of control. There was a time when they could seem quite intimidating.
I find software pretty intimidating. How does this work?
I think what Dan is talking about is the ability of XML to be defined by a schema, basically a set of rules that limits what tags you can use, and where you can put them. So you can have tags <abbreviation>, <unclear>, whatever, and just place those around the text you want to set apart from the surrounding text:
<abbreviation>abbreviated text</abbreviation> <unclear>unclear text</unclear>
They function the same way as things like brackets and special formatting to indicate specific editorial elements, but are rather more clear (and can be processed in any number of ways - abbreviations could be rendered in italics, or in another colour, for example).
What this means is that if someone is creating a transcription and marking bits of it in XML (abbreviations, expansions, unclear letters, whatever) there is some guarantee that everyone following that schema will mark the same thing in the same way. There is still the possibility that those tags will be misused (<abbreviation> to mark something other than abbreviated text), but perhaps less likely to be misused than formatting such as special brackets, italics, etc.
The most widely-used schema for humanities text encoding is the Text Encoding Initiative (TEI): http://www.tei-c.org/
Bottom line: a wiki is a good informal way of sharing work (with loss of
policy control) or for crowdsourcing very simple questions (are these the right letters? what letters does this image contain?). But for really encoding expert knowledge or doing anything complicated with the text at all, you are going to want to use XML. There are very robust ways of doing that in a distributed fashion.
I suspect that crowdsourcing may work for stuff like translation and particularly for image recognition. As mentioned above, reading the manuscripts is very easy with a limited amount of practice.
I snipped a bit of what Dan said about using a system of distributed XML. I think that is a fine idea if you have some IT support and transcribers who are ready and willing to learn how to tag using XML (I've worked on a project like this so I know it can be done - it also helps if you have funding to pay all these people). But I think that Dan is writing off wikis a bit too easily. I think for a project with limited or no funding, and that does not want to have the curve it takes to learn XML/TEI, the Oxygen software, setting up files etc. (not a small order), a wiki would be a really good way to get started. Assuming you could get strong conventions that are tied to accepted editorial (non-digital) conventions (use italics for one thing, square brackets for this other thing, three dots for unclear text, do or do not expand abbreviations, etc.) assuming it's done well by everyone you would be able to convert that text to XML later which would give you the ability to add more expert knowledge if you wanted to (including notes, or additional information that would be difficult to put in a wiki system). I'm not necessarily talking about translating the code that the wiki creates behind the scenes, because that might be more difficult than taking the surface text and translating that (I've converted "traditional" transcriptions but not wiki code, so I can't really judge).
Another thought - if you want to apply for funding at some later point, a wiki would give you something to point to to show that you already have community involvement, and any transcriptions completed could be the basis for an XML-based digital edition or project. These are things that funding agencies like to see.
And, really, if all you want is an interactive space for looking at images and viewing transcriptions, the only problem I see with having that *only* in a wiki is that of long term preservation. If you export to XML (even if you don't actually use it), the transcriptions are likely to last longer, and they also become available to be taken and used by other scholars (at least more easily than they would be from wiki alone). But if you just want a simple collaborative transcription space, you want to make things simple for your transcribers, you aren't very concerned that those transcriptions be around in 20 years, and you don't care that those transcriptions are available to be incorporated into other projects, I think a wiki would be fine. It would certainly be easier for your average manuscripts transcriber to learn than XML (although, from experience, I can say that with a good tutorial and a lot of time and support, medievalists really can learn TEI! But having been that support, I can also say it's a lot of work for all involved).
Hope this helps.
Dot
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Hi All,
I agree with Dot that any collaborative transcription (which is slightly different from collaborative editing) is all about adhering to mutually agreed conventions of collaborative transcription.
A good model for this is the Distributed Proofreaders that crowdsources correction of dirty OCR for project gutenberg. If I'm not mistaken there is per volume a responsible person who checks things and is responsible for any specialised guidelines, anyone is allowed to go and proofread a few pages depending on the perceived level of difficulty of the volume. As you proofread more you graduate to being allowed to do volumes with a higher level of perceived difficulty, and presumably eventually oversee volumes.
I feel that kind of graduated responsibility in crowdsourcing techniques is probably a good idea. But it all comes down to the guidelines for the average user being clear and easy to understand, or them being able to flag potential problems easily.
With collaborative editing, then there are lots of models with slightly different methodologies and technologies. I, of course, agree with Dot and Dan that XML as a preservation format is pretty much a requirement these days. And with Dot that medievalists can learn (and in my experience quite frequently are learning) TEI XML quite quickly. After all as medievalists we do things like learn skills like palaeography and languages like medieval Latin, nevermind all the other things, so by comparison to learn a human-readable markup format is really a lot easier than most of the things we do.
-James
On Mon, Jul 27, 2009 at 22:47, Dot Porterdot.porter@gmail.com wrote:
Hi Edward (and Dan, and all). Just a few thoughts on what Dan's already written...
On Sun, Jul 26, 2009 at 5:56 PM, Bucknerd3uckner@btinternet.com wrote:
"Daniel Paul O'Donnell" daniel.odonnell@gmail.com The main problem is that there is no way of automatically enforcing a policy on what is to be captured
or how these details are to be recorded: if you have multiple people working on a project, and you have no control over what they enter, they will invariably diverge in practice: one scholar might expand abbreviations, another prefer to leave them unexpanded; one might record multiple options for reading difficult passages, another just choose the most likely, and yet others use different symbols for indicating uncertainty.
Why? This is an issue about conventions.
Setting up a wiki for public transcription would be all about conventions, I think. You would need to make a clear set of conventions available to the transcribers so they know exactly how to note ... anything that you might want them to note (abbreviations, expansions if you want them to expand them, unclear letters, suggestions for editorial corrections if you want them included as well). But if you could supply clear guidelines I think a wiki would be a fine place for sourcing transcriptions, and much simpler in the first instance than using XML.
XML and XML tools like the OxygenXML editor are designed precisely to
give you this kind of control. There was a time when they could seem quite intimidating.
I find software pretty intimidating. How does this work?
I think what Dan is talking about is the ability of XML to be defined by a schema, basically a set of rules that limits what tags you can use, and where you can put them. So you can have tags <abbreviation>, <unclear>, whatever, and just place those around the text you want to set apart from the surrounding text:
<abbreviation>abbreviated text</abbreviation> <unclear>unclear text</unclear>
They function the same way as things like brackets and special formatting to indicate specific editorial elements, but are rather more clear (and can be processed in any number of ways - abbreviations could be rendered in italics, or in another colour, for example).
What this means is that if someone is creating a transcription and marking bits of it in XML (abbreviations, expansions, unclear letters, whatever) there is some guarantee that everyone following that schema will mark the same thing in the same way. There is still the possibility that those tags will be misused (<abbreviation> to mark something other than abbreviated text), but perhaps less likely to be misused than formatting such as special brackets, italics, etc.
The most widely-used schema for humanities text encoding is the Text Encoding Initiative (TEI): http://www.tei-c.org/
Bottom line: a wiki is a good informal way of sharing work (with loss of
policy control) or for crowdsourcing very simple questions (are these the right letters? what letters does this image contain?). But for really encoding expert knowledge or doing anything complicated with the text at all, you are going to want to use XML. There are very robust ways of doing that in a distributed fashion.
I suspect that crowdsourcing may work for stuff like translation and particularly for image recognition. As mentioned above, reading the manuscripts is very easy with a limited amount of practice.
I snipped a bit of what Dan said about using a system of distributed XML. I think that is a fine idea if you have some IT support and transcribers who are ready and willing to learn how to tag using XML (I've worked on a project like this so I know it can be done - it also helps if you have funding to pay all these people). But I think that Dan is writing off wikis a bit too easily. I think for a project with limited or no funding, and that does not want to have the curve it takes to learn XML/TEI, the Oxygen software, setting up files etc. (not a small order), a wiki would be a really good way to get started. Assuming you could get strong conventions that are tied to accepted editorial (non-digital) conventions (use italics for one thing, square brackets for this other thing, three dots for unclear text, do or do not expand abbreviations, etc.) assuming it's done well by everyone you would be able to convert that text to XML later which would give you the ability to add more expert knowledge if you wanted to (including notes, or additional information that would be difficult to put in a wiki system). I'm not necessarily talking about translating the code that the wiki creates behind the scenes, because that might be more difficult than taking the surface text and translating that (I've converted "traditional" transcriptions but not wiki code, so I can't really judge).
Another thought - if you want to apply for funding at some later point, a wiki would give you something to point to to show that you already have community involvement, and any transcriptions completed could be the basis for an XML-based digital edition or project. These are things that funding agencies like to see.
And, really, if all you want is an interactive space for looking at images and viewing transcriptions, the only problem I see with having that *only* in a wiki is that of long term preservation. If you export to XML (even if you don't actually use it), the transcriptions are likely to last longer, and they also become available to be taken and used by other scholars (at least more easily than they would be from wiki alone). But if you just want a simple collaborative transcription space, you want to make things simple for your transcribers, you aren't very concerned that those transcriptions be around in 20 years, and you don't care that those transcriptions are available to be incorporated into other projects, I think a wiki would be fine. It would certainly be easier for your average manuscripts transcriber to learn than XML (although, from experience, I can say that with a good tutorial and a lot of time and support, medievalists really can learn TEI! But having been that support, I can also say it's a lot of work for all involved).
Hope this helps.
Dot
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
-- *~*~*~*~*~*~*~*~*~*~* Dot Porter (MA, MSLS) Metadata Manager Digital Humanities Observatory (RIA), Regus House, 28-32 Upper Pembroke Street, Dublin 2, Ireland -- A Project of the Royal Irish Academy -- Phone: +353 1 234 2444 Fax: +353 1 234 2400 http://dho.ie Email: dot.porter@gmail.com *~*~*~*~*~*~*~*~*~*~*
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Thanks for these replies. From work on other wikis, in particular Wikipedia, I think
1. Crowdsourcing very poor at anything involving summarisation, synthesis and so on. Hence Wikipedia is good at biographies (which have a set format, and usually follow the progress of someone's life in the obvious order). Very poor at high level subjects like 'History', 'Philosophy', 'Roman Empire' and that sort of thing, where 95% of the work is sourcing the relevant and important facts and so on.
2. There is no problem with conventions - co-editors generally quick to absorb relevant policy, house style and so on (over much, in my view).
3. For these reasons, wikis well suited to translation work (which has absolutely no demands on organisation or synthesis).
4. For similar reasons, transcription would also be well suited for wiki work.
5. What originally drew me to the idea was finding an important medieval work (a critical edition from the 1960's) in a London library where the basement had clearly flooded at some time. The volumes were out of order, there were missing leaves, one volume was even missing. Many important works are not critical editions and are simply transcriptions made by dedicated enthusiasts. These are published in obscure journals like CIMAGL, in courier font, generally not checked by others (in my view - it is easy to locate mistakes), and generally not accessible to the outside world.
6. Thus, publication on a wiki would ensure much better access to important works, and also the opportunity for others to check.
7. Some here have commented on the use of character recognition, which I find bizarre. I studied optical pattern recognition in the 1980's and it was accepted then, and it is still true I think, that machines cannot understand human speech or writing unless they also grasp the semantics. I can work through a text without concentrating on the meaning and I can get probably a 90% success rate. Then I go through again, this time translating as I go along and get a 98% success rate. Finally I go one level higher (it is philosophy I usually translate) and try to understand not just what the writer is literally saying in their language, but what they actually mean, the argument they are making. This gets me to 99% but I am still learning. It is very difficult to transcribe medieval texts without a deep understanding of the *kind* of thing the writer is trying to say. That is because the writer was communicating with his or her (usually his) audience knowing the assumptions they would make and which would not need to be clarified.
8. To give an example, some years ago I hired a Cambridge PhD to help me brush up my Latin. We worked through some medieval texts and we got stuck at 'Minor patet'. He thought this meant 'it is less clear'. In fact, as I soon found out, 'Minor' in this context means 'the minor proposition' (of a syllogism).
9. I did try out my OCR on a manuscript, but it was completely hopeless. Only humans will ever be able to read these things.
10. Thanks for the tips about XML. I do work with XML and indeed I have made many experiments with trying to present images of manuscripts together with the Latin transcript and then an English translation. Another reason for presenting the material like this is that we should no longer be hostage to the person making a transcription, who is often interpreting the Latin in a way that suits their interpretation of grammar and meaning. It was not until I started reading manuscripts that I realised how much of the printed material we read is simply a typographer's invention. For example medieval texts do not generally use the honorific capital. They write 'aristotle' and even 'god', rather than 'Aristotle' or 'God'. Actually they don't even write the full word. There are standard abbreviations for all the commonly used words, such as Aristotle, Priscianus and so on. The only way to present this material is to give the original, a transcript in the original language, and a translation into a modern language.
Edward
On Tue, Jul 28, 2009 at 20:34, Bucknerd3uckner@btinternet.com wrote:
- Some here have commented on the use of character recognition, which I
find bizarre. I studied optical pattern recognition in the 1980's and it was accepted then, and it is still true I think, that machines cannot understand human speech or writing unless they also grasp the semantics.
<snip/>
- I did try out my OCR on a manuscript, but it was completely hopeless.
Only humans will ever be able to read these things.
<snip/>
I believe you may be thinking about this the wrong way. Yes, straightforward boundary-mapping OCR is almost always going to be doomed to failure on the wide and various nature of human handwriting. But what does work is machine-assisted transcription. Where similar technology to that in OCR looks through the images and finds fragments (words, parts of words, ligatures,etc.) that it thinks broadly similar, and displays them (with a bit of context) for a human to (dis)agree are the right reading or to provide a reading for all the ones the human ticks off a list as the same letters. Rather than machine-transcription (which is what OCR is), this is machine-assisted transcription and much more plausible because the limitations in OCR isn't the pattern matching of This looks like This but the disconnect between the matched graphical component and the idealised character transcription. If a computer can provide a list of 100 fragments of 'th' that look similar (some maybe from 'the' and some from 'with') for a human to confirm, then that is a big step forward in transcription. (Especially if it then occasionally mixes back in ones you've already approved but look slightly different to most just to double-check.)
I remember seeing a project that was doing just this for a particular manuscript several years ago at the DRH (now DRHA) conference I think.
-James
Regarding crowdsourcing and XML, I once put a dozen students to work on transcriptions of some very formulaic accounting documents. It was enough work to get them through the medieval paleography and language that I didn't want them spending a lot of time mastering XML or new programs. So, I created a simplified schema that they could easily memorize and let them insert tags in their generic word processor.
The tags (e.g. <id>...</id> for a proper name of a person; <amt>...</amt> for amount, etc.) are easily converted into TEI compliant XML with a simple search and replace macro. I wanted two sets of eyes on the transcriptions, so I had each text transcribed by two different students, but only one copy included tags. I created a couple of tag validators so they could easily check their markup. Then, I designed an editorial review utility that allows me to highlight differences between the dual transcriptions, and incorporate any of my modifications (to transcriptions or tags) into the final transcription.
The grad students who were in charge of supervising teams of undergrads worked pretty hard at removing ambiguity in our editorial conventions through lots of communication. I've still got plenty of master editing to do, but we cranked through an awful lot of pages this way, and the students loved the experience.
Jesse jesse_hurlbut@byu.edu