Dear all,
I have a question about what tools and techniques people are using for data mining in medieval studies, especially for authorship attribution and scribal identification. If you use them then which systems do you use? How do you verify the results? Do you trust the system to work and not simply to reproduce your own views? If you don't use these systems then why not? I'm thinking of a recent article in LLC 23 (2008) by Sculley and Pasanek which seems to me to raise some important questions about the validity of data mining, but I'm also asking as a designer myself. If I'm producing software for scribal identification or authorship attribution, what can I do (if anything) that will convince you to use and trust my system?
Peter
-- Dr Peter Stokes Leverhulme Early Career Fellow Dept. of Anglo-Saxon, Norse and Celtic The University of Cambridge 9 West Rd, Cambridge, CB3 9DP Tel: +44 1223 767314 Fax: +44 1223 335092
On Mon, Sep 28, 2009 at 12:57, Peter Stokes pas53@cam.ac.uk wrote:
Dear all,
I have a question about what tools and techniques people are using for data mining in medieval studies, especially for authorship attribution and scribal identification. If you use them then which systems do you use? How do you verify the results? Do you trust the system to work and not simply to reproduce your own views? If you don't use these systems then why not? I'm thinking of a recent article in LLC 23 (2008) by Sculley and Pasanek which seems to me to raise some important questions about the validity of data mining, but I'm also asking as a designer myself. If I'm producing software for scribal identification or authorship attribution, what can I do (if anything) that will convince you to use and trust my system?
Hi Peter,
I think the only things one can do to create trust in data-mining tools is have a degree of openness and a fair bit of documentation on the process. If in your documentation you spell out exactly how it makes an identification, and what the limitations and assumptions of the tool are, then in having a greater understanding of the software, I'm likely to trust it more. If your software source code is released under an open licence and I can go and see how it works, and perhaps more importantly in this kind of software, fiddle with the rules it uses to ascertain a particular scribe or author so I can re-run it and see the effect on a test suite sample texts, then that might make me trust it more as well. So for me comprehensive user and technical documentation, open source code and open sample data would make it more trustworthy or usable. Because I'm that naturally skeptical sort of person, what wouldn't make me trust it more would be just high-profile journal articles or well-respected luminaries going on about how great it was. (In fact, that would probably make me more suspicious and curious as to how it worked underneath...) Not sure if that was the kind of answer you were looking for.
-James