Hi Jim!

I don't think the computational linguistics has to get too involved when you're dealing with scholastic Latin. If -nt[vowel]- gets converted to -nc[vowel]- and -ae- gets converted to -e-, for example, then you're well on the way. You lose some fine granularity in your search, but that's going to happen anyway when you're dealing with heterogeneous materials like what you describe. I worked out a few rules like this for a processing routine in Pascal years ago for work on 14th century Latin preaching (Bromyard), and I'd be happy to dig out the code and send the relevant bits. If you do it with hashes and you don't have rules like this, won't you have to create the association between scientia and sciencia (and all other pairs) explicitly?

Peter

From: dm-l-bounces@uleth.ca [mailto:dm-l-bounces@uleth.ca] On Behalf Of James R. Ginther
Sent: Thursday, June 16, 2005 12:43 PM
To: dm-l@uleth.ca
Subject: [dm-l] Searching Latin texts and orthographical variants

I am the project director of the Electronic Grosseteste, a research resource that provides access to electronic medieval Latin texts and an integrated bibliography. The textbase is composed of a variety of Latin texts (most of them under copyright but still searchable). Right now the search engine is pretty primitive, and one enhancement I would like to make is to account for orthographical variants in the texts. Some texts were classicized, while other editors followed either the orthography of a single manuscript or attempted to follow some sort of convention based generally on Latin texts in later medieval England (these are the facts, and this post is not about the joy of debating editorial practice). Ideally, I would like to allow searches to include returns for classical and "medieval" spellings. For example, if a user queried "scientia" the engine would return matches for "scientia" and "sciencia". (wildcards are permitted, btw).

Now I work in Perl5, and so my initial thought was to create a set of hash tables that would map these variants since hashes would allow for more than one variant per entity, and the engine would then perform a lookup for each query element. Now I suppose coding into the engine the "orthographical rules" is another option, but I'll be honest and admit that computational linguistics has never been my thing. And, the beauty of hashes in Perl is that they are compiled very quickly, and don't eat too much memory.

Now before I go and reinvent the wheel with these hash tables, does anyone know of an open-source method or resource that addresses this kind of problem (I know that Brepols--pardon me, Brepolis...yeesh---has this all figured out but they don't play will with others, so that's a closed door.). My limited scouring of the web has yielded no joy, and so I seek the sage advice of this community.

Many thanks

Jim

--------------------
Dr James R. Ginther, PhD
Assoc. Professor of Medieval Theology
& Director of Graduate Studies
Dept of Theological Studies
St Louis University
ginthej@slu.edu
---------------------------------
dept: http://theology.slu.edu/
research: http://www.grosseteste.com/