I am the project director of the Electronic
Grosseteste, a research resource that provides access to electronic medieval
Latin texts and an integrated bibliography. The textbase is composed of
a variety of Latin texts (most of them under copyright but still
searchable). Right now the search engine is pretty primitive, and one
enhancement I would like to make is to account for orthographical variants in
the texts. Some texts were classicized, while other editors followed
either the orthography of a single manuscript or attempted to follow some
sort of convention based generally on Latin texts in later medieval England
(these are the facts, and this post is not about the joy of debating editorial
practice). Ideally, I would like to allow searches to include returns
for classical and "medieval" spellings. For example, if a user queried
"scientia" the engine would return matches for "scientia" and "sciencia".
(wildcards are permitted, btw).
Now I work in Perl5, and so my initial thought
was to create a set of hash tables that would map these variants since hashes
would allow for more than one variant per entity, and the engine would then
perform a lookup for each query element. Now I suppose coding into the
engine the "orthographical rules" is another option, but I'll be honest and
admit that computational linguistics has never been my
thing. And, the beauty of hashes in Perl
is that they are compiled very quickly, and don't eat too much
memory.
Now before I go and reinvent the wheel with these
hash tables, does anyone know of an open-source method or
resource that addresses this kind of problem (I know that Brepols--pardon
me, Brepolis...yeesh---has this all figured out but they don't play will with
others, so that's a closed door.). My limited scouring of the web has
yielded no joy, and so I seek the sage advice of this
community.
Many thanks
Jim