[apologies for the slightly redundant subject line. It's not even very funny.]
I'm looking for a search engine to handle what I guess is termed "fuzzy searching" across a corpus of Latin legal texts.
Essentially, what we will have are TEI tagged transcriptions, but we will not have detailed parts of speech encoding (and I don't believe it's realistic to add such encoding), so the search could not rely on tags. Variant spellings are a huge issue, so we would like a search that is "smart" in the sense that it will have some kind of algorithmic approach to finding potential variant spellings (as opposed to relying on a list of known variant spellings). We do not want to rely on any kind of Boolean searching (commas, curly brackets, etc.). We want a search where the user will discover the variants *after the fact* (once the search is done), rather than having to make a determination about what those variants might be ahead of time. Finally, the search will need to work both within single transcriptions, and across multiple transcriptions (potentially across the entire corpus).
Is anyone on-list familiar with any existing search engines or frameworks that suit our needs, or that might be modified to suit them?
Thanks, Dot
In haste...
Lucene will do fuzzy matching based on Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance ), which might work for you. You'd have to figure out whether the defaults work for you or whether you'd have to figure out your own.
You might look for any work on phonetic algorithms (http://en.wikipedia.org/wiki/Phonetic_algorithm ) for Latin. I'm not aware of any.
H
On Sep 1, 2008, at 8:01 AM, Dot Porter wrote:
[apologies for the slightly redundant subject line. It's not even very funny.]
I'm looking for a search engine to handle what I guess is termed "fuzzy searching" across a corpus of Latin legal texts.
Essentially, what we will have are TEI tagged transcriptions, but we will not have detailed parts of speech encoding (and I don't believe it's realistic to add such encoding), so the search could not rely on tags. Variant spellings are a huge issue, so we would like a search that is "smart" in the sense that it will have some kind of algorithmic approach to finding potential variant spellings (as opposed to relying on a list of known variant spellings). We do not want to rely on any kind of Boolean searching (commas, curly brackets, etc.). We want a search where the user will discover the variants *after the fact* (once the search is done), rather than having to make a determination about what those variants might be ahead of time. Finally, the search will need to work both within single transcriptions, and across multiple transcriptions (potentially across the entire corpus).
Is anyone on-list familiar with any existing search engines or frameworks that suit our needs, or that might be modified to suit them?
Thanks, Dot
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l