Here's some more detail on full-text searching using Lucene, which I promised several weeks ago. I've tried to focus on the aspects relevant to digital medievalists. Links to code are at the end. The Lucene FAQ provides technical detail on object models and APIs more accurately and succinctly than I probably could:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi
First, here's a demo version of our Lucene-based search engine working on a collection of Classical Latin:
http://vermont.perseus.tufts.edu/hopper/searchresults.jsp?q=amo&target=l...
Lucene runs in two modes: indexing and searching. The indexing phase takes a set of "documents", such as a list of web pages or a directory of text files. This happens "off-line", every day or so. The searching phase occurs when someone types in a query. Lucene consults the index files and returns a list of documents that match the query.
Order of Search Results
By default, Lucene ranks matching documents using the following formula: the percentage of terms in the document that were search terms. This tends to favor short documents. It looks like it's possible to specify a sort order at index-time, but I haven't experimented with that feature yet.
Lemmatization
One factor that is very important for full-text searching languages like Latin, Old English, and Old Norse is word inflections. You need to be able to connect "amo" with "amabitur" and "amantes". We have approached this problem by caching a morphological analysis of every word that appears in the language corpus. The result is a mapping of inflected forms to dictionary lemmas.
There are several ways to use this linguistic data, but the most effective (I've found) is to index the documents in their original, inflected form, and expand the query. In other words, if someone searches for the string "amo", the search wrapper program first lemmatizes the query ("amo#1"), then gathers all the known inflected forms of "amo#1" and joins them into one new query. You can see the effective query at the bottom of the search results page linked to above.
The query expansion approach turns out to be very efficient. One not-terribly- fast desktop can, in one query, search for every form of three Greek verbs and return results within seconds. Homonyms are a problem, but I've found that they are fairly easy to weed out from looking at search results.
Linguistic Data for Old English?
I currently have inflected form-lemma mappings for Latin. I am in contact with some folks at UCLA who have made great progress working on morphological analysis for Old Norse. Does anyone know of other resources for that language? Finally, does anyone have a resource of this sort for Old English? It seems like the data could be extracted from the parsed corpora of OE, but I don't know if the (rather restrictive) licenses would cover that sort of use.
Proposal: A Medieval Google
Is there any interest in a general search engine for web-accessible medieval texts? If there were a list of URLs to indexable pages (provided by OAI or something even simpler), it would just be a matter of occasionally indexing them. The ability to search all surviving Old English probably exists for CDROMs and private databases, but is it available freely and publicly?
Code Samples
A wrapper for the indexing functions: http://vermont.perseus.tufts.edu/public/search/Indexer.java
A wrapper for query expansion: http://vermont.perseus.tufts.edu/public/search/SearchResults.java
A wrapper for the actual search results: http://vermont.perseus.tufts.edu/public/search/SearchResults.java
And JSP code to create a search results page: http://vermont.perseus.tufts.edu/public/search/searchresults.html