I confess I haven't tried Anastasia yet, through lack of time and not of
interest. Glancing at the documentation, I see that Anastasia appears to
do lookups in a book file, which has (previously?) been compiled from
the TEI. ("This must be an Anastasia book compiled from the XML using
the Anastasia GroveMaker tool. This is designed to permit fast,
real-time access to any part of the underlying XML.") Anastasia's speed
is attributed to the fact that it is "interrogating not the source XML
but a set of binary files (an "AnaGrove") which are optimized to permit
the fastest access to information about every aspect of the source XML."
GroveMaker is written in C.
It would appear, then, that Anastasia would have the same syncronization
requirements as the kind of pre-transformed XSL-based solution that
James is talking about: if you change the TEI, you've got to remember to
recompile the AnaGrove. If I've understood this correctly, then it seems
to me that the correct object of comparison for Anastasia's ability to
present an arbitrary page would not be a single-pass XSL against a large
TEI file, but rather, say, a Cocoon pipeline using XSL to present
arbitrary XML page files which have been pre-extracted from the TEI file
and optimized for presentation by an XSL process along the lines of the
one I hacked. The TCL scripts that Peter R. sent would be the equivalent
of the matcher and generator in the Cocoon pipeline:
<map:match pattern="*/*.html">
<map:generate src="path-to-texts/{1}/{2}.xml"/>
...
</map:match>
(where a url like "hengwrt/23.html" would be mapped to an extracted page
file "path-to-texts/hengwrt/23.xml"). The output of this generator, as
of the TCL scripts, is the marked-up text of a given page, ready to be
styled and presented to the reader.
Anastasia may well still win on speed, of course, if not in the
presentation then in the preprocessing. An XSL-based extraction process
will have a hard time keeping up with a custom-written C indexer. But it
will be a lot easier to maintain, I submit (with apologies if I've
misunderstood what Anastasia is doing).
Has anyone looked at teiPublisher yet, by the way?
(http://teipublisher.sourceforge.net/docs/index.php) Judging from the
texts in the "Documenting the American South" project, they seem to have
sidestepped the whole problem of presenting arbitrary pages, by only
offering the complete text in a single HTML file. I haven't looked into
it to see if there are other options, though.
Final note regarding the sample text: again, I don't understand what the
problem is with the last page, even with the appendix. How can any
process know that that last div is an appendix and not a new section of
the main text, beginning and ending on page 5? Whatever marker
GroveMaker uses to make this distinction would also be available to XSL,
presumably. (Perhaps my ignorance of TEI is showing...)
Peter Binkley
-----Original Message-----
From: dm-l-bounces@uleth.ca [mailto:dm-l-bounces@uleth.ca] On
Behalf Of Peter Robinson
Sent: Friday, April 08, 2005 12:11 AM
To: dm-l@uleth.ca
Subject: RE: [dm-l] what xslt can't do..?
The Digital Medievalist List (see end of message for contact
information and project URLs).
Peter (Binkley) has already pointed out the problems with the
XSLT approach he took.
I think even the fix he suggests, of returning a stripped
node fragment, would still lead to performance problems (for
instance..to pull a single page say out of the Hengwrt
Chaucer you would need every time to read the WHOLE document
to create the node sets for ALL pages, just so you could
extract the node set for just one, every time). There is
also a problem with the last page. Presume that we have,
indeed, what we always have: a whole bunch of other text in
separate divs in the document after the last page, as below.
Here is a new div containing an
appendix:
Peter B's approach would include that text, wrongly, in the
node set for the last page. I'm sure this could be fixed
(get the node sets only for the pbs in the div holding them)
but it is just yet another complication. (Incidentally, the
best solution to the 'missing last end of page' problem, and
ideed to various other problems, is what Steve De Rose
describes as 'trojan milestones': see
http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRo
se01/EML2004DeRose01.html).
I think too there is some misconception about just how the
system I use
(Anastasia)
would cope with this problem. Peter B(inkley) suggests it is
some kind of 'coded project', apparently using Java, while
Peter B(aker..this is ridiculous) seems to think I would be
using raw C. For the many who do not seem to have looked yet at
Anastasia: it provides a tcl scripting environment which lets
you manipulate XML easily in some ways (including ways very
important to us) which XLST finds difficult. Particularly --
the whole focus of this discussion -- it is straight-forward
in Anastasia to manipulate a document according to
alternative hierarchies implicit in the element relations: so
you can show just one column or one page of a text otherwise
structured in hierarchical divisions.
Thus, the whole Anastasia code to pull out the first two
pages of this document looks like this:
proc begin {book me stylename} {
global pagecounter startEl
set pagecounter 0
set startEl [findSGElement "stag("pb") with attvalue("1")"] }
proc pb {me context} {
global pagecounter finish
if {$pagecounter=="2"} {set finish 1}
incr pagecounter
}
Folks, that is all there is. The 'begin' function finds the
first page by its attribute value (there are other ways one
could do it, to be sure that this is the first page) and sets
Anastasia to start reading the document right there.
The 'pb'
function counts every page every time it hits a pb: when the
pagecounter reaches '2'
it has read two full pages and setting the 'finish' variable
to 1 stops the output.
One could include a few more proc functions to format the
headings, paragraphs, etc, or one could identify the last
page by attribute value, but that is all.
And it does this just as quick for a document with 100000
pages as one.
I don't have a prejudice against XSLT, which seems a fine
tool for doing most of the things one wants to do with XML.
But that does not mean it can do *everything* equally easily
(which is how this whole thread started, with musings on its
limits as a typesetting language). Some things certainly
Anastasia does much easier, and that is no surprise -- it was
designed just for that. She is open source. Go look at her
on http://anastasia.sourceforge.net Peter R(obinson)
***revised document with an appendix after the last page...
<div>
<head>The whole text and all the texts</head> <div> <pb
n="1"/> <head>First text</head> <p n="1">some text starts
here and goes ita<hi rend="italic">lic an<pb n="2"/>d
then</i> we get a pagebreak</p> <p n="2">so the text
finishes</p> <p n="3"> with yet another page <pb n="3"/> and
another page start </p> </div> <div> <head>Second text</head>
<pb n="4"/> <p n="1">here my new text on the next page etc
etc</p> <pb n="5"/> <p n="2">here my new text on the next
page etc etc</p> </div> <div> <p>Now here we have an appendix
and some more text after that</p> </div> </div>
Digital Medievalist Project
Homepage: http://www.digitalmedievalist.org Journal (December
2004-): http://www.digitalmedievalist.org/journal.cfm
RSS (announcements) server:
http://www.digitalmedievalist.org/rss/rss2.cfm
Wiki: http://sql.uleth.ca/dmorgwiki/index.php
Change membership options:
http://listserv.uleth.ca/mailman/listinfo/dm-l
Submit RSS announcement: http://www.digitalmedievalist.org/newitem.cfm
Contact editorial Board: digitalmedievalist@uleth.ca dm-l
mailing list dm-l@uleth.ca
http://listserv.uleth.ca/mailman/listinfo/dm-l