RE: [dm-l] what xslt can't do..? - dm-l

8 Apr 2005

      I confess I haven't tried Anastasia yet, through lack of time and not of
interest. Glancing at the documentation, I see that Anastasia appears to
do lookups in a book file, which has (previously?) been compiled from
the TEI. ("This must be an Anastasia book compiled from the XML using
the Anastasia GroveMaker tool. This is designed to permit fast,
real-time access to any part of the underlying XML.") Anastasia's speed
is attributed to the fact that it is "interrogating not the source XML
but a set of binary files (an "AnaGrove") which are optimized to permit
the fastest access to information about every aspect of the source XML."
GroveMaker is written in C.
It would appear, then, that Anastasia would have the same syncronization
requirements as the kind of pre-transformed XSL-based solution that
James is talking about: if you change the TEI, you've got to remember to
recompile the AnaGrove. If I've understood this correctly, then it seems
to me that the correct object of comparison for Anastasia's ability to
present an arbitrary page would not be a single-pass XSL against a large
TEI file, but rather, say, a Cocoon pipeline using XSL to present
arbitrary XML page files which have been pre-extracted from the TEI file
and optimized for presentation by an XSL process along the lines of the
one I hacked. The TCL scripts that Peter R. sent would be the equivalent
of the matcher and generator in the Cocoon pipeline:
<map:match pattern="*/*.html">
    <map:generate src="path-to-texts/{1}/{2}.xml"/>
    ...
</map:match>
(where a url like "hengwrt/23.html" would be mapped to an extracted page
file "path-to-texts/hengwrt/23.xml"). The output of this generator, as
of the TCL scripts, is the marked-up text of a given page, ready to be
styled and presented to the reader.
Anastasia may well still win on speed, of course, if not in the
presentation then in the preprocessing. An XSL-based extraction process
will have a hard time keeping up with a custom-written C indexer. But it
will be a lot easier to maintain, I submit (with apologies if I've
misunderstood what Anastasia is doing).
Has anyone looked at teiPublisher yet, by the way?
(http://teipublisher.sourceforge.net/docs/index.php) Judging from the
texts in the "Documenting the American South" project, they seem to have
sidestepped the whole problem of presenting arbitrary pages, by only
offering the complete text in a single HTML file. I haven't looked into
it to see if there are other options, though.
Final note regarding the sample text: again, I don't understand what the
problem is with the last page, even with the appendix. How can any
process know that that last div is an appendix and not a new section of
the main text, beginning and ending on page 5? Whatever marker
GroveMaker uses to make this distinction would also be available to XSL,
presumably. (Perhaps my ignorance of TEI is showing...)
Peter Binkley
...
-----Original Message-----
From: dm-l-bounces@uleth.ca [mailto:dm-l-bounces@uleth.ca] On 
Behalf Of Peter Robinson
Sent: Friday, April 08, 2005 12:11 AM
To: dm-l@uleth.ca
Subject: RE: [dm-l] what xslt can't do..?
The Digital Medievalist List (see end of message for contact 
information and project URLs).

Peter (Binkley) has already pointed out the problems with the 
XSLT approach he took.
 I think even the fix he suggests, of returning a stripped 
node fragment, would still lead to performance problems  (for 
instance..to pull a single page say out of the Hengwrt 
Chaucer you would need every time to read the WHOLE document 
to create the node sets for ALL pages, just so you could 
extract the node set for just one, every time).  There is 
also a problem with the last page.  Presume that we have, 
indeed, what we always have: a whole bunch of other text in 
separate divs in the document after the last page, as below.  
Here is a new div containing an
appendix:
Peter B's approach would include that text, wrongly, in the 
node set for the last page.  I'm sure this could be fixed 
(get the node sets only for the pbs in the div holding them) 
but it is just yet another complication.  (Incidentally, the 
best solution to the 'missing last end of page' problem, and 
ideed to various other problems, is what Steve De Rose 
describes as 'trojan milestones': see 
http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRo
se01/EML2004DeRose01.html).
I think too there is some misconception about just how the 
system I use
(Anastasia)
would cope with this problem.  Peter B(inkley) suggests it is 
some kind of 'coded project', apparently using Java, while 
Peter B(aker..this is ridiculous) seems to think I would be 
using raw C.  For the many who do not seem to have looked yet at
Anastasia: it provides a tcl scripting environment which lets 
you manipulate XML easily in some ways (including ways very 
important to us) which XLST finds difficult.  Particularly -- 
the whole focus of this discussion -- it is straight-forward 
in Anastasia to manipulate a document according to 
alternative hierarchies implicit in the element relations: so 
you can show just one column or one page of a text otherwise 
structured in hierarchical divisions.
Thus, the whole Anastasia code to pull out the first two 
pages of this document looks like this:
proc begin {book me stylename} {
 global pagecounter startEl
 set pagecounter 0
 set startEl [findSGElement "stag("pb") with attvalue("1")"] }
proc pb {me context} {
  global pagecounter finish
  if {$pagecounter=="2"} {set finish 1}
  incr pagecounter
}
Folks, that is all there is.  The 'begin' function finds the 
first page by its attribute value (there are other ways one 
could do it, to be sure that this is the first page) and sets 
Anastasia to start reading the document right there. 
The 'pb'
function counts every page every time it hits a pb: when the 
pagecounter reaches '2'
it has read two full pages and setting the 'finish' variable 
to 1 stops the output.
One could include a few more proc functions to format the 
headings, paragraphs, etc, or one could identify the last 
page by attribute value, but that is all. 
And it does this just as quick for a document with 100000 
pages as one.
I don't have a prejudice against XSLT, which seems a fine 
tool for doing most of the things one wants to do with XML.  
But that does not mean it can do *everything* equally easily 
(which is how this whole thread started, with musings on its 
limits as a typesetting language).  Some things certainly 
Anastasia does much easier, and that is no surprise -- it was 
designed just for that.  She is open source.  Go look at her 
on http://anastasia.sourceforge.net Peter R(obinson)
***revised document with an appendix after the last page...
<div>
<head>The whole text and all the texts</head> <div> <pb 
n="1"/> <head>First text</head> <p n="1">some text starts 
here and goes ita<hi rend="italic">lic an<pb n="2"/>d 
then</i> we get a pagebreak</p> <p n="2">so the text 
finishes</p> <p n="3"> with yet another page <pb n="3"/> and 
another page start </p> </div> <div> <head>Second text</head> 
<pb n="4"/> <p n="1">here my new text on the next page etc 
etc</p> <pb n="5"/> <p n="2">here my new text on the next 
page etc etc</p> </div> <div> <p>Now here we have an appendix 
and some more text after that</p> </div> </div>

Digital Medievalist Project
Homepage: http://www.digitalmedievalist.org Journal (December 
2004-): http://www.digitalmedievalist.org/journal.cfm
RSS (announcements) server: 
http://www.digitalmedievalist.org/rss/rss2.cfm
Wiki: http://sql.uleth.ca/dmorgwiki/index.php
Change membership options: 
http://listserv.uleth.ca/mailman/listinfo/dm-l
Submit RSS announcement: http://www.digitalmedievalist.org/newitem.cfm
Contact editorial Board: digitalmedievalist@uleth.ca dm-l 
mailing list dm-l@uleth.ca 
http://listserv.uleth.ca/mailman/listinfo/dm-l