Peter (Binkley) has already pointed out the problems with the XSLT approach he took. I think even the fix he suggests, of returning a stripped node fragment, would still lead to performance problems (for instance..to pull a single page say out of the Hengwrt Chaucer you would need every time to read the WHOLE document to create the node sets for ALL pages, just so you could extract the node set for just one, every time). There is also a problem with the last page. Presume that we have, indeed, what we always have: a whole bunch of other text in separate divs in the document after the last page, as below. Here is a new div containing an appendix: Peter B's approach would include that text, wrongly, in the node set for the last page. I'm sure this could be fixed (get the node sets only for the pbs in the div holding them) but it is just yet another complication. (Incidentally, the best solution to the 'missing last end of page' problem, and ideed to various other problems, is what Steve De Rose describes as 'trojan milestones': see http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRose01/EML2004De...).
I think too there is some misconception about just how the system I use (Anastasia) would cope with this problem. Peter B(inkley) suggests it is some kind of 'coded project', apparently using Java, while Peter B(aker..this is ridiculous) seems to think I would be using raw C. For the many who do not seem to have looked yet at Anastasia: it provides a tcl scripting environment which lets you manipulate XML easily in some ways (including ways very important to us) which XLST finds difficult. Particularly -- the whole focus of this discussion -- it is straight-forward in Anastasia to manipulate a document according to alternative hierarchies implicit in the element relations: so you can show just one column or one page of a text otherwise structured in hierarchical divisions.
Thus, the whole Anastasia code to pull out the first two pages of this document looks like this:
proc begin {book me stylename} { global pagecounter startEl set pagecounter 0 set startEl [findSGElement "stag("pb") with attvalue("1")"] }
proc pb {me context} { global pagecounter finish if {$pagecounter=="2"} {set finish 1} incr pagecounter }
Folks, that is all there is. The 'begin' function finds the first page by its attribute value (there are other ways one could do it, to be sure that this is the first page) and sets Anastasia to start reading the document right there. The 'pb' function counts every page every time it hits a pb: when the pagecounter reaches '2' it has read two full pages and setting the 'finish' variable to 1 stops the output. One could include a few more proc functions to format the headings, paragraphs, etc, or one could identify the last page by attribute value, but that is all. And it does this just as quick for a document with 100000 pages as one.
I don't have a prejudice against XSLT, which seems a fine tool for doing most of the things one wants to do with XML. But that does not mean it can do *everything* equally easily (which is how this whole thread started, with musings on its limits as a typesetting language). Some things certainly Anastasia does much easier, and that is no surprise -- it was designed just for that. She is open source. Go look at her on http://anastasia.sourceforge.net Peter R(obinson)
***revised document with an appendix after the last page... <div> <head>The whole text and all the texts</head> <div> <pb n="1"/> <head>First text</head> <p n="1">some text starts here and goes ita<hi rend="italic">lic an<pb n="2"/>d then</i> we get a pagebreak</p> <p n="2">so the text finishes</p> <p n="3"> with yet another page <pb n="3"/> and another page start </p> </div> <div> <head>Second text</head> <pb n="4"/> <p n="1">here my new text on the next page etc etc</p> <pb n="5"/> <p n="2">here my new text on the next page etc etc</p> </div> <div> <p>Now here we have an appendix and some more text after that</p> </div> </div>
Peter Robinson wrote:
The Digital Medievalist List (see end of message for contact information and project URLs).
Peter (Binkley) has already pointed out the problems with the XSLT approach he took. I think even the fix he suggests, of returning a stripped node fragment, would still lead to performance problems (for instance..to pull a single page say out of the Hengwrt Chaucer you would need every time to read the WHOLE document to create the node sets for ALL pages, just so you could extract the node set for just one, every time).
Is this necessarily the case? Couldn't you simply output these fragments as individual page files - so only having to do the process once? I agree with you whole-heartedly that XSLT is problematic for choosing to display from one arbitrary point to another arbitrary point in the XML, but your points aren't arbitrary. Page breaks will be the same each time, so I don't see the need to do this dynamically. Store your full versions of Hengwrt, and also store files created for each page, searching the former, but displaying the latter. Just because we *can* dynamically pull out a single page from the whole of a large XML file doesn't mean we *have* to.
I don't have a prejudice against XSLT, which seems a fine tool for doing most of the things one wants to do with XML. But that does not mean it can do *everything* equally easily (which is how this whole thread started, with musings on its limits as a typesetting language). Some things certainly Anastasia does much easier, and that is no surprise -- it was designed just for that. She is open source. Go look at her on http://anastasia.sourceforge.net
I'd of course encourage people to do so as well, Anastasia is a great tool and it is wonderful that it has been made open source.
was designed just for that. She is open source. Go look at her on http://anastasia.sourceforge.net
I'd of course encourage people to do so as well, Anastasia is a great tool and it is wonderful that it has been made open source.
And its the subject of a Wiki stub in our "tools" category, which of course might be filled in in by users. Perhaps using some of the discussion here, as this is the type of problem that helps people understand what a tool is and why it is useful. http://sql.uleth.ca/dmorgwiki/index.php/Anastasia.
If anybody does do that, we're currently restricted on embedding images (e.g. screen shots) in the wiki. If you put them up somewhere and link to them from the wiki page, I'll embed them as soon as I can.
James Cummings wrote (re Peter Robinson's remarks on dynamic extraction):
Is this necessarily the case? Couldn't you simply output these fragments
as
individual page files - so only having to do the process once? I agree
with
you whole-heartedly that XSLT is problematic for choosing to display from
one
arbitrary point to another arbitrary point in the XML, but your points
aren't
arbitrary. Page breaks will be the same each time, so I don't see the
need to
do this dynamically. Store your full versions of Hengwrt, and also store
files
created for each page, searching the former, but displaying the latter.
Just
because we *can* dynamically pull out a single page from the whole of a
large XML
file doesn't mean we *have* to.
--------
Maybe we don't *have" to, but we certainly should. If we cannot handle such a task dynamically, then we are forced into adopting one of several clunky solutions. James suggests that we introduce redundancy into the mix. But in so doing, we now are faced with issues of synchronization. What happens when changes are are made to the document? Any change made to the "master" must be reflected in all derived versions. So one must either edit each of the generated versions to include the change (which will almost guarantee the loss of synchronization) or re-generate the reduced, and redundant, versions of the document after any change. Again, loss of synchronization is bound to result (as it relies on a human being to remember to do this). Indeed, any system that relies on the proper functioning of human beings will surely fail at some point.
I agree with Peter Robinson. A dynamic approach is always best.
/**************************************** * Michael L. Norton, Ph.D. * Computer Science Dept. * ISAT/CS #209 * MSC 4103 * James Madison University * Harrisonburg, VA 22807 * (540)568-2777 * nortonml@jmu.edu ***************************************/
Michael L. Norton wrote:
Just because we *can* dynamically pull out a single page from the whole of a large XML file doesn't mean we *have* to.
Maybe we don't *have" to, but we certainly should. If we cannot handle such a task dynamically, then we are forced into adopting one of several clunky solutions. James suggests that we introduce redundancy into the mix. But in so doing, we now are faced with issues of synchronization. What happens when changes are are made to the document? Any change made to the "master" must be reflected in all derived versions. So one must either edit each of the generated versions to include the change (which will almost guarantee the loss of synchronization) or re-generate the reduced, and redundant, versions of the document after any change. Again, loss of synchronization is bound to result (as it relies on a human being to remember to do this). Indeed, any system that relies on the proper functioning of human beings will surely fail at some point.
I agree with Peter Robinson. A dynamic approach is always best.
I understand, and agree, that in theory a dynamic approach from a single source is preferable to creating static files. I don't think there is really such a problem of synchronization that you suggest, however. When changes are made in the 'master' document, this isn't going to be the copy of the document that is currently being used to dynamically create pages on the production server. The copying of this document to a production server is a single step -- whether that step involves a single process or a whole set of them is not really fraught with problems. If I type 'make site' I don't need to know that my Makefile (or whatever technology you prefer) is first making a backup of the current site, then copying any changed files into CVS for the production server, then recreating a whole series of expanded page files, or whatever. It is all still a single step.
But that is really tangential, the problem I was really envisioning is with really really large XML files, perhaps gigabytes in size. It doesn't matter that it is possible to produce single pages dynamically from such a file using XSLT, because the time and resources to do so reliably on a high-traffic website will make it painful for your users. In such cases then a solution like Peter's is better for dynamic creation of pages. However, I still argue that it isn't necessary to have this perceived division between 'master' file and individual page files. What about such things as xinclude? So you have your master file, but it is only filled with xinclude pointers to each of your individual page files. You make all changes on the individual files, if you need to search the whole thing or whatever you use the master file with xincludes expanded. (Or, my preferred way, would be the reverse, you make all the changes in a single master file which then is used to generate the site consisting of an xinclude-based master file and the individual page files.) Since the master file doesn't contain the all the individual page files, there is no redundancy. (Except, of course, between your working copy and production copy and backup copies, but that just makes good sense!)
-James