TAN: Unicode, XML, and LaTeX

List overview All Threads
Download

newer

older

RE: [dm-l] TAN: Unicode, XML, and...

DM News: cfp: Extreme 2005

Daniel O'Donnell

3 Apr 2005 3 Apr '05

3:12 p.m.

Hi all, I've been playing with LaTex (actually MikTex) as a TEI2PDF conversion method, and I have a question about unicode support for LaTex that members of this list might have a better handle on than the TUG group seems to. What is the best method of handling UTF-8 XML code in LaTeX2e? Recent discussion (i.e. Feb 2005) on the various mailing lists archived at TUG seem to suggest this is something that is difficult to do and won't really be resolved until LaTeX 3. I'm sure members here have some practical experience. Thanks. -dan

P.S. I'm cross posting this in tei-l; apologies for the double messages.

-- Daniel Paul O'Donnell, PhD Associate Professor of English University of Lethbridge Lethbridge AB T1K 3M4 Tel. (403) 329-2377 Fax. (403) 382-7191 E-mail daniel.odonnell@uleth.ca Home Page http://people.uleth.ca/~daniel.odonnell/ The Digital Medievalist Project: http://www.digitalmedievalist.org/

Show replies by date

John McChesney-Young

3 Apr 3 Apr

7:26 p.m.

Daniel Paul O'Donnell wrote in part:

...

I've been playing with LaTex (actually MikTex) as a TEI2PDF 
conversion method, and I have a question about unicode support for LaTex that members of this list might have a better handle on than the TUG group seems to. What is the best method of handling UTF-8 XML code in LaTeX2e? ... >>

I was quite distressed last fall when I opened the brand-new 4th edition of Kopka and Daly and found no entry in the index under Unicode. The even more recent second edition of _The Latex Companion_ is a little better (check the index under "UTF-8"; there's still no "Unicode" entry), but the situation is still pretty sketchy from what I understand. LaTeX & Unicode will in theory work with the characters in [inputenc] encodings (e.g., Latin 1 and Latin 2), and there's support for CKJ with the difficult-to-set-up [ucs] package, but it's not as easy nor complete as it seems like it should be after all these years - a constant refrain about fonts in general on the TeX lists, by the way.

Progress does seem to be being made, though; see e.g. the TUGboat EuroTeX 2003 Proceedings Table of Contents:

http://www.tug.org/TUGboat/Contents/contents24-3.html

You might also find parts of the 2005 Conference Preprints .pdf useful (4.6 MB; linked at:)

http://www.dante.de/dante/events/eurotex/

I'm far from an expert, but you might want to look into (and ask on the lists for) Aleph and Omega:

http://www.ntg.nl/mailman/listinfo/aleph

http://tug.org/mailman/listinfo/omega

If by chance you haven't tried it already, the newsgroup news:comp.text.tex might be worth approaching, or one of the archived search results at Google Groups:

http://groups-beta.google.com/

using the search string:

unicode xml group:comp.text.tex

might prove useful. The first result that comes up looks at first glance particularly promising.

If you're not wedded to LaTeX2e, you might consider one of its cousins, the cutting-edge cross-platform implementation of TeX called ConTeXt:

http://www.pragma-ade.com/

http://contextgarden.net/Main_Page

See for the Unicode side:

http://contextgarden.net/Encodings_and_Regimes

and for the XML side:

http://contextgarden.net/XML

ConTeXt also has a very active mailing list:

http://www.ntg.nl/mailman/listinfo/ntg-context

From what I've gathered from discussions on that mailing list, almost anything you can do in LaTeX you can also do in ConTeXt; often if if something is not currently possible, the developers will come up with a solution and incorporate it in the next minor update a month or two down the road. The user (and developer) base is much smaller than for LaTeX and there are no (!) books and the documentation is not always as helpful as one might want, but the recent founding of ConTeXt Garden has helped make the implementation a good deal more intelligible than it was, say, a year ago.

Alternatively, if switching to Mac OS X is an option, bleeding-edge (so-called by its main developer, Jonathon Kew) XeTeX:

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=xetex

is platform-specific but allows you to use any fonts in the system without having to generate new files, name them in special & esoteric ways, and put them in the right place - the kind of thing that tends to drive new users of LaTeX back to WYSIWYG. Note also the link to the active XeTeX mailing list at the above page.

Sorry not to be of more specific help, but I hope these will provide some leads to those who can offer better assistance.

John

-- *** John McChesney-Young ** panis~at~pacbell.net ** Berkeley, California, U.S.A. ***

Peter Baker

8:11 p.m.

Dan,

I won't try to cover any of the ground that John McChesney-Young has covered in his post, except to say that that is also my sense of the state of LaTeX right now: support for UTF-8 is still clunky, and Unicode/OpenType fonts are not natively supported. To judge from its website the promising Omega project seems to have stalled.

I'm wondering if you've considered XSL-FO as a way to get from TEI to PDF. Apache's FOP (see http://xml.apache.org/fop/) is supposed to be a good free implementation. If I recall correctly, earlier versions of FOP used TeX or LaTeX to output PDF files, but now it outputs directly to PDF. There's a great advantage to staying within XML to do a job like producing a PDF: you use an XSLT script to output an FO file, and then send that to FOP.

There are also commercial products for working with FO (here's one with a free academic license: http://www.renderx.com/download/academic.html), and some of them sound far easier to use than FOP. Being rather perverse, I'd probably stick with FOP myself. I'll bet there's some experience getting TEI to FOP on this list.

Peter

Daniel O'Donnell wrote:

...

The Digital Medievalist List (see end of message for contact information and project URLs).

Hi all, I've been playing with LaTex (actually MikTex) as a TEI2PDF conversion method, and I have a question about unicode support for LaTex that members of this list might have a better handle on than the TUG group seems to. What is the best method of handling UTF-8 XML code in LaTeX2e? Recent discussion (i.e. Feb 2005) on the various mailing lists archived at TUG seem to suggest this is something that is difficult to do and won't really be resolved until LaTeX 3. I'm sure members here have some practical experience. Thanks. -dan

P.S. I'm cross posting this in tei-l; apologies for the double messages.

Daniel O'Donnell

9:44 p.m.

What scared me off XSL-FO was a recurring theme on TEI-L that suggested that XSL-FO engines are too expensive or too poor to be used very much by academics. In particular, it is claimed that the free ones don't do tables very well, and the ones that do do tables well are expensive. I'll take a look at the ones you suggest; I'm also looking at Context, though, as somebody noted, it ain't consumer friendly in its set up. On TEI-L, it has often been suggested that XML-ers can see LaTeX largely as a PDF engine.

I must say that two days intensive work with LaTeX has left me with mixed feelings. On the one hand, I can produce a very neat looking articles suitable for immediate publication in Phys Rev Letters. On the other hand, the whole system seems pretty old fashioned. It claims to separate content and form, but it doesn't really: all markup is basically format oriented. It looks like the system the original HTML design was probably based on. SGML and XML are a conceptual leap forward, although SGML suffered from a terrible lack of delivery mechanisms, and XML seems to be suffering from the lack of a consumer oriented Print/PDF mechanism.

So in short, I think as an old SGML/XMLer, I'd rather stick with a concept I understand (trees and XPATH), that convert a tree model to a process-model like LaTeX seems to be; but I'd also prefer my output to be cheap and easy to produce, without limitations (for example) on the type of tables I can produce. My immediate goal for learning some kind of XML > typesetting language is to see if I can automate the production of Brodart call-number stickers for my personal library (I'm a dork, as my neighbour has repeatedly noted), something that requires me ultimately to set a very unusual page size. This is not hard to do in XSL-FO, as far as I can tell, but quite difficult in LaTeX: another reason to stick with FO, if I can.

Jargon Watch: XSL: eXtensible Stylesheet Language (a language for converting XML to output formats) XSL-FO: a language for converting XML to (amongst other things) print and PDF TeX and LaTeX: a typesetting language (and variant) used by natural scientists (for the most part) to typeset articles and books. XML: an HTML-like language used for encoding text structurally. SGML: an early version of XML HTML: the language used to encode web pages XPATH: an XML standard TEI-L: mailing list of the Text Encoding Initiative

-dan

Peter Baker wrote:

...

Dan,

I won't try to cover any of the ground that John McChesney-Young has covered in his post, except to say that that is also my sense of the state of LaTeX right now: support for UTF-8 is still clunky, and Unicode/OpenType fonts are not natively supported. To judge from its website the promising Omega project seems to have stalled.

I'm wondering if you've considered XSL-FO as a way to get from TEI to PDF. Apache's FOP (see http://xml.apache.org/fop/) is supposed to be a good free implementation. If I recall correctly, earlier versions of FOP used TeX or LaTeX to output PDF files, but now it outputs directly to PDF. There's a great advantage to staying within XML to do a job like producing a PDF: you use an XSLT script to output an FO file, and then send that to FOP.

There are also commercial products for working with FO (here's one with a free academic license: http://www.renderx.com/download/academic.html), and some of them sound far easier to use than FOP. Being rather perverse, I'd probably stick with FOP myself. I'll bet there's some experience getting TEI to FOP on this list.

Peter

Daniel O'Donnell wrote:

...
The Digital Medievalist List (see end of message for contact information and project URLs).

Hi all, I've been playing with LaTex (actually MikTex) as a TEI2PDF conversion method, and I have a question about unicode support for LaTex that members of this list might have a better handle on than the TUG group seems to. What is the best method of handling UTF-8 XML code in LaTeX2e? Recent discussion (i.e. Feb 2005) on the various mailing lists archived at TUG seem to suggest this is something that is difficult to do and won't really be resolved until LaTeX 3. I'm sure members here have some practical experience. Thanks. -dan

P.S. I'm cross posting this in tei-l; apologies for the double messages.

7430

Age (days ago)

7431

Last active (days ago)

dm-l@uleth.ca

3 comments

3 participants

tags (0)

participants (3)

Daniel O'Donnell
John McChesney-Young
Peter Baker