[dm-l] Character reference conversions: help?

Dieter Köhler d.k at philo.de
Wed Oct 6 16:56:48 MDT 2004


Hello Dan,

You may use my Open XML Editor ("http://www.philo.de/xmledit/") to 
transcode XML documents from UCS-4 to UTF-8.  The editor currently supports 
a couple of input formats, but only UTF-8 as output format.  Loading the 
document and using the "save as ..." function should do the job.  Be 
careful: just saving the document will replace the original with a UTF-8 
version.

Note also that this is not a simple character by character 
transformation.  Instead, a model of the XML document is internally built 
and this model is then serialized.  This implies that
- the transformation might not work if the input XML document is not 
wellformed,
- some additional normalization of the XML document takes place (in 
particular: whitespace between markup in the XML prolog, i.e. before the 
first tag, is normalized to line breaks, empty elements are expanded, e.g. 
<foo/> becomes <foo></foo>, and character references are expanded),
- external entities are not transcoded.

A step-by-step explanation in German how the transformation from UCS-4 into 
UTF-8 works is included in my draft about "Zahlen- und Zeichencodierung", 
available at "http://philo.de/lv/multimedia/codierung.html".  Non-German 
speakers may find some useful links to English Web-pages about this topic 
there as well.

Dieter Köhler



At 12:53 05.10.2004 -0600, Daniel O'Donnell wrote:
>Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: 
>http://www.digitalmedievalist.org/cfp.htm
>----------------
>Hello all,
>         I need some advice on converting Unicode character references. 
> Currently, am encoding character references in what I believe is UCS-4 
> format (Universal Character Set). This means they look like this in my 
> source files:
>
>&#x1090;
>
>I want to import xhtml documents into Open Office, which seems to need 
>UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a 
>filter that might do the conversions for me? Or have advice on using open 
>office (Windows version) with UCS-4 encoding?
>
>-dan
>--
>Daniel Paul O'Donnell, PhD
>Associate Professor of English
>University of Lethbridge
>Lethbridge AB T1K 3M4
>Tel. (403) 329-2377
>Fax. (403) 382-7191
>E-mail <daniel.odonnell at uleth.ca>
>Home Page <http://people.uleth.ca/~daniel.odonnell/>
>
>
>_______________________________________________
>Project web site: http://www.digitalmedievalist.org/
>dm-l mailing list
>dm-l at uleth.ca
>http://listserv.uleth.ca/mailman/listinfo/dm-l





More information about the dm-l mailing list