[dm-l] Character reference conversions: help?
Dieter Köhler
d.k at philo.de
Wed Oct 6 16:56:48 MDT 2004
Hello Dan,
You may use my Open XML Editor ("http://www.philo.de/xmledit/") to
transcode XML documents from UCS-4 to UTF-8. The editor currently supports
a couple of input formats, but only UTF-8 as output format. Loading the
document and using the "save as ..." function should do the job. Be
careful: just saving the document will replace the original with a UTF-8
version.
Note also that this is not a simple character by character
transformation. Instead, a model of the XML document is internally built
and this model is then serialized. This implies that
- the transformation might not work if the input XML document is not
wellformed,
- some additional normalization of the XML document takes place (in
particular: whitespace between markup in the XML prolog, i.e. before the
first tag, is normalized to line breaks, empty elements are expanded, e.g.
<foo/> becomes <foo></foo>, and character references are expanded),
- external entities are not transcoded.
A step-by-step explanation in German how the transformation from UCS-4 into
UTF-8 works is included in my draft about "Zahlen- und Zeichencodierung",
available at "http://philo.de/lv/multimedia/codierung.html". Non-German
speakers may find some useful links to English Web-pages about this topic
there as well.
Dieter Köhler
At 12:53 05.10.2004 -0600, Daniel O'Donnell wrote:
>Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers:
>http://www.digitalmedievalist.org/cfp.htm
>----------------
>Hello all,
> I need some advice on converting Unicode character references.
> Currently, am encoding character references in what I believe is UCS-4
> format (Universal Character Set). This means they look like this in my
> source files:
>
>႐
>
>I want to import xhtml documents into Open Office, which seems to need
>UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a
>filter that might do the conversions for me? Or have advice on using open
>office (Windows version) with UCS-4 encoding?
>
>-dan
>--
>Daniel Paul O'Donnell, PhD
>Associate Professor of English
>University of Lethbridge
>Lethbridge AB T1K 3M4
>Tel. (403) 329-2377
>Fax. (403) 382-7191
>E-mail <daniel.odonnell at uleth.ca>
>Home Page <http://people.uleth.ca/~daniel.odonnell/>
>
>
>_______________________________________________
>Project web site: http://www.digitalmedievalist.org/
>dm-l mailing list
>dm-l at uleth.ca
>http://listserv.uleth.ca/mailman/listinfo/dm-l
More information about the dm-l
mailing list