Character reference conversions: help?

List overview All Threads
Download

newer

older

New Server up and running

Oops: forgot the link

Daniel O'Donnell

5 Oct 2004 5 Oct '04

12:53 p.m.

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

-dan

-- Daniel Paul O'Donnell, PhD Associate Professor of English University of Lethbridge Lethbridge AB T1K 3M4 Tel. (403) 329-2377 Fax. (403) 382-7191 E-mail daniel.odonnell@uleth.ca Home Page http://people.uleth.ca/~daniel.odonnell/

Show replies by date

Roberto Rosselli Del Turco

5 Oct 5 Oct

1:29 p.m.

Il giorno mar, 05-10-2004 alle 12:53 -0600, Daniel O'Donnell ha scritto:

...

Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

Can't you just copy and paste your documents from Mozilla/Firefox/whatever into OOo? I know, this looks too simple to be true ... but I just tried[1] and it works!

Ciao

[1] Picked up an xhtml file, inserted random decimal entities, loaded it in Epiphany (based on Mozilla's engine), copied text and pasted it into a unicode text editor: I ended up with unicode characters.

-- Roberto Rosselli Del Turco roberto.rossellidelturco at unito.it Dipartimento di Scienze rosselli at ling.unipi.it del Linguaggio Then spoke the thunder DA Universita' di Torino Datta: what have we given? (TSE) Hige sceal the heardra, heorte the cenre, mod sceal the mare, the ure maegen litlath. (Maldon 312-3)

Murray McGillivray

1:46 p.m.

That's what occurred to me immediately. Should work with any Unicode-based browser. UTF-8 is the more compact 8-bit Unicode Transformation Format, which encodes each Unicode character in one or more octets, using basically one octet in ASCII encoding for the most common Latin characters and so saving space or width.

Murray

Roberto Rosselli Del Turco wrote:

...

Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Il giorno mar, 05-10-2004 alle 12:53 -0600, Daniel O'Donnell ha scritto:

...
Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

Can't you just copy and paste your documents from Mozilla/Firefox/whatever into OOo? I know, this looks too simple to be true ... but I just tried[1] and it works!

Ciao

[1] Picked up an xhtml file, inserted random decimal entities, loaded it in Epiphany (based on Mozilla's engine), copied text and pasted it into a unicode text editor: I ended up with unicode characters.

Daniel O'Donnell

1:55 p.m.

Good idea. That solves the UTF problem, but it causes me to lose all the stylesheet information. And unless I'm nuts, I can't see a way of loading a CSS sheet in to format a document (not that it would work, probably, since the coding would be gone.

Perhaps there is a way to automatically convert it in e-macs. -dan

Murray McGillivray wrote:

...

Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

That's what occurred to me immediately. Should work with any Unicode-based browser. UTF-8 is the more compact 8-bit Unicode Transformation Format, which encodes each Unicode character in one or more octets, using basically one octet in ASCII encoding for the most common Latin characters and so saving space or width.

Murray

Roberto Rosselli Del Turco wrote:

...
Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Il giorno mar, 05-10-2004 alle 12:53 -0600, Daniel O'Donnell ha scritto:

...
Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

Can't you just copy and paste your documents from Mozilla/Firefox/whatever into OOo? I know, this looks too simple to be true ... but I just tried[1] and it works!

Ciao

[1] Picked up an xhtml file, inserted random decimal entities, loaded it in Epiphany (based on Mozilla's engine), copied text and pasted it into a unicode text editor: I ended up with unicode characters.

Project web site: http://www.digitalmedievalist.org/ dm-l mailing list dm-l@uleth.ca http://listserv.uleth.ca/mailman/listinfo/dm-l

Peter Baker

1:53 p.m.

Dan,

If you process the file with an XSLT script that looks like this:

<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version = '1.0' xmlns="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform%22%3E

<xsl:output method="xml" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%22/%3E

<xsl:template match="/"> <xsl:copy-of select="."/> </xsl:template>

</xsl:stylesheet>

then your entities should all get converted to UTF-8 automagically.

Peter

Daniel O'Donnell wrote:

...

Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

-dan

Paul F. Schaffner

4:51 p.m.

For those of us not comfortable with XSLT, any editor designed to work with unicode text (e.g. UniPad, BabelPad) should be able to convert to and from numeric character references (decimal or hex), UTF-8 (little- or bigendian), UTF-16, and (my favorite) 'java unicode escapes' (aka UCNs). UniPad is pricey; BabelPad, I think, is free. Ideally also from and to ISO standard SGML SDATA character entities. Such editors, however, usually work on one file at a time, rather than as a batch-mode filter.

For the latter purpose, Perl should do the trick: the only filters I use locally go the wrong direction (either from named entities to UTF-8 or from UTF-8 to UCNs), but there are a couple of scripts posted at http://www.mail-archive.com/linux-utf8@nl.linux.org/msg02741.html that look like they should work; if not, there are undoubtedly others out there that will. I think all it has to do is parse &...; strings and convert them to UTF-8 characters, but what would I know?

pfs -- -------------------------------------------------------------------- Paul Schaffner | pfs@umich.edu | http://www-personal.umich.edu/~pfs/ 316 Hatcher Library N, Univ. of Michigan, Ann Arbor MI 48109-1205 --------------------------------------------------------------------

On Tue, 5 Oct 2004, Peter Baker wrote:

...

If you process the file with an XSLT script that looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version = '1.0' xmlns="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform%22%3E

<xsl:output method="xml" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd%22/%3E

<xsl:template match="/"> <xsl:copy-of select="."/> </xsl:template>

</xsl:stylesheet>

then your entities should all get converted to UTF-8 automagically.

Peter

Daniel O'Donnell wrote:

...
Digital Medievalist Journal (Inaugural Issue Fall 2004). Call for papers: http://www.digitalmedievalist.org/cfp.htm

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

-dan

Project web site: http://www.digitalmedievalist.org/ dm-l mailing list dm-l@uleth.ca http://listserv.uleth.ca/mailman/listinfo/dm-l

David Badke

2:06 p.m.

According to a Wiki article (http://en.wikipedia.org/wiki/Universal_character_set):

"The Universal Character Set is a character encoding that is defined by the international standard ISO/IEC 10646. It maps hundreds of thousands of abstract characters, each identified by an unambiguous name, to integers, called numeric code points.

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode Standard and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of the Unicode Standard are identical to those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, the new and updated characters were brought into the UCS via ISO/IEC 10646-1:2000."

It seems to me (as is implied by the second paragraph above) that entity code values in UCS should refer to the same character position as in Unicode. So, for example, if &#x1090 is the "Z" character in UCS-4 (I know, it's not!) then it should also be the "Z" character in UTF-8 (assuming the same font is used). If so, no conversion of numerically encoded entities should be needed. Of course, I could be wrong!

Since you ask the question, I assume the transfer isn't working as expected. What exactly is happening?

David

At 11:53 AM 05/10/2004, you wrote:

...

Hello all, I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

David Badke

http://badke.ca/david http://bestiary.ca

Dieter Köhler

11:56 p.m.

...

It seems to me (as is implied by the second paragraph above) that entity code values in UCS should refer to the same character position as in Unicode. So, for example, if &#x1090 is the "Z" character in UCS-4 (I know, it's not!) then it should also be the "Z" character in UTF-8 (assuming the same font is used). If so, no conversion of numerically encoded entities should be needed. Of course, I could be wrong!

In Unicode a "Unicode scalar value" aka "ISO/IEC 10646 code point" is a number which stands for an "abstract character", which is something like the Platonic idea of a character. This number itself must be distinguished from its representation in a specific stream of bits used by a computer. The way how a sequence of numbers should be represented in a sequence of bits is called a "serialization". UCS-2 and UCS-4 ("UCS" stands for "Universal Character Set") are two different ways to serialize Unicode. The different UTF ("UCS transformation format") versions (like UTF-8, UTF-16) have been introduced later for the same purpose. UCS-2 covers only a subset of all possible Unicode scalar values; UCS-4 virtually allows to serialize larger numbers in a bit stream than are used as Unicode scalar values. UTF-8 and UTF-16 are covering exactly the range of possible Unicode scalar values.

The mapping from a character set definition to the actual code units used to represent the data is called a "character encoding form". A character encoding form plus a serialization is called a "character encoding scheme". In fact UCS-2, UCS-4, UTF-8, and UTF-16 are all character encoding schemes for Unicode.

XML character references like "Ω" allow for the representation of Unicode scalar values which are not supported by the character encoding scheme used to serialize an XML document. They are a sort of meta-serialization. The character reference "Ω" advises an XML processor to replace this reference with a representation of the Unicode scalar value 937 (which stands for the Greek upper-case Omega abstract character). This is equivalent to the XML character reference "Ω" (and even "&#00000000000000937;" or "&x0000003A9;" as leading zeros may be added at libitum). The 'x' in this character reference indicates that the number is of hexadecimal type (also called 'sedecimal' by those who do not like the mixture of Latin and Greek), while in the absence of an 'x' it is a decimal number. Numbers in the hexadecimal system make use of 16 basic symbols (0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F) representing the numbers from 0 to 15. The decimal number '16' is represented by hexadecimal '10', decimal '17' by hexadecimal '11', decimal '255' is hexadecimal 'FF', decimal '256' is hexadecimal '100', and decimal '937' is hexadecimal '3A9'.

"Fonts" are again from a different level of abstraction. A font is a set of so-called "glyph images" used for the visualization of characters. A character may be represented by different glyph images, and the same glyph image might represent different characters.

Dieter Köhler

David Badke

6 Oct 6 Oct

12:40 a.m.

At 10:56 PM 05/10/2004, you wrote:

...

XML character references like "Ω" allow for the representation of Unicode scalar values which are not supported by the character encoding scheme used to serialize an XML document. They are a sort of meta-serialization. The character reference "Ω" advises an XML processor to replace this reference with a representation of the Unicode scalar value 937 (which stands for the Greek upper-case Omega abstract character). This is equivalent to the XML character reference "Ω"

Are you saying that the entity "&#937", which is the Greek upper-case Omega abstract character in Unicode, is not the Greek upper-case Omega abstract character in UCS-4? In other words, that the 937 code point refers to different abstract characters in Unicode vs. UCS, that the two standards map the same code point to different characters? If so, it appears (to me, at least) to be a very bad design choice; conversion would certainly be needed in this case. If not, then UCS &#937 is exactly the same as Unicode &#937, and no conversion of the entity coding is needed.

There must be a character chart somewhere comparing UCS and Unicode code points...

...

"Fonts" are again from a different level of abstraction. A font is a set of so-called "glyph images" used for the visualization of characters. A character may be represented by different glyph images, and the same glyph image might represent different characters.

Exactly. So if the font in use had no glyph at code point &#937, or had some other glyph there, you would not see the Greek upper-case Omega with that font. My badly-stated point was that while the entity &#937 always means the Greek upper-case Omega abstract character in the Unicode standard, the glyph you actually see depends on the font in use. A Unicode-compliant font should show the Omega glyph, but whether it does or not is up to the font creator.

David

David Badke

http://badke.ca/david http://bestiary.ca

Lou Burnard

2:25 a.m.

The TEI Guidelines (P5 edition) contain two chapters about character encoding issues, specifically revised to address the use of Unicode.

You can read the current drafts at http://www.tei-c.org/Activities/CE/FASC-ch.pdf (overview of character encoding issues) and http://www.tei-c.org/Activities/CE/FASC-wd.pdf (recommendations for encoding non-Unicode characters and glyphs

Comments on these preliminary drafts from the digiterati on this list would be particularly welcome.

Also, if you have specific proposals for changes or additions to the proposals, please visit http://tei.sourceforge.net and make a "feature request"!

Lou Burnard

2:26 a.m.

The TEI Guidelines (P5 edition) contain two chapters about character encoding issues, specifically revised to address the use of Unicode.

Comments on these preliminary drafts from the digiterati on this list would be particularly welcome.

Also, if you have specific proposals for changes or additions to the proposals, please visit http://tei.sourceforge.net and make a "feature request"!

Lou Burnard

Dieter Köhler

8:35 p.m.

...

Are you saying that the entity "&#937", which is the Greek upper-case Omega abstract character in Unicode, is not the Greek upper-case Omega abstract character in UCS-4? In other words, that the 937 code point refers to different abstract characters in Unicode vs. UCS, that the two standards map the same code point to different characters? If so, it appears (to me, at least) to be a very bad design choice; conversion would certainly be needed in this case. If not, then UCS &#937 is exactly the same as Unicode &#937, and no conversion of the entity coding is needed.

UCS-4 describes one way how to serialize Unicode scalar values into a stream of bits to be written to a file, send over the Internet, passed to another application, etc. It says that each Unicode scalar value must be encoded in 32 bit, i.e. 4 byte á 8 bit (the reason for the '4' in the name). Different ways to order these bytes are allowed.

For example the decimal number 937 is 1110101001 in binary notation. "UCS-4 Big Endian" prefixes this with zeros to reach the required 32 bit. Thus, serializing Greek upper-case Omega in UCS-4 Big Endian results in 00000000000000000000001110101001 written to a data stream. Other versions modify the order of the bytes (groups of 8 bit) to be serialized, e.g. Greek upper-case Omega in UCS-4 Little Endian is 10101001000000110000000000000000.

"Ω" serialized in UCS-4 Big Endian is just a concatenation of the individual characters '&' , '#' , '9', '3', '7', ';' (binary 100110, 100011, 111001, 110011, 110111, 111011):

000000000000000000000000001001100000000000000000000000000010001100000000000000000000000000111001000000000000000000000000001100110000000000000000000000000011011100000000000000000000000000111011

That this represents Greek upper-case Omega is opaque to UCS-4. UCS-4 treats it just as the sequence of individual characters that constitute the character reference. It is a matter of a higher level protocol to interpret this as Greek upper-case Omega, for XML this is outlined in the XML 1.0 specification sec. 4.1 ("http://www.w3.org/TR/2004/REC-xml-20040204/#sec-references"):

...

CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

[...]

If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with "&#", the digits up to the terminating ; provide a decimal representation of the character's code point.

(The first line in this quotation is a formal description how to generate a character reference. For more information see "http://www.w3.org/TR/2004/REC-xml-20040204/#sec-notation".)

In practice, XML character references had been superfluous, if only UCS-4, UTF-16 or UTF-8 would be used for serializing XML documents. The purpose of character references is to represent Unicode characters in an XML document serialized with a character encoding scheme, say US-ASCII, that does not include all Unicode characters. One of larger of the many design flaws of XML is, in my opinion, that it provides such a mechanism only for the character data of an XML document, but not for element and attribute names. For example, an XML document may use Greek upper-case Omega for the name of an element, but it is impossible to serialize this document using US-ASCII.

Dieter Köhler

timpart＠perdix.demon.co.uk

7 Oct 7 Oct

3:41 p.m.

Daniel O'Donnell recently said:

...

I need some advice on converting Unicode character references. Currently, am encoding character references in what I believe is UCS-4 format (Universal Character Set). This means they look like this in my source files:

႐

I want to import xhtml documents into Open Office, which seems to need UTF-8 encoding (I don't know what UTF stands for). Does anybody know of a filter that might do the conversions for me? Or have advice on using open office (Windows version) with UCS-4 encoding?

Do you actually need to convert at all? I'm assuming you are using a PC or Unix to prepare your text file.

Your file is in a particular character set. If you have only typed in unaccented characters and haven't used things like directional quotes then you will have only used 7-bit ASCII (which is a subset of many other character sets).

The file is intended to be XML. That implies that certain sequences of typed characters have a particular meaning to an XML interpreter. For example < indicates the start of a tag. An XML interpreter needs to know the character set a file is in so that it can understand all the characters in it. There are many different character sets in the world and it is unlikely that an interpreter can understand them all. The default character set for XML is UTF-8 (Unicode transformation format using sequences of 8 bits) If you want to use a different character set you have to declare it in the XML header and the interpreter must support it.

UTF-8 was picked as the default for XML because by design the first 128 characters in Unicode have exactly the same byte values in UTF-8 as are used in 7-bit ASCII. In other words if you type in a file on most systems which doesn't use character codes over 127, not only is it an ASCII file, it also a UTF-8 file.

Since not all character sets include all the characters in the UCS, XML provides an alternative way of representing them rather than just typing them in. This is the ႐ format you mention above. So instead of typing "the" I could type "thA" instead, which is the same but more long winded. This is not UCS-4 encoding, simply a way of indicating the abstract numbers assigned to the characters.

The byte sequence for ႐ in your file is 26 23 78 31 30 39 30 3B. If your file was stored in UCS-4 that same sequence of characters would be the byte sequence 00 00 00 26 00 00 00 23 00 00 00 78 00 00 00 31 00 00 00 30 00 00 00 39 00 00 00 30 00 00 00 3B. If instead of typing in the &#x; reference you had just typed the character directly from your keyboard then it would be 00 00 10 90 in that hypothetical UCS-4 file.

I am wondering if open office is simply telling you that the only character set supported for the file is UTF-8. If so you may not have any worries - just try using your file as is.

Regards,

Tim

-- Tim Partridge. Any opinions expressed are mine only and not those of my employer

7612

Age (days ago)

7614

Last active (days ago)

dm-l@uleth.ca

12 comments

9 participants

tags (0)

participants (9)

Daniel O'Donnell
David Badke
Dieter Köhler
Lou Burnard
Murray McGillivray
Paul F. Schaffner
Peter Baker
Roberto Rosselli Del Turco
timpart＠perdix.demon.co.uk