Text that is repeatedly put through a web based content management system can get badly mutilated. If that text is treated as XML the mutilation can be fatal.

While investigating this I established the best encoding (numeric entities) to use and discovered a few problems that I hadn't been aware of.

This article is derived from internal notes and includes mention of, named character entities, literal Unicode glyphs, utf-8 encoding, numerically encoded characters (decimal and hexadecimal), truetype fonts, XML, XHTML, Content Management...

Encoding Markup for XML Content Managers

First published 10 September 2004

Cloud over the Rimutakas

I have been working on a content management system. It takes web pages encoded as XHTML, lets users edit content then publishes the content back to the web.

I used third party tools expecting to simply plug them in and ignore the internal plumbing. It didn't work out that way. Even though I severely limited the editing, the XML was getting mangled and needed rehabilitation. Among the problems some characters were getting lost and others were transformed beyond recognition.

Not all of the several ways that characters can be represented in XML (or XHTML) survive repeated processing.

See for yourself

I looked at three of these representations:

  1. Literal characters (The characters I am looking at are generally stored as a single character in the computer, some characters require more storage space. If the file that saves them is encode as utf-8 a lot of content will also work on older editors.)
  2. Named entities, these look like é and are familiar to web page authors
  3. Numeric entities (decimal), these look like  . They are similar to named entities but, unfortunately, make reading text difficult.

The table below shows how these three different forms look. It contains characters which have a named entities and which tend to be most commonly needed in English, western European languages, mathematics... Even though the three columns look the same, under the skin they are written differently and they don't always behave the same. Some of these differences are described in the notes.

The same page will also look different if displayed with different fonts. In some fonts characters may show up while others may go missing. (Missing characters usually show as a hollow box.) I have picked one font that shows a lot of the characters which are missing on this page. It may be seen on this version of the page. If you have the font called Arial Unicode MS installed on your computer, the missing characters should show up. If not it'll look the same.

This displays using Arial, if you have it installed. Failing that it may display in Syntax, Helvetica or your default sans-serif font.

  3 Representations  
Name Description Number Numeric Named Literal Notes
nbsp Non-breaking space 160       Named Entity ( ) is not legal XML, though it is legal XHTML. This makes these entities unsuitable for processing as XML.
iexcl Inverted exclamation 161 ¡ ¡ ¡ Literal This literal and all other latin literals lost in a processing test.
cent Cent sign 162 ¢ ¢ ¢ Literal
pound Pound sterling sign 163 £ £ £ Literal
curren General currency sign 164 ¤ ¤ ¤ Literal
yen Yen sign 165 ¥ ¥ ¥ Literal
brvbar Broken vertical bar 166 ¦ ¦ ¦ Literal
sect Section sign 167 § § § Literal
uml Umlaut (diaeresis) 168 ¨ ¨ ¨ Literal
copy Copyright sign 169 © © © Literal
ordf Feminine ordinal 170 ª ª ª Literal
laquo Left angle quote, guillemot left 171 « « « Literal
not Not sign 172 ¬ ¬ ¬ Literal
shy Soft hyphen 173 ­ ­ ­ Literal
reg Registered trademark 174 ® ® ® Literal
macr Macron accent 175 ¯ ¯ ¯ Literal
deg Degree sign 176 ° ° ° Literal
plusmn Plus or minus sign 177 ± ± ± Literal
sup2 Superscript two 178 ² ² ² Literal
sup3 Superscript three 179 ³ ³ ³ Literal
acute Acute accent 180 ´ ´ ´ Literal
micro Micro sign 181 µ µ µ Literal
para Paragraph sign (pilcrow) 182 Literal
middot Middle dot 183 · · · Literal
cedil Cedilla 184 ¸ ¸ ¸ Literal
sup1 Superscript one 185 ¹ ¹ ¹ Literal
ordm Masculine ordinal 186 º º º Literal
raquo Right angle quote, guillemot right 187 » » » Literal
frac14 One Quarter (vulgar fraction) 188 ¼ ¼ ¼ Literal
frac12 One Half (vulgar fraction) 189 ½ ½ ½ Literal
frac34 Three Quarters (vulgar fraction) 190 ¾ ¾ ¾ Literal
iquest Inverted question mark 191 ¿ ¿ ¿ Literal
Agrave Capital A, grave 192 À À À Literal
Aacute Capital A, acute 193 Á Á Á Literal
Acirc Capital A, circumflex 194 Â Â Â Literal
Atilde Capital A, tilde 195 Ã Ã Ã Literal
Auml Capital A, umlaut (diaeresis) 196 Ä Ä Ä Literal
Aring Capital A, ring 197 Å Å Å Literal
AElig Capital AE dipthong (ligature) 198 Æ Æ Æ Literal
Ccedil Capital C, cedilla 199 Ç Ç Ç Literal
Egrave Capital E, grave 200 È È È Literal
Eacute Capita E, acute 201 É É É Literal
Ecirc Capital E, circumflex 202 Ê Ê Ê Literal
Euml Capital E, umlaut (diaeresis) 203 Ë Ë Ë Literal
Igrave Capital I, grave 204 Ì Ì Ì Literal
Iacute Capital I, acute 205 Í Í Í Literal
Icirc Capital I, circumflex 206 Î Î Î Literal
Iuml Capital I, umlaut (diaeresis) 207 Ï Ï Ï Literal
  3 Representations  
Name Description Number Numeric Named Literal Notes
ETH Capital Eth, Icelandic 208 Ð Ð Ð Literal
Ntilde Capital N, tilde 209 Ñ Ñ Ñ Literal
Ograve Capital O, grave 210 Ò Ò Ò Literal
Oacute Capital O, acute 211 Ó Ó Ó Literal
Ocirc Capital O, circumflex 212 Ô Ô Ô Literal
Otilde Capital O, tilde 213 Õ Õ Õ Literal
Ouml Capital O, umlaut (diaeresis) 214 Ö Ö Ö Literal
times Multiplication sign 215 × × × Literal
Oslash Capital O, slash 216 Ø Ø Ø Literal
Ugrave Capital U, grave 217 Ù Ù Ù Literal
Uacute Capital U, acute 218 Ú Ú Ú Literal
Ucirc Capital U, circumflex 219 Û Û Û Literal
Uuml Capital U, umlaut (diaeresis) 220 Ü Ü Ü Literal
Yacute Capital Y, acute 221 Ý Ý Ý Literal
THORN Capital Thorn, Icelandic 222 Þ Þ Þ Literal
szlig Small sharp s, German (sz ligature) 223 ß ß ß Literal
agrave Small a, grave 224 à à à Literal
aacute Small a, acute 225 á á á Literal
acirc Small a, circumflex 226 â â â Literal
atilde Small a, tilde 227 ã ã ã Literal
auml Small a, umlaut (diaeresis) 228 ä ä ä Literal
aring Small a, ring 229 å å å Literal
aelig Small ae dipthong (ligature) 230 æ æ æ Literal
ccedil Small c, cedilla 231 ç ç ç Literal
egrave Small e, grave 232 è è è Literal
eacute Small e, acute 233 é é é Literal
ecirc Small e, circumflex 234 ê ê ê Literal
euml Small e, umlaut (diaeresis) 235 ë ë ë Literal
igrave Small i, grave 236 ì ì ì Literal
iacute Small i, acute 237 í í í Literal
icirc Small i, circumflex 238 î î î Literal
iuml Small i, umlaut (diaeresis) 239 ï ï ï Literal
eth Small eth, Icelandic 240 ð ð ð Literal
ntilde Small n, tilde 241 ñ ñ ñ Literal
ograve Small o, grave 242 ò ò ò Literal
oacute Small o, acute 243 ó ó ó Literal
ocirc Small o, circumflex 244 ô ô ô Literal
otilde Small o, tilde 245 õ õ õ Literal
ouml Small o, umlaut (diaeresis) 246 ö ö ö Literal
divide Division sign 247 ÷ ÷ ÷ Literal
oslash Small o, slash 248 ø ø ø Literal
ugrave Small u, grave 249 ù ù ù Literal
uacute Small u, acute 250 ú ú ú Literal
ucirc Small u, circumflex 251 û û û Literal
uuml Small u, umlaut (diaeresis) 252 ü ü ü Literal
yacute Small y, acute 253 ý ý ý Literal
thorn Small thorn, Icelandic 254 þ þ þ Literal
yuml Small y, umlaut (diaeresis) 255 ÿ ÿ ÿ Literal
OElig Latin Capital OE (ligature) 338 Œ Œ Œ Literal
oelig Latin Small OE (ligature) 339 œ œ œ Literal
  3 Representations  
Name Description Number Numeric Named Literal Notes
Scaron Capital S with caron 352 Š Š Š Literal
scaron Small s with caron 353 š š š Literal
Yuml Capital Y, umlaut (diaeresis) 376 Ÿ Ÿ Ÿ Literal
fnof florin (latin small f with hook) 402 ƒ ƒ ƒ Literal
circ Circumflex accent 710 ˆ ˆ ˆ Literal
tilde Small tilde 732 ˜ ˜ ˜ Literal
Alpha Capital Greek Alpha 913 Α Α Α
Beta Capital Greek Beta 914 Β Β Β
Gamma Capital Greek Gamma 915 Γ Γ Γ
Delta Capital Greek Delta 916 Δ Δ Δ
Epsilon Capital Greek Epsilon 917 Ε Ε Ε
Zeta Capital Greek Zeta 918 Ζ Ζ Ζ
Eta Capital Greek Eta 919 Η Η Η
Theta Capital Greek Theta 920 Θ Θ Θ
Iota Capital Greek Iota 921 Ι Ι Ι
Kappa Capital Greek Kappa 922 Κ Κ Κ
Lambda Capital Greek Lambda 923 Λ Λ Λ
Mu Capital Greek Mu 924 Μ Μ Μ
Nu Capital Greek Nu 925 Ν Ν Ν
Xi Capital Greek Xi 926 Ξ Ξ Ξ
Omicron Capital Greek Omicron 927 Ο Ο Ο
Pi Capital Greek Pi 928 Π Π Π
Rho Capital Greek Rho 929 Ρ Ρ Ρ
Sigma Capital Greek Sigma 931 Σ Σ Σ
Tau Capital Greek Tau 932 Τ Τ Τ
Upsilon Capital Greek Upsilon 933 Υ Υ Υ
Phi Capital Greek Phi 934 Φ Φ Φ
Chi Capital Greek Chi 935 Χ Χ Χ
Psi Capital Greek Psi 936 Ψ Ψ Ψ
Omega Capital Greek Omega 937 Ω Ω Ω
alpha Small Greek Alpha 945 α α α
beta Small Greek Beta 946 β β β
gamma Small Greek Gamma 947 γ γ γ
delta Small Greek Delta 948 δ δ δ
epsilon Small Greek Epsilon 949 ε ε ε
zeta Small Greek Zeta 950 ζ ζ ζ
eta Small Greek Eta 951 η η η
theta Small Greek Theta 952 θ θ θ
iota Small Greek Iota 953 ι ι ι
kappa Small Greek Kappa 954 κ κ κ
lambda Small Greek Lambda 955 λ λ λ
mu Small Greek Mu 956 μ μ μ
nu Small Greek Nu 957 ν ν ν
xi Small Greek Xi 958 ξ ξ ξ
omicron Small Greek Omicron 959 ο ο ο
pi Small Greek Pi 960 π π π
rho Small Greek Rho 961 ρ ρ ρ
sigmaf Small Greek final Sigma 962 ς ς ς
sigma Small Greek Sigma 963 σ σ σ
tau Small Greek Tau 964 τ τ τ
  3 Representations  
Name Description Number Numeric Named Literal Notes
upsilon Small Greek Upsilon 965 υ υ υ
phi Small Greek Phi 966 φ φ φ
chi Small Greek Chi 967 χ χ χ
psi Small Greek Psi 968 ψ ψ ψ
omega Small Greek Omega 969 ω ω ω
thetasym Small Greek theta 977 ϑ ϑ ϑ Common This glyph (character) not present in common fonts tested.
upsih Greek Upsilon with hook 978 ϒ ϒ ϒ Common
piv Greek Pi symbol 982 ϖ ϖ ϖ Common
ensp En space 8194
emsp Em space 8195
thinsp Thin space 8201
zwnj Zero width non-joiner 8204
zwj Zero width joiner 8205
lrm Left-to-right mark 8206
rlm Right-to-left mark 8207
ndash En dash 8211 Literal
mdash Em dash 8212 Literal
lsquo Left single quotation mark 8216 Literal
rsquo Right single quotation mark 8217 Literal
sbquo Single low-9 quotation mark 8218 Literal
ldquo Left double quotation mark 8220 Literal
rdquo Right double quotation mark 8221 Literal
bdquo Double low-9 quotation mark 8222 Literal
dagger Dagger 8224 Literal
Dagger Double Dagger 8225 Literal
bull Bullet / Small black circle 8226 Literal
hellip Horizontal Ellipsis 8230 Literal
permil Per mille (thousand) sign 8240 Literal
prime Prime / Minutes / Feet 8242
Prime Double prime 8243
lsaquo Single left-pointing angle quotation mark 8249 Literal
rsaquo Single right-pointing angle quotation mark 8250 Literal
oline Overline / Spacing overscore 8254
frasl Fraction Slash 8260
euro Euro sign* 8364 Literal
image Blackletter capital I (imaginary part) 8465 Common
weierp Script capital P / Weierstrass p 8472 Common
real Blackletter capital R (real part) 8476 Common
trade Trademark symbol 8482 Literal
alefsym Alef symbol / First transfinite 8501 Common
larr Leftwards arrow 8592
uarr Upwards arrow 8593
rarr Rightwards arrow 8594
darr Downwards arrow 8595
harr Left Right arrow 8596
crarr Downwards arrow with corner leftwards 8629 Common
lArr Leftwards double arrow 8656 Common
uArr Upwards double arrow 8657 Common
rArr Rightwards double arrow 8658
dArr Downwards double arrow 8659 Common
  3 Representations  
Name Description Number Numeric Named Literal Notes
hArr Left Right double arrow 8660
forall For All 8704
part Partial Differential 8706
exist There exists 8707 Common
empty Empty Set 8709 Common
nabla Nabla / Backward difference 8711
isin Element Of... 8712
notin Not an elementof 8713 Common
ni Contains as member 8715
prod n-ary product / product sign 8719
sum n-ary sumation 8721
minus Minus sign 8722
lowast Asterisk operator 8727 Common
radic Square root / Radical sign 8730
prop Proportional to 8733 Common
infin Infinity symbol 8734
ang Angle 8736
and Logical And / Wedge 8743
or Logical Or / Vee 8744
cap Intersection 8745
cup Union / Cup 8746 Common
int Integral 8747
there4 Therefore 8756
sim Tilde operator 8764
cong Approximately equal to 8773 Common
asymp Almost equal to / Asymptotic 8776
ne Not equal to 8800
equiv Identical to / Equivalent 8801
le Less than or euqal to 8804
ge Greater than or equal to 8805
sub Subset of 8834
sup Superset of 8835
nsub Not a subset of 8836 Common
sube Subset of or equal to 8838
supe Superset of or equal to 8839
oplus Circle plus 8853
otimes Circled times 8855 Common
perp Othogonal / Perpendicular to / Up tack 8869
sdot Dot operator 8901 Common
lceil Left ceiling 8968 Common
rceil Right ceiling 8969 Common
lfloor Left floor 8970 Common
rfloor Right floor 8971 Common
lang Left pointing angle bracket 9001 Common
rang Right pointing angle bracket 9002 Common
loz Lozenge 9674
spades Black Spade suit 9824
clubs Black Clubs suit 9827
hearts Black Hearts suit 9829
diams Black Diamonds suit 9830 Common

What is happening, what does it mean

The table suggests a few things:

Notes:

  1. Many invisible characters in the literal column are encoded as numeric entities (not literals).
  2. Some characters have different glyphs in the different fonts. This includes differences between my copies of Arial and "Arial Unicode MS". Changes include rotation of glyphs.
  3. Some fonts display missing characters inconsistently.
  4. For the common fonts test these Windows fonts were used: Times New Roman, Georgia, Courier New, Arial, Tahoma and Verdana. Less common fonts were also tested.
  5. For another view of this situation see the evolt article.
  6. For the technical references see this W3C link.
  7. For the web old timers Stephen le Hunte's, HTML Reference Library, had some coverage of this topic.
  8. The numeric entities can also be written to the base 16, Hex. (  decimal becomes   hex) This is not discussed here.
  9. In the real world all Unicode glyphs (not just those with a name) need to be handled.

Related Topics