Arc Forumnew | comments | leaders | submitlogin
3 points by dpkendal 5200 days ago | link | parent

It's not that they're troublesome to render in simple circumstances, but consider:

1. Alice is writing a blog entry declaring her unrequited love for Bob.

2. Alice includes a quote from Bob's blog by copying-and-pasting. Bob is using UTF-8 charset, and the copied section includes some Unicode curled quotes, because Bob is typographically sensitive.

3. Because Alice is using a backwards text editor which refuses to believe there is something other than 1 byte = 1 character and plain ASCII in the lower half, it shows the characters exactly as Alice and Bob intended (such a third-rate text editor is sure to be using the operating system's standard text editing control, which will know about such tricks) -- but internally it has crapped up the representation, and lo:

4. When Alice submits this onto her ISO-8859-1-encoded blog, she will have mixed character sets and invalid XML and all of a sudden her well-written love letter is smeared by �s where there ought to be “s and ”s.

Further reading:

- http://textism.com/article/663/pomegranate

- http://daringfireball.net/2003/02/short_and_curlies

- http://www.alistapart.com/articles/emen/



1 point by aw 5199 days ago | link

Ah, I understand. Your goal is to encode Unicode text in HTML which is itself encoded in ASCII only, so it can be transmitted through channels (such as bad editors) which mess up non-ASCII text.

To encode any Unicode text this way we'd also want to encode HTML special characters such as <, >, and &; but that could be done before calling your routine.

-----