RSS, OPML and the XML platform.
Copyright 2012 World Readable
The most confusing part of RSS is understanding how to encode textual elements other than the item/description. You see, item/description is special. It's encoded as HTML, which means you double escape certain characters. The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text. Encoding plain text doesn't sound like much of a problem, but when it comes to RSS, everything is really simple, but encoding plain text is really difficult. Why? Because many RSS publishers have encoded these other elements as HTML. This, in turn, resulted in many RSS parsers double-decoding these elements.
The entire problem can be described with a few simple examples of how you might encode and decode an ampersand symbol.
|Text to Encode||Single Encoded||Double Encoded|
|This & That||This & That||This &amp; That|
|This & That||This &amp; That||This &amp;amp; That|
Encoding is the easy part. There are only really two cases to consider when encoding. In decoding, we have to address all three cases produced by the encodings.
|Text to Decode||Single Decoded||Double Decoded|
|This & That||This & That||error|
|This &amp; That||This & That||This & That|
|This &amp;amp; That||This &amp; That||This & That|
There are actually more than 2 ways to encode. Beyond single and double encode, you can also use entity and decimal and hexdecimal numeric encoding.
|Text||Entity Encoding||Decimal Numeric Encoding||Hexdecimal Numeric Encoding|
|This & That||This & That||This & That||This & That|
Using combinations of each, you can actually encode the same string 12 different ways. The current dispute (problem) has nothing to do with entity or numeric encoding, so I'll use entity encoding for the remainder of this document and I'll never mention numeric encodings again, to save you a bit of confusion.
As I described above, the item/description element is double encoded and everything else is single encode. Following are example RSS feeds with properly encoding for title and description elements.
|Test Case||Text to Encode||RSS|
|Test Case #1||This & That||
|Test Case #2||This & That||
You can download copies of thisandthat.1.xml and thisandthat.2.xml. You can also subscribe to them using your RSS reader and tell me the results. I tested them in IE7 and both worked as per the RSS spec.
This is all nice, but we don't live a perfect world and as noted, some developers are double encoding their titles. Here's some example of feeds that a not properly encoded.
|Test Case||Text to Encode||RSS|
|Test Case #3||This & That||
<title>This &amp; That</title>
|Test Case #4||This & That||
Some have decided that since some people are doing the right-thing and some people are doing the wrong-thing, that they could try to predict which and present the end-user with a best guess. The strategy here is to double decode every textual element and if that results in an error condition, then single decode it. This strategy will actually compensate for some feeds that improperly encode. In fact, both test case #3 and #4 will actually work using this compromise. The problem? Test #2 will actually fail using this compromise.
In the next few days, I want to make a push to correct vendor bugs in this area. Their are two tests. I'm going to ask my readers to publish two blog entries (case #1 and #2 above), point to the entries and tell me what the RSS items fragment looks like. I'm also going to ask my readers to subscribe to case #1 and #2 above and tell me which RSS reader you use and how they look in your reader. For every test case that succeed, the reader earns 1 point. For every test case that fails, the reader earns 2 points. Don't worry about duplicate entries, they count, as long as you actually did the work. Whomever gets the most points wins ($30 of whatever they want from Amazon). I'm also gonna do a random draw of all the participants for another prize ($20 of whatever they want from Amazon).
Once the results start coming, I'll publish them and file bug reports with the vendors. The goal, for me, is to improve the encoding for RSS.
I may have typos or mistakes above. If you point them out, I'll award you another bonus point.
Update: As James and Chip noted in comments, the links to test case #3 and #4 were wrong. James points out that I missed CDATA encoding in my enumeration of encoding methods (that was a glaring oops). Two points for James and 1 to Chip for typo corrections.