This & That Contest - The RSS Blog</title

Sun, 10 Jun 2007 15:18:12 GMT

The most confusing part of RSS is understanding how to encode textual elements other than the item/description. You see, item/description is special. It's encoded as HTML, which means you double escape certain characters. The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text. Encoding plain text doesn't sound like much of a problem, but when it comes to RSS, everything is really simple, but encoding plain text is really difficult. Why? Because many RSS publishers have encoded these other elements as HTML. This, in turn, resulted in many RSS parsers double-decoding these elements.

The Many Ways of Encoding

The entire problem can be described with a few simple examples of how you might encode and decode an ampersand symbol.

Text to Encode	Single Encoded	Double Encoded
This & That	This & That	This &amp; That
This & That	This &amp; That	This &amp;amp; That

Encoding is the easy part. There are only really two cases to consider when encoding. In decoding, we have to address all three cases produced by the encodings.

Text to Decode	Single Decoded	Double Decoded
This & That	This & That	error
This &amp; That	This & That	This & That
This &amp;amp; That	This &amp; That	This & That

Entity and Numeric Encoding

There are actually more than 2 ways to encode. Beyond single and double encode, you can also use entity and decimal and hexdecimal numeric encoding.

Text	Entity Encoding	Decimal Numeric Encoding	Hexdecimal Numeric Encoding
This & That	This & That	This & That	This & That

Using combinations of each, you can actually encode the same string 12 different ways. The current dispute (problem) has nothing to do with entity or numeric encoding, so I'll use entity encoding for the remainder of this document and I'll never mention numeric encodings again, to save you a bit of confusion.

The Right Way

As I described above, the item/description element is double encoded and everything else is single encode. Following are example RSS feeds with properly encoding for title and description elements.

Test Case	Text to Encode	RSS
Test Case #1	This & That	<rss version="2.0"> <channel> <title>This & That</title> <description>This & That</description> <link>https://rssweblog.com/</link> <item> <title>This & That</title> <description>This &amp; That</description> <link>https://rssweblog.com/</link> </item> </channel> </rss>
Test Case #2	This & That	<rss version="2.0"> <channel> <title>This &amp; That</title> <description>This &amp; That</description> <link>https://rssweblog.com/</link> <item> <title>This &amp; That</title> <description>This &amp;amp; That</description> <link>https://rssweblog.com/</link> </item> </channel> </rss>

You can download copies of thisandthat.1.xml and thisandthat.2.xml. You can also subscribe to them using your RSS reader and tell me the results. I tested them in IE7 and both worked as per the RSS spec.

The Wrong Way

This is all nice, but we don't live a perfect world and as noted, some developers are double encoding their titles. Here's some example of feeds that a not properly encoded.

Test Case

Text to Encode

RSS

Test Case #3

This & That

<title>This &amp; That</title>
<description>This &amp; That</description>
<link>https://rssweblog.com/</link>
<item>
    <title>This &amp; That</title>
    <description>This &amp; That</description>
    <link>https://rssweblog.com/</link>
</item>
</channel>
</rss>

Test Case #4

This & That

<rss version="2.0">
<channel>
<title>This &amp;amp; That</title>
<description>This &amp;amp; That</description>
<link>https://rssweblog.com/</link>
<item>
    <title>This &amp;amp; That</title>
    <description>This &amp;amp; That</description>
    <link>https://rssweblog.com/</link>
</item>
</channel>
</rss>

You can download copies of thisandthat.3.xml and thisandthat.4.xml. You can also subscribe to them using your RSS reader and tell me the results. I tested them in IE7 and both fail as expected.

The Compromise

Some have decided that since some people are doing the right-thing and some people are doing the wrong-thing, that they could try to predict which and present the end-user with a best guess. The strategy here is to double decode every textual element and if that results in an error condition, then single decode it. This strategy will actually compensate for some feeds that improperly encode. In fact, both test case #3 and #4 will actually work using this compromise. The problem? Test #2 will actually fail using this compromise.

The Contest

In the next few days, I want to make a push to correct vendor bugs in this area. Their are two tests. I'm going to ask my readers to publish two blog entries (case #1 and #2 above), point to the entries and tell me what the RSS items fragment looks like. I'm also going to ask my readers to subscribe to case #1 and #2 above and tell me which RSS reader you use and how they look in your reader. For every test case that succeed, the reader earns 1 point. For every test case that fails, the reader earns 2 points. Don't worry about duplicate entries, they count, as long as you actually did the work. Whomever gets the most points wins ($30 of whatever they want from Amazon). I'm also gonna do a random draw of all the participants for another prize ($20 of whatever they want from Amazon).

The Goal

Once the results start coming, I'll publish them and file bug reports with the vendors. The goal, for me, is to improve the encoding for RSS.

Corrections

I may have typos or mistakes above. If you point them out, I'll award you another bonus point.

Update: As James and Chip noted in comments, the links to test case #3 and #4 were wrong. James points out that I missed CDATA encoding in my enumeration of encoding methods (that was a glaring oops). Two points for James and 1 to Chip for typo corrections.

Categories: rss, xml

Reader Comments

Sun, 10 Jun 2007 20:21:04 GMT

User comment

Your links for 3 and 4 link to 1 and 2. --Chip Camden

Sun, 10 Jun 2007 20:23:08 GMT

User comment

It looks like FeedDemon does the double-decoding for titles in the reading pane, but only single-decoding for the blog title in the feeds pane.

Within the text of the entry, FeedDemon only does single-decoding.

Mon, 11 Jun 2007 14:04:48 GMT

User comment

MyYahoo Beta seems to use the compromise strategy.

Randy

Mon, 11 Jun 2007 14:09:05 GMT

User comment

Google Reader uses the compromise strategy for item/title, but not channel/title.

Randy

Mon, 11 Jun 2007 14:11:17 GMT

User comment

NetVibes seems to use the compromise strategy for channel/title, but not item/title.

Randy

Mon, 11 Jun 2007 14:13:16 GMT

User comment

iGoogle seems to use the compromise strategy.

Randy

Tue, 12 Jun 2007 04:15:57 GMT

User comment

Bloglines seems to use the compromise stategy.
4 Jason Schramm

Randy

Tue, 13 Jan 2009 16:13:36 GMT

User comment

I would interpret "The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text." to mean that text formatting/markup is not allowed in any field other than item/description. Is that right?