The RSS Blog

News and commentary from the RSS and OPML community.

The most confusing part of RSS is understanding how to encode textual elements other than the item/description. You see, item/description is special. It's encoded as HTML, which means you double escape certain characters. The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text. Encoding plain text doesn't sound like much of a problem, but when it comes to RSS, everything is really simple, but encoding plain text is really difficult. Why? Because many RSS publishers have encoded these other elements as HTML. This, in turn, resulted in many RSS parsers double-decoding these elements.  

The Many Ways of Encoding

The entire problem can be described with a few simple examples of how you might encode and decode an ampersand symbol.

Text to Encode Single Encoded Double Encoded
This & That This & That This & That
This & That This & That This & That

Encoding is the easy part. There are only really two cases to consider when encoding. In decoding, we have to address all three cases produced by the encodings.

Text to Decode Single Decoded Double Decoded
This & That This & That error
This & That This & That This & That
This & That This & That This & That

Entity and Numeric Encoding

There are actually more than 2 ways to encode. Beyond single and double encode, you can also use entity and decimal and hexdecimal numeric encoding.

Text Entity Encoding Decimal Numeric Encoding Hexdecimal Numeric Encoding
This & That This & That This & That This & That
 

Using combinations of each, you can actually encode the same string 12 different ways. The current dispute (problem) has nothing to do with entity or numeric encoding, so I'll use entity encoding for the remainder of this document and I'll never mention numeric encodings again, to save you a bit of confusion.

The Right Way

As I described above, the item/description element is double encoded and everything else is single encode. Following are example RSS feeds with properly encoding for title and description elements.

Test Case Text to Encode RSS
Test Case #1 This & That

<rss version="2.0">
<channel> 
  <title>This &amp; That</title>
  <description>This &amp; That</description> 
  <link>https://rssweblog.com/</link> 
  <item> 
    <title>This &amp; That</title>
    <description>This &amp;amp; That</description> 
    <link>https://rssweblog.com/</link> 
  </item>
</channel>
</rss> 

Test Case #2 This &amp; That

<rss version="2.0">
<channel> 
  <title>This &amp;amp; That</title>
  <description>This &amp;amp; That</description> 
  <link>https://rssweblog.com/</link> 
  <item> 
    <title>This &amp;amp; That</title>
    <description>This &amp;amp;amp; That</description> 
    <link>https://rssweblog.com/</link> 
  </item>
</channel>
</rss> 

 

You can download copies of thisandthat.1.xml and thisandthat.2.xml. You can also subscribe to them using your RSS reader and tell me the results. I tested them in IE7 and both worked as per the RSS spec.

The Wrong Way

This is all nice, but we don't live a perfect world and as noted, some developers are double encoding their titles. Here's some example of feeds that a not properly encoded.

Test Case Text to Encode RSS
Test Case #3 This & That

<rss version="2.0">
<channel>

  <title>This &amp;amp; That</title>
  <description>This &amp;amp; That</description> 
  <link>https://rssweblog.com/</link> 
  <item> 
    <title>This &amp;amp; That</title>
    <description>This &amp;amp; That</description> 
    <link>https://rssweblog.com/</link> 
  </item>
</channel>
</rss> 

Test Case #4 This &amp; That

<rss version="2.0">
<channel> 
  <title>This &amp;amp;amp; That</title>
  <description>This &amp;amp;amp; That</description> 
  <link>https://rssweblog.com/</link> 
  <item> 
    <title>This &amp;amp;amp; That</title>
    <description>This &amp;amp;amp; That</description> 
    <link>https://rssweblog.com/</link> 
  </item>
</channel>
</rss> 

You can download copies of thisandthat.3.xml and thisandthat.4.xml. You can also subscribe to them using your RSS reader and tell me the results. I tested them in IE7 and both fail as expected.

The Compromise

Some have decided that since some people are doing the right-thing and some people are doing the wrong-thing, that they could try to predict which and present the end-user with a best guess. The strategy here is to double decode every textual element and if that results in an error condition, then single decode it. This strategy will actually compensate for some feeds that improperly encode. In fact, both test case #3 and #4 will actually work using this compromise. The problem? Test #2 will actually fail using this compromise.

The Contest

In the next few days, I want to make a push to correct vendor bugs in this area. Their are two tests. I'm going to ask my readers to publish two blog entries (case #1 and #2 above), point  to the entries and tell me what the RSS items fragment looks like. I'm also going to ask my readers to subscribe to case #1 and #2 above and tell me which RSS reader you use and how they look in your reader.  For every test case that succeed, the reader earns 1 point. For every test case that fails, the reader earns 2 points. Don't worry about duplicate entries, they count, as long as you actually did the work. Whomever gets the most points wins ($30 of whatever they want from Amazon). I'm also gonna do a random draw of all the participants for another prize ($20 of whatever they want from Amazon).

The Goal

Once the results start coming, I'll publish them and file bug reports with the vendors. The goal, for me, is to improve the encoding for RSS.

Corrections

I may have typos or mistakes above. If you point them out, I'll award you another bonus point.

Update: As James and Chip noted in comments, the links to test case #3 and #4 were wrong. James points out that I missed CDATA encoding in my enumeration of encoding methods (that was a glaring oops). Two points for James and 1 to Chip for typo corrections.

Reader Comments Subscribe
Your links for 3 and 4 link to 1 and 2.  --Chip Camden
It looks like FeedDemon does the double-decoding for titles in the reading pane, but only single-decoding for the blog title in the feeds pane.


Within the text of the entry, FeedDemon only does single-decoding.

MyYahoo Beta seems to use the compromise strategy.

Randy

Google Reader uses the compromise strategy for item/title, but not channel/title.

Randy

NetVibes seems to use the compromise strategy for channel/title, but not item/title.

Randy

iGoogle seems to use the compromise strategy.

Randy

Bloglines seems to use the compromise stategy.
4 Jason Schramm

Randy

I would interpret "The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text." to mean that text formatting/markup is not allowed in any field other than item/description.  Is that right?

I would interpret "The problem is all the other textual elements which by definition should not be HTML encoded, but rather they should be plain text." to mean that text formatting/markup is not allowed in any field other than item/description.  Is that right?

Type "339":