RSS, OPML and the XML platform.
Copyright 2012 World Readable
This last week, I pointed to some flawed thinking about RSS. In particular, about how to determine if an <item> in an RSS feed is new content. Today, I realized that I pointed to the flaw, but didn't tell you how to do it better. The author suggested using <pubDate> as an indicator of new content. Of course, this doesn't always work as some RSS items update their pubDate when the item is modified. The best approach is to use the <guid> element, but even the guid element is optional. With all these optional fields and with RSS 1.0 and multiple version of Atom to worry about, just what is a developer to do? What I did was write a small little function that either returns the guid from an RSS item, or derives one from the remaining content. The algorithm is really simple, this is Really Simple Syndication. Here's the pseudo code:
if (exist(item.guid)) return item.guid.text;
if (exist(item.link)) return item.link.text;
if (exist(item.pubDate)) return item.pubDate.text;
if (exist(item.title)) return item.title.text;
if (exist(item.description)) return hash(item.description.text);
if (exist(rss10:item.link)) return rss10:item.link.text;
if (exist(atom03:entry.id)) return atom03:entry.id.text;
if (exist(atom10:entry.id)) return atom10:entry.id.text;
Since an RSS item must have at least one of <link>, <title> and <description>, you can always derive a unique id for RSS feeds. Now it's simply a matter of comparing the result of this function to the results of previously sent items and you know if this item is new content or not.
Please leave a comment, if you have ideas on how to make this better. I'm gonna try to write more information articles like this one going forward.