Sun, 14 Jan 2007 01:28:42 GMT

RSS Feed State Draft

This is a first draft of an RSS Feed State specification that I intend to present to the RSS Advisory Board for adoption as a specification. Any feedback would be great. Typo checkers would be very welcome. Grammar checkers would be adored. It's based on an older RSS feed state document, I wrote several years back.

RSS Feed State

by Randy Charles Morin

Introduction #

The RSS Spec describes an XML format, but does not describe how RSS feeds change or when and how RSS clients should be checking for updates to the RSS file. This specification attempts to fill this gap.

RSS is just XML #

It's very important to remember that RSS is just structured XML , that is, the elements, attributes and their order is defined by a specification. Their are three widely used RSS formats; RSS 0.91, RSS 1.0 and RSS 2.0. I suggest that RSS publishers use RSS 2.0 as it's the fastest growing of the formats and most widely supporters by RSS readers. Because it's just XML, most of this document is about XML feed state and applies equally well to other XML formats, like CDF and Atom. In fact, many of the techniques described here-in also apply equally well to other non-XML formats. It's not like were inventing the wheel here, these techniques have been used for years by Web client and servers to interchange documents.

Syndication Hints #

RSS itself specifies several techniques for guiding the RSS reader in pulling RSS feeds over the Web. These techniques are quite often neglected by both RSS publishers and RSS readers. In order for these syndication hints to be affective, both the RSS publisher and the RSS reader must respect them.

skipHours and skipDays #

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Very useful in specific circumstances.
Not widely implemented.

Many of us sleep and during those sleeping hours, we rarely blog. Many of us work and during those working hours, we rarely blog. So, why then are RSS readers pulling our feeds during those down hours? Well, truth is, they don't have to. RSS 2.0 and 0.91 both implemented a great syndication hint that told RSS readers when to avoid reading the RSS feed. By adding these elements to our RSS feeds, we're telling RSS readers to stop polling during these hours or even days. This can have a very positive affect on the bandwidth requirements of your Weblog. The following is an example RSS 2.0 feed that tells the RSS reading client not to poll the RSS feed during the six hours from 6AM GMT to 11AM GMT (until noon) and neither to poll the entire day of Sunday.

<rss version="2.0">
   <channel>
<description>News and commentary from the cross-platform scripting community.</description>
<link>http://www.scripting.com/</link>
<title>Scripting News</title>
<skipHours>
   <hour>6</hour>
   <hour>7</hour>
   <hour>8</hour>
   <hour>9</hour>
   <hour>10</hour>
   <hour>11</hour>
</skipHours>
<skipDays>
   <day>Sunday</day>
</skipDays>
<item>
   <title>stuff</title>
</item>
   </channel>
</rss>

This would reduce the bandwidth required to serve the feed by about one third. Of course, this depends entire on whether your readers use well behaved RSS readers and the times of the day they read your blog. Of course, if you blog all days of the week and all hours of the day, then this syndication hint won't be of much help.

A technique I once used to reduce bandwidth in rare blogging hours was to put every second hour. This allowed RSS readers to poll my feed every second hour during my non-blogging hours and every hour otherwise. I didn't want a six hour polling gap, just in case, I was awake at 3AM and wanted to get my message out as quickly as possible.

TTL #

Who should implement?
RSS readers SHOULD, centralized RSS
aggregators MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

TTL or time to live is another great syndication hint available in RSS 2.0. It's defined as "a number of minutes that indicates how long a channel can be cached before refreshing from the source." It's a hint telling you how long you can cache the RSS feed. An RSS reader could use this hint to automatically set the polling interval for the RSS feed. The following is an example RSS feed that sets the refresh hint to two hours.

Most RSS readers poll the source RSS feeds once per hour by default . If you don't blog very often and are not concerned with how quickly your message is read by your readers, then a larger TTL value can significantly reduce the bandwidth requirements of your RSS feed. On the other hand, if you want to get your message out there quickly and are not worried about the bandwidth consumption, then a lower TTL can get RSS clients to pull your RSS feed more often.

It's very important to note that nobody is suggesting that an RSS reader shouldn't poll the RSS feed more frequently than the TTL value indicates. Rather, the TTL value is telling the RSS reader that the feed data is good for so many minutes and that it only needs to refresh from source when the TTL is exceeded. This is a very small distinction, but an important one, because there's no contract that says an RSS client can't poll an RSS feed every five minutes regarless of the TTL value.

Syndication Module #

Who should implement?
RSS readers SHOULD and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful, but not widely implemented.

RSS 1.0 implements a mechanism similar to TTL called the RDF Site Summary Syndication Module. This module is a bit more flexible than TTL, but rarely used. Again, the technique is not a contract telling RSS readers to limit polling to this interval, but rather a hint to the RSS reader as to how often the feed is generally updated. Although the syndication module is intended for RSS 1.0, the extensibility of RSS 2.0 allows you to use it, but you might find it's not well supported.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
   xmlns= "http://purl.org/rss/1.0/"><BR> <channelrdf:about="http://meerkat.oreillynet.com/?_fl=rss1.0">
<title>Meerkat</title>
<link>http://meerkat.oreillynet.com</link>
<description>Meerkat: An Open Wire Service</description>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>2</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
   </channel>
</rdf:RDF>

The exactness of this RSS extension would normally allow you to exactly control the polling of the RSS feed, but both the RSS publisher and RSS reader must implement the mechanism and it is sparsely implemented.

HTTP is not stateless #

To this point, I've identified a few RSS mechanisms for controlling the state of RSS feeds. The next few sections describe the state of your RSS feed as a Web document. Remember that your RSS feed is really just another object that is transferred over HTTP, the protocol of the Web. As such, it takes on all the attributes of HTTP objects and contrary to what we'd like to believe HTTP is not stateless. Whenever you make an HTTP request and whenever you receive an HTTP response, the HTTP package contains a set of headers that are easily extended to provide HTTP with all sorts of state data. Many of these state attribute and other HTTP attributes are described in the following sections.

Cacheability #

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

HTTP implements are very elaborate mechanism for increasing performance called response caching. The algorithm is describe in the HTTP/1.1 RFC in various sections. The RFC sections on Caching in HTTP and Cache-Control describe most of what you would require in order to implement an HTTP response cache, but the protocol is quite extensive and many HTTP libraries have these mechanisms built-in. I suggest implementing one of the existing libraries.

The advantages of Cacheability in the arena of RSS only arise when you use shared caches, that is, when serveral RSS readers are behind the same Web proxy. If several RSS readers were behind the same Web proxy, then the feed can be cached by the Web proxy and served to more than one user.

The question always arises on how to implement HTTP/1.0 cacheability. First, any RSS readers or publishers that implement HTTP/1.0 are simply wrong. HTTP/1.1 is widely implemented and every RSS reader should be making HTTP/1.1 requests. That said, there are a lot of RSS readers that have chosen for some ridiculous reason to implement HTTP/1.0. I almost feel like telling you to ignore HTTP/1.0 requests, but you might not like this response. Rather, I suggest that all RSS readers should make HTTP/1.1 requests, knowing that all RSS publishers have implemented this version of the protocol and that RSS publishers handle HTTP/1.1 request as best possible and HTTP/1.0 requests as minimally as possible. In other words, do the least work possible when handling cacheability and all the other HTTP attributes of HTTP/1.0 requests. Again, that said, if you find your bandwidth is out-of-control, implementing HTTP/1.0 attributes for cacheability and compression can help, but not much.

If you need more help implementing Cacheability, then Mark Nottingham has a great article on HTTP Caches.

Entity Tags #

Who should implement?
RSS readers MUST and RSS publishers SHOULD.

How useful is this mechanism?
Very useful in reducing bandwidth requirements.

Entity tags or ETags are a hash of the response content, a.k.a. cache validator. It is passed in the HTTP response headers. The client saves the Etag and next time he requests the same URL, he includes the Etag as the If-None-Match header. If the Etag matches the current representation, then Web server responds with the HTTP 304 status code and no content. This tells the RSS reader that the content has not changed since the previous request. If the Etag doesn't match the current representation, then the RSS feed is returned in the response content, as usual.

Last Modified #

Who should implement?
RSS readers MUST and RSS publishers CAN.
RSS publishers SHOUD use Entity Tags.

How useful is this mechanism?
Somewhat useful in reducing bandwidth requirements.

Another form of cache validation is the Last-Modified header. It works similar to the ETag, except that it's based on time, rather than some sort of content hash. That's not to say that ETags can't also be dates, but ETags are not limited to dates, whereas Last-Modified headers are dates and only dates.

ETags are often referred to as the strong cache validator, that's because ETags are not based on artificial hash of the image. Dates on the other hand are an artificial hash of an image. To explain, if you have a hit counter on a page, then two simultaneous pulls would produce the same Last-Modified date, but two different ETags. An insignificant difference, but one none-the-less. This is not to say that an ETag must be a strong cache validator, it may also be weak.

Last-Modified works in the same manner as the Entity Tags, except that the Last-Modified header value returned in an HTTP response is passed as the If-Modified-Since HTTP header in future HTTP requests.

Gzip #

Who should implement?
RSS readers SHOULD and RSS publishers CAN.

How useful is this mechanism?
Very useful in reducing bandwidth requirements.

HTTP also provides a mechanism for compressing the response content. The RSS reader can pass either gzip or compress in the Accept-Encoding HTTP header to tell the Web server that it is capable of understanding compressed responses. Gzip works the same for both HTTP/1.0 and HTTP/1.1.

GZip, ETag and Cacheability are not widely supported by all RSS publishers and readers, but the goal of this document is to change that. RSS software developers, let's get busy.

Redirect #

Who should implement?
RSS readers MUST implement 302 redirect.
RSS readers SHOULD implement 301 redirects.
RSS readers SHOULD implement XML redirects.
RSS publishers CAN implement any.

How useful is this mechanism?
Very useful in maintain subscriptions.

Sometimes you need to move a feed from one URL to another. For example, services like FeedBurner host your RSS on your behalf. After hosting with FeedBurner you might decide to move your feed back to a URL within your own domain. FeedBurner implements an HTTP 301 Permanent Redirect for a ten day period after you decide to move your feed URL.

Most RSS readers currently treat HTTP 301 permanent redirects as temporary redirects and don't update their database. This is a good start, but his causes an extra redundant network cycle forever going forward. Rather, RSS clients SHOULD update their database, replacing the original RSS feed URL with the new RSS feed URL returned by the HTTP 301 response.

RSS feed publishers may also temporarily redirect your RSS feed URL by returning an HTTP 302 Temporary Redirect. RSS clients MUST NOT uupdate their database when they receive an HTTP 302 response.

A last type of redirect is an XML level redirect. These should be treated the same as HTTP 301 response.

<?xml version="1.0"?>
<redirect>
<newLocation>http://feeds.feedburner.com/TheRssBlog<newLocation>
</redirect>

Gone #

Who should implement?
RSS readers MUST and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

Finally when an RSS feed is over, when it's life has been served and you no longer want to incur the bandwidth of clients repeatedly requesting a stale RSS for the rest of time, how do you tell the client to stop requesting me. This is simple. If you respond with an HTTP 410 status code, then you are "notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed." That's a quote from the HTTP RFC.

But, not everybody has the Web server control to respond 410 to requests. Especially if you have a shared Web server, this may not be possible at all. In this case, I suggest you respond with the following tidbit of XML that tells the RSS reader that you are no longer servicing this request.

<?xml version="1.0"?> <redirect> <newLocation/> </redirect>

User-Agent #

Who should implement?
Centralized RSS readers that poll on behalf of multiple
end-users SHOULD and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

RSS feed services like FeedBurner and Fedafi use the User-Agent header to determine which RSS clients are polling their feeds. They can even use the volume and IP address of polls to determine how many clients are actually polling a feed. When trying to compile accurate reader stats, a problem occurs when a single RSS client is polling a feed on behalf of more than one end-user. In this case, many RSS clients have adopted a mechanism of overloading the User-Agent in the RSS request to pass this meta-data. A sample HTTP User-Agent header is shown with the overloaded data.

User-Agent: kb.Rmail (http://www.r-mail.org; 500 subscribers)

Pinging #

Who should implement?
RSS readers and RSS publishers CAN.

How useful is this mechanism?
Somewhat useful.

Most of this document has been about reducing bandwidth by controlling when an RSS client polls your feed. Pinging has the exact opposite purpose. If you want a RSS client to immediately poll and index your feed, then you can use an HTTP GET or XML-RPC ping to tell many RSS clients that you have new or updated blog entries that are ready for indexing. The ping has long been a part of the blogosphere infrastructure and was started by Dave Winer on a website called weblogs.com, which he later sold to Verisign.

HTTP Ping #

An HTTP GET ping request is very simple. It's a very simple HTTP GET request. For example, if I wanted to tell Weblogs.com that The RSS Blog has been updated, then I issue the following HTTP GET requests.

http://rpc.weblogs.com/pingSiteForm?name=The+RSS+Blog&url=http%3A%2F%2Fwww.kbcafe.com/rss

Simply change the name and url parameters to URI encoded values for your blog to ping the Weblogs.com website. If the ping request succeeds, then the response should have the HTTP 200 status code and the content should be an HTML human readable page indicating this status. If the ping request files, then the response should have the appropriate non-HTTP 200 status code and the content should be an HTML human readable page informing the user why the request failed.

XML-RPC Ping #

Who should implement?
RSS aggregators CAN and RSS publishers SHOULD.

How useful is this mechanism?
Very useful in specific circumstances.

The XML-RPC ping is a lot more complicated. In fact, there are two different kinds of XML-RPC pings; simple and extended pings.

Simple Ping Request #

In the simple XML-RPC ping, the request contains two parameters, the name and url of the website or blog that has changed. The url can either be the HTML homepage of the site or the RSS feed for the site. The XML-RPC method is 'weblogUpdates.ping' (without the quotes). A sample simple XML-RPC ping request is shown.

POST /RPC2 HTTP/1.0 Host: rpc.weblogs.com Content-Type: text/xml <?xml version="1.0"?> <methodCall> <methodName>weblogUpdates.ping</methodName> <params> <param> <value>The RSS Blog</value> </param> <param> <value>https://rssweblog.com/</value> </param> </params> </methodCall>

Extended Ping Request #

In the extended XML-RPC ping, the request contains four or five parameters. The first two parameters are still the name and url of the website or blog. The second parameter in the extended ping is always the url of the homepage and not the url of the RSS feed. The third parameter is the url of the webpage that has changed. The fourth parameter the url of the RSS feed. The fifth parameter and only optional parameter is a delimited list of categories or tags associated with the blog. The delimiter is the '|' character. The XML-RPC method is 'weblogUpdates.extendedPing' (without the quotes). A sample extended XML-RPC ping request is shown.

POST /RPC2 HTTP/1.0 Host: rpc.weblogs.com Content-Type: text/xml <?xml version="1.0"?> <methodCall> <methodName>weblogUpdates.extendedPing</methodName> <params> <param> <value>The RSS Blog</value> </param> <param> <value>https://rssweblog.com/</value> </param> <param> <value>https://rssweblog.com/</value> </param> <param> <value>http://feeds.feedburner.com/TheRssBlog</value> </param> <param> <value>rss|opml</value> </param> </params> </methodCall>

Ping Response #

Whether you issue a simple or extended ping request, the response has similar form. The response is a struct with two named parameters, flerror and message. The flerror is a boolean. If the request succeeds, the flerror is 0 (false), otherwise it is 1 (true). The message describes the success or failure of the requests in human readable language and can be used to best describe the reason for any failures. A sample response is shown.

HTTP/1.1 200 OK Connection: close Content-Type: text/xml Date: Sun, 30 Sep 2001 20:02:30 GMT <?xml version="1.0"?> <methodResponse> <params> <param> <value> <struct> <member> <name>flerror</name> <value> <boolean>0</boolean> </value> </member> <member> <name>message</name> <value>Your ping succeeded!</value> </member> </struct> </value> </param> </params> </methodResponse>

Who to ping? #

There are many different ping sinks on the Internet that want your pings, too many. I'll list a few of the major ping servers.

Ping Sink	Instructions	HTTP GET	XML-RPC
Weblogs.com	http://www.weblogs.com/api.html	http://rpc.weblogs.com/pingSiteForm?name={0}&url={1}	http://rpc.weblogs.com/RPC2
Yahoo!	http://publisher.yahoo.com/submit.php	http://api.my.yahoo.com/rss/ping?u={0}	http://api.my.yahoo.com/RPC2
Bloglines	http://www.bloglines.com/services/api/	http://www.bloglines.com/ping?url={0}	http://www.bloglines.com/ping
IceRocket	http://www.icerocket.com/c?p=ping	http://www.icerocket.com/c?p=ping&url={0}	http://rpc.icerocket.com:10080/
Technorati	http://technorati.com/ping	http://technorati.com/ping?url={0}	http://rpc.technorati.com/rpc/ping
FeedBurner	http://www.feedburner.com/fb/a/ping	http://www.feedburner.com/fb/a/pingSubmit?bloglink={0}	http://ping.feedburner.com

Note that you wouldn't normally want to ping FeedBurner unless they were hosting your RSS feed.

Retrieving Pings #

One unfortunate part of the pinging infrastructure is that the ping sinks get advance notification of changes to most blogs. This makes it very difficult for new blogosphere services to participate on equal footing with the established services. Helping those new services compete is weblogs.com changes.xml infrastructure. As weblogs.com receives new ping, they update two files; changes.xml [http://rpc.weblogs.com/changes.xml] and shortChanges.xml [http://rpc.weblogs.com/shortChanges.xml]. The first contains all the recent pings in the last hours and the second contains all the recent pings in the last 5 minutes. Further, weblogs.com archives a list of all the pings its receives in each hour. All pings received in the first hour of January 1st 2007 would be archived in the file http://www.weblogs.com/2007/01/01/01.xml.

The format of these XML files is pretty straight forward. A very small sample of the XML file is shown.

<weblogUpdates version="2" updated="Sat, 13 Jan 2007 19:08:00 GMT" count="2000001"> <weblog name="The RSS Blog" url="http://www.kbcafe.com/rss" when="1"/> <weblog name="iBLOGthere4iM" url="http://www.kbcafe.com/iBLOGthere4iM" when="2"/> </weblogUpdates>

The root element has three unimportant attributes; version, updated and count. The updated attribute is a formatted date representing the time this file was retrieved. The count attribute is the total number of pings received by weblogs.com.

The important part of this document is the weblog elements contained within the weblogUpdates root. For each ping received, a new record is created with three attributes; name, url and when. The name and url attribute are the name and url submitted in the ping request above. The when attribute in the number of minutes since the ping was received.

Note that you shouldn't pull the changes.xml file too frequently, but should rather use the shortChanges.xml when you need updates as frequently as every 5 minutes.

Categories:�rss,�rssboard

Reader Comments

Sun, 14 Jan 2007 20:27:36 GMT

User comment

D. Lambert,
That's a great idea!

Randy

Tue, 04 Oct 2011 08:41:10 GMT

User comment

There are many different ping sinks on the Internet that want your pings, too many. I'll list a few of the major ping servers.