February 27, 2015

Crawling websites using RSS feeds

A lot of websites have RSS feeds. Especially websites that are updated regularly with new content, such as news websites and blogs. Using these feeds it is possible to crawl the sites (notably news sites) both more effectively (getting more good content) and efficiently (getting less redundant content). This is because the RSS feeds shine a light on the most critical datum we need, what has changed.

To take advantage of this, I have developed an add-on for Heritrix, called simply CrawlRSS.

The way it works is that for any given site a number of RSS feeds are identified. For each feed a number of 'implied pages' are defined. When a new item is detected in a feed that is taken to imply that the "implied pages" (such as a category front page etc.) have also changed, causing them to be recrawled.

The feeds for a given site are emitted all at the same time. If there is duplication amongst them (a news story is both under domestic and financial news, for example) the URL is only enqueued once, but the implied pages for both feeds are enqueued. All of the discovered URLs are then crawled along with embedded items such as images, stylesheets etc that are discovered via regular link extraction.

The scoping here is very important. Aside from URLs deriving from the feed, only a very narrow range of embedded content is allowed to be crawled. This ensures that the crawler is able to complete a crawl round in a reasonable amount of time. Obviously, this means that an RSS triggered crawl will not capture, for example, old news stories. It is not a replacement to traditional snapshot crawls, but to be a complimentary technique that can help capture the minute-to-minute changes in a website.

Once all discovered URLs deriving from one site's feeds have been crawled, the site becomes eligible for a feed update. I.e. the feeds may be crawled again. A minimum amount of wait between crawling the feeds can be specified. However, this is just a minimum wait time (so we aren't constantly updating a feed that hasn't changed). No maximum limit is imposed. If there is a lot of content to download, it will simply take the time it needs. Indeed, during the first round (when all the layout images, stylesheets etc. are crawled) it can possibly take a couple of hours for all derived URLs to be crawled.

Typically, the crawler will quickly reach the point where any given site is mostly just waiting for the next feed update. Of course, this depends mostly on the nature of the sites being monitored.

The quality of the RSS feeds varies immensely. Some sites have very comprehensive feeds for each category on the site. Others have only one overarching feed. A problematic aspect is that some websites only update their RSS feeds at fixed intervals, not as new items are added to the site. This means that the feed may be up to an hour out of date.

It is necessary to examine and evaluate the RSS feeds of prospective sites. This approach is best suited to high value targets where it is reasonable to invest a degree of effort into capturing them as best as possible.

When done correctly, on a site with well configured feeds, this enables a very detailed capture of the site and how it changes. And this is accomplished with a fairly moderate volume of data. For example, we have been crawling several news websites weekly. This has not given a very good view of day-to-day (let alone minute-to-minute) changes but has produced 15-20 GiB of data weekly. Similar crawling via RSS has given us much better results at a fraction of the data (less than .5 GiB a day). Additionally, the risk of a scope leakage is greatly reduced.

Where before we got 1 capture a week, we now get up to 500 captures of the front pages per week. This is done while reducing the amount of overall amount of data gathered! For websites that update their RSS feeds when stories are updated, we also capture the changes in individual stories.

This is, overall, a giant success. While, true, that not all websites provide a usable RSS feed, where it is present, this approach can make a huge difference. Look, for example at the search results for the front page of the state broadcaster here in Iceland, RUV in our wayback. Before we started using their RSS feed, we would capture their front page about 15-30 times a month. After, it was more like 1500-1800 times a month. A hundred fold increase. With a very small storage footprint.

As I stated before, this doesn't replace conventional snapshot crawling. But for high value, highly dynamic sites, this can, very cheaply, improve their capture by a staggering amount.

Next week I'll do another blog post with the more technical aspects of this.