For your stats folder.
A post on the Topix.net blog lets us know that the amount of RSS content from news organizations is not as great as some might believe.
Rich Skrenta reports that of the 7000+ news sources Topix crawls only 7% have feeds. He goes on to say that even if the site has a feed, Topix usually crawls the HTML content.
"Even for sites which offer feeds, we'll generally continue to crawl the human-readable version. We've seen sites where the RSS broke but no one at the paper seemed to notice, or cases where the RSS was out of sync with the human-viewable web content."
What about search tools that focus on weblog content?
It contained the following numbers:
+ Only 63% of the weblogs Waypath crawls have feeds; only 22% have full-text in their feeds.
These Waypath numbers were a bit surprising to me. I was thinking that the penetration of RSS/XML feeds in the blogosphere was greater especially when it comes to blogs offering full text feeds.
>From the searcher perspective it's worth remembering that an RSS search might not be the same thing as a full text search.
Thanks to G.L. for the news tip.