15,000 Blogs Added to Topix.net Database

Material from 15,000 blog sources have been added to the Topix.net database. Topix.net already contains material from 12,000 mainstream media sources. Items from blogs and mainstream sources are mixed on topical “feed” pages and search results pages. Topix CEO, Rich Skrenta, has the details (including some great charts and stats) on the company blog.

If you’ve never visted and/or used Topix.net, it’s more than worth a look. I use many times each day (it was one of my top new resources for 2004) either as a news search tool or by browsing some of the more than 300,000 topical “feeds” and 30,000 local feeds that are constantly updated. Btw, Topix also does a great job of separating press releases from other content (look for the PR Scan link in the left column of every page). Channels are available for every Zip Code in the U.S. (and most postal codes in Canada) as well as celebrities, industries, and much more. I find material via Topix I either don’t see elsewhere or see it using Topix first. Every channel can be viewed on the Topix site or can be via RSS.

So, let’s get to today’s news from Topix.net about the addition of content from more than 15,000 blogs to their crawl of more than 12,000 news sources.

+ Blog posts are currently highlighted in a tan/manila box to separate them from mainstream media. This is most likely a beta and will not be the final UI.

+ Topix crawls both RSS and HTML. However, Rich Skrenta tells us that it’s an RSS crawl for most of the blog content.

+ “Posts should show up on our site and search index within 1-3 minutes of being crawled.” Note: Our blog as well as the DocuTicker site I edit were fortunate enough to be two of about 500 blogs that have been in the Topix index prior to today. I can say that many times I was able to find something I posted in Topix within a VERY and I mean a very few minutes.

+ The Topix blog post offers a pie-chart comparing the amount of posts (by topic) from weblogs versus what Topix calls “mainstream media.” Interesting. The only thing I’m unclear about what is precisely a blog and does the definition vary from blog to blog? For example, does a “blog” from the BBC, Washington Post or MSNBC count as a blog or a mainstream source? I’ll admit that this is a gray area as blogs become more mainstream. Just how a blog is defined these days is very debatable.

+ The numbers. Topix.net CEO Rich Skrenta offers some insights and numbers the “real” number of blogs out there versus the amount of spam blogs that exist. Very interesting and some might say, amazing numbers that will sure have people talking. I’ll leave it at that for now. Tag the following numbers: wow. (-:

While the total number of unique feeds that have ever existed, or blogging accounts that have ever been signed up can certainly be counted, what is far more relevant to us is the composition of the daily posting stream. [My emphasis] What we’re seeing is that 85-90% of the daily posts hitting ping services such as weblogs.com are spam (take a look for yourself). Of well-ranked non-spam blogs that we’ve discovered, we’ve found about half haven’t been updated in the past 60 days. Our filters sift through what’s left, which even after discarding 95%, is still a great deal of good material.

Why 15,000 Blogs? Who Made the Selections?
So, how did Topix choose the 15,000 blogs that are now in the database? Skrenta explains that more than 1 million blogs were crawled and then ranked using their NewsRank algorithm that looked at blog posting frequency, writing style, type of reference, popularity, etc. We also learn that 15,000 blogs is an arbitrary number and Topix hopes to add more (lots more) moving forward.

Adding Your Blog
If you’re blog isn’t included in the Topix crawl, you can submit your blog (and give feedback on the service) here.

This is all very new and I look forward to seeing how useful the blog content is versus what I’ve been finding from Topix over the past year. One feature that would be good to have is an option to toggle either blog content or mainstream media content on or off both topical pages and the advanced search interface.

More later.

See Also:
An OJR interview from earlier this year with Rich Skrenta and Chris Tolles from Topix.net

Related reading

Simple Share Buttons