In May, SearchDay published part one of an interview with Dan Chan, founder and sole proprietor of Daypop, a specialized search engine focusing on weblog and news content. Today, we present the second part of our conversation with Dan.
Q. Dan, how does Daypop define a weblog? How do you add material to the crawl? Are there weblogs you will not add?
A. My definition of a weblog hasn't changed too much since Daypop started crawling. In the early days, it was easy to spot a blog. It was reverse-chronological posts of interesting links or journal entries or both. It was generally hosted by Blogger, Pitas, or Diaryland. These days there are tons of blog hosting services and there are a lot more collaborative blogs. The format has expanded to include categories and feature articles and photo albums, but the central concept is still the same.
For the first year, I surfed blogs. This happened in batches. When I had a free weekend, I would hop from blog to blog and check out what was out there. I found that there was an amazing number of very good, very well written blogs that didn't get any attention.
I'd add all of the blogs I found to Daypop's blog list. I also take site submissions and every once in a while I would work through them, reviewing each one before adding them to the index. This kind of strategy was just not scalable.
Recently, I added about 19000 blogs to the index from data gathered from weblogs.com's update file. I did this without reviewing the blogs. There have been a couple isolated cases of spamming but it hasn't been too bad.
Daypop accepts all weblogs. I faced the question of blog censorship very early on. A Daypop user had pointed out a site in the index that was rascist and Anti-Semitic. I decided to leave the site in the index.
I think I've explained it best here:
"We all act as our own editors and filters to the information we're presented with. Daypop gives you news and views. I feel it's up to you to do the rest. I don't think it would be right for me to take away that role of editor from you."
Q. Is the Daypop database refreshed daily? Are certain sites recrawled more than others?
A. Blogs are crawled at most every 12 hours. If a blog is one that is infrequently updated, Daypop adjusts to that blog's approximate update schedule and crawls it less frequently. The big International news sites like CNN and NY Times are crawled every 15 minutes. Other important news sites get crawled every hour or several hours. The remaining news sites are crawled every 24 hours.
Q. How long does Daypop keep material in its database? In other words, could I use Daypop to search for a weblog's postings back in 2000? Do you see a need for a weblog archive search tool?
Daypop is a current events search engines so it only searches back one week. My goal was to eventually provide a way to search archived weblog pages back to 2001, when Daypop started. Unfortunately, most of the raw page data was lost in one of my server hard drive mishaps.
Q. Will you give us some background into the rankings available on Daypop? Do you have plans to create others?
I separate what I call Daypop's Trend Analysis into four categories. There's Link Analysis, Word Analysis, Wishlist Analysis, and Authority Analysis.
Link Analysis started with the creation of the Top 40 page shortly after Daypop launched. The Top 40 page ranks links much like the way a football team is ranked. A team's standing is determined by its number of wins, with its most recent wins counting more.
The Top 40 gives more weight to links that have recently been created. This means only fresh, newly discovered links make it to the Top 40. The Top News and Top Posts/ pages are just filtered versions of the Top 40 that only give you news articles and weblog posts, respectively.
Word Analysis measures word occurrences and changes over time to determine the "burstiness" of certain words. Words that have experienced a burst in usage in blogs in the last few days are listed in the Word Bursts page.
News Bursts measures bursts in usage on front pages of news sites. Link analysis alone doesn't catch all the memes that are going around. Sometimes there are no authoritative links to anchor the meme. That's the purpose of Word Analysis -- to catch these memes.
Wishlist Analysis measures how popular certain books, videos and music is on bloggers' wishlists. It does this using similar algorithms to the Top 40.
Authority Analysis gives a global ranking of blogs, ranked by Citations and also by Daypop Score. Daypop Score takes into account blog importance and weights citations using this. High Daypop Scoring weblogs confer more weight or importance to weblogs that they link to.
There are plans to roll out some new services. One of them is in beta right now and it pertains directly to the next question about personalization.
Q. Do you think web search "personalization" will be the next big thing? If not, what will be?
Contextual eighting can be seen as a form of Search Personalization. In this case, the context of the word is determined by your interests. The search engine would know that you are a wildlife fanatic and most likely mean Jaguar, the feline, and not Jaguar, the car.
How do you determine "interest?"
One good way is to use a person's blog, if it exists. The blog is a goldmine of information about a person's interests. Analyzed for word content, you could potentially categorize someone as say, a Mac fanatic who lives in Los Angeles. Then, using the outgoing links from a blog and the citations to the blog, you could determine a more generalized "neighborhood" of interests.
Multi-term searches probably give enough usable contextual information in the absence of "interest" data.
You could also use a blog as the starting point for determining Daypop Scores, giving that blog's links the most weight. That way, every page in the index has a "personalized" Daypop Score for every blog in the index. Combining all these strategies would lead to more relevant search results.
There's a feature that's in beta right now that's related to this concept of Personalization. It's something to look out for in the next month.
This interview with Dan Chan concludes with part three.
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.
At SES London (9-11 Feb) you'll get an overview of the latest tools, tips, and tactics in Paid, Owned, Earned, Integrated Media and Business Intelligence to streamline your marketing campaigns in 2015. Register by 31 October to take advantage of Early Bird Rates.