How the wheel turns. Back in the 1990s, portals gave away free home pages and with them came a huge amount of search engine spam. Today, portals like Google, Yahoo, MSN and AOL give away free blog space -- and lo and behold, we have blog spam that apparently hit a new high with a blog spam emergency this weekend, as Tim Bray writes. The blogosphere has been buzzing with discussion on the problem.
As for myself, I just continue to shake my head that these type of spam issues with blogs simply weren't expected. The solution? It's likely going to be just like what happened with free web space -- free blog space will get ignored by search engines.
Come along, and we'll do a tour of past and present, plus a look at the issues you get when you try to maintain quality when also ranking search results by time.
1997: Free Home Pages & Spam
First the past. It was around 1997 when free web space for personal home pages seemed to become more accessible to many people. I remember it well, because soon after, I started getting complaints from those making use of these services. Search engines weren't finding all of their pages, they found. Some discovered that none of their pages got indexed at all.
It got so bad that eventually, I had to do an article about it. Search Engines And Free Web Pages still floats around in the SEW Archives area for our SEW members. Here's the top to it, which will sound really familiar when I start talking about blog spam later on:
Many people take advantage of free web space provided by their internet access providers. What they don't realize is that search engines have shown a tendency to miss or even ignore certain sites. Complaints have been heard from those using space provided by America Online, CompuServe and other places.
Indeed, AltaVista no longer even accepts submissions from Tripod, a popular web service that provides free space. Why? Search engine spammers were using free space there as a base of operations. It's easy to open up a new account, hit the search engine with bogus pages, then move on once the spamming attempt is detected.
At that time, it was more the internet access providers giving away space, rather than portals. But not long after, portals jumped in themselves, culminating I'd say with Yahoo's $3.6 billion acquisition of GeoCities in 1999.
Search spam hosted on free web space had died down as a problem by that point, however. Why? It was both because search engines were largely ignoring these areas of the web and because these areas were ignored, they no longer were attractive magnets to spammers. No one wants space that can't be seen.
2005: Free Blogs & Splogs
The blogosphere was hit by a blogspot.com splogbomb. Someone did the inevitable and wrote a script that created blog after blog and post after post.
I'm not talking 100 blogs with a 100 posts each. Im talking what could easily turn into 10s of THOUSANDS of blogs pinging out millions of posts!
Do a search for HDNet on Icerocket.com or any of the other engines and look at all the Splogs there are. And they have URLs like this So google, at least for the time being, we shut out adding new blogspot posts to our index until we clean all the bullshit you dumped on us out of our indexes.
Sound familiar? I mean, just change the names, and the result is the same. Blogs are simply more sophisticated home pages, for many, as I've written. And splogs (spam+blogs) are just more sophisticated home page spamming attempts.
Allow anyone to create content for free or with no real barriers, and surprise, a few people will go to extremes and be abusive. Result? IceRocket no longer trusts the free blog space that Google offers through Blogger/Blogspot, in the same exact way that many search engines stopped trusting the free web space of the GeoCities of the past.
Google's Failure To Police Or Post Barriers
Google: Kill Blogspot Already!!! from Chris Pirillo also went up Sunday. As with Mark Cuban, Chris finds Blogger's Blogspot-hosted blogs the chief culprit.
I don't know what's (specifically) making it so insanely easy for these spammers to get signed into your system, but you need to change that....
Suggestion, Google? As bold as this might sound, you should institute an authentication system - a captcha of sorts - for every single post that gets sent through your Blogger service. This means that there's no more easy rides for the idiots out there who are killing your baby and the blogosphere.
Fair enough -- some barriers to entry would help, either in setting up the free space (captchas, charging a token fee, whatever) in the first place or perhaps even in how people are allowed to post.
Google's certainly winning no points with me on this front. Back in June, I wrote My Encounter With Search Spam On Blogger, where I talked about someone that lifted a description from the Search Engine Watch web site in a misleading manner, the same person who had lots of other splogs going, as well. In addition to writing about it, I also went through the formal reporting channels. Nevertheless, there it sits still.
But Barriers Won't Solve Issues For Blog Owners
Of course, it won't help if only Google cleans things up. I haven't checked, so apologies if I'm mistaken, but I'm pretty sure that I can get going with free space over at MSN Spaces and Yahoo 360. Google's Blogger is simply a more well known service. Closing down abuse at Blogger would be great, but I suspect that just means the abuse will move elsewhere.
For potential bloggers, I'm afraid my advice about free home pages from back in 1997 will become just as applicable to free blogging space:
That may seem unfair [search engines ignoring free web pages], but when you use free web space, it's as if you have hundreds of roommates. They can get the entire domain in trouble, and the police, or the search engines in this case, may not care that you are innocent.
Ask your provider if there have been any problems with search engines visiting free web pages. They should know if there are complaints, and they should also be able to help resolve any problems. They have the ability to direct large numbers of people toward the search engines, so it's to the advantage of the search engines to work with the providers.
If it's crucial to be indexed, you may want to consider leaving the free web space and going with a commercial hosting service.
In other words, get your own domain name. It has never, ever, ever, ever, ever, ever been a good idea from a search marketing perspective to make use of someone else's domain name, as you are not in control of your own destiny.
- Bad "neighbors" also sharing the domain name causing you trouble with search engines
- Your domain name landlord down the line potentially taking away the house where you live
Don't trust me? Don't trust this fundamental bit of advice that I and other search marketers have been saying for years, to have your own domain name? Then usability expert Jacob Nielsen just said the same thing today in Weblog Usability: The Top Ten Design Mistakes. Tip number 10 is not to use a domain name owned by another service. He talks about the controlling your own destiny issue, as well as being seen as an amateur and problems in moving over to your own domain name down the line.
Splogs & Searching Issues
How about the searcher side of things? Tara Calishain found Google and Feedster most impacted by splog, Technorati seeming more resistant (probably in part, I suspect, because it actually spiders pages rather than relies on feeds) and Yahoo getting by primarily because of the limited feeds it covers.
Russell Beattie, like Chris Pirillo, found his PubSub feeds getting washed out with spam. I thought the comments below his post were especially interesting, looking at fighting back on the Google AdSense front. It's an issue that's come up before. Not only does Google host a bunch of this junk content, but it also helps fuel it by people earning through AdSense.
Ranking By Time Magnifies Spam
Back to Mark Cuban, his post highlights one of the key issues that blog search faces. Time ranking magnifies the spam problem.
The major search engines have plenty of spam in their indexes. You simply don't see this as much because searches are sorted by relevancy. What are deemed the best pages across the entire web? Links are used to help calculate this, but textual data on the page and in the links, along with many other factors also come into play.
In contrast, blog search is largely ranked by time. Post something, send it out in your feed, and boom -- you're at the top of the list! That is, until someone else posts and pushes you back down.
How About Some Authority Mixed In?
Solution? How about ranking by time and also limiting matches to only quality blogs. Ah, but you see, that's what PubSub supposed to be able to do. When you create a feed over there, you can use the Filtering By LinkRank feature limit to the top 1 percent, 2 percent, 5 percent, 10 percent or 25 percent of blogs (or technically, feeds).
I've played with it a bit, and haven't been impressed. I got a feed for [google] and know I've limited it past the default (PubSub unfortunately doesn't show your setting after a feed is made). Nevertheless, most of the current matches right now are all coming from one site simply because the word "google" appears in the "Ads By Google" links it carries.
News Search Is Great Because They Limit Sources
Over at Robert Scoble, his The race to time-based and blog search post last week touches on exactly the problem of mixing time and relevancy together. His view is that search engines in general suck on the time-based aspect:
Let's look at Yahoo, Google, and MSN first so you can see just how bad those three are if you want to find something that was added to the Web yesterday.
We have a great case study. Yesterday Microsoft and Real settled their anti-trust case and announced a new partnership. It was written about on hundreds of blogs and hundreds of ?pro? news sources.
We also have today?s Apple announcements. So, let?s search on both of those...
Robert goes on to be unimpressed at finding new stuff. But the reality is that search engines are great at finding new stuff. That's called news search. And news search is great because the sources are limited. Not everyone get in. It may be that for blog search to be great, you have to have that same time of limitation. More on this in my response to Robert's post, which I've reprinted below:
Let's qualify. You mean how bad they are if you only look at the web search results and ignore the onebox/shortcut displays they have.
In other words, do [video ipod] on Google or Yahoo, and at the top of the pages, they show you plenty of news results. They aren?t behind in gathering fresh data. They?re simply segregating it into the news area and giving you a heads-up that it is there.
You?re either missing it or ignoring it because those top of the page segments don?t feel ?normal? to you. All I can say is that the search engines are aware of that issue.
If you look at my Invisible Tabs article it talks about how at some point, the search engines need to automatically push the right button or tab or link for you, to give you 10 news results for queries that obviously are news related. Or you do a shopping search and you get all shopping results automatically.
Remember, web search is NOT a time based activity. Honestly. Think about it. The last time you did a web search for something new, you weren?t looking for the best overall site on the subject, were you? No, you wanted the latest, timely information. You wanted news. They give you excellent news through news search engines. And Yahoo, among the majors, as you know just started incorporating blogs as a news source, as well.
Overall, Robert, I think the posts you are doing on search are great in raising the issues out there and helping push for further UI changes that need to happen. But I think it would also help to point out some of the features that do exactly what you want, when they exist. IE ? everyone, you want timely info? news.google.com, news.yahoo.com are great places to go.
As for your blog search problem, yeah, I know that well. It?s why I don?t depend on blog search much. I get timely, but I also get all the crud. PubSub tries to solve this by picking the most authoritative blogs, but I haven?t found that?s really solved the problem much.
Ultimately, it will probably come down to blog search further refining this, letting you search by default against a set of hand selected or some other method filtered blogs, to cut out all the spam ? and you can go further across all the blogs if you want. But when there are simply so many blogs out there, a good chunk of them splogs and so on, you?ve got to have some filtering. THAT?s why news search works so well, because the vertical sites allowed in there are reviewed.
FYI, Technorati's out with some handy fresh numbers, finding that two to eight person of new blogs are spam but notes this weekend's problems may have been perceived as worse simply because spam is targeting the names of people. Bloggers are big ego searchers, so if someone targets your name, blog spam can see worse.
Want to comment or discuss? Visit our Search Engine Watch Forums!