Problems With Splogs & Time-Based Searching
How the wheel turns. Back in the 1990s, portals gave away free home pages and
with them came a huge amount of search engine spam. Today, portals like Google,
Yahoo, MSN and AOL give away free blog space — and lo and behold, we have blog
spam that apparently hit a new high with a blog spam emergency this weekend, as
Tim Bray
writes. The blogosphere has been buzzing with discussion on the problem.
As for myself,
I just continue to shake my head that these type of spam issues with blogs
simply weren’t expected. The solution? It’s likely going to be just like what
happened with free web space — free blog space will get ignored by search
engines.
Come along, and we’ll do a tour of past and present, plus a look at the
issues you get when you try to maintain quality when also ranking search results
by time.
1997: Free Home Pages & Spam
First the past. It was around 1997 when free web space for personal home
pages seemed to become more accessible to many people. I remember it well,
because soon after, I started getting complaints from those making use of these
services. Search engines weren’t finding all of their pages, they found. Some
discovered that none of their pages got indexed at all.
It got so bad that eventually, I had to do an article about it.
Search Engines And Free Web Pages still floats around in the SEW Archives
area for our SEW members.
Here’s the top to it, which will sound really familiar when I start talking
about blog spam later on:
Many people take advantage of free web space provided by their internet
access providers. What they don’t realize is that search engines have shown a
tendency to miss or even ignore certain sites. Complaints have been heard from
those using space provided by America Online, CompuServe and other places.Indeed, AltaVista no longer even accepts submissions from Tripod, a popular
web service that provides free space. Why? Search engine spammers were using
free space there as a base of operations. It’s easy to open up a new account,
hit the search engine with bogus pages, then move on once the spamming attempt
is detected.
At that time, it was more the internet access providers giving away space,
rather than portals. But not long after, portals jumped in themselves,
culminating I’d say with Yahoo’s $3.6 billion
acquisition
of GeoCities in 1999.
Search spam hosted on free web space had died down as a problem by that
point, however. Why? It was both because search engines were largely ignoring
these areas of the web and because these areas were ignored, they no longer were
attractive magnets to spammers. No one wants space that can’t be seen.
2005: Free Blogs & Splogs
Now let’s skip ahead to today. Well, more specifically to yesterday, when
Mark Cuban who backs blog search engine
IceRocket wrote in his
Get Your Blogspot
Shit Together Google post:
The blogosphere was hit by a blogspot.com splogbomb. Someone did the
inevitable and wrote a script that created blog after blog and post after
post.I’m not talking 100 blogs with a 100 posts each. Im talking what could
easily turn into 10s of THOUSANDS of blogs pinging out millions of posts!Do a search for HDNet on Icerocket.com or any of the other engines and look
at all the Splogs there are. And they have URLs like this So google, at least
for the time being, we shut out adding new blogspot posts to our index until
we clean all the bullshit you dumped on us out of our indexes.
Sound familiar? I mean, just change the names, and the result is the same.
Blogs are simply more sophisticated home pages, for many, as I’ve
written.
And splogs
(spam+blogs) are just more sophisticated home page spamming attempts.
Allow anyone to create content for free or with no real barriers, and
surprise, a few people will go to extremes and be abusive. Result? IceRocket no
longer trusts the free blog space that Google offers through
Blogger/Blogspot, in the same exact way
that many search engines stopped trusting the free web space of the GeoCities of
the past.
Google’s Failure To Police Or Post Barriers
Google: Kill Blogspot Already!!! from Chris Pirillo also went up Sunday. As
with Mark Cuban, Chris finds Blogger’s Blogspot-hosted blogs the chief culprit.
I don’t know what’s (specifically) making it so insanely easy for these
spammers to get signed into your system, but you need to change that….Suggestion, Google? As bold as this might sound, you should institute an
authentication system – a captcha of sorts – for every single post that gets
sent through your Blogger service. This means that there’s no more easy rides
for the idiots out there who are killing your baby and the blogosphere.
Fair enough — some barriers to entry would help, either in setting up the
free space (captchas,
charging a token fee, whatever) in the first place or perhaps even in how people
are allowed to post.
Google’s certainly winning no points with me on this front. Back in June, I
wrote My
Encounter With Search Spam On Blogger, where I talked about someone that
lifted a description from the Search Engine Watch web site in a misleading
manner, the same person who had lots of other splogs going, as well. In addition
to writing about it, I also went through the formal reporting channels.
Nevertheless, there it sits still.
But Barriers Won’t Solve Issues For Blog Owners
Of course, it won’t help if only Google cleans things up. I haven’t checked,
so apologies if I’m mistaken, but I’m pretty sure that I can get going with free
space over at MSN Spaces and
Yahoo 360. Google’s Blogger is simply a more
well known service. Closing down abuse at Blogger would be great, but I suspect
that just means the abuse will move elsewhere.
For potential bloggers, I’m afraid my advice about free home pages from back
in 1997 will become just as applicable to free blogging space:
That may seem unfair [search engines ignoring free web pages], but when you
use free web space, it’s as if you have hundreds of roommates. They can get
the entire domain in trouble, and the police, or the search engines in this
case, may not care that you are innocent.Ask your provider if there have been any problems with search engines
visiting free web pages. They should know if there are complaints, and they
should also be able to help resolve any problems. They have the ability to
direct large numbers of people toward the search engines, so it’s to the
advantage of the search engines to work with the providers.If it’s crucial to be indexed, you may want to consider leaving the free
web space and going with a commercial hosting service.
In other words, get your own domain name. It has never, ever, ever, ever,
ever, ever been a good idea from a search marketing perspective to make use of
someone else’s domain name, as you are not in control of your own destiny.
I don’t care whether it is Blogger, MSN Spaces, Yahoo 360,
Typepad,
WordPress or anyone. If you make use of someone else’s domain name, you are
ultimately leaving yourself open to:
Don’t trust me? Don’t trust this fundamental bit of advice that I and
other search marketers have been saying for years, to have your own domain name?
Then usability expert Jacob Nielsen just said the same thing today in
Weblog Usability: The Top
Ten Design Mistakes. Tip number 10 is not to use a domain name owned by
another service. He talks about the controlling your own destiny issue, as well
as being seen as an amateur and problems in moving over to your own domain name
down the line.
Splogs & Searching Issues
How about the searcher side of things? Tara Calishain
found Google and Feedster most impacted by splog, Technorati seeming more
resistant (probably in part, I suspect, because it actually spiders pages rather
than relies on feeds) and Yahoo getting by primarily because of the limited
feeds it covers.
Russell Beattie, like Chris Pirillo,
found his
PubSub feeds getting washed out with spam. I thought the comments below his post
were especially interesting, looking at fighting back on the Google AdSense
front. It’s an issue that’s come up before. Not only does Google host a bunch of
this junk content, but it also helps fuel it by people earning through AdSense.
Ranking By Time Magnifies Spam
Back to Mark Cuban, his post highlights one of the key issues that blog
search faces. Time ranking magnifies the spam problem.
The major search engines have plenty of spam in their indexes. You simply
don’t see this as much because searches are sorted by relevancy. What are deemed
the best pages across the entire web? Links are used to help calculate this, but
textual data on the page and in the links, along with many other factors also
come into play.
In contrast, blog search is largely ranked by time. Post something, send it
out in your feed, and boom — you’re at the top of the list! That is, until
someone else posts and pushes you back down.
How About Some Authority Mixed In?
Solution? How about ranking by time and also limiting matches to only quality
blogs. Ah, but you see, that’s what PubSub supposed to be able to do. When you
create a feed over there, you can use the
Filtering By LinkRank
feature limit to the top 1 percent, 2 percent, 5 percent, 10 percent or 25
percent of blogs (or technically, feeds).
I’ve played with it a bit, and haven’t been impressed. I got a feed for
[google] and know I’ve limited it past the default (PubSub unfortunately doesn’t
show your setting after a feed is made). Nevertheless, most of the current
matches right now are all coming from one site simply because the word "google"
appears in the "Ads By Google" links it carries.
Still, the idea is good, so perhaps it will improve at PubSub or another
service (Om Malik’s
saying
forthcoming Sphere will do this. Cool if
so, but I haven’t played with it yet, and we’ll all see).
News Search Is Great Because They Limit Sources
Over at Robert Scoble, his
The race to time-based and blog search post last week touches on exactly the
problem of mixing time and relevancy together. His view is that search engines
in general suck on the time-based aspect:
Let’s look at Yahoo, Google, and MSN first so you can see just how bad
those three are if you want to find something that was added to the Web
yesterday.We have a great case study. Yesterday Microsoft and Real settled their
anti-trust case and announced a new partnership. It was written about on
hundreds of blogs and hundreds of ?pro? news sources.We also have today?s Apple announcements. So, let?s search on both of
those…
Robert goes on to be unimpressed at finding new stuff. But the reality is
that search engines are great at finding new stuff. That’s called news search.
And news search is great because the sources are limited. Not everyone get in.
It may be that for blog search to be great, you have to have that same time of
limitation. More on this in my response to Robert’s post, which I’ve reprinted
below:
Let’s qualify. You mean how bad they are if you only look at the web search
results and ignore the onebox/shortcut displays they have.In other words, do on Google or Yahoo, and at the top of the
pages, they show you plenty of news results. They aren?t behind in gathering
fresh data. They?re simply segregating it into the news area and giving you a
heads-up that it is there.You?re either missing it or ignoring it because those top of the page
segments don?t feel ?normal? to you. All I can say is that the search engines
are aware of that issue.If you look at my
Invisible
Tabs article it talks about how at some point, the search engines need
to automatically push the right button or tab or link for you, to give you 10
news results for queries that obviously are news related. Or you do a shopping
search and you get all shopping results automatically.
FYI, Technorati’s out with some handy fresh numbers, finding that two to
eight person of new blogs are
spam but notes
this weekend’s problems may have been perceived as worse simply because spam is
targeting the names of people. Bloggers are big ego searchers, so if someone
targets your name, blog spam can see worse.
Want to comment or discuss? Visit our
Search Engine Watch Forums!