Problems With Splogs & Time-Based Searching

How the wheel turns. Back in the 1990s, portals gave away free home pages and
with them came a huge amount of search engine spam. Today, portals like Google,
Yahoo, MSN and AOL give away free blog space — and lo and behold, we have blog
spam that apparently hit a new high with a blog spam emergency this weekend, as
Tim Bray

writes
. The blogosphere has been buzzing with discussion on the problem.

As for myself,
I just continue to shake my head that these type of spam issues with blogs
simply weren’t expected. The solution? It’s likely going to be just like what
happened with free web space — free blog space will get ignored by search
engines.

Come along, and we’ll do a tour of past and present, plus a look at the
issues you get when you try to maintain quality when also ranking search results
by time.

1997: Free Home Pages & Spam

First the past. It was around 1997 when free web space for personal home
pages seemed to become more accessible to many people. I remember it well,
because soon after, I started getting complaints from those making use of these
services. Search engines weren’t finding all of their pages, they found. Some
discovered that none of their pages got indexed at all.

It got so bad that eventually, I had to do an article about it.

Search Engines And Free Web Pages
still floats around in the SEW Archives
area for our SEW members.
Here’s the top to it, which will sound really familiar when I start talking
about blog spam later on:

Many people take advantage of free web space provided by their internet
access providers. What they don’t realize is that search engines have shown a
tendency to miss or even ignore certain sites. Complaints have been heard from
those using space provided by America Online, CompuServe and other places.

Indeed, AltaVista no longer even accepts submissions from Tripod, a popular
web service that provides free space. Why? Search engine spammers were using
free space there as a base of operations. It’s easy to open up a new account,
hit the search engine with bogus pages, then move on once the spamming attempt
is detected.

At that time, it was more the internet access providers giving away space,
rather than portals. But not long after, portals jumped in themselves,
culminating I’d say with Yahoo’s $3.6 billion
acquisition
of GeoCities in 1999.

Search spam hosted on free web space had died down as a problem by that
point, however. Why? It was both because search engines were largely ignoring
these areas of the web and because these areas were ignored, they no longer were
attractive magnets to spammers. No one wants space that can’t be seen.

2005: Free Blogs & Splogs

Now let’s skip ahead to today. Well, more specifically to yesterday, when
Mark Cuban who backs blog search engine
IceRocket
wrote in his
Get Your Blogspot
Shit Together Google
post:

The blogosphere was hit by a blogspot.com splogbomb. Someone did the
inevitable and wrote a script that created blog after blog and post after
post.

I’m not talking 100 blogs with a 100 posts each. Im talking what could
easily turn into 10s of THOUSANDS of blogs pinging out millions of posts!

Do a search for HDNet on Icerocket.com or any of the other engines and look
at all the Splogs there are. And they have URLs like this So google, at least
for the time being, we shut out adding new blogspot posts to our index until
we clean all the bullshit you dumped on us out of our indexes.

Sound familiar? I mean, just change the names, and the result is the same.
Blogs are simply more sophisticated home pages, for many, as I’ve
written.
And splogs
(spam+blogs) are just more sophisticated home page spamming attempts.

Allow anyone to create content for free or with no real barriers, and
surprise, a few people will go to extremes and be abusive. Result? IceRocket no
longer trusts the free blog space that Google offers through
Blogger/Blogspot, in the same exact way
that many search engines stopped trusting the free web space of the GeoCities of
the past.

Google’s Failure To Police Or Post Barriers


Google: Kill Blogspot Already!!!
from Chris Pirillo also went up Sunday. As
with Mark Cuban, Chris finds Blogger’s Blogspot-hosted blogs the chief culprit.

I don’t know what’s (specifically) making it so insanely easy for these
spammers to get signed into your system, but you need to change that….

Suggestion, Google? As bold as this might sound, you should institute an
authentication system – a captcha of sorts – for every single post that gets
sent through your Blogger service. This means that there’s no more easy rides
for the idiots out there who are killing your baby and the blogosphere.

Fair enough — some barriers to entry would help, either in setting up the
free space (captchas,
charging a token fee, whatever) in the first place or perhaps even in how people
are allowed to post.

Google’s certainly winning no points with me on this front. Back in June, I
wrote My
Encounter With Search Spam On Blogger
, where I talked about someone that
lifted a description from the Search Engine Watch web site in a misleading
manner, the same person who had lots of other splogs going, as well. In addition
to writing about it, I also went through the formal reporting channels.
Nevertheless, there it sits still.

But Barriers Won’t Solve Issues For Blog Owners

Of course, it won’t help if only Google cleans things up. I haven’t checked,
so apologies if I’m mistaken, but I’m pretty sure that I can get going with free
space over at MSN Spaces and
Yahoo 360. Google’s Blogger is simply a more
well known service. Closing down abuse at Blogger would be great, but I suspect
that just means the abuse will move elsewhere.

For potential bloggers, I’m afraid my advice about free home pages from back
in 1997 will become just as applicable to free blogging space:

That may seem unfair [search engines ignoring free web pages], but when you
use free web space, it’s as if you have hundreds of roommates. They can get
the entire domain in trouble, and the police, or the search engines in this
case, may not care that you are innocent.

Ask your provider if there have been any problems with search engines
visiting free web pages. They should know if there are complaints, and they
should also be able to help resolve any problems. They have the ability to
direct large numbers of people toward the search engines, so it’s to the
advantage of the search engines to work with the providers.

If it’s crucial to be indexed, you may want to consider leaving the free
web space and going with a commercial hosting service.

In other words, get your own domain name. It has never, ever, ever, ever,
ever, ever been a good idea from a search marketing perspective to make use of
someone else’s domain name, as you are not in control of your own destiny.

I don’t care whether it is Blogger, MSN Spaces, Yahoo 360,
Typepad,
WordPress
or anyone. If you make use of someone else’s domain name, you are
ultimately leaving yourself open to:

  • Bad "neighbors" also sharing the domain name causing you trouble with
    search engines
  • Your domain name landlord down the line potentially taking away the house
    where you live

Don’t trust me? Don’t trust this fundamental bit of advice that I and
other search marketers have been saying for years, to have your own domain name?
Then usability expert Jacob Nielsen just said the same thing today in
Weblog Usability: The Top
Ten Design Mistakes
. Tip number 10 is not to use a domain name owned by
another service. He talks about the controlling your own destiny issue, as well
as being seen as an amateur and problems in moving over to your own domain name
down the line.

Splogs & Searching Issues

How about the searcher side of things? Tara Calishain

found
Google and Feedster most impacted by splog, Technorati seeming more
resistant (probably in part, I suspect, because it actually spiders pages rather
than relies on feeds) and Yahoo getting by primarily because of the limited
feeds it covers.

Russell Beattie, like Chris Pirillo,
found his
PubSub feeds getting washed out with spam. I thought the comments below his post
were especially interesting, looking at fighting back on the Google AdSense
front. It’s an issue that’s come up before. Not only does Google host a bunch of
this junk content, but it also helps fuel it by people earning through AdSense.

Ranking By Time Magnifies Spam

Back to Mark Cuban, his post highlights one of the key issues that blog
search faces. Time ranking magnifies the spam problem.

The major search engines have plenty of spam in their indexes. You simply
don’t see this as much because searches are sorted by relevancy. What are deemed
the best pages across the entire web? Links are used to help calculate this, but
textual data on the page and in the links, along with many other factors also
come into play.

In contrast, blog search is largely ranked by time. Post something, send it
out in your feed, and boom — you’re at the top of the list! That is, until
someone else posts and pushes you back down.

How About Some Authority Mixed In?

Solution? How about ranking by time and also limiting matches to only quality
blogs. Ah, but you see, that’s what PubSub supposed to be able to do. When you
create a feed over there, you can use the
Filtering By LinkRank
feature limit to the top 1 percent, 2 percent, 5 percent, 10 percent or 25
percent of blogs (or technically, feeds).

I’ve played with it a bit, and haven’t been impressed. I got a feed for
[google] and know I’ve limited it past the default (PubSub unfortunately doesn’t
show your setting after a feed is made). Nevertheless, most of the current
matches right now are all coming from one site simply because the word "google"
appears in the "Ads By Google" links it carries.

Still, the idea is good, so perhaps it will improve at PubSub or another
service (Om Malik’s
saying
forthcoming Sphere will do this. Cool if
so, but I haven’t played with it yet, and we’ll all see).

News Search Is Great Because They Limit Sources

Over at Robert Scoble, his

The race to time-based and blog search
post last week touches on exactly the
problem of mixing time and relevancy together. His view is that search engines
in general suck on the time-based aspect:

Let’s look at Yahoo, Google, and MSN first so you can see just how bad
those three are if you want to find something that was added to the Web
yesterday.

We have a great case study. Yesterday Microsoft and Real settled their
anti-trust case and announced a new partnership. It was written about on
hundreds of blogs and hundreds of ?pro? news sources.

We also have today?s Apple announcements. So, let?s search on both of
those…

Robert goes on to be unimpressed at finding new stuff. But the reality is
that search engines are great at finding new stuff. That’s called news search.
And news search is great because the sources are limited. Not everyone get in.
It may be that for blog search to be great, you have to have that same time of
limitation. More on this in my response to Robert’s post, which I’ve reprinted
below:

Let’s qualify. You mean how bad they are if you only look at the web search
results and ignore the onebox/shortcut displays they have.

In other words, do on Google or Yahoo, and at the top of the
pages, they show you plenty of news results. They aren?t behind in gathering
fresh data. They?re simply segregating it into the news area and giving you a
heads-up that it is there.

You?re either missing it or ignoring it because those top of the page
segments don?t feel ?normal? to you. All I can say is that the search engines
are aware of that issue.

If you look at my
Invisible
Tabs
article it talks about how at some point, the search engines need
to automatically push the right button or tab or link for you, to give you 10
news results for queries that obviously are news related. Or you do a shopping
search and you get all shopping results automatically.

The problem is the search engines are frightened about making such a
change. If they get it wrong, they may lose people. So they are slowly letting
vertical listings creep in this way.

Remember, web search is NOT a time based activity. Honestly. Think about
it. The last time you did a web search for something new, you weren?t looking
for the best overall site on the subject, were you? No, you wanted the latest,
timely information. You wanted news. They give you excellent news through news
search engines. And Yahoo, among the majors, as you know just started
incorporating blogs as a news source, as well.

Overall, Robert, I think the posts you are doing on search are great in
raising the issues out there and helping push for further UI changes that need
to happen. But I think it would also help to point out some of the features
that do exactly what you want, when they exist. IE ? everyone, you want timely
info? news.google.com, news.yahoo.com are great places to go.

As for your blog search problem, yeah, I know that well. It?s why I
don?t depend on blog search much. I get timely, but I also get all the crud.
PubSub tries to solve this by picking the most authoritative blogs, but I
haven?t found that?s really solved the problem much.

Ultimately, it will probably come down to blog search further refining
this, letting you search by default against a set of hand selected or some
other method filtered blogs, to cut out all the spam ? and you can go further
across all the blogs if you want. But when there are simply so many blogs out
there, a good chunk of them splogs and so on, you?ve got to have some
filtering. THAT?s why news search works so well, because the vertical sites
allowed in there are reviewed.

FYI, Technorati’s out with some handy fresh numbers, finding that two to
eight person of new blogs are
spam but notes
this weekend’s problems may have been perceived as worse simply because spam is
targeting the names of people. Bloggers are big ego searchers, so if someone
targets your name, blog spam can see worse.

Want to comment or discuss? Visit our
Search Engine Watch Forums!