Inktomi Spam Database Left Open To Public

Inktomi suffered an embarrassing lack of security last week, when it was discovered that a copy of its database of spam and porn sites was left open to the public.

Brett Tabke, who runs the Search Engine World web site, found the database through a search at the AllTheWeb search engine. That ironic, given that AllTheWeb is run by FAST, which competes with Inktomi for search customers.

A search for "inktomi editorial database" at AllTheWeb brought up a link leading to the Inktomi's page filtering system. Instructions on the home page of the system allowed users to easily set up guest access to explore the system or to even establish their own accounts.

Once inside, users could access Inktomi's "Hall of Spam," a list of web sites and URLs that Inktomi had banned or penalized for index abuse. The system's documentation also discussed other ways URLs might be classified, such as being on a protected "whitelist," and provided some additional details on what Inktomi considers spam.

Tabke revealed the existence of the database in his quarterly newsletter last week. Not surprisingly, Inktomi removed public access to the database soon after.

"It was not supposed to be open to the public, but obviously, for a period of time, it was. That has now been resolved. It was an internal system that we used for our service," said Troy Toman, Inktomi's general manager of search.

Toman added that the database left open was a six-month old mirrored copy of the actual database. So, while people could create user accounts and alter information, that had no effect on the Inktomi database, the company says.

Despite being closed to the public, Tabke's article covered enough details about what was in the database to cause some concern among some of his readers, who posted comments at the Webmaster World discussion site that Tabke also operates.

Some of the upset aired over findings from the database is justified; but in other cases, it's hard to be outraged.

For example, there's the suggestion that Inktomi is being abusive by having a large database of spam URLs and a computerized system to track them. Tabke estimated that over 1 million URLs were on the list. So what? Spamming is a full-time, industrially-applied activity for some search engine optimization firms. It would almost be a bigger concern if Inktomi lacked some organized system to monitor and block spam. Nor would anyone in the SEO industry doubt that there is plenty of spam to catch. If anything, 1 million URLs seems low.

Inktomi said that the current system was created in early 2000, in order to comply with the Digital Millennium Copyright Act. That act requires search engines to remove links to illegal content, if they are notified. The system thus can classify a URL to be "legaldelete," meaning that it should be removed for legal reasons. However, the system was also designed to better manage spam control.

"We obviously have many ways of automatically detecting spam, but we realized that they don't capture everything, so this was for human knowledge," Toman said.

Another concern raised in the forums was the discovery that Inktomi will block IP addresses of some companies suspected of spamming them. This has the impact of removing not just one URL the company may have but all of them.

This sounds like a draconian measure, but reviewing some of the entries in the Hall of Spam leaves one with little sympathy for many of those who were caught. For instance:

  • "AddURL stuffing, hundreds an hour. These pages appear to exist solely to generate traffic for his CD sales affiliate program."
  • "Buys expired domain names. Forwards to...a porn site depending on Referrer
  • "Massive spamming operation, spamming keywords and mistyped domains of every kind."
  • "dnsspam royale: 20k+++ subdomains from these 94 domains."

The last example, involving DNS spam, refers to the practice of having multiple subdomains. For example, if your domain was site.com, you could establish a variety of subdomains off that domain without having to pay a registration fee for each new subdomain. For example, you could have:

books.site.com
movies.site.com
auctions.site.com

There is nothing wrong with having subdomains like this. In fact, there are good reasons for site owners to split up their sites this way. However, subdomains are also a cheap and easy way for spammers to create individual and distinct web sites, in order to try and dominate a search engine with listings.

Tabke's article said that "most 3rd level domains are considered DNS spam" by Inktomi. That's not the case. The guidelines posted with the spam database was pretty clear in explaining that subdomains were only considered spam if there was an incredibly large number of them AND if human review confirmed they'd been established as spam:

"Review domains with > 1000 subdomains," the database's guidelines stated. Inktomi also confirmed that subdomains, when not used for index stuffing, are perfectly fine.

A more serious concern was the belief that Inktomi might be banning sites in order to force people to pay for listings. However, the first 94 entries in the Hall of Spam, all that could be seen when viewing it via the web, had several references to sites that were already using paid inclusion and caught out for spamming, such as:

  • "SEO doing massive spam campaign. 162 URLs in Paid inclusion."
  • "SEO...In BoW, Gigadoc and Addurl. Has thousands of Paid Inclusion URLs"
  • "SEO spammer. dnsspam. Submitting massive urls to addurl and lately to paid inclusion."

These entries don't suggest that spamming will be tolerated by Inktomi, as long as someone is paying to be listed. Quite the contrary, they suggest that paid inclusion is no protection against Inktomi's spam penalties. Inktomi's paid inclusion guidelines also make it pretty clear that content deemed to be spam will be removed.

"We've kicked out pages where people have decided to pay us," confirmed Toman.

But what about cloaking? There were several posts concerned about Inktomi's partnership with search engine optimization firm MediaDNA, which delivers cloaked content. MediaDNA especially focuses on helping sites that have content locked behind password protected areas and which cannot be spidered by search engines. Several were upset that Inktomi tolerates cloaking from MediaDNA but not from other companies.

That sounds pretty damning, but it's not true. Cloaking is not considered spam by Inktomi. The company operates a policy of allowing one cloaked page to link to one "real" page, as long as the cloaked page reflects the content of the real page. Any company deliver relevant cloaked content to Inktomi through its paid inclusion programs. MediaDNA has no special exclusive.

In fact, a company can even deliver cloaked content without using Inktomi's paid inclusion program. As long as it is one cloaked page per real page, and as long as the content is considered relevant, then Inktomi doesn't view cloaking as spam.

"You could take a policy to say cloaking is evil in every case, but there are ways to use cloaking that are actually legitimate, to make sure a search engine gets relevant information," Toman said.

By the way, Google does consider cloaking to be evil, in every case. If it detects it, expect your pages to be removed or penalized.

The Hall of Spam database also had a reference to another Inktomi paid inclusion partner, Position Technologies, seeming to report spam to Inktomi. This raised concerns that Inktomi might be trying to drive search engine optimization customers to particular companies, or that its partners might be using the paid inclusion system to gain a competitive advantage over other SEO companies.

"SEO spammer. Some paid inclusion....Reported to us by positiontech.com, our paid inclusion partner. Is positiontech gaming us?," one entry in the Hall of Spam read.

Tabke didn't speculate on what this entry meant in his article; he only noted it as interesting. However, within the forum, others speculated that this meant Position Tech was reporting SEO companies as spammers, apparently in an attempt to drum up its own business.

The problem with this is that Position Tech isn't a SEO firm. Instead, it creates tools that are used by SEO firms, so alienating them isn't in its interest.

For its part, Position Tech explained within the forum that the reference to "reporting" meant that the URLs came from the automated feed of paid inclusion URLs that Position Tech sends to Inktomi, not that they were reported as spam. People can buy paid inclusion in Inktomi via Position Tech, and those purchased URLs are reported to Inktomi each day. The company added that it doesn't report spam at all to Inktomi nor monitor URLs, unless they specifically make use of Position Tech optimization tools.

As for Inktomi's view, Toman said:

"At some point, there were some Position Tech entries that caused us to question this and put that in the database," Toman said. "I think it's also a good example of how we are not afraid to question content that comes in from our partners, if they have content that skirts the issue of spam."

Toman also disputed that the database was being used to drive customers to use particular partners.

"We certainly didn't say, 'We have these three people as companies, so let's go get their competitors'," Toman said. "No one got into that database without trying to spam Inktomi."

Toman added that most spam reports came in from Inktomi's search partners, such as MSN Search or HotBot.

"Our partners will submit a query where the results have been spammed very badly," Toman said. "We look at the results, determine if they have been spammed. If that meets the definition, we have an option to demote the pages or just saying we don't want them in the index at all.

Another concern raised was that companies are being banned or penalized without knowing it. This is really only a problem for those who don't use paid inclusion. Those who do use paid inclusion are notified if their URLs are removed, Inktomi says.

As for those with free listings, it is entirely possible they might be banned without being informed. However, Inktomi is hardly unique in doing this. All the major crawlers ban sites for spamming, and none of them generally inform people when this happens nor post a list of sites that have been banned.

That always sounds frightening, but the reality is that the vast majority of site owners do nothing to get themselves banned. You have to really work hard to get a search engine to target your listings. That's why those that do get banned usually understand that they've done something wrong. They don't bother following up with the search engines about their drop in traffic because they know exactly why it has happened.

Should you think you've been banned for spam, the solution is to get in contact with the search engines. At Inktomi, you can do this via the new spamcrusader@inktomi.com address that I wrote about last issue. If you're innocent, you should have your pages restored. Indeed, one entry from the Inktomi's Hall of Spam reflects this:

"This is a list of spammers' CUSTOMERS. Either they paid an SEO or they encourage affiliates to spam. Let's have a heart; if they call us mystified why they're not in the index, un-blacklist them."

Inktomi said it doesn't make its spam list public because the comments haven't been cleaned up to make sense to the average person.

"If I had the resources to fully document it and provide all the instructions, I could make that public," Toman said.

I suggested that perhaps a simpler Hall of Spam should be published, with just company names. That way, SEO firms that think they've done something wrong can easily confirm this. More importantly, potential SEO customers could examine the list to see exactly who is on it.

That's especially important, because there were at least two prominent names on the list that I reviewed, which list people such as British Airways, Amazon, LL Bean and Lucent Technologies as their clients. Tabke, who managed to download a much longer list of the Hall of Spam, claimed that it read like a Who's Who of the SEO industry.

This leads into the bigger issue of what is spam. Comments in the Webmasters World forum have lead into the often-repeated desire that search engine optimization specialists would like search engines to spell out exactly what is acceptable, so that they can play by the rules.

Perhaps we'll get there, but as I've written before, the barrier to getting really detailed rules is the fear search engines have that doing so simply provides too much information about how they operate.

Of all the information I reviewed within the filter database, there was only one thing that really concerned me: a URL definition called "track-monetization." It was defined as:

"We want to know how much money we can make from a potential monetization customer before we appoach them for a potential monetization deal."

Using this classification, Inktomi will tag existing, freely listed URLs from major web sites that they feel could be paid inclusion partners, then approach them about entering the program, the company says. The pitch is then that by using paid inclusion, they could get even more listings or have their listings updated more frequently, thus getting good traffic.

Should the company not wish to participate, Inktomi says that their sites will continue to be indexed via the regular crawling mechansims.

Here are some other things from the database that may be of interest.

Inktomi also maintains a "whitelist" of URLs. These are not URLs that can get a "free pass" for spam, as was suggested in the forum. Well, not entirely. These are URLs that were reviewed as potentially being spam and then determined they were not. They have been placed on the list so that they don't accidentally get banned by the automatic spam detection system. Similarly, entire sites such as CNN might be on the whitelist. This is to ensure that if they have content mentioning things such as "breasts" in an article, they aren't accidentally tagged as porn.

While whitelist URLs are protected from automatic spam filters, that protection is not eternal or sacrosanct.

"You can never do anything you want to, because that list is regularly reviewed and updated," Toman said. "If someone has been identified as not being spam and then later tries to spam us, we're going to find that out."

Inktomi also makes reference to "LinkFlux" spam. Toman says LinkFlux is simply Inktomi's internal notation to the algorithm that determines the weight of links, sort of its equivalent of Google's PageRank. LinkFlux spam refers to those who create artificial links in hopes of manipulating Inktomi's link analysis system in their favor.

Finally, Inktomi's not the only search engine to have private information leak out recently. Last month, confidential protocols explaining how Google's partners can control Google's search results were posted online. Linked from the Need To Know site, Google's search results protocols are no longer available to the public.

The Inktomi Papers
Search Engine World Quarterly, Sept. 12, 2001
http://www.searchengineworld.com/newsletter/2001/

Tabke's article appeared at the URL above but has been pulled. He originally wrote it without any comments from Inktomi, not wanting to contact them before publishing, for fear they'd pull the database. Now that the database has been closed, Tabke's pulled his article, until Inktomi responses can be added.

Webmaster World Discussion Thread
http://www.webmasterworld.com/forum5/829.htm

This is where the Inktomi article has been discussed by Tabke's readers and others.

The ambiguous Inktomi anti-spam policy
Pandia, Sept. 14 2001
http://www.pandia.com/sw-2001/54-mediadna.html

Covers Tabke's article and touches on some of the ways discussed in the database about how spam is tracked down, including by monitoring SEO discussion forums.

How Inktomi Works
http://searchenginewatch.com/subscribers/inktomi.html

Explains how the Inktomi search engine operates, including details on paid inclusion programs and partners, available to Search Engine Watch members.

Pay For Placement?
http://searchenginewatch.com/resources/paid-listings.html

Extensive past coverage on paid inclusion issues can be found here.

Inktomi Content Policy- Spam Removal Guidelines
http://www.inktomi.com/products/search/content_policy.html

Inktomi posted these guidelines fairly recently, and they make clear that cloaking is OK.

FAQs about Inktomi Search/Submit
http://www.positiontech.com/inktomi/faq.htm

Item 21 explains that Inktomi allows one doorway page per "real" page and doesn't ban them from being cloaked.

Need To Know #20, Aug. 3, 2001
http://www.ntk.net/index.cgi?back=2001/now0803.txt

The reference to Google's confidential information is listed in the Anti-News section.