Image Credit: Simon Heseltine
Designing a search engine is no easy task.
Should a search engine filter content such as violent images, pornography, malware, spam, personally identifiable information, hate content, hacking instructions, bomb-making instructions, pro-anorexia sites, Satanism and witchcraft, necrophilia, content farms, and black hat SEO firms from its search results?
Welcome to the slippery slope Google and many other search engines must navigate when creating algorithms to crawl the web. With 60 trillion URLs on the web, a search engine needs to make many tough calls to on content ensure users get what they’re looking for.
SES San Francisco kicked off today with a keynote featuring Google’s Distinguished Engineer Matt Cutts and Patrick Thomas, a specialist on Google’s User Policy team, who walked through several difficult decisions Google’s search team has faced recently and through the years.
The keynote began with some news about backlinks in Google Webmaster Tools. You can read more about that here.
As Thomas explained, there is plenty of great content. While some decisions are straightforward, every search engine must determine where to draw the line.
In Google’s case, they have a cheat sheet of sorts, where they aim to ensure the results are as comprehensive as possible, removals are kept to a minimum, they rely on algorithms over manual action, they help users avoid identity theft, and they don’t push any offensive content to users if they didn’t specifically search for it.
“We want to be a little careful not to shock and offend you,” Thomas said. “We try to balance what our users search for and what’s on web, but not offend users.”
The web is the sum of human knowledge and experience. You’re going to come across a lot of controversial content, Thomas said.
Every website believes it should rank number one, so Google realized that it must minimize the manual influence, Cutts said. For instance, everyone will disagree on the definition of spam. After all, one man’s spam is another’s great marketing strategy, just as one man’s censorship is another’s law, and one man’s censorship is someone else’s corporate policy.
What exactly is a content farm? Different people have different lines.
In the case of removing pages from its index due to defamatory content, Google must deal with issues of he said, she said, or even this country said, that country said. When everyone is alleging everyone is doing something wrong, what do you do?
Discussion then turned to the worst of the worst: Explicitly sexual content, hate speech, and violent content.
One sticky issue is, what exactly is pornography? Is it a woman giving a self-breast exam? Or a painting of a nude woman? Or is it more “edgy” than this?
In the case of filtering porn, Cutts said that if a 10-year-old cub scout is searching Google and his mom was watching, would she be shocked by Google’s results? Maybe a bikini wouldn’t be the end of world, but Google definitely doesn’t want to have explicit sexual content on terms such as “breast”, as most searchers might want results about breast cancer.
Then there are hate speech type sites. If you remove a hate site from the search results, it’s still on the web. If you remove a hate site with an algorithm, then what happens to the anti-hate sites, because most of these sites share much of the same language as the hate sites.
Then there’s violent content. They cited an example, which ultimately wasn’t censored, of a picture from Syria featuring piles of dead bodies. Thomas said it was “extremely newsworthy, probably coming from primary sources on ground. But you probably wouldn’t see this on nightly news broadcast.”
Another controversial topic Google deals with involves autocomplete search suggestions. Cutts said the raw material for Google’s suggestions is what people are typing in, combined with content on web that supported such terms, such as in the case of a “Bernie Madoff (or Allen Stanford) Ponzi scheme” search suggestion.
Most recently, Google found itself internally debating censoring autocomplete suggestions when the Boston bombing happened earlier this year. “Used pressure cooker bomb” became a top suggestion.
Google as a search engine indexes the web, and sometimes influences the web, whether it’s pushing the importance of fast page speed or quality content through algorithmic updates like Panda. There may be no easy answers or ever universal agreement on the best way to design a search engine, but as this keynote session highlighted, the choices are rarely black and white.
After the session, I interviewed Cutts. Here’s what he had to say about the new Webmaster Tools data and more: