Word or Phrase Co-Occurrence within Particular Industries

This week at SES San Jose I had several great conversations with delegates and fellow speakers. One hot topic of discussion was the occurrence percentage of specific words for specific industries. A document about auto racing, for example, should typically be expected to include the word “tire” a certain number of times.

Back in the summer of 2004, Search Engine Watch Forums moderator Dr. Edel “Orion” Garcia introduced himself to the SEW community with an excellent discussion about keyword co-occurrence and semantic connectivity. A more recent thread revisits the subject at Webmaster World Forums. Marcia Weltersuccinctly defines what search engines are trying to do when analyzing thousands of semantically-related documents:

“…statistics on co-occurrence patterns are used to relate clusters of terms/phrases into coherent ‘themes’ and make predictions based on those statistics.”

Are Search Engines Smarter Now than 4 Years Ago?

I certainly hope so. Many people have maligned search engines as being either unwilling to really tackle latent semantic indexing because of the processor-intensive requirement, or unable to because of many technological and taxonomical limitations. There are too many words that could mean multiple things, and so many Web sites have at least some content that appears to humans but not text browsers.

Looking at the contextual system that allows Google and Yahoo to place contextual ads on sites dynamically. Thanks to a brief analysis of the content, the search engines clearly have the ability to map ads to topics. The real question is how the engines use this type of data for their organic algorithms.

If two documents have the same number of links pointed to them, and have relatively equal content and trust value, could the addition of particular words that are more likely to occur in topical content be the difference-maker that drives a higher ranking?

Spammers are Testing this Every Day

What got me thinking more about this was the continued appearance within blog search engines of very peculiar content when I searched for my name. There are some other folks named Chris Boggs in the world, so I don’t mean content about them. Rather, I find that my name is often used in strings of complete gibberish that looks like it was created by software that just throws strings of unrelated words into paragraphs.

Many of you have probably seen auto-generated content before, either in blogs or even e-mails that slip through spam filters. There are some recent studies as to why this helps to evade the filtering mechanisms employed by e-mail protection software, and Tulsa World recently described the new tactic of using more benign titles and content leading to higher open rates for e-mail spammers. So for example, an e-mail with the subject line “Oil Drops to $100 A Barrel” is opened much more often than “Cheap Viagra Online.”

The e-mail tactic is closely related to search engine spammers trying to get content ranked organically. There are many reasons to be ranked for long-tail terms as well as one-word terms, but there are probably two main reasons this tactic is being employed: to drive traffic to hosted ads such as Google AdWords, or to act as an authority site and pass link value.

When analyzing the strange content for my name, I found that most of the sites were search marketing-related and included contextual advertising. One particular site, undergroundtraininglab.info, has about six or seven pages with the same indexed content, each with different page titles related to search. (Warning: Don’t to try to research that site unless you have strong software to protect your computer. There may be malware, and in either case there are strange redirects going on when I attempt to click through to the Technorati-indexed pages.)

My initial theory is that there’s a way to leverage specific words within your content that can help you increase the chances of that content ranking well for a semantic family of searches. There may be a tremendous opportunity in researching common industry words specific to whatever site you’re working on to optimize for organic search. If these words or phrases are identified, they should be included within the appropriate pages of optimized content, in a non-obtrusive and still user-friendly manner.

I hope to hear from others that have tested this theory or some that are going to, at the SEW Forums thread dedicated to this topic: Searching for Common Industry Words.

Frank Watson Fires Back

The autobots seem to be taking over. But I think that they force the search engines to improve their spam filters. Now if they only applied that throughout the Web, we’d have much better information.

I’ve had conversations recently with Mike Grehan about the Web and how the crawler is no longer effective. He points out that Google only sees about 30 percent of all new content created each day.

Combine that with the comment Pat Sexton made this week that well over 10 million people have found Facebook’s SuperPoke! app without using Google. And you can start to see a way of communicating information that doesn’t rely on a search engine.

As industry leaders we need to be aware of the implications.

Related reading

Simple Share Buttons