Bill Slawski put up an interesting post over the weekend titled Can Web Search Use Wikipedia to Understand References to Names?. Bill references a paper by Microsoft researcher Silviu Cucerzan. The gist of the paper is that search engines can use Wikipedia as a cross referencing source, to help a search engine understand when it sees a name like “Bush” in a document which Bush is being referred to (George W. Bush, his father, Reggie Bush, or whatever).
In principle, what the paper discusses is how the context of the use of a particular name in a web document can be compared to the context of the use of that name on Wikipedia. Simplistically put, if the reference to “Bush” appears on a site about the New Orleans Saints, the likelihood that it’s about Reggie Bush is quite high. The search engine can use an external reference source, such as Wikipedia, as a method of validation, but trying the various pages on Wikipedia with a last name of Bush, and noting the references in common.
For example, the Wikipedia page and the web page being analyzed probably both use phrases like New Orleans Saints, football, running back, etc. By developing this sense of context, the web page being analyzed can be more properly classified, even if the page never uses the running back’s full name. So if the user searches on Reggie Bush, the search engine will know that the particular web page can be considered as relevant to the query.
It makes for interesting reading, and provides some insight into the types of analysis that search engines perform. What makes this even more intense to think about is that this is just one example of thousands of such scenarios that search engines deal with. It’s a complicated process, indeed.