InsideGoogle points us to the FirstMonday article: Internet time and the reliability of search engines.
>From the article, “A large part of the problem comes from the fact that a page might have many dates. Some search engines (AltaVista and Google) can be used to search for information from certain periods of time. However, these “date stamps” are not determined by the first occurrence of these pages in the Web, but by the last date at which a page was updated. The “same” Web page may therefore belong to the year 1995 in a data set collected in 2003, while in a data set collected in 2004 it belongs to the year 2003. If used to search with historical dates, search engines represent the results of interacting frequencies of the updating of Web pages and search engine crawlers, and not necessarily the dates of publication of the documents under study.”
The article concludes, “This has major consequences for the use of search engines in social science research. In short, search engines are unreliable tools for data collection for research that aims to reconstruct the historical record or for research that aims to analyze the structure of information at a particular moment in history.”
Issues surrounding date searching with general purpose web engines are not new. Back in 2002 I co-authored: It’s Tough to Get a Good Date with a Search Engine with Genie Tyburski where we touch on several of them.
Just what is the “date” when it comes to a web page is a major issue. Which date would be of greatest interest to the searcher? Is it the date the page was first crawled? The date the page was first written? First posted? Last updated or changed? What about pages that “disappear” from the index and then get “recrawled” at a “later” date? If a standard was agreed upon would it applied properly?
>From the searcher perspective limiting by date with a general purpose web database is something that should be done only with great care and an understanding of the problems that the article points out.
Finally, limiting to a specific date or range of dates is not an issue with news and discussion databases since this material has a
unique and agreed upon date stamp associated every item in the database For example, the Yahoo News article was published and posted on October 1, 2004 or the Google Group posting become available on September 23, 2004 at 1:04 PM.
Another issue is also in play when it comes to news searching. Most of the major news engines only allow you to limit your search to the last month. In other words, if it the article is older than a month it might no be available. It may be in the main web database depending if the publisher has kept the link live. In some cases the link is still available but there is a charge to read the full text.
As I’ve pointed out in the past, numerous databases are available that contain deep archives of this type of content. Many times they’re available for free from from your local library and also offer subject-based access to the content. Even better is that they’re online via the web, no need to visit the library.