Tara over at ResearchBuzz notes that Google seems to have lifted the 101K limit on indexing HTML files that it has long had: Has Google Dropped Their 101K Cache Limit? Gary and I played some more yesterday to test this and found an example briefly that showed its true -- sort of.
I'd love to put up a live link showing this, but it disappeared almost as soon as we found it. Tara updated her blog to note the same strange disappearing act happened to her. But she also noted that what Google says it reports for a page in its search results listings may differ from what it shows actually cached.
Here's an example to explain more. This search at Google brings up a page I know is larger than 101K, the archive of all blog postings we've done in December:
Search Engine Watch Blog: December 2004 Archives
... I've compiled these lists of search patents and ... My first compilation on the SEW site
was posted on ... Applications Systems and methods for searching using queries ...
blog.searchenginewatch.com/blog/0412 - 101k - 1 Feb 2005 - Cached - Similar pages
Look in the last line of the page's listing, and you'll see that Google says it is 101K long. In reality, it's 633K. That's how big it is if you were to save the file without images to your hard drive. For example, right-click on the link, save the file to your hard drive, and that's how much information is in there. The 101K figure is simply how much of the page Google has actually recorded.
Now let's go to Google's text-only cache of the page. If Google has only indexed 101K of the page, then it should end abruptly about one-sixth of the way down. In this case, it does.
Now here's another example where things get weird:
ResourceShelf, ... ResourceShelf is Compiled & Edited By Gary Price, MLIS Gary Price Library & Internet Research Consulting gary@ resourceshelf.com Gary's Bio ...
tinyurl.com/jnpm - 101k - Cached - Similar pages
That's 226K. And when Gary checked this page yesterday, briefly Google reported it nearly the same (actually slightly larger), as the screenshot below shows:
Now back to the cached-text version Google has of the page. If only 101K is actually indexed, then only about half of the page's content should only show in the cache. Instead, the content of the actual page looks to be the same as the cached page.
One more test. I looked for a string of text that only appears on this page and also near the bottom of the page. If Google is indexing all the text, it should have brought the page up for the query. That didn't happen. It found a page from Gary's research news site, but not the same one.
So...something's going on, but what exactly isn't clear. I did ask Google about it but haven't gotten back a formal answer yet. Instead, I got an informal "isn't interesting what you can spot" type of thing that typically means Google is doing something but isn't sure if they want to come right out and say it, because it might not last.
Examples? Barry notes that some are seeing the return of a Google "Search Harder" button that the company has never announced, or the Google Frequent Searcher counter feature that rolled out quietly in limited form, then disappeared.
For a rundown on how much of a web page each major search engine officially says it indexes, see my Search Engine Size Wars V Erupts. Note that for some other file types, indexing might be deeper. Google does PDF files up to 2MB, if I recall correctly. Fair to say they should all should index web pages up to their full amounts or at least much higher than is currently done.
Want to discuss or comment? Visit our forum thread, Has Google Dropped Their 101K Cache Limit?