Google Upping 101K Page Index Limit?

Author

Danny Sullivan

Date published February 2, 2005 Categories

Industry

Tara over at ResearchBuzz notes that Google seems to have lifted the 101K limit on indexing HTML files that it has long had:
Has Google Dropped Their 101K Cache Limit? Gary and I played some more
yesterday to test this and found an example briefly that showed its true — sort of.

I’d love to put up a live link showing this, but it disappeared almost as soon as we found it. Tara updated her blog to note the same strange disappearing act happened to her. But she also
noted that what Google says it reports for a page in its search results listings may differ from what it shows actually cached.

Here’s an example to explain more. This search at Google brings up a page
I know is larger than 101K, the archive of all blog postings we’ve done in December:

Search Engine Watch Blog: December 2004 Archives
… I’ve compiled these lists of search patents and … My first compilation on the SEW site
was posted on … Applications Systems and methods for searching using queries …
blog.searchenginewatch.com/blog/0412 – 101k – 1 Feb 2005 – Cached – Similar pages

Look in the last line of the page’s listing, and you’ll see that Google says it is 101K long. In reality, it’s 633K. That’s how big it is if you were to save the file
without images to your hard drive. For example, right-click on the link, save the file to your hard drive, and that’s how much information is in there. The 101K figure is
simply how much of the page Google has actually recorded.

Now let’s go to Google’s
text-only
cache of the page. If Google has only indexed 101K of the page, then it should end abruptly about one-sixth of the way down. In this case, it does.

Now here’s another example where things get weird:

ResourceShelf
ResourceShelf, … ResourceShelf is Compiled & Edited By Gary Price, MLIS Gary Price Library & Internet Research Consulting
gary@ resourceshelf.com Gary’s Bio …
tinyurl.com/jnpm – 101k – Cached – Similar pages

That’s 226K. And when Gary checked this page yesterday, briefly Google reported it nearly the same (actually slightly larger), as the screenshot below shows:

Now back to the cached-text version Google has of the page. If only 101K is actually indexed, then only about half of the page’s content should only show in the cache.
Instead, the content of the actual page looks to be the same as
the cached page.

One more test. I looked for a

string of text that only appears on this page and also near the bottom of the page. If Google is indexing all the text, it should have brought the page up for the query.
That didn’t happen. It found a page from Gary’s research news site, but not the same one.

So…something’s going on, but what exactly isn’t clear. I did ask Google about it but haven’t gotten back a formal answer yet. Instead, I got an informal “isn’t
interesting what you can spot” type of thing that typically means Google is doing something but isn’t sure if they want to come right out and say it, because it might not
last.

Examples? Barry notes that some are seeing the return of a Google “Search Harder” button that the company has
never announced, or the Google Frequent Searcher counter feature that rolled out quietly in limited
form, then disappeared.

For a rundown on how much of a web page each major search engine officially says it indexes, see my Search
Engine Size Wars V Erupts. Note that for some other file types, indexing might be deeper. Google does PDF files up to 2MB, if I recall correctly. Fair to say they should
all should index web pages up to their full amounts or at least much higher than is currently done.

Want to discuss or comment? Visit our forum thread, Has Google Dropped Their 101K Cache Limit?

More about:

Resources

Analytics The 2023 B2B Superpowers Index

The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
Analytics Data Analytics in Marketing

The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
Digital Marketing The Third-Party Data Deprecation Playbook

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Information

Follow us

Google Upping 101K Page Index Limit?

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future

Follow us

Google Upping 101K Page Index Limit?

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Get the Latestdaily news and insights about search engine marketing, SEO and paid search.

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future

Get the Latest
daily news and insights about search engine marketing, SEO and paid search.