How to Get More Pages into Google’s Index

Date published 1 August 2007 Author

Jonathan Hochman

Categories

Industry
SEO

Many people obsess about every word Matt Cutts says, but there are plenty of other Googlers that can teach us a thing or two about Google’s inner workings. At Search Engine Strategies Chicago 2006 I was on a panel with one of them: Dan Crow, who is part of Google’s search quality group and is the Product Manager for the crawl infrastructure group.

When Jill Whalen and Pauline Kerbici of High Rankings started a local organization called Search Engine Marketing New England (SEMNE), I suggested that they invite Dan to speak. Besides his charming British accent, Dan’s a great speaker because he knows everything about Googlebots and indexing.

Last month in Providence, nearly 100 SEMNE members and guests showed up to meet Dan. To learn about the official presentation, you can read Jill’s summary, “Getting into Google,” and Rand Fishkin’s post, “Dan Crow of Google on Crawling, Indexing & Ranking.”

Instead of yet another summary, here I will cover the unofficial story, the conversations I had with Dan before and after the main event.

Dan Crow’s Advice to Webmasters

Dan started our conversation by saying that the World Wide Web is very large, and Google is not even sure how large. They can only index a fraction of it. Google has plenty of capital to buy more computers, but there just isn’t enough bandwidth and electricity available in the world to index the entire Internet. Google’s crawling and indexing programs are believed to be the largest computations ever.

Googlebots fetch pages, and then an indexing program analyzes the pages and stores a representation of the page into Google’s index. The index is an incomplete model of the Web. From there, PageRank is calculated and secret algorithms generate the search results. The only pages that can show up in Google’s search results are pages included in the index. If your page isn’t indexed, it will never rank for any keywords.

Because the Web is so much larger than the index, Google has to make decisions about what to spider and what to index. Dan told me that Google doesn’t spider every page they know about, nor do they add every spidered page to the index. Two thoughts flashed through my mind at that moment: (1) I need to buy Dan a drink, (2) What can I do to make sure my pages get indexed?

Bandwidth and electricity are the constraining resources at Google. On some level they have to allocate those resources among all the different Web sites: Google isn’t going to index Web sites A – G and then ignore H-Z. Dan suggested that each day Google has a large but limited number of URLs it can spider, so for large sites it’s in the site owners’ interests to help the indexing process run more efficiently, because that may lead to more pages being indexed.

How much effort Google decides to put into spidering a site is a secret, but it’s influenced by PageRank. If your site has relatively few pages with high PageRank, they’ll all get into the index no problem, but if you have a large number of pages with low PageRank, you may find that some of them don’t make it into Google’s index.

Clean Code Matters

What can we do to get more pages indexed? I’ve always suspected that streamlining HTML code is a good way to facilitate indexing. Reducing code bloat helps pages load faster and use less bandwidth. I asked if it would help to move JavaScript and CSS definitions to external files, and clean up tag soup. Dan’s answer was refreshingly clear. “Those would be very good ideas,” he said.

SEOs pay a lot of attention to issues like duplicate content, link building to increase PageRank, and link structure to move PageRank throughout the site. However, I haven’t seen many SEO articles about the importance of proper Web development methodology. All too often when I look at a new site, I am appalled at the sloppy coding. The typical site could be streamlined significantly.

Yes, you should try to increase the PageRank of your pages, and you should design your link structure so that PageRank is distributed throughout your site in a way that makes sense. You should provide unique and valuable content. Those tactics will help your indexing, but you also need to pay attention to the dirty details of how your pages are put together. If everybody served clean code, Google would be able to index significantly more pages.

Why doesn’t Google do more to educate webmasters about the efficient use of bandwidth and computing power? Perhaps it would look bad for Google to ask webmasters to recode their sites to make Google’s job easier. Nonetheless, if Google can tell me how to get more of my pages into the index, I’m ready to listen and cooperate.

Clean HTML is good not just for getting indexed, but also because it means more people can read your site. The cleaner and more compatible your code, the wider a range of browsers it will work with, and this is especially important for users with screen readers and those using mobile devices such as cell phones.

Jonathan Hochman has two computer science degrees from Yale. He runs an Internet marketing consultancy and a web development shop.

Search Headlines

We report the top search marketing news daily at the Search Engine Watch Blog. You’ll find more news from around the Web below.

12 Ways to Butcher Your SEO Campaign, Fathom SEO
MSN, DoubleClick, and SPOCK, DART Search Blog
Robots Exclusion Protocol: now with even more flexibility, Google Blog
The Semantic Web & Its Implications on Search Marketing, Search Engine Journal
iCrossing Buys Proxicom for Site SEO from the Get-Go, ClickZ
Why Wal-Mart Is Going Social Media, ClickZ
Google, Yahoo, Microsoft: Year-To-Date PPC Report Card, Search Engine Land
Microformats in Google Maps, Google Maps API Blog
Social Networking Goes Global Major, comScore
The Emotions that Make Us Link, SEOmoz

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

How to Get More Pages into Google's Index

Dan Crow’s Advice to Webmasters

Clean Code Matters

Search Headlines

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

SEO takeaways from 2020: A review of the most unusual year for search

Interview with Lior Davidovitch, the founder of PUBLC

Search engine results: The ten year evolution

Alternatives to Google: Mojeek believes a truly independent and tracking-fr...

Google and Facebook back Berners-Lee's Case #ForTheWeb

Google's PageRank algorithm, explained

Search trends 2018: what can marketers learn?

SEW Interview: Clark Boyd on visual search