IndustrySubmitting And Encouraging Crawlers

Submitting And Encouraging Crawlers

Explains how to ensure that more of your content with a web site is indexed by crawler-based search engines.

In an ideal world, you would never need to submit your web site to crawler-based search engines. Instead, they would automatically come to your site, locate all of your pages by following links, then list each of these pages. That doesn’t mean you would rank well for every important term, but at least all the content within your web site would be fully represented. Think of it like playing a lottery. Each page represented in a search engine is like a ticket in the lottery. The more tickets you have, the more likely you are to win something.

In the real world, search engines miss pages. There are several reasons why this may happen. For instance, consider a brand new web site. If no one links to this web site, then search engines may not locate it during their normal crawls of the web. The site essentially remains invisible to them.

This is why Add URL forms exist. Search engines have operated them so they could be notified of pages that should be considered for indexing. Submitting via Add URL doesn’t mean a page will automatically be listed, but it does bring the page to the search engine’s attention.

Add URL forms have long been an important tool for webmasters looking to increase their representation through “deep submitting,” which I’ll cover below. But it is also important that you consider site architectural changes that can encourage “deep crawling.” These architectural changes should keep your site better represented, in the long run.

Paid Inclusion

Several search engines offer “paid inclusion” programs where, for a fee, you are guaranteed to have the pages you want listed. These programs are the best way to ensure that your important pages are listed. The downside, of course, is that you have to pay. The Crawler Submission Chart lists who offers paid inclusion programs, and you can follow links from that page to get even more information.

Deep Submitting

In the past, there was a strong relationship between using Add URL forms and pages getting listed. Pages submitted via Add URL would tend to get listed and listed more quickly than pages that were not submitted using the forms. For this reason, people often did “deep submits.” They would submit many pages at the same time, hoping to get them all listed quickly.

Those days are essentially gone. In my opinion, there is very little value to most people spending much time on deep submits. That is because many search engines have altered the behavior of their Add URL forms in response to spamming attempts. The vast majority of submissions to Add URL systems are junk pages not worth listing. Because of this, the major crawlers are far more likely to rely on link crawling than Add URL submissions.

Deep submits may still be effective in some places. For instance, AltaVista tends to list any page submitted within four to six weeks. Therefore, submitting to AltaVista directly may increase the representation or freshness of your listings.

Aside from AltaVista, I don’t suggest bothering with a deep submission elsewhere. However, should you still wish to do this, there are some deep submission tools listed on the Toolbox page that you might find helpful. The Crawler Submission Chart also provides an at-a-glance guide to submission limits and timings for major crawler-based search engines.

Link Crawling

Building links can help improve the chances of your pages getting indexed. This is because the more links that are pointing at you, the more likely the crawler is likely to come across your pages and add them for free.

Think of it like asking people for recommendations. If you ask 10 different people for a good doctor and several of them suggest the same person, you are more likely to contact that doctor for help. In the same way, having many links pointing at your pages (or a few links from important web sites) makes it more likely crawlers from major search engines will pick up your pages.

The More About Link Analysis page has more tips on building links, plus it discusses how building links can also help your pages rank better.

Dealing With Doorways

It’s unusual for web pages within a site not to have links pointing at them. Typically, sites are built so that users can navigate their way to any page within the site from the home page. Similarly, search engines can also follow links and locate material.

An exception to this rule is with doorway pages (see What Are Doorway Pages, for an introduction to the concept). These pages do not typically have links pointing at them, since they are designed to be “entrances” to the web site that appear in response to particular searches on different search engines.

Because doorway pages lack inbound links, crawlers have no chance of finding them. That means you’ll either have to submit any doorway pages, if you use them, to crawlers via their free Add URL services. Of course, this doesn’t guarantee that they’ll get listed. An alternative is to submit via paid inclusion systems, for the crawlers that offer this. However, you may find that there’s a limitation on how many doorways pages you can submit.

Alternatively, you can also create what is often called a “hallway” page. This is simply a list of all the doorway pages in your site. You submit the hallway page to a crawler, then the crawler — if it reads the hallway page — MAY decide to index some of the doorway pages on the list.

Hallway pages are essentially site maps of just your doorway pages. You wouldn’t put a hallway page out for your human visitors to see as you would a site map, because you don’t want to send them back to the various “entrances” to your site.

Encouraging Deep Crawling

It’s important to remember that even if you don’t submit each and every one of your pages, search engines may still add some of them any way. Crawlers follow links — if you have good internal linking within the pages of your web site, then you increase the odds that even pages you’ve never submitted may still get listed.

A site map can be especially helpful here. It is simply a list of all the pages in your site, such as this one for Search Engine Watch. If a crawler comes to it, then it easily learns of other pages it might wish to visit. Your human visitors may appreciate a site map, also.

Some search engines routinely do “deep crawls” of web sites. None of them will list all of your pages, but they will gather a good amount beyond those you actually submit. Even non-deep crawlers will still tend to gather some pages beyond those actually submitted, assuming they find links to these pages from somewhere within your site.

Shallow Vs. Deep Structure

One thing that can help with crawlers is to have a “shallow” rather than a “deep” web site structure. For example, look at these URLs, all from the same example web site:

http://site.com/
http://site.com/products/
http://site.com/products/books/
http://site.com/products/books/fiction/
http://site.com/products/books/fiction/childrens/
http://site.com/products/books/fiction/childrens/harrypotter/

In this example, the web site sells different products, and all the product information is kept within the /products/ area of the site. Below that is the /books/ section, then below that is the /fiction/ section, then /childrens/ and finally /harrypotter/ has information about Harry Potter books. As you can see, the Harry Potter books area is buried deep in the structure. It is in the “basement” of this web site, so to speak.

In general, search engines pay more attention to the upper portions of your site. The assumption is that the good stuff will not be buried. This does NOT mean that content deep in a web site will not get listed. In fact, if people link deep within your site, that can even help the pages to get indexed by crawlers. However, if you go to a more shallow structure, you may find more of your pages do get included. What’s a shallow structure like? Take a look:

http://site.com/
http://site.com/products.html
http://site.com/books.html
http://site.com/fiction.html
http://site.com/childrens.html
http://site.com/harrypotter.html

In the examples above, all of the pages now reside on the “top floor” of your web site. Rather than your pages being buried deep, they are instead spread widely across the upper levels in a shallow format.

Multiple Web Sites

Even the best deep crawlers may have problems with large web sites. This is because crawlers try to be “polite” when they visit sites and not request so many pages that they might overwhelm a web server. For instance, they might request a page every 30 seconds over the course of an hour. Obviously, this won’t allow them to view many pages. Other crawlers are simply not interested in gathering every single page you have. They’ll get a good chunk, then move on to other sites.

For this reason, you might want to consider breaking up your site into smaller web sites. For instance, consider a typical shopping site that might have sections like this:

http://site.com/
http://site.com/books/
http://site.com/movies/
http://site.com/music/

The first URL is the home page, which talks about books, movies and music available within this site. The second URL is the book section, which contains information about all the books on sale. The third URL is the movie section, and the fourth is the music section.

Now imagine that three main sections have 500 pages of product information each. Altogether, that gives the site about 1,500 pages available for spidering. Next, let’s assume that the best deep crawler tends to only pick up about 200 pages from each site it visits — this number is completely made up, but it will serve to illustrate the point. This would mean that only 250 pages out of 1,500 pages are spidered, or 17 percent of all those available.

Now it is time to consider subdomains. Any domain that you register, such as “site.com,” can have an endless number of “subdomains” that make use of the core domain. All you do is add a word to the left of the core domain, separated by a dot, such as “subdomain.site.com.” These subdomains can then be used as the web addresses of additional web sites. So returning to our example, let’s say we create three subdomains and use them as the addresses of three new web sites, as so:

http://books.site.com
http://movies.site.com
http://music.site.com

Now we move all the book content from our “old” web site into the new “books.site.com” site, doing the same thing for our movies and music content. Each site stands independently of each other. That means when our deep crawler comes, it gathers up 250 pages from one site, moves to the next to gather another 250, then does the same thing with the third. In all, 750 pages of 1,500 are gathered — 50 percent of all those available. That’s a huge increase over the 17 percent that were gathered when you operated one big web site.

If you decide to go the subdomain route, you’ll need to talk with your server administrator about establishing the new domains. There is no registration fee involved, but the server company might charge a small administrative fee to establish the new addresses. Of course, you may also have to pay an additional monthly charge for each site you operate.

You could also register entirely new domains. However, I suggest subdomains for a variety of reasons. First, there’s no registration fee to pay. Second, it’s nice to see the branding of your core URL replicated in the subdomains. On the downside, subdomains have been abused, so having entirely new domains may begin to be a safer options, as described next.

Be Careful!

You shouldn’t break up a site as described above just to please search engines. Break up a site because it makes sense to do it for your human visitors. Having all your content about a particular product or subject under a nice, short URL is a usability feature for humans. They can easily remember the location. They may also be inclined to bookmark or link to the specific area, the latter of which can help you when it comes to link analysis.

Also, some people have abused subdomains. This in turn may cause search engines to be suspicious of them. Because of this, you might consider having separate domains, such as:

http://bookssite.com
http://moviessite.com
http://musicsite.com

However, having separate domains is no magic protect against being considered a search engine spammer, if there’s no legitimate reason for breaking up your site.

Ask yourself these questions. Do you have lots of content, perhaps hundreds of pages, for a particular product, service or subject. Would moving this content into its own web site make sense for your human visitors? If both answers are yes, you’re probably breaking up your site for the right reason.

Even if the answers to the questions are yes, when the same company starts running many web sites, even for the right reasons, this might be seen with suspicion. How many is too many? There’s no right answer, but I would take a more critical eye when advancing beyond 10.

Finally, some people break up sites because they mistakenly believe that search engines will read all the pages within a web site and rank sites with pages all on the same “theme” more highly. This doesn’t actually happen, and the first question in the article below examines the issue in more depth.

Reader Q&A
The Search Engine Update, June 17, 2002
https://www.searchenginewatch.com/_subscribers/articles/02/article.php/2152491

Finally, also see the ABCs and URLs page, which provides further discussion about having multiple web sites and domain names.

Root Page Advantage

Breaking up a web site into multiple sites gives you more “root” pages, which tend to be more highly ranked than any other page you will have. That’s both due to search engine algorithms and because root pages tend to attract the majority of links from other web sites.

The root page is whatever page appears when you just enter the domain name of a site. Usually, this is the same as your home page. For instance, if you enter this into your browser:

searchenginewatch.com

The page that loads is both the Search Engine Watch home page and the “root” page for the Search Engine Watch web server. However, if you have a site within someone else’s web server, such as like this…

http://www.server.com/mysite/

…then your home page is not also the root page. That’s because the server has only one root page, whatever loads when you enter “server.com” into your browser.

So in our example, there used to be only one root page, that which appeared when someone went to “site.com,” and this page had to be focused around all different product terms. Now, each of the new sites also has a root page — and each page can be specifically about a particular product type.

Breaking up a large site might also help you with directories. Editors tend to prefer listing root URLs rather than long addresses that lead to pages buried within a web site. So to some degree, breaking up your site into separate sites should give each site more respect.

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y