Despite the fact that sitemaps are simply lists of canonical URLs submitted to search engines, it's amazing how rare it is to come across a perfect one. Issues often arise when large site owners use sitemap auto generation tools that aren't configured properly. These sites typically come with challenges for search engine crawlers like pagination and URLs generated by faceted navigation.
Spiders decide what pages to crawl based on URLs placed in a queue from previous crawls and that list is augmented with URLs from XML sitemaps. Therefore, sitemaps can be a key factor in ensuring search crawlers access and assess the content most eligible to be seen in search engine results.
The following is a quick overview of search engine sitemap guidelines and limitations followed by a technique to help identify crawling and indexation issues using mutiple sitemaps in Google Webmaster Tools.
Bing & Google Guideline & Limitation Overview
The sitemap protocol has been a standard adopted by search engines in 2006. Since then Bing and Google have developed useful Webmaster Tool dashboards to help site owners identify and fix errors.
Out of the two search engines, Bing particularly has a low threshold, or at least they outwardly state they begin devaluing sitemaps if 1 percent of the URLs result in an error (return anything but a status code 200).
Google provides clear guidelines, limitations, and a more robust error reporting system when using their webmaster dashboard. In addition to submitting quality sitemaps, ensure that files stay within the following hard limits applicable to Google.
- Limit sitemaps to 50,000 URLs
- File size should be under 50MB
- 500 sitemaps per account
Both search engines support sitemap index files. Rather than submitting multiple sitemap files individually, the sitemap index file makes it easier to submit several sitemap files of any type all at once.
Basic Sitemap Optimization
Basic sitemap optimization should include checking for pages that are:
- Duplicated (multiple URLs in different sitemaps are OK)
- Returning status code errors - 3XX, 4XX, and 5XX
And any pages that specify:
- Meta rel canonicals that are not self-referential
- Noindex meta robots tags
There are tools to quickly parse URLs contained within XML files and find this information like the Screaming Frog SEO crawler.
Using Google Webmaster Tools
Once comprehensive and quality XML sitemaps have been submitted to Google and Bing, breaking up sitemaps into categories can provide further insight into crawling and indexation issues.
A great place to start is by breaking up sitemaps by page type. Sitemaps can be diced up in any way that makes sense to provide feedback, the main goal being to expose any areas of a site with a low indexation rate.
Once an area has been identified, finding the source of the issue can begin. Using Fetch as Googlebot to identify uncrawlable content and links is often very helpful.
Another particularly useful technique is to use multiple sitemap indexation identification (or MSII, an acronym I just made up...) in combination with advanced search operators in an attempt to find excess indexation.
So for example, hypothetically speaking, if your website is having trouble getting posts indexed, it would be helpful to create a XML sitemap containing only blog posts. From this, an indexation rate can be calculated and an advanced search can be used to see what other pages might be diluting the crawling and indexation of post pages.
The search below shows that tag and author pages could be the hypothetical hindrance.
Additional advanced Google searches like site:example.com/blog/ inurl:tag AND inurl:author could then be done to determine the scale of potential excess crawling and indexation. The same concept can be applied to dynamic parameters contained within URLs generated by faceted navigation, pagination, sorting products, etc.