IndustryMeet The Crawlers

Meet The Crawlers

Representatives of Yahoo, Google, Ask Jeeves and LookSmart offer a inside glimpse of how they operate under the hood.

Representatives of Yahoo, Google, Ask Jeeves and Looksmart offer an inside glimpse of recent developments at the major search engines.

A special report from the Search Engine Strategies 2004 Conference, March 1-4, New York City.

Always a favorite, the NYC Search Engine Strategies session of “Meet the Crawlers” was packed, as usual. This session lived up to its reputation of providing information straight from the source and granted the audience direct access to representatives of the big engines.

Yahoo Technology Changes

Tim Mayer, Director of Product Management at Yahoo Search, provided a brief overview of recent technology activities at Yahoo. Here are a few highlights from his presentation.

With the release of its new search engine, Yahoo now powers over half of the US web searches – this is a dramatic shift in the market share within the industry. Tim stated that Yahoo now has 260 million users world wide and 100 million registered users.

Concerted effort has been made to increase the size of the Yahoo index. Tim stated that Yahoo “grew the index over 50% from what it was on Yahoo.com before.” Tim also mentioned that the Yahoo index is more than 99% populated through the free crawl process.

In addition to its paid inclusion SiteMatch program, Tim announced Yahoo was bringing back a free Add URL program. Although sites submitted through this free program are not guaranteed inclusion into the Yahoo index, it does provide a mechanism for webmasters to inform Yahoo of quality pages.

Tim described more than a dozen daily tests to improve search quality of the user interface. The Yahoo team is extremely focused on improving the user experience and providing relevant results. To the user, this means on a given day, your results or user interface may change. To ensure fresh content, Tim mentioned that Yahoo has added a daily crawl for updating documents that it knows change frequently.

Toolbars provide search engines valuable feedback, so to go along with its new search engine, Yahoo is now offering its own tool bar called Yahoo Companion.

Really Simple Syndication or RSS is an XML-based format for distributing content such as news on the web. The new Yahoo RSS feature allows a content provider to make a RSS feed available and Yahoo will index it as it crawls the web.

Yahoo sees personalized search as the future focus. The goal of personalization is to better understand the user intent. Currently, people have to type in extra words in their query to be more specific to get the results they want. With Personalized search, the search engine delivers relevant results with fewer words. For example, if people want a haircut and Yahoo knows that they live in midtown New York, Yahoo would be able to automatically supply haircutters in that area.

Google Time

The second speaker was Craig Nevill-Manning, Senior Research Scientist at Google. Nevill-Manning provided an overview of Google’s ranking process and elaborated on a few points to help webmasters.

Nevill-Manning stated that the Page Rank of a page is dependent on the aggregate importance of all the pages pointing to that page. He said that this is one significant factor that factors into the rank “so that for the same query, the different pages with essentially the same content we chose the one that has the best reputation in terms of the best reputation of others linking to it.”

Nevill-Manning went on to explain that on the other side of the ranking function is the text analysis. “Google looks at the words on the page, the links, the text of the links pointing to that page, and various other items on the page like proximity of adjacent words and so on”. Nevill-Manning stated that there are about 100 additional factors considered and those factors are constantly being tweaked to improve the ranking to make it more relevant.

Like Yahoo, Google updates its index frequently. Google looks for content that has changed recently or that changes regularly over time. For news updates, Google has developed a separate news crawl that can update on a minute-by-minute basis.

Google Webmaster Tips

Nevill-Manning devoted a large portion of his briefing to providing tips specifically oriented toward webmasters. Here’s a top level summary:

1. Create good content. Nevill-Manning encouraged webmasters to create sites with appropriate and relevant content. The Google linking theory maintains that if other people like your site, they will link to you organically and you will attract just the kind of page rank that Google values.

2. Get links from relevant sources. Nevill-Manning discouraged webmasters from getting everyone to link to their site (random links) and instead encouraged webmasters to get related sites to link to their site. He said the focus should be on the user and how useful to them these links will be.

3. Get proper directory representation. Nevill-Manning recommended that webmasters submit their site to web directories and to make sure their site is represented in appropriate places within the web directories.

4. Use 301 redirects when moving content. Nevill-Manning emphasized that if your pages are moving or your site is moving; use a 301 redirect (permanent redirect) rather than a 302. According to Nevill-Manning, Google will interpret the 301 more appropriately.

5. Protect your bandwidth. Nevill-Manning mentioned that if you find that Googlebot is using too much of your server’s bandwidth, you can tell Google to back off. Nevill-Manning recommended using the standard HTTP if-modified-since header to tell Googlebot when a page was last modified.

Here’s how this helps. When Googlebot comes to your site it asks your server “has this page been modified since a particular date?” Using the if-modified-since header the server can quickly tell Googlebot “no.” Since most sites don’t change that often using this header can make a difference in the load on your server because Google won’t fetch the page. Nevill-Manning stated that this is a standard feature of many web servers so it may just be a matter of turning it on.

6. Isolate “no index” pages. Webmasters often have pages they don’t want indexed by Google (these might include cgi scripts, web logs, or pages that are duplicates of existing site pages).

Nevill-Manning mentioned several methods for telling Google not to index certain pages. The most common method is by using the robots.txt file protocol. Google is a “polite” robot and respects the robots.txt. Additionally, Nevill-Manning mentioned there are meta tags you can embed in text of your html that tells us not to index that page. Remember that you can’t rely on the robots.txt to keep your data secure. The file only keeps Google from crawling your site.

On a related note, Nevill-Manning offered a cautionary comment. He mentioned that people build a page but don’t link to it and expect the page not to be found by Google. Nevill-Manning said that because Googlebot uses different methods of getting to content, a page can have no links to it, yet still be known by Googlebot (the leading theory on how this occurs is that Google learns of these pages via the Google toolbar).

Nevill-Manning also warned webmasters of a few practices not to follow.

1. Don’t cloak. Nevill-Manning described cloaking as giving Google different information than you’d give a web surfer.

2. Don’t do automated queries. Another Google no-no is using programs that automate queries. Google considers such programs a violation of their terms of service and have devised methods to turn off offenders.

3. Don’t hide text or links on a page. Nevill-Manning reminded webmasters to not use any method that subverts the way Google ranks such as hiding text or links on a page.

Google offers more helpful webmaster information on its site.

Google Advertising Update

Nevill-Manning attacked one conspiracy theory head on when he described how Google tries to keep “a strong distinction between content where money changes hands and content where money doesn’t change hands.” This separation ensures that Google search results are unbiased and objective. Nevill-Manning stated that there is a set of engineers that work on ranking functions and they essentially don’t talk to any of the sales people or anyone that interacts with the advertising side of the business.

Nevill-Manning described the AdSense program as a technology that places ads not just on web search results but on non-search content web pages. According to Nevill-Manning, their technology can analyzes your page to figure out its main topic and then find, through their network of advertisers, which ad would be best suited to show on your page. He said this method ensured a very high click through rate compared with many other forms of advertising on the web. Nevill-Manning also mentioned that many people are making significant amounts of money through content that they didn’t previously monetize.

Ask Jeeves: What’s New

Michael Palka, Director of Search, said that Ask Jeeves was now the number two pure search engine and the number five overall search engine player.

Palka described “subject specific popularity” as the feature that makes Ask Jeeves unique. This feature allows Ask “to analyze the entire web link graph and then break it down into subject specific communities.” Once the communities are identified, they can further classify the communities that are on the same topic which allows Ask to identify the authorities. The final step in validating authorities is actual editor review.

Palka identified Ask Jeeves’ “smart search” as taking search beyond temporal links. Smart search features provide fast access to weather forecasts, stock quotes, news headlines and other related areas that might be helpful to the user.

LookSmart’s Recent Focus

The final speaker in this session was Kevin Berk, VP of Advertiser Solutions at LookSmart/WiseNut. Berk stated that he wanted to answer the most often asked question he’d been asked at this conference, “What is going on at LookSmart?”

According to Berk, LookSmart is alive and kicking and involved in many activities. One area with a considerable focus is improving the user experience and increasing relevancy in search results.

Berk mentioned WiseNut was the search engine for LookSmart.com and that Zeal is the place to go within the Looksmart family to submit your site.

Berk showed a short demonstration of the distributed crawling capability of Grub. This new technology “lets individuals, businesses and organizations donate their computers’ otherwise unused processing power to run software programs that continually crawl the Net, indexing websites and other documents. This data is gathered into the first comprehensive, daily-updated registry of websites, which will be used to provide accurate, up to the minute results for search engines.”

Question and Answer Time

The presentations were followed by a short question and answer session where audience members were able to throw questions to the search engine representatives.

The first question asked what would happen to the Yahoo properties AltaVista and All The Web?

Tim Mayer assured the audience that AltaVista and All the Web would remain as search destinations. However, both properties will be migrating to Yahoo search technology platform (This occurred shortly after the conference).

Another audience member had concerns with the push toward Yahoo’s Personalization and whether there was a way to turn it off.

Yahoo’s Mayer stated that because of all the privacy issues, personalization at Yahoo will always be opt-in. He assured the audience that even if you actively opted in, there will also be the capability to turn personalization off.

One question asked how spiders handled URL rewriting (such as you might do if you had a dynamic site).

The engines were in strong agreement on this question. They all stated that simple URLs were more attractive to crawlers and may even encourage spider to crawl deeper on sites. Other advice recommended getting rid of session IDs and don’t have a cookie driven site. Both of these practices could keep your site from being indexed.

Another question was about using keyword phrases in domains and other filenames. The word from the engines was to use caution here. Yahoo’s Tim Mayer was blunt about the danger when he advised webmasters to not focus on this or risk falling into over optimization. He stated that “when I see 4 hyphens in a name that is a spam indicator. That is a low quality URL.” He also warned against creating a separate subdomain for each page on a site. Mayer said, “When you get into that area, you’re pushing too hard.”

The best advice is still to make filenames intuitive to the user and focus on building great content for your site.

One audience member asked about the difference between paid inclusion versus free crawl.

Tim Mayer from Yahoo mentioned that there are certain benefits to having a structured relationship with the service provider. For example, with paid inclusion you get customer service and better refresh cycles. Additionally you can get content into the index like dynamic content that may not be crawlable. Mayer reminded the audience that the Paid inclusion program exerted no influence on ranking and that participating URLs won’t get any special treatment.

Tim clarified one area related to free crawl. If your URL has been discovered in the free crawl and you stop subscribing to SiteMatch (Yahoo’s paid inclusion program), your URL will still be in the index. However, it won’t be on the same refresh schedule that it had previously with SiteMatch. He also mentioned that if you submit URLs and they are reviewed by the editorial team and deemed “High quality,” other pages from your site may also be added.

Summary

The general focus for the major search engines continues to be a focus on the user experience. Whether it is Google’s algorithms tweaks to improve relevancy or Yahoo’s and AllTheWeb’s changes to the actual User Interface of the engines to explore improvements in customer interaction, the search engines continue to strive to maximize the usability, relevancy, and accuracy of the search experience.

Christine Churchill is President of KeyRelevance.com, a full service search engine marketing firm. She is also on the Board of Directors of the Search Engine Marketing Professional Organization (SEMPO) and serves as co-chair of the SEMPO Technical Committee.

Resources

The 2023 B2B Superpowers Index

whitepaper | Analytics The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing

whitepaper | Analytics Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook

whitepaper | Digital Marketing The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

whitepaper | Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y