Google Does PDF & Other Changes

Google now includes listings of Adobe PDF files from across the web, a first for any major search engine and a feature long overdue for them to offer. PDF, for Portable Document Format, is a popular means for researchers, among many others, to publish information. By including PDF content in its listings, Google makes its service even more useful for those trying to get into the nooks and crannies of the web.

Not all of Google's computers have been updated with the PDF information, which means that is pretty random as to whether you'll encounter the PDF-enhanced listings.

"If you did a query, depending on the luck of the draw and how busy the data centers are, you might not get PDF search," said Craig Silverstein, Google's directory of technology. All the computers should be updated by the end of this week, he said.

When that happens, users will have access to the full text of 13 million PDF files. They will appear mixed among the normal listings of HTML documents, when relevant for a particular query. However, PDF files will be prefaced by a [pdf” label next to their title. This is to help warn users before they click on these documents, as they can sometimes be quite large to download. They also require having the Adobe Acrobat reader to view them. If you don't have Acrobat, don't worry -- clicking on the "text only" below the listing will let you view a text only version of the files.

Want to restrict your search to just PDF files at Google? Make use of the inurl: command AFTER your search words. For instance, a search for "colleges inurl:pdf" tells Google to find documents containing the word "colleges" and which have pdf in their URL. So far, this seems to pretty much ensure that you only get back PDF files. Multiword queries also seem to work -- "amazon ebay inurl:pdf" brings back PDF files mentioning both of those companies, for example.

That inurl: command look interesting? Here are some other power commands you can add to your Google arsenal:

allinurl: is supposed to tell Google to find ALL the words you specify after it, within the URL of a web page. In contrast, inurl: is supposed to find ANY of the words you specify after it in the URL, rather than all of them, Google says.

You can also try allintitle: to find all the words you specify within the title of a document, while intitle: is supposed to find ANY of the words.

There's also a site: command, which is useful for finding pages just from a particular site -- " google" would bring up all the pages from Search Engine Watch that mention Google, for instance.

Unfortunately, using site: with your domain alone won't work to bring up all of your site's pages. Instead, you'll need to enter at least one other word -- try for something you know appears on every page but which isn't a stop word, like "the." You may find that repeating the words in your main URL will do the trick, such as " searchenginewatch." It looks odd, but it usually works.

Of course, rather than trying to remember these commands, you can also make use of Google's advanced search page (listed below).

Google now also supports the Boolean OR command -- sort of....

"To say we support the OR command would be a great overstatement. People have discovered if you type in a capital OR, Google does something with it," Silverstein said.

What's happened is that Google has added some behind the scene logic to catch variations of words written with diacritical marks, such as accented words like pjches (French for peaches), even when they are written without the accents, Silverstein said. This logic means that if you do a search for two words, with a capital OR in the middle, you may get something similar to a Boolean OR working for you.

In addition to crawling PDF files, Google has also been indexing some dynamically generated content since the end of last year.

"We've been expanding the kinds of dynamic content that we crawl, developing mechanism so we can tell if we are getting trapped. We're starting to crawl more, and we're very excited about it," Silverstein said.

Google has also grown its index of WML and HDML pages, designed for WAP browsers, to 2.5 million pages. An option to find this is supposed to be available when visiting the Google home page using a WAP browser. Be sure NOT to place a slash after the .com in Google -- otherwise, Google will fail to detect you are a mobile user.

Google is also offering a new "University Search" program to universities, where it will allow them to make their sites searchable for free. See below for a link to more information.

In other developments, some WebPosition users have found in past months that they are blocked from running queries at Google. This is by no means a universal problem and seems mostly to effect those running a large number of queries from a fixed IP address. Google has no plans to remove this blocking and considers any robotic requests, no matter how small, to be a violation of its terms of service.

"Our philosophy is that we are happy to have people look at our results, in particular if we have the ability to serve ads up to them. Part of our business model is advertising," Silverstein said. "An automated system, which strips our results....we do not look to kindly on that."

Chances are, ordinary individuals will not find themselves blocked. It's the larger drains on Google that light up the radar screen. "When people start using resources that are noticeable to us, then certainly that's an issue," Silverstein said.

Google has also expanded its AdWords program so that up to eight paid listings per page may appear along the right side of the main results. Previously, this was limited to three paid listings.

Finally, some reports of pages disappearing from Google around mid-November have come in. Google attributes these to temporary glitches during crawling that can happen and says they should be fixed by the most recently released version of its index.


Google Advanced Search

Google Toolbar

Want to search a specific site from Google? That's built into this nifty toolbar, as well as the ability to search the entire web with Google.

Google Ventures into the Invisible Web
About Web Search Guide, Jan. 31, 2001

Chris Sherman has an excellent in depth look of PDF search at Google.

Adobe Acrobat Home Page

Learn more about PDFs and download the Acrobat software that lets you read PDF files from here.

Adobe PDF Search Engine

Adobe's own PDF-specific search engine can be found here. Until Google, this was the best we had, for locating PDF files. Now, its 1 million documents is small compared to Google's 13 million PDF files.

Google Adds Title Syntax (Finally) And URL Syntax
ResearchBuzz, Dec. 6, 2000

Tara Calishain provides a close up look at playing with new Google power commands.

Boolean Searching on Google
Search Engine Showdown, Nov. 3, 2000

Guide from Greg Notess about trying to make use of OR at Google

Google University Search Sign Up

Does your university need a site specific search capability? You can sign up for Google's free offering from here.

Up Close With Google AdWords
The Search Engine Update, Oct. 16, 2000

Earlier article covering the basics of Google's AdWords paid links program.

The Google Interview
Search Engine Matrix Live Chat, Jan. 31, 2001

Google was recently featured in a live chat session for site owners that you will find archived here -- use the drop down box to find the session.