Google Scholar Documentation and Large PDF Files

Google would be doing themselves a favor in offering better documentation and disclosure about what Google Scholar does and doesn’t offer. Yes, it’s only a beta but this type of info should have been available from day one.

First, Google should let users know what publishers they’re working with and also remind people that some material in Google Scholar is culled directly from the open web. It would save a great deal of time having some idea what is and is not available. How much is full text vs. links to citations/purchase options? What criteria do they use to decide what open web content makes it into Google Scholar?

Second, they need to offer the Google definition what is and isn’t “scholarly” material. While many definitions exist, Google should define and disclose what they consider scholarly and how about they select it for the index. That said, I don’t think citations to press releases, resumes, and links to bibliographic records for cookbooks, would be considered scholarly using most definitions.

Third, for several years its been documented (not directly by Google) that they don’t crawl more than 101kb of an html web page. It’s also been noted since Google began crawling PDF files that they might not crawling the full text of larger files. In 2002, Greg Notess wrote that many files larger than 120KB were not fully indexed. Although Google seems to be indexing fully indexing larger PDF files these days, I’ve noticed that some large reports and papers discovered via Google Scholar are NOT fully indexed. Here’s an example:

Title: A Resource Handbook On DOE Transportation Risk Assessment
Live Version
A 280 page document

However, the Google Scholar cached version shows that only about half of the document has been indexed and is searchable.

Document size might not be an issue that will come up with every search. However, how often its an issue is really besides the point. It illustrates that Google needs to improve their documentation and disclosure especially since many people consider Google a full text index.

Two more bits of Google Scholar news. While preparing a review of Google Scholar (it should be available soon), librarian and legendary reference reviewer Peter Jacso has built a resource that allows you to compare Google Scholar results with various native search databases.

+ Side-by-Side Native Search Engines vs Google Scholar
He also offers this early commentary:
Preliminary tests have shown that Google Scholar often retrieves far fewer unique items than the native search engines of the publishers. On the positive side, Google Scholar links to citing references if the document was cited by journals indexed in Google Scholar, and provides the immensely useful citedness score of the documents. When Google Scholar has more “hits” for a query, they often turn out to be duplicates and triplicates (not always displayed adjacently) with a separate hit for the TOC entry, the abstract, the PDF file and (if available) the HTML file. Although their URLs are slightly different, they take you to the same spot in the archive. These are redundant and confusing.

Finally, other search engines may not index the full text of documents as well. Danny gives a recent rundown on that here.

Related reading

The word PREPARED is written on a blackboard with the UN crossed out. A hand is underlining it.
A hand holding a transparent piece of plastic or glass, with the Google logo superimposed onto it.
Simple Share Buttons