Google Scholar Documentation and Large PDF Files

Date published 1 December 2004 Author

Gary Price

Categories

Industry

Google would be doing themselves a favor in offering better documentation and disclosure about what Google Scholar does and doesn’t offer. Yes, it’s only a beta but this type of info should have been available from day one.

First, Google should let users know what publishers they’re working with and also remind people that some material in Google Scholar is culled directly from the open web. It would save a great deal of time having some idea what is and is not available. How much is full text vs. links to citations/purchase options? What criteria do they use to decide what open web content makes it into Google Scholar?

Second, they need to offer the Google definition what is and isn’t “scholarly” material. While many definitions exist, Google should define and disclose what they consider scholarly and how about they select it for the index. That said, I don’t think citations to press releases, resumes, and links to bibliographic records for cookbooks, would be considered scholarly using most definitions.

Third, for several years its been documented (not directly by Google) that they don’t crawl more than 101kb of an html web page. It’s also been noted since Google began crawling PDF files that they might not crawling the full text of larger files. In 2002, Greg Notess wrote that many files larger than 120KB were not fully indexed. Although Google seems to be indexing fully indexing larger PDF files these days, I’ve noticed that some large reports and papers discovered via Google Scholar are NOT fully indexed. Here’s an example:

Title: A Resource Handbook On DOE Transportation Risk Assessment
Live Version
A 280 page document

However, the Google Scholar cached version shows that only about half of the document has been indexed and is searchable.

Document size might not be an issue that will come up with every search. However, how often its an issue is really besides the point. It illustrates that Google needs to improve their documentation and disclosure especially since many people consider Google a full text index.

Two more bits of Google Scholar news. While preparing a review of Google Scholar (it should be available soon), librarian and legendary reference reviewer Peter Jacso has built a resource that allows you to compare Google Scholar results with various native search databases.

+ Side-by-Side Native Search Engines vs Google Scholar
He also offers this early commentary:
Preliminary tests have shown that Google Scholar often retrieves far fewer unique items than the native search engines of the publishers. On the positive side, Google Scholar links to citing references if the document was cited by journals indexed in Google Scholar, and provides the immensely useful citedness score of the documents. When Google Scholar has more “hits” for a query, they often turn out to be duplicates and triplicates (not always displayed adjacently) with a separate hit for the TOC entry, the abstract, the PDF file and (if available) the HTML file. Although their URLs are slightly different, they take you to the same spot in the archive. These are redundant and confusing.

Finally, other search engines may not index the full text of documents as well. Danny gives a recent rundown on that here.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Google Scholar Documentation and Large PDF Files

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future