DeepDyve - the 'deep web' search startup - announced Steve Wozniak has join their advisory board, the Industry Standard reported.
"DeepDyve, formerly known as Infovell, switched its name and business model last year. Instead of trying to sell itself as a business utility, DeepDyve now plans to build a consumer-focused, advertising-driven business," the Industry Standard noted.
"DeepDyve is a research engine for the Deep Web or invisible web which is that part of the Web that is not index-able by search engines like Google," WebGuild stated.
DeepDyve goes beyond keyword indexing used by most current search engines. As the site explains, "DeepDyve's KeyPhrase technology extracts substantially more information from documents than typical keywords. It indexes every word, as well as every phrase in each document, and weighs their informational impact using advanced statistical computation."
Given that the deep web holds much more information than is available through the main search engines this area offers a new way for people to access information.
Posted by Frank Watson at 7:16 PM | Permalink | Comments (2)
Tools to monitor your brand or your competition have become an important part of an SEMs arsenal. Search Monitor has been working hard to provide the latest tools for our industry and just announced the launch of geo-targeted monitoring.
"Geo-targeted monitoring enables companies to watch their competition and affiliates, and to manage their reputation and brand use locally. With this launch, interactive agencies, marketers, affiliate managers, and compliance teams are able to monitor paid search, blogs, forums, news, and web sites around the world from a local point of view in countries including: USA, UK, Canada, Australia, Argentina, Chile, France, Germany, Italy, Belgium, Luxembourg, Mexico, Spain, the Netherlands, New Zealand, and many thers. Monitoring can be conducted in English or in the language native to each country," their press release stated.
Search Monitor covers
1. Competitor Monitor gives insights into competitive bidding strategies, competitor market share and visibility, ranking on sponsored search, ad copy strategies, and promotions like free shipping, trials, or sales.
2. Trademark Monitor eases the tasks associated with reputation management by auto-detecting advertisers sponsoring branded keywords, use of trademarks and slogans in ad copy and display urls, and negative, positive, and neutral brand buzz on blogs, news, and web sites.
3. Affiliate Monitor simplifies oversight of affiliate programs by auto-identification of affiliates using sponsored search to detect violations of rank requirements, keyword restrictions, ad copy requirements or restrictions, and landing page copy requirements or restrictions.
Posted by Frank Watson at 10:59 AM | Permalink | Comments (1)
Cuil, pronounced "cool," has officially launched.
All the kool kids are talking about it. The question is whether anyone will use it.
As a new search engine, Cuil is a longshot. It's no Google Killer.
Check out Cuil.com. Google needs the competition. But don't expect a revolutionary search experience. The results page looks very much like Guy Kawasaki's Alltop.com.
Cuil was created by former Google engineers Anna Patterson, Russell Power and Louis Monier, who picked up $33 million in venture capital to launch the search engine.
So how is Cuil different than Google? They're claiming bragging rights for search index size: 120 billion Web pages. While Patterson says that's 3X the size of Google's index, most people acknowledge that size doesn't matter.
As Google's official blog notes, many pages not indexed either point to similar content or would diminish the quality of its search results in some other way. T
Of course, Cuil can't use PageRank to organize results. So Cuil apparently assesses the actual content of a page.
Cuil's results are most similar to universal search, displaying photos horizontally across the page. Sidebars can be clicked on to learn more about related topics.
In a nod to privacy, Cuil promises not to retain users' search histories or surfing patterns.
Posted by Kevin Heisler at 7:52 AM | Permalink | Comments (16)
For years SEOs have been about the inability of search engines to crawl flash pages. But now Adobe is making an effort to keep Flash in the web development toolbox. They've announced the provision of Flash technology to Google and Yahoo in order to facilitate the indexing of sites and pages created with Flash.
“Until now it has been extremely challenging to search the millions of RIAs and dynamic content on the Web, so we are leading the charge in improving search of content that runs in Adobe Flash Player,” said David Wadhwani, general manager and vice president of the Platform Business Unit at Adobe. “We are initially working with Google and Yahoo! to significantly improve search of this rich content on the Web, and we intend to broaden the availability of this capability to benefit all content publishers, developers and end users.”
Over at the Google Webmaster Central Blog, an FAQ was posted offering up more details about the update. Here are some highlights:
Google says it can't crawl images, videos or FLV files because they do not contain text content.
What do you think about search engines crawling Flash? Are you more inclined to use Flash on your sites now? Leave your reaction in the comments!
Posted by Nathania Johnson at 9:54 AM | Permalink | Comments (2)
My home PC has started running, to paraphrase an Election Night Dan Rather, slower than a lame horse in molasses on a January morning. So I need to reinstall, prompting quite a bit of soul searching on my part: what operating system (OS) do I install; what programs do I really need; what do I need on my computer to be good at my job?
I've come up with a list of about 15 programs I really need, but so as not to make this post go on forever, I've divided the list into categories. Let's start with those programs that save you time.
Top 5 Timesavers
1. Launchy - If you have Launchy, you understand why I can't live without it. If you don't have it, prepare for your life to change. Launchy is a keystroke launcher, which basically means it's a better way of doing everything on your computer: launching programs, finding files on your desktop, performing web searches; visiting web sites. It can even give you local weather and perform calculations. All you need to is press whatever shortcut key you've assigned to Launchy, start typing, and whatever program or file you want comes up. Forget the Start Menu; forget Windows Explorer; forget your browser. Launchy will change all that--and save you a considerable amount of while doing it.
2. X1 - A few of these tools, like Launchy above and Ergo below, perform desktop search. But none of them do it as well as X1, which includes live searching abilities, numerous advanced search options, extensive previewing tools, and active email abilities. I'm organized (on the computer at least), but with hundreds of emails daily, along with reports for numerous clients, desktop search is a must-have time saver. No one does it better than X1. And, believe it or not, X1 is still free! You may not know that going to their site, as they only offer a preview of the newest version, but you can download older versions, which still work better than any other option, in the X1 Forums.
3. Ergo - The last search tool I use is Ergo. It's a cool visual search engine that combines a bunch of web search options with desktop search. What I really use it for is the annotation tools it has to mark up and share websites, and the cool grouping options it has to parse or organize search results. Truly smart search may still be a dream (especially for us SEOs, as it would mean the end of keyword research), but visual search tools like Ergo and SearchMe and clustering tools like Vivisimo's Clusty provide the next best thing: the ability to find what you are actually looking for before you go through results. Trust me; when you can search without browsing, you'll find the site you need in half the time.
4. Snag-it - Last but not least, Snag-it has proved invaluable for me when it comes to reporting. If you take as many screenshots as I do--of great search results, YouTube honors, social bookmarking and networking standings and occasional snafus--you know the hassle of trimming shots in Word or PhotoShop. And if you need to blur something out or add any effects, a 1-minute task blooms into a 10-minute endeavor. If you work with more than one monitor, double those estimates. Snag-it solves all that; copy only what you want from the screen and add effects on the fly. It's the only piece of software on this list that isn't free, but it's worth it.
Posted by at 4:24 AM | Permalink
Widemile Inc. has announced the beta availability of its third-generation optimization and multi-variate testing platform. The new technology allows users to simultaneously test a variety of offers, text, images, and other key variables.
The announcement coincides with 13 partnerships with companies participating in the initial platform. Those companies include Ascentium, Avenue A | Razorfish, Brand Digital, Closed Loop Marketing, DDB in Seattle, Palazzo Intercreative, POP, Portent Interactive, Red Bricks Media, SolutionSet, Stratigent, TMP Directional Marketing, and ZeroDash1.
"Widemile's third-generation software-as-a-service (SaaS) multivariate optimization system was specifically designed using open software and systems to meet enterprise standards for security, stability and performance," said Dean Kimball, Widemile co-founder and CTO. "Developed with partners in mind, the Widemile optimization system contains a wide range of testing, reporting and client management capabilities within an easy-to-use browser-based application, and provides a level of performance and interactivity that has previously only been possible with desktop applications."
Those partners are lining up to sing Widemile's praises. Randy Barney, Director of Site Optimization, Avenue A | Razorfish said, "We're excited about Widemile's approach and toolset, which is structured to scale with our business and client needs," while Lance Loveday, CEO of Closed Loop Marketing articulated that "Widemile is positioned well to enable us to seamlessly provide optimization services to our clients."
Posted by Nathania Johnson at 10:55 AM | Permalink
The New York Times, CNET, InformationWeek, and 52 other Google News sources missed the significance of Microsoft's new Research Lab in Cambridge, Mass., headed by Jennifer Chayes and her husband, Christian Borgs. The Times implied that Chayes and Borgs work in an ivory tower where basic research doesn't have a business imperative.
Nothing could be further from the truth in the online world.
Jennifer Tour Chayes, PhD in mathematical physics, led the highly esteemed Theory Group specializing in theoretical computer science. She's the co-author of almost 100 scientific papers and co-inventor of more than 20 patents. The New York Times only mentions her work in developing simple models of liquids and solids and the development of some exceedingly fast networking algorithms. Hunh?
Their groundbreaking work in search engine algorithms and social search may be the foundation of a successful Microsoft-Yahoo merger.
Chayes is one of the world's experts in the modeling and analysis of random, dynamically growing graphs (social graph, social search, Facebook, MySpace) – which are used to model the Internet, the World Wide Web and social networks.
One of the papers the couple co-authored, "Bid optimization in online advertisement auctions", details the ways paid search campaigns can be optimized by advertisers and search engines. "Multi-unit auctions with budget-constrained bidders", written by Borgs, Chayes, Nicole Immorlica (MIT), Mohammad Mahdian, and Amin Saberi (published in June 2005), discusses ways to optimize revenue for search engines given the fixed budgets of search marketers.
Their recent work provides a tutorial on search engine optimization and PageRank, before delving deep into algorithms few search marketers (myself included) understand.
Search engine optimization lives and dies by PageRank. Here's what you need to know about their research into PageRank.
Borgs and Chayes go beyond where a Web page ranks and explore the pages or sets of pages that contribute most to its rank. That's the foundation of link building. With the exception of link farms, link building has largely been a manual effort, somewhat arcane, but vital to SEO. PageRank contributions have been used for link spam detection and in the classification of web pages.
Chayes and Borg note that a set of pages contributing significantly to the PageRank of a page is often called a "contribution set" or "supporting set" of the page. Their work goes a long way toward solving the mysteries of Google PageRank -- and fighting the spam that threatens to degrade the relevancy of all search engine results pages.
Link spam can be detected in many ways besides the SpamRank-type algorithms: applying machine learning to link-based features, the analysis of page content, TrustRank, and Anti-TrustRank, and statistical analysis of various page features. Chayes, Borgs and their research associates use the local algorithm developed here to design several locally computable page features for link spam detection, and evaluate these features experimentally.
Chayes' contributions to Microsoft technologies include the development of methods to analyze network structure and behavior, auction algorithm design (i.e. paid search auctions), and online business model design and analysis.
She's famous for her work on phase transitions in problems in discrete mathematics and theoretical computer science. The result? The rise of some of the fastest known algorithms for fundamental problems in combinatorial optimization, the intersection of artificial intelligence, mathematics and software engineering. That would be search engine algorithms, paid search auctions and search engine revenue optimization.
Algorithms fuel search engines, spam filters, online advertising engines, social networks, machine translation and most of the online world. Social sciences - economics, psychology and sociology - analyze how and why people value things and study how people interact with each other. That's why, for example, Hal Varian, plays a key role in Google's success as the company's chief economist.
That's why Google's Marissa Mayer says social search is the future of Google.
That's the core of Search Engine WarGames.
Posted by Kevin Heisler at 3:32 PM | Permalink
A number of the leading online news publishers are looking to organize greater control over how and what news of theirs gets listed in the search results of the various search engines, according to a report by the Associated Press.
"Currently, Google Inc., Yahoo Inc. and other top search companies voluntarily respect a Web site's wishes as declared in a text file known as "robots.txt," which a search engine's indexing software, called a crawler, knows to look for on a site," AP noted.
Though the individual engines have other proprietorial code and the publishers want to have a greater influence on how this is developed and would like to see a unified methodology, the article reported.
"The current system doesn't give sites "enough flexibility to express our terms and conditions on access and use of content," said Angela Mills Wade, executive director of the European Publishers Council, one of the organizations behind the proposal. "That is not surprising. It was invented in the 1990s and things move on," Wade told AP.
Robots.txt files were first developed in 1994 and have been the standard method webmasters use to block spiders (the crawlers search engines use to go through websites' content). However, there has been much conversation online over the past 5-6 years that some crawlers ignore the robots.txt file.
The publishers desire for "proposed extensions, known as Automated Content Access Protocol, partly grew out of those disputes. Leading the ACAP effort were groups representing publishers of newspapers, magazines, online databases, books and journals. The AP is one of dozens of organizations that have joined ACAP", AP noted.
Posted by Frank Watson at 1:10 PM | Permalink
No way I could pass on pointing to the Debby Richman blockbuster post.
Wharton says: Online recommendation engines may chop off Long Tail of Search.
Prick up your ears, Chris Anderson Your Long Tail doberman (below) is under attack:
> is in the Page Title…nice trick) www.pandia.com/sew/169-duplicate-content.html
Duplicate Content - Get it right or perish www.webmasterworld.com/google/3060898.htm
Regular Google results:
Duplicate Content Issues www.seroundtable.com/archives/003398.html
Avoiding Duplicate Content Penalties www.elixirsystems.com/seo_tips/avoiding-duplicate-content-penalty.php
Official Google Webmaster Central Blog: Deftly dealing with ... (ahem perhaps Webmaster Central Blog should get shorter Titles too?) googlewebmastercentral.blogspot.com/2006/12/deftly-dealing-with-duplicate-content.html
I would give this one a tie. It is probably due to the fact that the term “duplicate content” is so often used by bloggers and in forums that the top three are all SEO-related in the regular results of each of these searches. It seems interesting and also a good sign that Google's Webmaster Central Blog does not show in the top three on the SEM-search feature (they are on the list). Obviously Google is not trying to manipulate any results in favor of their own blog.
I feel very confident that I will be using the SEM search blog almost exclusively to search for SEO and Paid Search topics in the near future. Thanks for this great tool, Alister, Lee and Google – I am pretty certain that you better get ready for lots of traffic from search marketers and students interested in the subject as well. In fact, if I was to advertise, student-populated sites would probably be my first target. After all, this will probably end up getting blogged failry heavily so those in the SEM community should find out very rapidly.
Please share your thoughts on this search function at the thread at SEW Forums, Google Custom Search For Search Marketers and Search Students
Posted by Chris Boggs at 10:13 AM | Permalink
Update: Liana Evans of Seach Marketing Gurus has done a great job of journalism and corrected some of the errors that myself and others posted about this story. You can read about it here.
Wikipedia founder Jimmy Wales plans, in partnership with Amazon, on launching a search engine early next year, accordiing to the London Times
Wales contends that Google has developed flaws as it has grown. And believes he can use his wiki methodology to compete with Google, Yahoo and MSN.
He told the Times that computer algorithms do not make as good selections as humans and if people get to use his alternative they may prefer it.
“But we have a really great method for doing that ourselves,” Wales told The Times. “We just look at the page. It usually only takes a second to figure out if the page is good, so the key here is building a community of trust that can do that.”
"The reputation already fostered by his Wikipedia community and the transparency of his technology will build sufficient trust in his search engine to bring in advertising revenue and make the Wikiasari venture profitable" The Times reported.
The project has been called Wikiasari - a combination of Hawaian and Japanese for "quick" "rummaging search". How this plays out should provide entertainment in the new year.
Update: Michael Arrington at TechCrunch has come up with a screen shot of the new engine. Looks like sponsored listings at the right, related links at the top and organic results where they normally appear.
There is also more detail and comments from Wales.
Posted by Frank Watson at 6:47 PM | Permalink
This weekend The Register published an article named Google developing eavesdropping software. The article describes how Google uses existing PC microphones fingerprinting technology to show relevant ads that appeal more to you. The article goes on to explain how the sound fingerprinting works; it "breaks sound into a five-second snippets to pick out audio from a TV, reducing the snippet to a digital "fingerprint", which it matches on an internet server." Privacy folks are worried about the repercussions of such software.
Postscript Barry: I should link to Google Paper Explains Listening To Your TV Can Help It Put Ads & Info On Your Computer we covered back in Jun. 9, 2006.
Posted by Barry Schwartz at 10:50 AM | Permalink
A New York Times article has a detailed analysis of Google's infrastructure and discussion with Urs Hölzle, senior vice president for operations at Google. Here are some of the key points I pulled from that article.
+ Google tends builds from ground up versus buying. + Google's computing costs are half those of other large Internet companies and a tenth those of traditional corporate technology users. + Critics call Google's philosophy "unnecessary and inefficient." + "Google is reducing cost while maintaining performance by shifting the burden of reliability from hardware to software individual hardware components can fail, but software automatically shifts the local task and the data to other machines." + Google is among Advanced Micro's five largest clients.
Posted by Barry Schwartz at 9:51 AM | Permalink
There are many people discussing a recent patent Google was awarded for picking up on ambient audio from your TV and pairing those sounds to your computer to serve up ads based on what you are watching (or something like that). Google Research Scientists, Michele Covell & Shumeet Baluja, described the technology as;
We showed how to sample the ambient sound emitted from a TV and automatically determine what is being watched from a small signature of the sound -- all with complete privacy and minuscule effort. The system could keep up with users while they channel surf, presenting them with a real-time forum about a live political debate one minute and an ad-hoc chat room for a sporting event in the next. And, all of this would be done without users ever having to type or to even know the name of the program or channel being viewed. Taking this further, we could collect snippets from the web describing the actors appearing in a movie or present maps of locales within the movie as it takes place (no matter if users are watching it as a live broadcast or as a recoded broadcast).There are two additional articles that have good coverage of this, that I am aware of. The first is at Small Biz Pipeline and the second is at TechCrunch. I particularly like how TechCrunch pulled out the four main points of the paper, as such;
+ Personalized information layers Here?s what Tom Cruise is wearing in the show you are watching and here's where you can buy the same clothes in your zip code. + Ad hoc social peer communities If you would like to chat about this show, ten of your college friends are watching it right now as well. + Real-time popularity ratings Nielsen requires hardware and the results aren't available in real-time. You might want to know if there is a spike in viewers watching the show on channel 9 right now. Advertisers might want to know that too. + TV- based bookmarks Click to save a show or clip into your video library and there will be more than just a few shows available for watching later.Posted by Barry Schwartz at 8:43 AM | Permalink
Ever wonder how spider/bots/crawlers behave? Well, if you did a new analysis "On Bots" was released at http://drunkmenworkhere.org/219. The article has an analysis and visualization of the behavior of search robots. The analysis covers Yahoo Slurp, Googlebot and MSNbot crawling 2 billion pages structured in a binary tree over 1 year. The study was conducted on a single site, so I am not sure how statistically valid it is over all sites on the Web. Just take a look at the overall results to see how much of a hog Yahoo is.
Posted by Barry Schwartz at 2:56 PM | Permalink
Microsoft's Camera Phone Search Project and Other Camera Phone Search Tech from ResourceShelf covers a new Microsoft Research project allowing you to take pictures of things in order to get search results back about it.
Snap something with your camera phone, then that goes into an image search database, which identifies the object or type of object in order to run other types of searches about it. Or that's the idea. You can't try it yet, and Microsoft isn't even certain what they may do with it.
How about searching by taking pictures of bar codes? Completely different idea than this project, but thanks for asking! The ResourceShelf post gives you resources on the whole Amazon bar code searching in Japan thing, for the curious. And Frucall, mentioned yesterday by Brian, deals with bar code searching as well. The downside is you have to key in the numbers.
Posted by Danny Sullivan at 7:31 AM | Permalink
Back in September, SEW Forums moderator Edel "Orion" Garcia posted a thread about a new search technology under development. It was coincidentally called the "Orion Search Engine" but not connected with our moderator. Instead, it was developed by a university student who now, according to news reports out this weekend, works for Google. Google's also acquired his search technology.
How great this search engine was is impossible to say. The press release that inventor Ori Allon put out last September was full of excitement, but so are plenty of releases trying to attract the attention of investors and the media. The search engine itself was never available for the public to use.
It sounds like Allon mainly developed an algorithm useful in pulling out better summaries of web pages. In other words, if you did a search, you'd be likely to get back extracted sections of pages most relevant to your query. From the release:
The results to the query are displayed immediately in the form of expanded text extracts, giving you the relevant information without having to go the website.
Such extraction could work well with moves by Google to expand direct answers that it offers, something all search engines are doing. Of course, the more Google and other search engines extract heavily from web pages without sending them actual traffic, the more likely they'll come under legal pressures of stepping over the fair use line.
Via Threadwatch, Google buys search algorithm invented by Israeli student from Haaretz has more details on Google getting the rights to the Orion algorithm and confirmation that Allon now works for Google. His university says that Yahoo and Microsoft were also in negotiations for the technology.
Google wins rights to Aussie algorithm from The Age reports that Allon's been with Google for about six weeks. However, Microsoft chairman Bill Gates never commented on the technology, to my knowledge. The Age just seems confused that Allon's press release mentioned public comments by Gates that there's room for improvement generally in search.
Google does deal for Aussie program from the Daily Telegraph pitches that the technology will revolutionize the way we search. Ho hum. Reality check, OK? When Google acquired the three people from Kaltix along with their search technology back in 2003, it hardly created a revolutionary change for us soon after.
By revolutionary, I mean a radical shake-up of how we search or a major leap-frogging past other players. That didn't happen post-Kaltix. We did indeed see better personalized search come from Google, what I find one of its most impressive features. But that's an evolutionary change. It works on top of other things Google has built. It doesn't overturn and throw out the base technology.
So my reality check alarm is mainly for anyone who thinks Google's going to suddenly change because Allon and this extraction algorithm are now at Google. He gives Google another good employee, and the technology will probably give Google another evolutionary change that may improve things over time, rather than instanty.
Want to comment or discuss? Visit our Search Engine Watch Forums thread, The Orion Search Engine.
Posted by Danny Sullivan at 7:56 AM | Permalink
Posted by Barry Schwartz at 8:38 AM | Permalink
The second issue of Google's Newsletter for Librarians is now available. It features an article by Karen Schneider, the director of the Librarians' Internet Index, the wonderful and important searchable directory of high quality web resources that I've mentioned on the blog and in SearchDay many times.
Schneider focuses on the some of the critical information judgments needed in determining the trustworthiness of a site and the info that it contains. Those of us who attended library school are aware of many of these concepts. I hope Karen's article reaches more than information professionals including students where these ideas should be taught and reinforced from the earliest grades forward.
Next, Matt "Jagger" Cutts is back with a look at how Google determines what sites are "most trusted." His article talks about the 100's of factors (including some traditional info retrieval metrics) that Google looks at in addition to PageRank.
For more of an in-depth discussion of this you might want to pick up a copy of Chris Sherman's (yes SearchDay's Chris Sherman) book, Google Power. You can preview the title via Amazon's Search Inside the Book. I was unable to find it using Google Book Search.
Remembering that Matt's article was written primarily for librarians and other information professionals, he explains that Google, like other engines analyzes the actual content.
He points out that, "this [analysis] goes beyond scanning page-based text, which webmasters can easily manipulate through meta-tags."
While it's true that Google and other engines look to some degree at the meta-description tag, he doesn't mention that although the meta-keyword tag is still used by some, it's value is not as great as it once was. Danny points this fact out in a 2002 article. You'll also meta tags listed in this post from Barry.
Cutts goes on to write: We also look at factors like fonts and the placement of words on a page. And we examine the content of neighboring pages, which can provide more clues as to whether the page we're looking at is trusted and will be relevant to users.
It would have been useful, particularly to the readers of this article, if Matt would have explained that the factors listed above and many others can also be manipulated or what others have termed "gamed."
As I've pointed out in many presentations to librarian, this is not a good or bad thing but simply the way large general-purpose web enginrs work. For the librarian, a knowledge and understanding of this is important and useful.
After reading both Karen's article and Matt's piece we see somewhat of a disconnect between trustworthiness in terms of inclusion and good placement on a results page versus the trustworthiness concepts that a human might use to judge not only the quality of a web page itself but the data it contains. Yes, I'll readily admit to being a bit prejudice here but I think Karen's article also illustrates the value of just one of the many skills well-trained librarian can offer.
Matt concludes with links to a few more excellent papers.
Btw, many of the same concepts (what Google calls and has patented as PageRank) are in place at just about every other major web engine. In other places, the concept is referred to as link analysis.
As a librarian I would have loved if Matt would have thrown a "shout out" to Dr. Eugene Garfield, the father of citation analysis. It has has been around since the 1950's and librarians have been using it since day one. The relationship between citation analysis (something librarians understand) and link analysis (PageRank) is strong and are even noted in Brin and Page's seminal paper. One of the biggest differences is that web link analysis is much more open than traditional citaton analysis and thereby harder to game (although to some degree) it's also possible.
Yes, the concepts used in citation analysis are really what drive link analysis.
If you want to learn more, this post has tons of links and interviews about citation analysis. It also includes a link to Garfield's paper, Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas."
Finally, although this Scientific American article was written in 1999, I still think it's one of the best, especially for non-geeks, about web link analysis. It was written by members of IBM's Clever team.
Clever was web search engine (never publicly released) by IBM. More about it here. Members of the Clever team read like a "who's who" of web search including Jon Kleinberg, Soumen Chakrabarti, and Prabhakar Raghavan who is now the head of Yahoo Research
As you review the article, take special note of the section where Clever and Google are compared. While Clever never made a public appearance, many of the concepts it offers are what power the Teoma/Ask Jeeves search technology.
Postscript: Yahoo's Prabhakar Raghavan offers archived materials from his Stanford classes on text and information retrieval classes online. Must have content for those interested in the subject.
Posted by Gary Price at 11:58 AM | Permalink
In October, Danny blogged about Google's "coming soon" quarterly newsletter for librarians. Today, the first issue went live. It's available here.
I've posted a bit more on ResourceShelf.
The highlight of this issue is the Matt Cutts authored article on how Google crawls content and ranks results, with a very nice explanation of how an inverted index works. For some librarians and many readers of this blog, it will be familiar material. Neverthless, when Mr. Cutts writes, it's always a great read and in this case an excellent review.
Posted by Gary Price at 5:36 PM | Permalink
News.com has a nice mention of long-time search watcher Stephen Arnold having compiled more than 120 patents he believes belong to Google on a CD. Want to get them in one go? Visit his site, pay your $50, and there you go. Gary, of course, regularly posts here about patents and links to where you can download them for free (use that Legal: Patents link below this post if you are an SEW member for a fast way to see his past posts). But if you want to save yourself some time and love reading patents, this looks like an easy way to go.
Posted by Danny Sullivan at 8:26 AM | Permalink
Convera, a well-known name in enterprise search technology, posted a bit of news today that their web index continues to grow. So? It's been reported that Convera will enter the public web search space with a release of a web search tool by the end of this year. The company also announced that they've added 100 million images to their web index. Convera recently announced that they've just licensed use of their web database by an undisclosed U.S. Government organization.
Posted by Gary Price at 4:51 PM | Permalink
If you're in need of a couple of roundups that look at various search tools and services, two of them are online today. One in Time magazine and the other in The Boston Globe. It's likely that the services and companies mentioned in these articles will be "new" to many readers of these publications. However, we've discussed and linked to many of them on the SEW Blog during the past year.
Time magazine's: On the Frontier of Search, and the Boston Globe article: Cutting through search-engine clutter, include mentions of:
Time NOTE: I was interviewed by a Time reporter for this story.
Boston Globe
Quotes "Search will ultimately be as good as having 1,000 human experts who know your tastes scanning billions of documents within a split second. It will model the human brain." Gary Flake, Distinguished Engineers at Microsoft Note: Dr. Flake is the former head of Yahoo Research Labs. Here's an interview that I did with Dr. Flake in 2004.
Posted by Gary Price at 10:54 AM | Permalink
Forbes is out with their 2005 "E-Gangs" list of important people in the "e" world. This years list is subtitled: Eight Masters Of Information, and focuses on members of the search and IR community. Many of eight honorees are people from companies/services we mention regularly on the blog.
The list includes:
This page contains links to short profiles of each list member.
Posted by Gary Price at 2:49 PM | Permalink
Developers and other search geeks out there might find this draft document that lists and discusses a number of search interface protocols and specifications worthy of a read/bookmark. The paper comes from the CORDRA project at Carnegie Mellon University. Thanks to Puzzlepieces for the tip. Btw, Michael Fagan (publisher of Puzzlepieces) also notes several other protocols/specifications that aren't listed in the paper.
Posted by Gary Price at 6:25 PM | Permalink
All of the major search engines, to one degree or another, provide insights into what they're working on in their research and development labs. The quality and quantity of what's shared varies widely, but you can get a good sense of what to expect in the future by spending some time with what's available. Today's SearchDay article, What's Cooking in Search Engine Labs shows you where to find the best sources of inside information, both official and unofficial.
Posted by Chris Sherman at 10:28 AM | Permalink
Wouldn't it be cool if search engines were so smart that they'd implicitly understand even our most complex information needs and provide spot-on results just from the spare two-or-three word queries that most of us rely on? It'll happen, though not in the near term. But you can get a tantalizing glimpse of how applied artificial intelligence will likely change our search experience by visiting a web site that's up and running today.
In today's SearchDay article, If Search Engines Could Read Your Mind, I take a look at 20Q.net, a site that lets you play a version of the popular children's game "twenty questions," and uses a neural network to guess what you're thinking about. 20Q.net is surprisingly good, and foreshadows some technologies that will likely be commonplace in our search tools some years down the road.
Posted by Chris Sherman at 10:18 AM | Permalink
Spotted via Geeking with Greg, is this Seattle Times article: Entrepreneurs seek new ways to mine Web, that takes a look at a few Seattle area search companies including:
+ Nervana (enterprise search and a great name for a Seattle company) + SingingFish + Findory + Infospace + The Work of Professor Oren Etzioni at the University of Washington Dr. Etizioni was the co-developer of Metacrawler and is now working on the Know-It-All project.
To address the problem of accumulating large collections of facts, we are developing KnowItAll --- a domain-independent system that extracts massive amounts of information from the Web in an autonomous, scalable manner.More about KnowItAll here (a paper from the WWW2005 conference; PDF) and here (from WWW20004; PDF.
Posted by Gary Price at 1:15 PM | Permalink
The ZDNet "Behind the Lines" blog has a look at a search panel that took place at PC Forum last week. The panel included Marissa Mayer from Google, Udi Manber from A9, Alain Rappaport, CEO of Medstory; and Arkady Volozh, CEO of Yandex, the leading Russian search engine and portal.
Here are a few key items (from my point of view)
+ Google's Marissa Mayer: "We don't know how to do [personalization] well, so we are starting with baby steps, such as knowing where you are as a context," Mayer said.
It will be interesting to see if Google will eventually request more info from users to help with personalization. Another question is, will Google users want to provide the info?
She [Marissa Mayer] said, "We need to get better not at doing searches, but at providing answers people are looking for. There will be a day when ten HTML links regardless of who you are is not the answer any more." She also said that the idea of everybody getting the same search result isn't reasonable.100% agree on this one. Large engines like Ask Jeeves and MSN are already doing work in this area. Yahoo is also offering many shortcuts that in some cases place an answer on the results page. Google, too! Companies like Kozoru are also doing work in this area. I've said numerous times that for certain "ready reference" queries, search engines will become answer engines. Answers instead of links will also be important for mobile web search to grow.
+ Udi Manber from A9 "In general, people will learn to use search better but have to invest the thinking--we are not in the mind reading business."
Way to go Udi! I'm glad, no thrilled, to read this. A little (like a few minutes) of explanation or training can go a long way. It has been my experience that with a little education users not only leave the session having a couple of new skills but also get excited to go out and learn more on their own.
People can't use what the don't know about and unlike those of us who follow the search space closely, no one has told them what search tools can do with just a small amount of knowlege. As web engines grow larger, searching skills and knowledge about a variety of tools will become even more important. Yet, according to this study search skills haven't really changed in the past seven years. Not every good answer can get into the Top 10 results when a searcher enters two or three keywords.
I'm not only talking about advanced searching skills like placing phrases in quotation marks (-:, but just showing people that large engines offer many services (news, images, shortcuts, etc) beyond the web search box. Also, this training time can be used to share info about specialized search tools (aka verticals) that might be able to save the searcher time, provide them with better results, and allow them to do more with the results.
Education about search and info retrieval should be a part of the curriculum from first grade on. If this is the "info age" (pardon the cliche), shouldn't info retrieval skills along with critical info skills (ability to judge the quality of the content) be crucial? Unfortunately, in many cases, they're not.
Posted by Gary Price at 1:59 PM | Permalink
Claria, the company behind the Gator eWallet software, has released new search relevancy ratings today examining how the top search listings on Google, MSN and Yahoo compare to pages the company says its research shows are actually most relevant. More important, the ratings mark the first use of technology Claria hopes will let it improve the results of major search engines or perhaps offer its own improved search engine.
You'll find the ratings in this company press release, and I examine them more in the Claria Unveils Behaviorial-Based Search Ranking article now posted for Search Engine Watch members. In short, this isn't a battery of tests that you can take to the bank to know who is best.
Instead, it's really meant to showcase the bigger point Claria wants to make. It's now going public with its RelevancyRank system that uses behavioral data to determine what it believes are the best pages on the web for any particular term.
Claria computes this by both monitoring the activities of web surfers and searchers through its own software applications and with partnerships it has with publishers. The company's plan is that the technology will either be licensed to search providers looking to use its data or it may release its own search engine powered by clicktracking and behavioral data itself.
More on this "third generation" of clicktracking in this article for SEW members, Claria Unveils Behaviorial-Based Search Ranking.
Posted by Danny Sullivan at 3:43 PM | Permalink
A Glimpse of the Soul of the MachineLucene is a popular, open-source search engine that's freely available for anyone to download and use. Lucene is a basic platform, and because it's written in Java it's highly extensible and can be adapted to many purposes.
If you're a proficient programmer, or simply are curious about what the code looks like for a world-class search engine, check out today's SearchDay article, How to Hack Your Own Search Engine, which is a review of a new book, Lucene in Action. The book offers detailed instructions for downloading and customizing Lucene yourself, and the companion site also makes all of the source code examples available for your perusal.
Posted by Chris Sherman at 9:27 AM | Permalink
In the past few week's, a couple members of Google's leadership have been sharing their thoughts and views in public forums.
First, Google's VP of Engineering Adam Bosworth, spoke to The Gillmor Gang (you can listen online) about future search engine architecture, personalization, and RSS. Findory's Greg Linden responds to some of Bosworth's comments with his take on the value of personalization.
Second, Google Blogoscoped points us to a transcript of a presentation by Peter Norvig, Google's Director of Search Quality. Norvig discusses semantic web ontologies, automation, and other issues.
Posted by Gary Price at 1:43 PM | Permalink
Microsoft: No Plans to Integrate Desktop Search into OSDuring a panel about search at the Harvard Business School Cyberposium, Mark Kroese, general manager of information services and merchant platform product marketing for MSN, told the audience that MS doesn't plan to integrate desktop search in the operating system.
"'...there's no immediate plan to do that as far as I know,' Kroese said. 'That would have to be a Bill G. [Microsoft chairman and chief software architect Bill Gates] and the lawyers' decision.'"The remainder of the eWeek article: Microsoft Won't Bundle Desktop Search with Windows, offers more coverage of the http://www.cyberposium.com/index.asp with comments from Yahoo!, Google, and Xerox representatives. Topics include local, paid, desktop, and enterprise search.
Here are a couple of key quotes from the article: At Yahoo, we think of local search as an extension of vertical search," [Bradley] Horowitz said. "It reaches into a different business model and provides a tremendous amount of value."
Microsoft's approach is a bit different, Kroese said. "At Microsoft our heritage is being a platform and our approach to search will not be a lot different."
"Today, paid [search] is a great business model," said Microsoft's Kroese. "But we're also pursuing other business models."
Google's [Deep] Nishar emphasized that "advertising is not necessarily evil." He noted that 40 percent of Internet search queries are commerce-specific queries. Charging advertisers for placement is not unethical, he said.
For additional coverage of Cyberposium, see the News.com article: Future of search rides on relevance.
Posted by Gary Price at 9:22 AM | Permalink
From SearchEngineBlog.com, Rich Skrenta Interview has Rich Skrenta sharing thoughts on how he went from being a founder of the Open Directory to starting up news search site Topix.
He considers the Open Directory (or any web directory) no longer necessary, given how the web has evolved and grown, saying:
It achieved these goals and has fulfilled its mission of becoming the largest human-edited directory of the web. But the web moved on, and while directories were very interesting in the mid '90's, keyword search has eclipsed them as the main ways consumers find information on the Internet.
Skrenta also provides some background on how to automatically classify the news and finding copies of stories where registration is not required, when possible.
Some time ago, I wanted to do my own long piece on the decline of directories but never got to it. So I'll dive in a bit here.
Back in 1999, it seemed directories had won in search, making up the majority of services when compared to crawler-based search engines. News.com has a nice piece on this from back then, Web search results still have human touch.
So what happened? I use a library metaphor to explain the decline. In the early days of crawlers, it was like walking into a library, asking for information about cars and being given thousands of matching pages from within the various books to sort through.
Sure, some of the good pages might be on top. But it was overwhelming to get so much junk and other information as well. In contrast, a directory was like using a card catalog. It helped you locate a few books on the topic of cars, a much more manageable list to deal with.
Crawler-technology has improved since 1999, of course. Google led the way and gave us the ability to search on every page of every book in the library AND largely get some very good matches right up front. The need for directories as a filtering device has diminished.
Again from News.com, The changing face of search engines from 2003 looks at some of this flip-flop, including the idea of Yahoo losing its directory "religion" when it put crawler-results first and the challenges the ODP has faced.
Even if directories are in decline, humans still have a role. In our recent coverage of AOL's search engine changes, it was heartening to hear that the company has about 60 people working to active shape, refine and customize some of the results its service. That's 60 more people than Google has actively intervening in keyword-specific results.
There are times when it is helpful for human beings to review results and do hand manipulation of them, to "program" the most important queries to help ensure quality. As I've written before, sadly most of the major search engines have abandoned such review.
You can't intervene in every result, and doing so raises other issues, such as when AOL removed the George W. Bush home page from coming up tops for a search on miserable failure. But it can also ensure that your users are not being served solely by an automated process that can and will make mistakes.
Sure -- airplanes can take off, fly and land themselves. But it's nice to have a pilot there as a double-check and vice versa.
Posted by Danny Sullivan at 9:50 AM | Permalink
Robert Weisman's article in the Boston Globe: In the shadow of Google takes a look at a few Boston search area companies including: + Eliyon Note: This tool uses open web data (be careful) and AI to build profiles about people. Although Eliyon still has a ways to go, I've seen a great deal of improvement over the past few months. Some of their services are free, while others are fee-based. + EasyAsk + Dotomi Endeca, Fast, Northern Light, and iPhrase are also briefly mentioned.
Key Quotes: "There's a lot of business to be had in search in the next few years, and it's not all going to Google," said Susan Aldrich, senior vice president at Patricia Seybold Group, a technology research and consulting firm in Boston. "I think several of the local companies have the potential to become wildly successful and get very big."
"Some of the things that Google does so well, like page rankings, are irrelevant in the enterprise," said Sue Feldman, vice president and search analyst at International Data Corp., a research firm in Framingham.
Posted by Gary Price at 4:18 PM | Permalink
Nathan Enns of FyberSearch dropped me an email to say he saw my proposal for search engines to consider an ignore tag and implemented it for his own FyberSearch search engine. More details and instructions in the press release at his site. OK, so FyberSearch is a tiny search engine, and the command is specific to it. This action isn't going to stop the problems bloggers and other publishers have. But it's a nice start!
For more background on the call for search engines to consider new tools for publishers, see my Comment Spam? How About An Ignore Tag? How About An Indexing Summit! post. Discussion is also on-going in this forum thread: Time For An Indexing Summit? I share within it that I'll likely set-up a summit-like panel for our next SES show in New York.
Posted by Danny Sullivan at 7:56 AM | Permalink
Bloggers seem increasingly upset at the comment spam they have to deal with, something driven primarily by those who seek higher search rankings by posting links to their sites into comment areas.
To me, the solution seems simple. Why not give designers a tag telling search engines to ignore portions of a web page? Or better yet, how about a coordinated summit among search engines and webmasters to advance the state of site indexing overall?
The solution would help more than bloggers. That's good, because more than bloggers need it. The problem bloggers face has already been an issue for those who run forums, guest books or any other type of venue allowing public contributions. All are -- and have been -- targets of those who want to promote web sites.
For a non-blogger perspective at the problem, check out Mike Grehan's Google PageRank Lunacy article we ran last year in SearchDay. It discusses how guest book spam spoiled a memorial site for a good friend of his. Just like bloggers, people with guest books need help too.
I take my inspiration for an ignore tag primarily from Bruce Clay, who proposed a somewhat similar idea for <ad> tags to Google informally earlier last year. Bruce's concern was that if he or others want to purchase links, they don't want those links to harm them somehow in search engines.
Believe it or not, there are some people who buy links because of the traffic the links themselves may drive. Bruce's thought was that if publishers such as Search Engine Watch's own JupiterMedia could surround paid links they sell with an ad tag, then search engines could discount those links for ranking purposes.
Interesting idea. I also like the idea for another reason. Since we've operated our Search Engine Watch Forums, we've been liberal about allowing people to link out to resources as relevant. But this can and has been abused. Not much, fortunately, but we occasionally have to police out the irrelevant link or the link hidden in a period or comma.
One solution would be an <ignore> tag. Using this, we could surround any posted links with the tag to prevent them from being indexed. If that became commonplace on forums, it might reduce the attraction for link spam to them.
That leads to another inspiration. Six Apart/Movable Type's Brad Choate wished for some type of page-based ignore feature last July in his Restricting Google on my terms post (something he originally asked for back in Feb. 2002). His solution, which he didn't realize when doing it (check out the comments of that post) was to cloak his pages using user agent detection.
Google, of course, doesn't like cloaking. But since Brad's intent isn't too deceive Google, chances are he's not going to get busted. But even more to the point, as he says, he wouldn't have to do such a thing if Google gave him some alternative.
More broadly, lots of people beyond bloggers in lots of situations wouldn't have to do such things if search engines gave us more options. It's not a Google thing. It's not blogger thing. It's a search indexing thing.
I mentioned the ignore idea to Yahoo at our SES Chicago show and got some interest, so maybe there's hope. It poses problems, of course. An ignore tag could be abused. An ignore tag also means that some good content that's marked as "ignore" might not get indexed. But perhaps we might also have levels. How about a <content> tag authors can use to denote the key body content, a <nav> tag to highlight navigation search engines might not want to index or weight as heavily or a <public> tag to denote publicly-contributed content that might deserve less weighting?
There are lots of possibilities. What I know is that the last time the search engines came together to help provide coordinated assistance to web site owners on indexing was May 1996, when we got agreement on the meta description and meta robots tags, along with some additional talk on new support for the robots.txt convention.
Since then, we've had unilateral advances such as AltaVista (new image indexing tags), Google (robots.txt expansion, no archiving tags) or others have added but nothing coordinated to involve web site owners or the search industry as a whole. After nearly 10 years, surely the time is ripe for that type of cooperation now.
At the very least, it might help get some bloggers off Google's back who blame it for the problem. A sampling of blame and other looks at the problem and solutions:
So what do you think? Time for an indexing summit? Are there indexing changes you'd like to see? Comments of any type? Come discuss in our forum thread: Time For An Indexing Summit?
Postscript: Support has now been officially announced for an ignore-like nofollow attribute. See the Google, Yahoo, MSN Unite On Support For Nofollow Attribute For Links post for more.
Posted by Danny Sullivan at 5:58 AM | Permalink
The magic that makes Google tick from ZDNet has a look at technical details behind delivering Google searches. But, I've got a few quibbles:
OK, enough with the quibbles, and which in fairness I could do with Google competitors, as well. See the rest of the article for some technical details on Google data centers, the fact there's not been a complete system failure since February 2000 and more.
Posted by Danny Sullivan at 12:56 PM | Permalink
Greg Linden from Findory alerts us to a presentation by Jeff Dean from Google. It took place earlier this week at the University of Washington in Seattle.
The presentation is titled: Google: A Behind-the-scenes Look and an archived version is now viewable (Real or MS) on the web. Greg's review of the presentation is also online.
Here are a few other presentations that might be of interest.
+ Seminar Presentation: Challenges in Running a Commercial Search Engine (3.5 MB; PDF) >From the IR perspective, interesting! A presentation by Amit Singhal, Senior Research Scientist at Google. It was the keynote address at IBM's Second Search and Collaboration Seminar 2004 in Haifa.
+ View a Presentation by Google CEO Eric Schmidt Eric Schmidt delivered this presentation (it runs about one hour, Windows Media) at a UC Berkeley during the EECS Annual Research Symposium in February.
+ And one from a non-Googler. Udi Manber, top guy at a9, spoke at the University of Washington last November. You can watch an archived version of his lecture here (RealVideo). It's titled, "The World's Information at Everyone's Fingertips."
Posted by Gary Price at 6:18 PM | Permalink | Comments (0)
OK, we've had Google Labs launched in 2002, a place where Google rolls out beta projects to the public. Overture then unveiled Overture Labs in 2003, later rebranded as Yahoo Research Labs at the beginning of this year. Now there's Yahoo Next, which has just taken over as the place to watch for Yahoo technology demos. The My Yahoo Search Beta is currently featured there. By the way, MSN's lab area is the MSN Sandbox.
Posted by Danny Sullivan at 7:51 AM | Permalink | Comments (0)
A presentation (PowerPoint slides) titled, Challenges in Running a Commercial Search Engine (3.5 MB; PDF) might be of interest to some of you.
The slides come from a keynote presentation by Amit Singhal, a Senior Research Scientist at Google.
The presentation was given in Israel on February 16th at IBM's Second Search and Collaboration Seminar 2004.
Posted by Gary Price at 8:51 AM | Permalink | Comments (0)