Yes, I said sole proprietor. Gigablast is entirely Matt’s work. From coding to crawling to marketing it’s all Matt Wells.
Recent weeks have brought about a great deal of development at Gigablast. In August, the engine began offering access to Adobe PDF and Microsoft Word, Powerpoint and Excel content. Recently, Gigablast began offering Boolean search capabilities.
Q. Matt, can you share with us a few career highlights and a brief bio? Also so we can get up to speed with Gigablast can you share some basic statistics about the database.
Around 1988, when in high school, I designed and wrote a graphic adventure game called “Revenge of the Germs” for Radio Shack’s Color Computer 2. I sold about 20 copies through Rainbow magazine. In college I developed the Artists’ Den web site, a searchable, open database that allowed artists to freely add pictures and descriptions of their artwork. This project led to my employment at Infoseek where I developed core search technologies for their world-famous search engine. After leaving Infoseek in August of 2000 I began working on Gigablast as an independent project.
Gigablast currently serves about half a million queries per day now, mostly to external sites. The index size is almost 200 million pages and changing daily.
Q. When did you start Gigablast? Did you write the software code yourself?
I wrote it all from scratch in C++. It has been almost 3 years since its inception. It uses zlib for doing compression and it uses a plotting library to make administrative graphs, but other than that, I custom coded everything.
Q. Did you see a whole in the web search market that Gigablast could fill? In other words what does Gigablast do that the other engines don’t. Why should a searcher use it?
The hole that I saw was a performance related hole, not a search results quality hole. I found a way to scale search more efficiently than everybody else. Theoretically, Gigablast can get by with ten times – even a hundred times – less hardware than other engines and achieve the same performance.
But as far as what is different from the user’s perspective, I think Gigablast’s scoring algorithms give less emphasis to links as other engines. I did this on purpose so new sites are not at as much a disadvantage as more established sites that have a large number of incoming links. This is why I chose the catch-phrase “Search the web from a different angle” to be displayed on the front page.
Another major difference is that Gigablast is the only large engine, to my knowledge, to ever do continuous updating and refreshing of the index in real-time. You can also add and update your URLs in real-time, too. I think Gigablast is the only large engine right now that allows that. Everybody else charges money and calls it “paid inclusion.”
Q. Can you tell us about the Gigablast crawler, Gigabot? How often does it visit a site for new content? I see you have a page submission feature, how do you handle spam? Has this been a problem? How frequently is the entire Gigablast database updated?
Gigabot uses a bisection method to determine the best times to spider a URL. If Gigabot finds that every time it visits a page that the content is unchanged then it will wait longer before its next visit. Likewise, if every time it visits the page it finds that the content has changed then it will visit that page more frequently.
Unfortunately, due to bandwidth constraints, I cannot update the entire index in a month, but I hope to fix this before year’s end.
Yes, there’s a lot of spam. I have custom algorithms for dealing with most various types of spam, including link farms, but I, just like the other engines, still partially rely on manual intervention. If I see someone submitting a lot of pages in a short amount of time I investigate their content. If it’s spam I ban their site. I also rely on user feedback to help me identify spam in the index.
Q. Running a search engine in the period where many people think Google is IT must be difficult. How are you handling competing with Google and other large engines? On your site you mention that a portion of traffic comes from feeds from other sites – is this a big portion of your traffic?
I’ve been steadily improving Gigablast’s relevance. Yes, it is sometimes a bummer always working in the shadow of Google, so it really makes my day when somebody tells me Gigablast gives better results. I think for a lot of queries it does, not for all of course, but those instances are almost always because of my hardware budget.
I have $$8,000 of hardware and Google has maybe around $50 million. Go figure. Since I don’t have the money or resources to compete with Google I more or less rely on the dropped crumbs. I rely on the difference in my scoring algorithms, my real-time indexing features and my dirt-bottom pricing structures to differentiate Gigablast and it’s products from Google. Yes, the large majority of the queries I serve come from the search feeds I supply to clients.
Q. A few months back you started something called Gigaboost that gives pages linking to Gigablast a higher relevancy ranking. Could you explain your rationale about starting this program?
Yes, I’m not charging money like everybody else, and I wanted to receive something in return. After all, a lot of people find this service useful. I look at webmasters as my partners. Granted, there are a few that are evil spammers, but, by in large, the webmaster community has decent values and plays by the rules. I don’t let the few ruin it for the majority.
Q. Have you run into any problems with caching web pages? I would imagine you observe the Robots Exclusion Protocol.
I haven’t run into any problems yet. I’ve gotten some emails from someone claiming to work for the KGB telling me I need to remove a particular page from the index because it might endanger somebody. In those seemingly urgent cases I try to remove the page right away, but I’m also aware that it could be a competitor of the page being removed. The large majority of the time, however, everybody is being honest. Yes, Gigablast does follow robots.txt.
Q. From a business perspective, Gigablast carries no advertising? Is this a decision you plan to keep? How does Gigablast make money?
Money is derived from selling search services on my products page. At this point I don’t think I’ll put up advertisements unless I need the revenue to support Gigablast or myself.
Q. Several months ago you began a Scandinavian site that does carry advertising from e-Spotting. What is your relationship with BOP Interactive the Scandinavian company that you work with?
We have a contract with each other. I can’t get into too much detail about it.
Q. As someone with operational experience and programming skills, would you care to comment about what’s wrong with web search today?
There’s an incredible amount of room for web search improvement today. Search is just beginning. In a few more years I think search has the potential to displace operating systems as the most complicated program space in the market. There’s still a good amount of innovation in the operating system space, but it is pretty well tilled soil. I think the search sector is just beginning and has much more room to grow than the operating system sector. An operating system will allow you to write a report, the search engine of the future will write the report for you.
Q. What are your longtime goals for Gigablast? Do you see it more of a testbed for new ideas that you’ll sell to others or market as enterprise search technology?
Once I finish my spell checker and some other things I really want to continue my work on some new and experimental algorithms that bring a fresh perspective to search. That is something that really interests me. My to do list is literally a quarter of a megabyte.
To keep current with the latest changes and updates at Gigablast, Matt’s tech blog is where he posts the latest goings on with the engine.