What does it take to power a social search index with over a trillion total posts and hundreds of terabytes of data? Facebook search quality and ranking engineer Ashoat Tevosyan shared a peek under the Graph Search hood in a post on the Facebook Engineering page yesterday.
Tevosyan highlighted the challenges facing Facebook as they gave users the ability to search posts in Graph Search, a feature added last week:
“Facebook’s underlying data schema reflects the needs of a rapidly iterated web service. New features often demand changes to data schemas, and our culture aims to make these changes easy for engineers. However, these variations makes it difficult to sort posts by time, location, and tags as wall posts, photos, and check-ins all store this information differently.”
Facebook sorts and indexes on over 70 different types of data kept in a production SQL database. Their search engine, Unicorn, is an inverted index framework with capabilities including index building and data retrieval; raw data is converted and separated into two parts to work within it. Document data contains the post data Facebook uses to rank results, while the inverted index is more typical of a traditional search index, in that it goes through each post to determine which hypothetical search filters match.
The Graph Search posts index is much larger than any other at Facebook, Tesovyan wrote. They had to move from RAM (which worked well for the smaller indices) to solid-state flash memory to accommodate the more than 700 terabytes of data in the posts index.
For a bit of perspective, consider that Amazon, with its over 59 million active customers, has about 42 terabytes of data to deal with. YouTube, where over 100 million videos are watched daily, holds at least 45 terabytes of video in their database. Google is mum on the size of their database, though we know they answer over a billion queries daily. Each query is stored and over the course of just a year, Google packs away data for over 365 billion queries. Even back in 2008, they were processing 20,000 terabytes of data daily.
The ability to search posts was born out of a company Hackathon, he explained.
“My second day as a Facebook intern coincided with a company-wide Hackathon, and I spent the night aiming to implement a way for my friends and me to find old posts we had written. I quickly discovered that the project was much more challenging than I had first anticipated. However, the engineering culture at Facebook meant that I was supported and encouraged to continue working on it, despite the scope of the project. The majority of the work–infrastructure, ranking, and product–has been accomplished in the past year by a few dozen engineers on the Graph Search team,” Tevosyan wrote.
Tevosyan also shared the fact that Facebook uses over 100 distinct ranking features in their post result scoring, in order to serve up the most relevant Graph Search content to users. Before a query reaches that ranking system, though, it is rewritten, which “involves tacking on optional clauses to search queries that bias the posts we retrieve towards results that we think will be more valuable to the user.”