Tim Berners-Lee first spoke of a Semantic Web at his address at the first World Wide Web Conference in 1994. Given the technical level of the audience, his presentation was, for the most part, met with excited nods.
The Web Berners-Lee described was a far cry from the library-style repository of the Web at that time, but the concept wasn’t so far-fetched, at least to the listeners with a more visionary nature.
“Semantic”, however, is a qualifier that means a great deal in this context. It demands that a machine, or more accurately, the software that drives that machine, must understand the information in the way it was intended. Let’s face it: most of us know a handful of human beings that are challenged in that regard.
Indeed, for a machine to comprehend the meaning behind what a human has put to text, requires a certain amount of artificial intelligence. Humor, irony, and emotion certainly seemed to be beyond the conceivable limits of a computer program in 1994. Even in 2012, there are still some that doubt that such comprehension will be possible in the near future.
The Other Approach
Looking at the issue strictly from the standpoint of achieving a semantic search capability, it seemed that rather than trying to teach a computer how to think like a human, it would probably be much easier to teach humans how to present data in a format that a machine could understand. Take Muhammad to the mountain…
Enter: semantic mark-up, such as RDFa, microformats, microdata, schema.org… structured data.
Many might be surprised to know how far some search engine algorithms have progressed in being able to parse text and discern what a document deals with, or even a specific sentence within that document. Certainly, there is still a ways to go, but more progress has been made than is obvious to the casual observer.
Syntactic and Semantic Graphs
Would you be surprised to realize that your elementary school teacher was introducing you to syntactic graphs when she had you do sentence diagrams? You remember these, right?
For the simple sentence, “The two of them were lost in the cave”, you would diagram it as:
Rudimentary, perhaps, but still pertinent. You were actually learning to parse the words in a sentence, determine their relationship to each other and then graph those relationships. (If you had been learning English as a second language, this would have been a very helpful exercise in helping you understand what the sentence was trying to say.)
At the point that you grasp the meaning of the sentence, you have left the purely syntactic mode and entered the realm of semantic relationships. In the context of search, the meaning of that sentence can then be aggregated with the meaning of other surrounding sentences in the document, to give a deeper understanding of the meaning of the document as a whole.
In your school classroom, you had a dictionary and lists of common verbs, adverbs, adjectives, prepositions, etc. These tools could be loosely compared to the entity classes of semantic mark-up. And interestingly, RDFa triples can be compared to the subject, verb and object of a simple sentence.
Just as your elementary English teacher taught you, there are rules that need to be applied to semantic parsers, as well. For instance, using an example that Tim Berners-Lee presented in 2001 in The Semantic Web, for a genealogy system, a rule such as “a wife of an uncle is an aunt” would be a rule that applied to that knowledge-representation system. A rule would often only apply in that system, even if the database could be utilized in a number of other systems.
Similarly, many rules (a great many!) must be included, behind the scenes, in semantic search algorithms. These help the algorithm recognize Uncle Bob’s wife as your aunt, even though she’s not specifically identified as such.
The Future of Semantic Search
What does this mean for search in the future? As the knowledge-representation graph continues to grow, linking more and more entities and establishing their relationships to each other, more connections will be available. And since the algorithms can learn, those connections will be more easily mapped.
Search engines will be able to dig deeper and deeper into the graph to find information about an entity. For instance, today, if you search on Google for “what does Bill Slawski say about patents?” (without the quotes), you’ll predictably get thousands of results. But those results will simply be documents that contain terms such as “Bill Slawski”, “patents” and “say”.
Eventually, a semantic search engine will be able to get past that limitation, understand the meaning of your question, understand the meaning of those thousands of documents, and then return results that only contain statements that Bill has made about patents.
Ask a question like “Does Bill Cosby come from a large family?” and the search engine of 2025 should be able to analyze all the portions of the graph that contain connections that pertain to Bill Cosby, and quantify the number of people that qualify as members of his family.
Can they do this now? To an extent, yes. Certainly, I believe, to a greater extent than most people realize. But since the vast majority of the data on the web is unstructured, they’re somewhat limited. As structured data becomes more prevalent, and as the algorithms learn more, it will become more obvious to what extent they can make those connections.
Imagine that you want to be able to ask how Bill Cosby’s wife’s next door neighbor voted in the 1948 presidential election. Today, no search engine is likely to be able to help you much.
But in 2025? Let’s examine that possibility.
If enough structured data has been made available, the engine could detect the fact that Bill Cosby’s wife is Camille (Hanks), who was 3 years old in 1948. And that her parents, Guy and Catherine Hanks, lived at 317 C St. in Anytown, Mass., a house which, at the time, had a hardware store on one side and a private residence on the other.
It could determine who the resident of that house was in 1948, and see that he worked from 1943 to 1950 as a volunteer for the Democratic party. It could then return a result with that information, perhaps even extrapolating a 98.5 percent probability that the man had voted for Harry Truman.
Do you really think that current technology isn’t capable of such a thing? Of course it’s capable! Do you think the data doesn’t exist? Sorry – it exists – and what’s more, much of it has already been uploaded.
The problem is, so much data is still unstructured, that the instances in which such a query could be accurately answered are too few and far between to make it viable. So where does that leave us?
It leaves us holding the bag. We need to provide more structured data, not just going forward but retroactively, as well. Fortunately, the search algorithms can fill in a lot of blanks, so uploaded data in an accessible digital format will take care of the majority of that gap. The question, of course, is how long will that take?
For my part, I’m betting that semantic search will be substantially enabled by 2025. Sooner would be better.