Stories, Semantics and the Web of Data

As a computer scientist I have spent hours talking to designers, architects and engineers to capture their domain knowledge to model in a computer, with the end goal of helping them do their jobs better. It isn’t always straight forward to perform knowledge elicitation with people who have been doing complex tasks, very well, for a long time. Often, they can no longer articulate why or how they do things. They behave intuitively, or so it seems. So, I listen to them as they tell me their stories. Everyone has a story. Everyone! It is how we communicate. We tell stories to make sense of ourselves and the world around us.

As Brené Brown says in her extraordinary TED talk on vulnerability:

…Stories are just data with a soul…

Up until now, stories have been the most effective way of transferring information but once we involve a computer, we become very aware of how clever and complex we humans are. With semiotics, we study how humans construct meaning from stories; with semantics, we are looking at what the meaning actually is. That is to say, when we link words and phrases together, we are creating relationships between them. What do they stand for? What do they mean?

Semantics

English Professor Marshall McLuhan who termed the phrase the medium is the message, described reading as rapid guessing. I see a lot of rapid guessing when my daughter reads aloud to me. Sometimes, she says sentences which are semantically correct and representative of what happens in the story, but they are not necessarily the sentences which are written down. She is basically giving me the gist. And, that is what our semantic memory does – it preserves the gist or the meaning of whatever it is we want to remember.

Understanding the gist, or constructing meaning, relies on the context of a given sentence, and causality – one thing leads to another – something humans, even young ones like my daughter, can infer easily. But this is incredibly difficult for a computer even a clever one steeped in artificial intelligence and linguistics. The classic example of ambiguity in a sentence is Fruit flies like a banana, which is quite funny until you extend this to a whole model such as our legal system, expressed as it is in natural language, and then it is easy to see how all types of misunderstandings are created, as our law courts, which debate loopholes and interpretations, demonstrate daily.

Added to the complexities of natural language, humans are reasoning in a constantly changing open world, in which new facts and rules are added all the time. The closed-world limited-memory capacity of the computer can’t really keep up. One of the reasons I moved out of the field of artificial intelligence and into human-computer interaction was because I was interested in opening up the computer to human input. The human is the expert not the computer. Ultimately, we don’t want our computers to behave like experts, we want them to behave like computers and calculate the things we cannot. We want to choose the outcome, and we want transparency to see how the computer arrived at that solution, so that we trust it to be correct. We want to be augmented by computers, not dictated to by them.

Modelling: Scripts and Frames

We can model context and causality, as Marvin Minsky’s frames first suggested. We frame everything in terms of what we have done and our experiences as sociologist Lucy Suchman proposed with her plans and situated actions.

For example, when we go to the supermarket, we follow a script at the checkout with the checkout operator (or self-service machine):

a) the goods are scanned, b) the final price is calculated, c) we pay, d) our clubcard is scanned, and e) we might buy a carrier bag.

Unless we know the person on the cash desk, or we run into difficulties with the self-service checkout and need help in the form of human intervention, the script is unlikely to deviate from the a) to e) steps above.

This modelling approach recognises the cognitive processes needed to construct semantic models (or ontologies) to communicate, explain, and make predictions in a given situation which differs from a formal models which uses mathematical proofs. However, in these human centred situations a formal proof model can be inappropriate.

However, either approach was always done inside one computer until Tim Berners-Lee found a way of linking many computers together with the World Wide Web (WWW). Berners-Lee realised that having access to potentially endless amounts of information in a collaborative medium, a place where we all meet and read and write was much more empowering than us working alone each with a separate model.

And, then once online, it is interesting to have social models, like informal community tagging improves Flickr and del.icio.us. Popular tags get used and unpopular ones don’t, rather like evolution. In contrast formal models use proofs to make predictions so we lose human input and the interesting social dynamic.

Confabulation and conspiracy

But it is data we are interested in. Without enough data points in a data set on which we apply a model, we make links and jumps from point to point until we create a different story which might or might not be accurate. This is how a conspiracy theory gets started. And, then if we don’t have enough data at all, we speculate and may end up telling a lie as if it is a truth which is known as confabulation. Ultimately having lots of data and the correct links gives us knowledge and power and the WWW gives us that.

Freeing the data

Throughout history we often have confused the medium with the message. We have taken our most precious stories and built institutions to protect the containers – the scrolls and books – which hold stories whilst limiting who can access them, in order to preserve them for posterity.

Now, we have freed the data and it is potentially available to everyone. The WWW has changed publishing and journalism, and the music industry forever. We have never lived in a more exciting time.

At first we weren’t too bothered how we were sharing data, pictures, pdfs, because humans could understand them. But, since computers are much better at dealing with large data sets, it makes sense for them to interpret data and help us find everything we need. And so, the idea of the semantic web was born.

Semantic Web

The term semantic web was suggested by Berners-Lee in 1999 to allow computers to interpret data and its relationships, and even create relationships between data on the WWW in a way in which only humans can do currently.

For example, if we are doing a search about a person, humans can easily make links between the data they find: Where the person lives, with whom, their job, their past work experience, ex-colleagues. A computer might have difficulty making the connections. However, by adding data descriptions and declaring relationships between the data to allow reasoning and inference capabilities, then the computer might be able to pull together all that data in a useful coherent manner for a human to read.

Originally the semantic web idea included software agents, like virtual personal assistants, which would help us with our searches, and link together to share data with other agents in order to perform functions for us such as organising our day, getting more milk in the fridge, and paying our taxes. But due to the limitations of intelligent agents, it just wasn’t as easy to do. So, the emphasis shifted from computers doing the work, to the semantic web becoming a dynamic system through which data flows, with human intervention, especially when the originator of the data could say: Here machine interpret this data this way by adding machine friendly markup.

Cooperation without coordination

It seems strange to contemplate now, but originally no one believed that people would voluntarily spend time putting data online, in the style of distributed authorship, but we have Wikipedia, DBPedia, GeoNames to name but a few places where data is trustworthy. And, we have W3C which recommends the best way to share online.

The BBC uses websites like the ones above and curates the information there to ensure the integrity of the data. That is to say, the BBC works with these sites, to fact check the data, rather than trying to collect the data by itself. So, it cooperates with other sites but does not coordinate the output. It just goes along and gets what it needs, and so the BBC now has a content management system which is potentially the whole of the WWW. This approach of cooperation without coordination is part of what has become known as linked data, and the WWW is becoming the Web of Data.

Linked Data and the Web of Data

Linked data is a set of techniques for the publication of data on the web using standard formats and interfaces so that we can gather any data we need in a single step on the fly and combine it to form new knowledge. This can be done online or behind enterprise firewalls on private networks, or both.

We can then link our data to other data that is relevant and related, whilst declaring meaningful relationships between otherwise arbitrary data elements (which as we have seen a computer couldn’t figure out by itself).

Google rich snippets and Facebook likes use the same approach of declaring relationships between data in order to share more effectively.

Trust: Data in the wild, dirty data, data mashups

It all sounds brilliant. However, it is impossible to figure out how to get your data mashup right from different sources when they all have different formats. This conundrum is known as data in the wild. For example, there is lots of raw data on www.gov.uk, which is not yet in the recommended format.

Then, there is the problem of dirty data. How can we trust the data we are getting if anyone can put it online? We can go to the sites we trust, but what if they are not collecting the data we need? What if we don’t trust data? What if we use the data anyway? What will happen? These are things we will find out.

How can we ensure that we are all using the same vocabularies? What if they are not? Again, we will find a way.

Modelling practice: extendable, reusable, discoverable

The main thing to do when putting up your data and developing models is to name things as meaningfully as you can. And, whilst thinking about reuse, design for yourself, do not include everything and the kitchen sink. Like all good design, if it is well designed for you, even if you leave specific instructions, someone will find a new way to extend and use your model, this is guaranteed. It is the no function in structure principle. Someone will always discover something new in anything you design.

So what’s next?

Up until now search engines have worked on matching words and phrases, not what terms actually mean. But, with our ability to link data together, already Google is using the knowledge graph to help uncover the next generation search engine. Facebook is building on its open graph protocol whilst harvesting and analysing its data to help advertisers find their target audience.

Potentially we have the whole world at our fingertips, we have freed the data, and we are sharing our stories. It may be written in Ecclesiastes that there is nothing new under the sun, but it is also written in the same place: Everything is meaningless. I think it is wrong on both counts, with this amount of data mashup and collaboration, I like to believe instead: Everything is new under the sun and nothing is meaningless. We live in the most interesting of times.