Text Analysis: Linked Data

Basic Linked Data Concepts

The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web. These best practices were introduced by Tim Berners-Lee in his Web architecture note Linked Data and have become known as the Linked Data principles. These principles are:

Use URIs as names for things
Use HTTP URIs, so that people can look up those names
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
Include links to other URIs, so that they can discover more things

The basic idea of Linked Data is to apply the general architecture of the World Wide Web to the task of sharing structured data on global scale. In order to understand these Linked Data principles, it is important to understand the architecture of the classic document Web. Uniform Resource Identifiers (URIs) as globally unique identification mechanism, the Hypertext Transfer Protocol (HTTP) as universal access mechanism, and the Hypertext Markup Language (HTML) as a widely used content format. In addition, the Web is built on the idea of setting hyperlinks between Web documents that may reside on different Web servers. The development and use of standards enables the Web to transcend different technical architectures. Hyperlinks enable users to navigate between different servers. They also enable search engines to crawl the Web and to provide sophisticated search capabilities on top of crawled content. Hyperlinks are therefore crucial in connecting content from different servers into a single global information space. Linked Data builds directly on Web architecture and applies this architecture to the task of sharing data on global scale.

To publish data on the Web, the items in a domain of interest must first be identified. These are the things whose properties and relationships will be described in the data, and may include Web documents as well as real-world entities and abstract concepts. As Linked Data builds directly on Web architecture, the Web architecture term resource is used to refer to these things of interest, which are, in turn, identified by HTTP URIs.

Resouce Description Framework (RDF)

In order to enable a wide range of different applications to process Web content, it is important to agree on standardized content formats. When publishing Linked Data on the Web, data is represented using the Resource Description Framework (RDF). RDF provides a data model that is extremely simple on the one hand but strictly tailored towards Web architecture on the other hand. To be published on the Web, RDF data can be serialized in different formats. The two RDF serialization formats most commonly used to published Linked Data on the Web are RDF/XML and RDFa.

The RDF data model represents information as node-and-arc-labeled directed graphs. The data model is designed for the integrated representation of information that originates from multiple sources, is heterogeneously structured, and is represented using different schemata. RDF aims at being employed as a lingua franca, capable of moderating between other data models that are used on the Web. The RDF data model is described in detail as part of the W3C RDF Primer. In RDF, a description of a resource is represented as a number of triples. The three parts of each triple are called its subject, predicate, and object. A triple mirrors the basic structure of a simple sentence, such as:

Mark Carter	has a	website
Subject	Predicate	Object

The resulting URIs for my name could look like:

http://macarter.org/person/macarter http://xmlns.com/foaf/0.1/name "Mark Carter"

The subject of a triple is the URI identifying the described resource. The object can either be a simple literal value, like a string, number, or date; or the URI of another resource that is somehow related to the subject. The predicate, in the middle, indicates what kind of relation exists between subject and object. The predicate is also identified by a URI. These predicate URIs come from vocabularies, collections of URIs that can be used to represent information about a certain domain. One way to think of a set of RDF triples is as an RDF graph. The URIs occurring as subject and object are the nodes in the graph, and each triple is a directed arc that connects the subject and the object. As Linked Data URIs are globally unique and can be dereferenced into sets of RDF triples, it is possible to imagine all Linked Data as one giant global graph. Linked Data applications operate on top of this giant global graph and retrieve parts of it by dereferencing URIs as required.