Friday

BBC World Cup 2010 dynamic semantic publishing

The World Cup 2010 website is a significant step change in the way that content is published. From first using the site, the most striking changes are the horizontal navigation and the larger, format high-quality video. As you navigate through the site it becomes apparent that this is a far deeper and richer use of content than can be achieved through traditional CMS-driven publishing solutions.

The site features 700-plus team, group and player pages, which are powered by a high-performance dynamic semantic publishing framework. This framework facilitates the publication of automated metadata-driven web pages that are light-touch, requiring minimal journalistic management, as they automatically aggregate and render links to relevant stories.

eng_595.jpg

Dynamic aggregation examples include:

The underlying publishing framework does not author content directly; rather it publishes data about the content - metadata. The published metadata describes the world cup content at a fairly low-level of granularity, providing rich content relationships and semantic navigation. By querying this published metadata we are able to create dynamic page aggregations for teams, groups and players.

The foundation of these dynamic aggregations is a rich ontological domain model. The ontology describes entity existence, groups and relationships between the things/concepts that describe the World Cup. For example, "Frank Lampard" is part of the "England Squad" and the "England Squad" competes in "Group C" of the "FIFA World Cup 2010".

The ontology also describes journalist-authored assets (stories, blogs, profiles, images, video and statistics) and enables them to be associated to concepts within the domain model. Thus a story with an "England Squad" concept relationship provides the basis for a dynamic query aggregation for the England Squad page "All stories tagged with England Squad".

This diagram gives a high-level overview of the main architectural components of this domain-driven, dynamic rendering framework.

diagram_595.png

The journalists use a web tool, called 'Graffiti', for the selective association - or tagging - of concepts to content. For example, a journalist may associate the concept "Frank Lampard" with the story "Goal re-ignites technology row".

In addition to the manual selective tagging process, journalist-authored content is automatically analysed against the World Cup ontology. A natural language and ontological determiner process automatically extracts World Cup concepts embedded within a textual representation of a story. The concepts are moderated and, again, selectively applied before publication. Moderated, automated concept analysis improves the depth, breadth and quality of metadata publishing.

Journalist-published metadata is captured and made persistent for querying using the resource description framework (RDF) metadata representation and triple store technology. A RDF triplestore and SPARQL approach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model. The high level goal is that the domain ontology allows for intelligent mapping of journalist assets to concepts and queries. The chosen triplestore provides reasoning following the forward-chaining model and thus implied inferred statements are automatically derived from the explicitly applied journalist metadata concepts. For example, if a journalist selects and applies the single concept "Frank Lampard", then the framework infers and applies concepts such as "England Squad", "Group C" and "FIFA World Cup 2010" (as generated triples within the triple store).

This inference capability makes both the journalist tagging and the triple store powered SPARQL queries simpler and indeed quicker than a traditional SQL approach. Dynamic aggregations based on inferred statements increase the quality and breadth of content across the site. The RDF triple approach also facilitates agile modeling, whereas traditional relational schema modeling is less flexible and also increases query complexity.


Our triple store is deployed multi-data center in a resilient, clustered, performant and horizontally scalable fashion, allowing future expansion for additional ontologies and indeed linked open data (LOD) sets.

The triple store is abstracted via a JAVA/Spring/CXF JSR 311 compliant REST service. The REST API is accessible via HTTPs with an appropriate certificate. The API is designed as a generic façade onto the triplestore allowing RDF data to be re-purposed and re-used pan BBC. This service orchestrates SPARQL queries and ensures that results are dynamically cached with a low 'time-to-live' (TTL) (1 minute) expiry cross data center using memcached.

All RDF metadata transactions sent to the API for CRUD operations are validated against associated ontologies before any persistence operations are invoked. This validation process ensures that RDF conforms to underlying ontologies and ensures data consistency. The validation libraries used include Jena Eyeball. The API also performs content transformations between the various flavors of RDF such as N3 or XML RDF. Example RDF views on the data include:

Automated XML sports stats feeds from various sources are delivered and processed by the BBC. These feeds are now also transformed into an RDF representation. The transformation process maps feed supplier ids onto corresponding ontology concepts and thus aligns external provider data with the RDF ontology representation with the triple store. Sports stats for matches, teams and players are aggregated inline and served dynamically from the persistent triple store.

The following "Frank Lampard" player page includes dynamic sports stats data served via SPARQL queries from the persistent triple store:

frank_595.jpg


The dynamic aggregation and publishing page-rendering layer is built using a Zend PHP and memcached stack. The PHP layer requests an RDF representation of a particular concept or concepts from the REST service layer based on the audience's URL request. If an "England Squad" page request is received by the PHP code several RDF queries will be invoked over HTTPs to the REST service layer below.

The render layer will then dynamically aggregate several asset types (stories, blogs, feeds, images, profiles and statistics) for a particular concept such as "England Squad". The resultant view and RDF is cached with a low TTL (1 minute) at the render layer for subsequent requests from the audience. The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page.

To make use of the significant number of existing static news kit and architecture (apache servers, HTTP load balancers and gateway architecture) all HTTP responses are annotated with appropriate low (1 minute) cache expires headers. This HTTP caching increases the scalability of the platform and also allows content delivery network caching (CDN) if demand requires.

This dynamic semantic publishing architecture has been serving millions of page requests a day throughout the World Cup with continually changing OWL reasoned semantic RDF data. The platform currently serves an average of a million SPARQL queries a day with a peak RDF transaction rate of 100s of player statistics per minute. Cache expiry at all layers within the framework is 1 minute proving a dynamic, rapidly changing domain and statistic-driven user experience.

The development of this new high-performance dynamic semantic publishing stack is a great innovation for the BBC as we are the first to use this technology on such a high-profile site. It also puts us at the cutting edge of development for the next phase of the Internet, Web 3.0.

So what's next for the platform after the World Cup? There are many engaging expansion possibilities: such as extending the World Cup approach throughout the sport site; making BBC assets geographically 'aware' is another possibility; as is aligning news stories to BBC programs. This is all still to be decided, but one thing we are certain of is that this technological approach will play a key role in the creation, navigation and management of over 12,000 athletes and index pages for the London 2012 Olympics.


Jem Rayfield is Senior Technical Architect, BBC News and Knowledge. Read the previous post on the Internet blog that covers the BBC World Cup website, The World Cup and a call to action around Linked Data.


Metadata is data about data - it describes other data. In this instance, it provides information about the content of a digital asset. For example, a World Cup story may include metadata that describes which football players are mentioned within the text of a story. The metadata may also describe the associated team, group or organization associated to the story.

IBM LanguageWare Language and ontological linguistic platform.

RDF is based upon the idea of making statements about concepts/resources in the form of subject-predicate-object expressions. These expressions are known as triples in RDF terminology. The subject denotes the resource; and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, to represent the notion "Frank Lampard plays for England" in RDF is as a triple, the subject is "Frank Lampard"; the predicate is "plays for" and the object is "England Squad".

SPARQL (pronounced "sparkle") is an RDF query language its name is a recursive acronym (i.e. an acronym that refers to itself) that stands for SPARQL Protocol and RDF Query Language.

BigOWLIM A high performance, scalable, resilient triplestore with robust OWL reasoning support

LOD The term Linked Open Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web.

JAVA Object-orientated programming language developed by Sun Microsystems.

Spring Rich JAVA framework for managing POJOs providing facilities such as inversion of control (ioc) and aspect orientated programming

Apache CXF JAVA Web services framework for JAX-WS and JAR-RS

JSR 311 Java standard specification API for RESTful web services.

Memcached Distributed memory caching system (deployed multi datacenter)

Jena Eyeball JAVA RDF validation library for checking ontological issues with RDF

N3 Shorthand textual representation of RDF designed with human readability in mind.

XML RDF XML representation of an RDF graph.

XML (Extensible Markup Language) is a set of rules for encoding documents and data in machine-readable form

Zend Open source scripting virtual machine for PHP, facilitating common programming patterns such as model view controller.

PHP Hypertext Preprocessor general-purpose dynamic web scripting language, use to create dynamic web pages.

CDN A content delivery network or content distribution network (CDN) is a collection of computers usually hosted within Internet Service Provider hosting facilities. The CDN servers cache local copies of content to maximize bandwidth and reduce requests to origin servers.

OWL Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies.

courtsy: BBC Jem Rayfield

No comments:

Scikit-learn