“Despite mans ability to generate information he is limited in his ability to link the information he generates to the information he needs in his daily conflicts. his strivings and aspirations, his welfare and, in a very direct way, his humanity. Indeed, if he were to bridge the gap between the generation and utilization of information, the horizon would be more promising in terms of his ability to reduce injustice and to provide a medium for constructive human development and enterprise.”
Anthony Debons 1974
Introduction
Scientific activity generates and uses a massive volume of many kinds of digital information: raw data, analyzed data, simulations, notes, letters, reports as well as the more traditional published journal articles. Through the World Wide Web and URLs, scientific information has become a rich web of connected digital information. However, as McGrath, Futrelle, and Plante (1999: 2) point out ‘there remains a significant and lasting challenge for humans to exploit this richness, to discover, access, and understand the knowledge that may reside or be created from digital resources.’ This essay examines advances in digital technology that might enable scientists to overcome some of these challenges and provide the ‘Linked Open Science’ (LOS) called for by Kauppinen and Mira de Espindola (2011) that is needed to tackle some of the greatest problems of our time.
Using the internet to publish information in accessible ways: the semantic web
Linked Open Science (LOS) an approach that ‘interconnects scientific assets to enable transparent, reproducible and transdisciplinary research.’ (Kauppinen:2011) Can only be achieved through publishing data using semantic web technology. But what is the semantic web? The original concept was first introduced by the web's inventor Tim Berners-Lee in the nineteen-nineties. Since then it has attracted a lot of excited attention and some outlandish promises have been made for it. Semantic web technologies, we are told, will provide computers with the ability to perform abstract cognitive processes previously the preserve of human beings. The literature doesn't shy from referring to computers that can 'understand' and 'reason' (Byrne&Goddard 2010) 'manipulate meaningfully', 'choose', 'infer' and 'comprehend' (Berners-Lee et al 2001). In addition to the hype, the sheer number of different approaches to the semantic web also muddies the water. Each new application developed emphasises a different feature of the technology so there are allot of competing definitions. Wikipedia's entry on the semantic web had no less than
1,701 edits at time of writing. In order to avoid vacuous hyperbole and cut a channel through the multitude of competing versions this essay will focus on what is perhaps one of the most conservative and hence realisable definitions of the semantic web, that of the World Wide Web Consortium (W3C), the international web standards committee founded in 1994 by Berners-Lee.
For the W3C the semantic web is first and foremost the 'web of data'. By data they refer to the raw information that populates governmental databases, personal bank statements, and scientific reports. And by web they refer to its essential interconnectedness. Although the web of data builds upon and evolves the extant web of documents it also contrasts quite markedly. It relies upon two essential innovations. First, the introduction of a common format for data. Currently data is stored in a variety of different formats, by different databases and computer programs. A common format would allow data pulled from radically different sources to be 'integrated' and 'combined' in interesting and productive ways. The second innovation is the provision of a language for enabling data to relate to actual-world-things. Such a language would enable users to discover and use databases and information resources on the basis that they contain data pertaining to one self-same thing. The term 'thing' is being used here in a very broad way, it might refer to a person, a city, an historical date, a protein or an object in the traditional sense like a IKEA book shelf, essentially anything that there can be published data on is a thing in the semantic web. (Ref: W3C 2001) It’s no wonder then that science is one of the leading areas developing semantic web technology. (Berners-Lee: 2009) It has allot to gain, not least an efficient and standardised means to distribute scientific data globally via the internet. We now need to take a closer look at semantic web technology.
Identifying appropriate and innovative methods of digital data representation and organisation: RDF, RDFS and OWL
Semantic web technology rests on three pillars: RDF, RDFS and OWL. In this section we’ll take a look at each of them in turn in relation to the goal of a truly LOS. Resource Description Framework (RDF) is the common format that really founds semantic web technology, it is a W3C standard for web meta-data or, in other words, a set of rules for providing simple descriptions of webbed data – the title, author, modification date and so on. The very term ‘resource’ suggests a greater generality than the traditional term web page, this is intended, data described in this way need not be formatted for the web and can be accessed and used by applications different to one it is intended for. RDF is based upon the idea that things have properties and these properties can be described by making statements about them, such statements are structured in triples consisting of subject, object and predicate. In the example statement: the web page http://www.example.com/fictional.html was created by Tom Barker, the web page is the subject about which something is said, the predicate is the property or the relation of the subject, in this case it having been ‘created by’ someone and the value of the property is the object, Tom Barker. RDF formalises such statements to make them ‘machine-readable’ by using Universal Resource Identifiers (URI). Unlike URLs URIs can pertain to anything that a RDF statement may need to refer to. So the person Tom Barker will have a URI so to would the abstract concept of ‘creation’ as well as the web page itself. It is important to note that although the URIs are expressed as a URL they needn’t refer to actual pages on the web, they are simply a unique identifier. Our example above could be presented in a RDF statement thus:
Subject: http://www.example.com/fictional.html
Predicate: http://purl.org/dc/elements/1.1/creator
Object: http://www.somegreatorganisation.com/futureemployee.html
The most common implementation language of RDF is XML:
This example is simple but RDF graphs can be more complicated with more predicates and objects related to the subject. (Ref: W3C 2004) An LOS graph would include all scientific researchers, research institutions, publications, and data sets to be linked in this simple way. A large part of the work that lies ahead in LOS is designing these complex graphs and building these webs of linked scientific data.
As we saw RDF is a standard for the syntax of descriptive sentences pertaining to things, their properties and relations; it is not a vocabulary of descriptive terms. Where RDF provides the grammar, it is RDF Schemas (RDFS) that provide the words. RDFS can be defined as technical vocabularies or taxonomies developed by users to describe the types of things their statements refer to. The resources schemas are a formal hierarchy consisting of ‘resources’ (in the semantic web every thing is a resource), types of resources or ‘classes’ (e.g. Research Institution) and these are broken down again into single ‘instances’ (e.g. City University London) (Ref: W3C 2004). This idea is taken a step further with Web Ontology Language (OWL). OWL includes the taxonomy of RDFS, that is the descriptive terms of classes of things and their relations, together with a set of ‘axioms’ about the computational operations that can or cannot be conducted on those things. There are a number of extant ontologies we could use to deliver LOS.
• NASA’s Semantic Web for Earth and Environmental Terminology (NASA 2011)
• An Ontology for Engineering Mathematics (Gruber&Olsen 1994)
• The Open Biological and Biomedical Ontologies (OBO 2012)
Other ontologies are urgently required to cover the entirety of scientific effort and deliver LOS.
Utilizing recent advances in information and communications technology to support completion of a wide range of information related tasks: LOS and the semantic web
As Bechhofer et al (2011: 1) point out the “simple move from paper-based to electronic publication does not necessarily make a scientific output decomposable. Nor does it guarantee that outputs, results or methods are reusable.” In this section we will look at the ways that scientists might wish to reuse scientific output and how semantic web technology can enable this.
The first reuse is validation. Integral to any scientific discovery is peer review, colleagues working in the field need to critically assess research data and methodology before committing to the research conclusions. A digitally published scientific paper structured according to RDF, RDFS & OWL standards could be linked as a subject to the following objects – original research data, research methodology, context notes, and all other studies of the same subject. This would greatly reduce the time consumed in acquiring this data as the reviewer could access it by clicking on a link rather than by making a formal request to the originating colleague and their research institution. Ultimately it will be possible to analyse semantic and statistical similarities in linked data-sets automatically to detect plagiarism and copyright issues. (Ref: Kauppinen et al 2011)
The second reuse is in planning new research. A scientist has been given funding to perform some research into sleep deprivation; she is planning her research and wants to know what other studies have been done into sleep deprivation, what data they generated and how it was analysed. In a fully linked open science semantic web technology would be able to locate all this very simply and accurately. A software agent would send out a request for the URI attached to scientific studies of sleep deprivation. The first search would look for instances where that URI was the principal subject, later the search could extend to include the same URI as an object or predicate. Information retrieval is more accurate with semantic web technology
The third reuse is in research itself. Here’s another hypothetical scenario: a climate science researcher requires geographical information data sets to use in a paper she is putting together on climate change in north east India. These data sets exist and are structured according to the standards of the semantic web and are accordingly easy to find. This significant innovation means that she can simply link to these data sets rather than downloading them to her local hard drive and thus ease pressure on local memory. Open Provenance Model Vocabulary (Zhao: 2010) ascribes important provenance meta-data to data-sets like the one in our example. Our researcher could find out who created the geographical dataset, who published it, who has transformed it, and even who else has used it.
Conclusion
Semantic web technology makes science more efficient. Although it is only one part in delivering Linked Open Science, it is perhaps the most important part. The technology and standards are now in place, however there now remains lots of work to be done converting existing scientific data into the RDF format, building scientific RDFSs and OWLs. Information specialists have an important role to play in this work.
Bibliography
Beckett, Dave (2005) Dave Beckett’s Resource Description Framework (RDF) Resource Guide [online] Available at RDF http://planetrdf.com/guide/ [accessed 02/01/2011]
Bechhofera, Sean et al (2011) Why Linked Data is Not Enough for Scientists, Preprint submitted to Elsevere [online] available at http://users.ox.ac.uk/~oerc0033/preprints/research-objects.pdf [accessed 07/01/2012]
Berners_lee, Tim, Hendler, James, and Lssila, Ora (2001) The Semantic Web Scientific American 284 34-43 [online] Available at: www.sciam.com/article.cfm?id=the-semantic-web. [accessed 02/01/2012]
Berners-Lee, Tim (2009) Tim Berners-Lee On The Next Web TED [online] Available at: http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html [accessed 05/01/2011]
Berners-Lee, Tim (2011) Sir Tim Berners-Lee on Open Data [online] Available at: http://www.youtube.com/watch?v=ppRzJW0FDwk [accessed 05/01/2012]
Berners-Lee, Tim (2010) Open Linked Data For a Global Community [online] Available at: http://www.youtube.com/watch?v=ga1aSJXCFe0 [accessed 05/01/2012]
Bryne, Gillian & Goddard, Lisa (2010) The Strongest Link: Libraries and Linked Data, D Lib magazine
Floridi, Luciano (2011) Web 2.0 vs. The Semantic Web: A Philosophical Assessment [online] Available at: http://www.philosophyofinformation.net/publications/pdf/w2vsw.pdf [accessed 27/12/2011]
Gruber, Thomas R. and Olsen, Gregory R (1994) An Ontology for Engineering Mathematics [online] Available at: http://www-ksl.stanford.edu/knowledge-sharing/papers/engmath.html[accessed 07/01/2012]
JISC (2010) ACRID: Advanced Climate Research Infrastructure for Data {online] Available at: http://www.jisc.ac.uk/whatwedo/programmes/mrd/clip/acrid.aspx [accessed 07/01/2012]
Kauppinen, Tomi (2010) About (Linked Open Science) [online] Available at; http://linkedscience.org/about/ [Accessed on 06/01/2012]
Kauppinen, Tomi, Mira de Espindola, Giovanna (2011) Linked Open Science—Communicating, Sharing and Evaluating Data, Methods and Results for Executable Papers [online] Available at: http://kauppinen.net/tomi/linked-open-science-camera-ready-2011-03-28.pdf [accessed 05/01/2012]
McGrath, Robert E. Futrelle, Joe, Plante Ray (1999 ) Digital Library Technology for Locating and Accessing Scientific Data [online] Available at: http://arxiv.org/PS_cache/cs/pdf/9902/9902012v1.pdf
[Accessed on 06/01/2012]
NASA (2011) Semantic Web for Earth and Environmental Terminology [online] Available at: http://sweet.jpl.nasa.gov/sweet/ [accessed 07/01/2012]
OBO (2012) The Open Biological and Biomedical Ontologies [online] Available at: http://open-biomed.sourceforge.net/opmv/ns.html [accessed 07/01/2012]
W3C (2011) W3C Semantic Web Activity [online] Available at: http://www.w3.org/2001/sw/
[accessed 01/01/2012]
W3C (2004) World Wide Web Consortium Issues RDF and OWL Recommendations [online] Available at: http://www.w3.org/2004/01/sws-pressrelease [accessed 02/01/2012]
W3C (2004) Resource Description Framework RDF[online] Available at: http://www.w3.org/RDF/ [accessed 02/01/2012]
W3C (2004) RDF Primer [online] Available at: http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ [accessed 02/01/2012]
W3C (2004) RDF Vocabulary Description Language 1.0: RDF Schema [online] Available at: http://www.w3.org/TR/rdf-schema/ [accessed 04/01/2012]
Zhao, Jun (2010) Open Provenance Model Vocabulary Specification [online]
Available at: http://open-biomed.sourceforge.net/opmv/ns.html [accessed 07/01/2012]