Friday, June 11, 2010

Fixing academic literature with HTML5 and the semantic web

Academic literature is broken. So says a recent article in The Biochemist - "Calling International Rescue, Knowledge lost in data landslide!". It's both a review and a call to arms, goading the bioscience community into seeing that traditional research publishing is no longer fit for purpose. The problem is the sheer volume of discovery by researchers across the globe leading to a profusion of journals, articles, databases, nomenclature and supporting evidence that is impossible to discover or digest. Knowledge is 'sequestered' in static articles in obscure journals without the data needed to allow verification or re-analysis, so that increasingly 'we don't know what we know'. Precious research funding is then wasted on confusion, endless literature searches and unknowing rediscovery.

The article goes on to suggest that the semantic web offers hints for the future and illustrates this through a collaboration with Portland Press to annotate static PDF articles for a whole Biochemical Journal volume enabling 'live' nomenclature database web searches, integrated data visualisers, end user commenting, etc, all of which becomes available when the arcticle is viewed in a special PDF reader from "Utopia Documents". It acknowledges however that a much wider standardisation effort is needed to establish the domain specific data formats and ontologies that are required to really harness our collective knowledge using the power of the semantic web.

This is an huge, and old, problem. Informatics is an entire field encompassing the semantic web and distributed repositories that I don't profess expertise in. My only thoughts here are twofold:

Firstly, special readers utilising proprietary markup on PDF files will never catch on. If this is to work, I think PDF itself must be superceded as the medium of choice for academic journal articles by HTML. Handing around web pages as documents may not feel right today, but with the addition of embedded fonts in HTML5 (see Scribd's demo of this), better CSS support for printing use, and more browser support for offline use so that your favourite reader app could be web app itself, I believe this is coming. If journal articles worked natively in any browser, so could annotations and mashed services invoked by them, with all the richness that the AJAX web world provides. Research articles would be easier to search for, and more accessible. They could be packaged together wth supporting data, or linked to it in an institutional repository which appropriate release mechanisms. It's the old story - standards, standards, standards - and there's no better information standard globally than HTML itself.

Secondly, I can't help but be dubious about hoping that the semantic web alone will make all this information discoverable and digestible. Just because an article outlining frontier research is joined up to other articles (and visible to clever semantic web queries) through nifty bits of common language, doesn't make the whole a joined up thought. Research, more than anything else, is more complicated than that. What this highlights to me is the increased importance of the well referenced review article. A colleague neatly characterised the difference between blogs and wikis for me recently, which is relevant here. Research letters are like blog posts in that their purpose is analysis, allowing others to criticise their work. Review articles, by contrast, are like wikis in that they focus on synthesis, drawing together multiple elements for a less controversial and more lasting picture of the state of a field. A semantic web of research letters alone can't replace an expert synthesis. It could contribute in serendipitous discovery, though, as the excellent partner article in the same Biochemist issue "Designing for (un)serendipity - computing and chance" discusses, much more is needed than happening on connected research to spark a new thought. As Louis Pasteur asserted "chance favours the prepared mind".

Having said all that, a semantic web of syntheses, review articles that is, with supporting databases where needed, does have a chance of evolving the standards required, assuming publishers and archivists push for their use more widely. Here's hoping we can drive progress in this area to improve the lot of the submerged researcher.


