Saturday, July 04, 2009

Beyond PDF. XML is better.

Many of the discussions on scholarly communication focus on version, with the assumption being that the best or authoritative version is the publisher's PDF. But is this really the best version for the future? There are arguments that XML is both more usable and more suited to preservation. An author's final manuscript in PubMedCentral, for example, in XML, is more searchable, generally more accessible for the print disabled, and in better shape for preserving into the future, than any PDF version.

Three great quotes on PDF, from a presentation by Alma Swan (thanks to Peter Suber and Charles Bailey on the Open Access Tracking Project):

John Wilbanks (on screen scraping): "Scraping is the right word, because having to work with PDF is really scraping the bottom of the barrel"

Clifford Lynch: "PDF is evil".

Peter Murray-Rust: "Getting to XML from PDF is like starting with the burger and trying to get back to the cow".

Comment: do we need to start writing and publishing in XML in the first place? At the very least, it seems to me that we should be asking ourselves this kind of question - and definitely questioning claims that the current best version is a publisher's PDF. There are moves in the publishing / word processing industry to facilitate this move, for example by Charlesworth Group, Nature Publishing Group, the Public Knowledge Project (Open Journal Systems / Open Conference Systems) and Microsoft, that I know of. Something to watch for, applaud and support.