Premium
We need a Web for data
Author(s) -
WILBANKS John
Publication year - 2010
Publication title -
learned publishing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.06
H-Index - 34
eISSN - 1741-4857
pISSN - 0953-1513
DOI - 10.1087/20100410
Subject(s) - citation , computer science , point (geometry) , world wide web , vice president , library science , management , mathematics , geometry , economics
Robert Metcalfe, co-inventor of Ethernet and the founder of 3Com, is often attributed with making the observation that the value of a telecommunications network is proportional to the square of the number of connected users of the system. This so-called Metcalfe’s Law goes a long way towards explaining why we can create and realize so much value from the Web. As more users get online, the network gets more valuable, spurring more users to get online, and so on. But we do not have anything like the Web for data, especially scientific and scholarly data. Our personal data is mined by Facebook, by Twitter, by Google, to serve us with relevant advertisements to underpin many of the ‘free’ services we access via smartphones and browsers. Yet those network effects – Metcalfe’s Law for data – remain elusive. We do not yet have the users to achieve the moment in the Metcalfe curve where value generation breaks above the cost line – and the reality is, given the small number of scholars and scientists, if we depend on more people being trained, we might not ever make it. That is why we need a Web of data that includes machines as users, not just people. Software has to be able to consume data in a meaningful way if we are going to realize that value generation for science that we know from our social and consumer lives. The Web of data would be a tremendously powerful tool for research. It would allow us to link together disparate information from unrelated disciplines, run powerful queries, and get precise answers to complex, data-driven questions. It is an undoubtedly desirable extension of the way that the existing networks increase the value of documents and computers through connectivity. However, making the Web of data turns out to be a deeply complex endeavor. Data – for our purposes, a catch-all word covering databases, datasets, and generally meaning here information that is gathered in the sciences as a result of either experimental work or environmental observation – requires a much more robust and complete set of standards to achieve the same ‘web’ capabilities we take for granted in commerce and culture. Unlike documents, the ultimate intended reader of most data is a machine – a piece of software that will process data and transform it into results that are comprehensible and meaningful to a human. Some classic examples include search engines, analytic software, visualization tools, database back ends, and more. There is simply too much data in production to place people on the front lines of analysis. When data generation per day scales into the petabytes, we just cannot keep up using the existing systems. This machine-readability requirement is very different from the Web of documents, which was designed to standardize the way information is displayed by machines to people. Your browser does not understand whether the web page you are reading is about physics or biology – but it knows precisely how to display the text on that page, whether it’s blinking (bad form!) or formatted using precise HTML (good form!). Machine readability for data means that the software running across data needs to be able to place that data into context, to know what other kinds of data it links to, by what other software systems it might be usable, and so forth. And that information needs to travel alongside the data itself somehow, either by integration directly into the data or via a persistent URL that contains a stable description of the data. Machine readability means we have to think, early and often, about the level of interoperability in any given chunk of data. ‘How “connectable” is it to other data?’ should be the first question we ask of new data, because the level of effort required to make data connectable post hoc is significant – frequently unbearable. Another way to think about this problem is as an interoperability issue: the connectability quotient creates significant pressures to build data that interoperates with other data ex ante, not post hoc. Interoperability implies a level of rigor in the design of data that understands the intended use of that We need a Web for data 333