Data Mapping and Matching: Languages for Scientific Datasets
Author(s) -
Paris C. Kanellakis
Publication year - 1995
Publication title -
electronic workshops in computing
Language(s) - English
Resource type - Conference proceedings
ISSN - 1477-9358
DOI - 10.14236/ewic/dbpl1995.15
Subject(s) - theme (computing) , matching (statistics) , computer science , panel discussion , order (exchange) , period (music) , library science , world wide web , mathematics , art , linguistics , statistics , philosophy , aesthetics , finance , economics
This is a report on a panel at the Fifth International Workshop on Database Programming Languages, September 6–8, 1995, Gubbio, Umbria, Italy. The panel was well attended and there was a fair amount of interaction with an audience of about 50 people. The panelists spoke for about 10 minutes each. Then, there was a 30 minute period involving questions and discussion with the audience. The order of the panel presentations was: Paris Kanellakis (panel chair) of Brown, David Maier of OGI, Peter Buneman of UPenn, Stan Zdonik of Brown, and Sophie Cluet of INRIA. We present the general theme of the panel, summaries of the panelist remarks, and a summary of the general discussion. Introduction The panel chair introduced the panelists and presented the panel theme. He set the stage for the discussion by contrasting commercial relational database technology with scientific data management. Database management systems have been very successful at providingefficient access to large databases of business applications. This success has been achieved for highly-structured record-oriented data by combining an elegant formalism (logic-based languages and algebras) with efficient implementation. New data-intensive applications such as those of the scientific community require efficient access to massive amounts of data, which differ in their semantics and organization from business data: (i) the data structure is more complex (complex objects, extensibility, heterogeneity, a fair amount of metadata); (ii) the time/space dimensions are essential (although poorly captured by the relational model); (iii) the querying of data often involves data mining for similarities. Panel Topic: A primary motivation for new database technology is to facilitate classification and exploratory search of the broad spectrum of multimedia data, available both at a user’s site and through network access. Many of the available datasets are scientific, residing in conventional databases or in, the more common and general, data exchange (DX) formats. The impact of current database technology (both object-oriented and relational) on managing scientific datasets is limited by a lack of interoperation with the growing variety of heterogeneous DX formats. Another significant problem of current database systems is insufficient modeling support for metadata as well as for spatial and temporal features, which are present in the majority of scientific applications. This panel will discuss these limitations of existing database languages and explore fresh approaches towards information integration (data mapping) and manipulation (data matching). Declarative Languages: From Relations to Constraints Paris Kanellakis also briefly commented on the evolutionof declarative data models from relational to constraint-based. Constraint databases are a candidate formalism for expressing, in a declarative fashion, queries on spatial and temporal Paris Kanellakis, his wife and their two children died unexpectedly and tragically on December 20, 1995. This is a tremendous loss to the DBPL community. An obituary is included in the preface to these proceedings. P.A. & V.T. Database Programming Languages (DBPL-5), 1995 1 Data Mapping and Matching: Languages for Scientific Datasets (Panel at DBPL 95) data. The tuples of the relational model are generalized to conjunctions of constraints. The underlying principle is to use in database languages, data types that are closer to the natural language specification of many application. The motivation for more general data models (such a consraint-based models) has been increased functionality. Similarity queries over time-series data is a good example of needed increased functionality. The queries that one would like to express here involve detecting similar sequence patterns. For example, an exact match between two sequences is rare, but sequences may almost match or they may present similarities. The detection of these similarities has many applications from financial (e.g., stock prices) to earth science (e.g., temperature readings). Data Exchange Formats and 3-Level Architecture David Maier considered data exchange (DX) formats. The majority of scientific datasets do not reside in conventional databases but rather in DX formats (e.g., CDF, HDF, CIF, FITS, ASN.1, Express, etc.). These self describing data were developed for allowing programs to exchange data. This is possibly the fastest-growing form of network accessible data. Some DX formats are now also being used for logical data definition and as the primary form of data storage. They are usually equiped with application program interfaces (API’s). And, not surprisingly, data management functionalities such as access methods (e.g., records in netCDF), catalog, query facilities, are been added to these API’s. DX formats present a number of advantages, that are making them popular, the most crucial probably being the existence of link libraries specific to scientific domains. Indeed, DX formats are becoming standard in scientific communities. However, they are missing many features commonly found in database management systems. In particular, they do not scale up and their query facilities are very primitive. David Maier advocated the use of an object-oriented database system as a Hybrid Data Manager. This is some middleware between the applications and the data sources (databases or files). This leads to a 3-level architecture with heterogeneous external sources at the bottom level, the object database acting as a mediator, and a homogeneous domain schema at the higher level. David Maier reported some experiments in Materials Science with the Gemstone object-oriented databases management system and 5 sources (two databases and 3 DX formats). Genome Databases and Database Languages Peter Buneman considered the case of genome data (e.g., ASN.1), an excellent demonstration of the adoption of DX formats despite their drawbacks. As in other fields, there are many reasons for this: (i) genome data is not adequately modeled using traditional database models, (ii) data descriptions/schemas are enormous and very rapidly changing, (iii) interoperability with special purpose algorithms (e.g., Blast or Fasta) is crucial. So, when developing the first genome banks, long-range concerns such as transaction-oriented support offered by database systems, were often overshadowed by economic as well as scientific pressure to get sequencing information in electronic form very fast. To answer the needs of genome databases, database systems have to offer better linguistic support of collection type and other types (e.g., variants) that are encountered in DX formats. It is important to be able to ask complex queries spanning multiple databases. (E.g., find the information on the DNA sequence known to be Chromosome 22 between location 22p11.2 and q12.1; and for each sequence, identify similar sequences from other organism.)
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom