Almost every organization manages data in some way. There are many purposes for these data, including human resources, sales, manufacturing, medical care, environment, aeronautics, basic science, engineering, and socio-economics. Sometimes, and again for many purposes, organizations need to share these data with other organizations or individuals. Several conferences are necessary to discuss all the requirements for data interchange, so this conference will focus on what is necessary to achieve a common understanding for data, i.e., for data sharing.
What is required to understand data? The term metadata is used for the data that describes some resource. Here, our resources are other data. Metadata are effective if they convey a complete understanding of the data they describe. Now, in practice this is not possible, but a good approximation of a complete description is. Additionally, a description may be complete in a limited sense if it says all that is necessary for some system to perform. If such a description exists, then we say those data are interoperable.
This means that interoperability is a relative term. It is achieved within the context of a set of requirements. Data interoperability is achieved if the receiver of the data can interpret and process those data in the same way as the sender intended. The context, or the set of requirements, is the intention of the sender.
Here is a set of graduated examples that will make this idea clear. Start with a CD and ask if the data it contains are shareable. It depends on what we want to do with the data, and that depends on how much we know, as follows:
- Copy the CD to another, so we must know the starting track or sector on the disk
- The data may be part of a file system or they may be used for streaming, such as for audio
- In a file system, the data may be coded in ASCII, UTF-8, or UTF-16
- The ASCII data might be organized as XML or comma delimited text
- The XML might be organized by some XML-Schema or DTD
- In a Schema, an element exists whose tag is "income", and the meaning of that must be known (wages only versus wages, dividends, and interest)
- Finally, how the income data are represented, the datatype, units of measure, and allowed values must be specified (say real, dollars, 0 to 10**9 versus integer, pesos, 0 to 10**10)
How does one describe data? The simple answer is to write down the metadata. But this brings up an interesting problem. What does the description look like? The data each organization produces is defined differently than the data in another organization. If we aren't careful, this will include the metadata, too. In other words, we allow each organization to define its data its own way until we get to the metadata.
Why? Say, two organizations are comparing each other's data. Then, they need the other's metadata to do this. If the metadata are organized in a different way, the comparison problem is just moved to the metadata level, and they must start over again. Therefore, it makes sense to use the same description scheme for describing data.
This is where standards come into play. A standard is a document containing a set of provisions (requirements, recommendations, instructions, and statements) developed through a consensus building and open process or through edict. In either case, the standard represents a recognized agreement. The requirements needed to achieve interoperability might come from a standard, for instance. This conference will address standards needed for data interoperability.
We aren't going to standardize the way we define data in each organization, but we can describe data by using the same techniques. That is, we wish to define metadata in a standard way, in order to help with the interoperability problem.
Standards themselves are designed to solve specific kinds of problems. For instance, ISO/IEC 11179 (Metadata registries) contains provisions for describing data in a general way. However, one wouldn't use it to define and export the scheme for developing a relational database application, though you could. A major goal of the conference is to describe a set of standards and show how these standards can be made to interoperate (i.e. map between them).
Another goal of data sharing is discovery. A person or organization may want to find data based on some particular concepts. There are 2 main ways of searching for this: search engine, and through a registry.
Search engines are both very good and very bad. They can be used to find a resource quickly, but one must often dig hard to find the right match among many thousands of possibilities. This is not efficient.
The other way is through the use of registries. A registry is akin to a catalog of resources that are available from some repository or a set of repositories. They might contain basic information about some resources, but they also have direct links to them, speeding the acquisition process. The conference will describe these differences in detail.
Terminology is the study of concepts, the objects that correspond to those concepts, and the terms used to designate them. In particular, concepts are a unique combination of characteristics, and characteristics, in turn, are abstractions of properties.
A property is the result of an observation of an object. Let's say you notice that the apple you took to lunch today is colored red. Red color is a property of that apple. For all apples we make the abstraction that they all have some color. They are not all red, for some are yellow, for instance. Color is a characteristic of apples.
Why is this important? In any data model, the entities or classes in that model represent concepts. The attributes or columns in an entity or class are characteristics of the concept, and the values the attributes are allowed to take are the properties associated with those characteristics. The rows in an entity (or table in a relational database) or the instances of a class are the objects corresponding to the concept.
Going back to the apple example, we have an entity in a database schema called "apple." "Apple" is the term used to designate the concept of apples. One of the columns in this entity is called "color". For the row representing the apple you took to work today, the color column has the value "red". Therefore, data are terminological things.
There is a deep connection between the language (concepts and terms) we use to describe some subject field and the data collected, manipulated, stored, and described in some (computer) system about that subject field. The conference is intended to illuminate this connection.