Definition of the
CIDOC

object-oriented
Conceptual Reference Model

and

Crossreference Manual

This page is the introductory page of Definition of CIDOC object-oriented Conceptual Reference Model and Crossreference Manual

Contents:

Initial Page
Preface to version 3.3.2.

Introduction

Intended Scope

Applied Form

Terminology

Cardinality Constraints

Naming Rules

Modelling principles

Minimality

Shortcuts

Disjointness

About Types

Completeness

Extensions

Examples

The Entity and Property List

Introduction

This document is a formal definition of the oo CIDOC Conceptual Reference Model (referred to in the following as the “CRM”). It is the result of work done by the CIDOC Documentation Standards Group, from 1994-2000, and the CIDOC CRM Special Interest Group from 2000-2002, as the result of an initiative to define the underlying semantics of database schemata and document structures needed in museum documentation for the support of good practice in conceptual modelling, data transformation, data exchange, information integration and mediation of heterogeneous sources.

Intended Scope

The intended scope should be understood as the domain that the CRM would ideally aim to cover, given sufficient time and resources, and is expressed as a definition of principles. The practical scope is, necessarily, a subset of the intended scope. The intended scope is difficult to define with the same degree of precision as the practical scope since it depends on concepts such as "cultural heritage" which are themselves complex and difficult to define. The objectives provided by the intended scope are important, however, since they allow appropriate sources to be selected for inclusion in the practical scope. The practical scope is expressed in terms of the reference documents and sources that have been used in its elaboration. The CRM covers the same domain as these reference sources ([x]…[y]). This means, that data encoded following one of those sources can be transformed or integrated into a CRM compatible form without loss of meaning, as long as the meaning remains within the intended scope of the CIDOC CRM.

The intended scope of the CRM may be defined as all information required for the scientific documentation of cultural heritage collections, with a view to enabling wide area information exchange and integration of heterogeneous sources. This definition requires some explanation:

· The term scientific documentation is intended to convey the requirement that the depth and quality of descriptive information which can be handled by the CRM should be sufficient for serious academic research into a given field and not merely that required for casual browsing. This does not mean that information intended for presentation to members of the general public is excluded, but rather that the CRM is intended to provide the level of detail and precision expected and required by museum professionals and researchers in the field.

· The term cultural heritage collections is intended to cover all types of material collected and displayed by museums and related institutions, as defined by ICOM [1]. This includes collections, sites and monuments relating to natural history, ethnography, archaeology, historic monuments, as well as collections of fine and applied arts. The exchange of relevant information with libraries and archives, and the harmonisation of the CRM with their models, fall within the CRM's intended scope.

· The documentation of collections is intended to encompass the detailed description both of individual items within collections as well as groups of items and collections as a whole. The scope of the CRM is the curated knowledge of museums. Information required solely for the administration and management of cultural heritage institutions, such as information relating to personnel, accounting, and visitor statistics, falls outside the intended scope.

· The CRM is specifically intended to cover contextual information: the historical, geographical and theoretical background in which individual items are placed and which gives them much of their significance and value.

· The goal of enabling information exchange and integration between heterogeneous sources determines the constructs and level of detail of the CRM. It also determines its perspective, which is necessarily supra-institutional and abstracted from any specific local context.

· The CRM aims to leverage contemporary technology while enabling communication with legacy systems.

Applied Form

The CRM is a domain ontology in the sense used in computer science[]. As such, the model is designed to be explanatory and extensible rather than prescriptive and restrictive. Currently, no specific formalism for semantic models has been widely accepted as standard, nevertheless the semantic deviations between the various available models are minimal. Consequently, the model has been formulated as an object-oriented semantic model[], which can easily be converted into other object-oriented models. It is our intention that this presentation format should be both natural and expressive for domain experts, and easily be converted to other machine readable formats such as RDF and XML. The definition itself, as presented here, is not enough to fully comprehend the model and its application. The use of rich specialization hierarchies causes a complex set of inherited properties and cross-references. So this relatively compact definition of about 220 elements corresponds to several thousand properties of the declared classes. A full set of direct and inherited declaration can automatically be generated from this definition, and be made available as a separate document.

Terminology

From the various terminologies in use for object-oriented models, we have selected the following for ease of understanding by non-computer experts. They are motivated by the terminology of RDF (Resource Description Framework), a recommendation of W3C. We use a slightly more constraint semantic model than RDF, a subset of TELOS [], which is compatible with many formalisms and can be implemented on a large variety of tools. We use:

“class” for a concept that may be called “class”, “individual class”, “entity” or “node”, in contrast to properties. It denotes a category of items the definition of whose does not depend on the existence of other categories, such as “physical object”, in contrast to “carried out”, which requires persons and activities to make sense. A class plays a role similar to a grammatical noun. A class is characterized by an intention, which is conveyed by a scope note.

“scope note” is not a definition in absolute terms but a text making clear to the user the relation of the class to known concepts of the domain. A class is associated in real life with a set of instances, the extension. This set is open and unknown – we do not know all instances of a concept in the current world or the past, no can we foresee the future. No concept in the CRM is defined by its extension (such as enumeration sets), however references of good examples are used to clarifying the intention.

“property” for a concept that may be called “attribute”, “reference”, “link”, “role” or “property”. A property denotes a category of items the definition of whose does depend on the existence of two other items, such as “carried out”, which requires some persons and an action to make sense, in contrast to “physical object”, which is defined on its own. It plays a role similar to a verb, which stands between a grammatical subject and an object. In the manner of semantic networks, we do not distinguish between internal attributes and references. Rather, every value associated with an instance of a class is regarded as a reference to that value, which in turn is instance of another class. The property is characterized by an intention, which is conveyed by a scope note. It is also associated with an extension. For instance, “my role in the writing of this document” is regarded an instance of “carried out”. Properties may themselves have properties, which point to other entities. In the CRM this is used for dynamically specializing properties, such as in the case of multiple potential kinds of roles.

“domain” for the class a property is defined for – like a grammatical subject. It is the most specialized class meaningful to the expert that comprises all potential uses of the property. We allow only one domain per property. Note, that it may always contain instances for which the property is not applicable.

“range” for the class a property refers – like a grammatical object. It is the most specialized class meaningful to the expert that comprises all potential values of the property. We allow only one range per property. Note, that it may always contain instances for which the property is not applicable. Note that the difference between range and domain is not substantial but conventional. We always give in parenthesis the name the property has if domain and range are interchanged, frequently simply the passive voice. Hence which one is domain and which one range depends only on the name of the property, such as “carried out” versus “was carried out by”.

“Superclass-Subclass” for “IsA” relations, which may be called “superclass – subclass”, “ parent class - derived class”, “generalization - specialization”, “genus-species”, “subsumes – is subsumed by”. An IsA relation between classes, such “a birth IsA event”, denotes that the subclass has all properties of its superclasses (strict inheritance), and some more of its own, and that the superclass comprises all instances of all subclasses and some more of its own. As domain and range are interchangeable, inheritance holds for both. An instance of a subclass of a domain of a property can use this property, and any property instance can refer to an instance of a subclass of the range of this property. A class may have more than one immediate superclass (multiple inheritance). The “IsA” relation holds also for properties. E.g. “carried out” is a subproperty of “participated”. If a property A is subclass of a property B, we require that domain of A is subclass of domain of B, and range of A is subclass of range of B.

Cardinality Constraints

We use Cardinality constraints of properties (e.g. one-to-many). They are not implementation recommendations. They only serve semantic clarification. As the model is designed to compile alternative opinions, and incomplete information, all properties should be implemented as multi-valued and optional, if not more complex relationships to information sources are used. Possible values are:

“1:many”: an individual domain instance can have zero or more such properties, but one of the values cannot be referred to by more than one (“fan-out”). E.g. one Formation Event may form many Groups, but one Group is formed by only one Formation Event (P95).

“many:1”: an individual domain instance can have zero or one such properties, but one of the values can be referred to by more than one (“fan-in”). E.g. one Birth can be only by one mother, but a mother may have had many births (P96).

“many:many”: unconstrained.

Naming Rules

We have applied the following naming rules:

· Classes are named using initial capitals, preceeded by “E” (like “entity”), and an identification number.

· Classes are named using noun phrases (nominal groups).

· Properties are named using lower case letters and are labelled in both directions. They are preceeded by “P” (like “property”), and an identification number.

· The direction of properties, and hence their names, are in accordance with the following priority list:

· Events

· Objects

· Actors

· Other

· property names are to be read from left to right for the domain – range direction, and, in brackets, from right to left, for the range – domain direction. Implementers can choose the appropriate name according to the orientation of their property of field attachment.

· Properties are named using verbal phrases. Properties with the character of states are named in present tense, such as “has type”, whereas properties related to events are named in past tense, such as “carried out”.

Modelling principles

The purpose of the CRM is not to analyse the philosophical substance of the concepts it defines, nor to provide a formal account of if an item is instance of one of its classes. Rather, it is to provide a core language that allows for integrating the semantics of heterogeneous data structures, or to develop data structures. The expert must be able to comprehend the meaning of a CRM concept, and decide, which of his data structure elements or intended meaning in a planned system are compatible with a CRM concept. We try to restrict the CRM to minimal notions that can safely be standardized.

Minimality

As a model for information integration, the CRM tries to be monotonic under increase of knowledge in an “Open World”: No construct should become invalid, if knowledge increases. The CRM does not provide any constraints to “improve” the quality of data produced by scholars and scientists, nor to enforce a certain “truth” for data about the past, such as requiring people to have one father only.

Consequently, there are no properties in the CRM that help justifying classification by one of its classes, such as “human DNA” may justify being a “person”. No definition of a CRM class is based on the existence of an instance of some CRM property. For instance, even an “information carrier” may not carry information, such as an empty diskette.

CRM concepts are “primitive”; they cannot be logically derived from other CRM classes and properties.

Siblings of CRM classes and properties under the same superclass are non-exclusive per default. E.g., an object may be a “biological object” and “man-made”. We do not declare complements, such as “former owner”, once we have declared a “current owner”.

Shortcuts

Some properties are declared as “shortcuts” of a path, that connects the same domain and range as the respective property, but leading through multiple properties and classes (normally one intermediate class). The declaration denotes, that all instances of this path can be seen as instances of the “shortcut” property. The opposite is normally not true: It may not be possible to infer the path from the existence of an instance of the shortcut property. In some cases, it may be possible to infer a path with a hypothetical intermediate node which is uniquely defined by a property, domain and range instance.

Disjointness

Disjoint sets are sets that share no instance. We call two concepts A and B “disjoint”, if their extensions should not share instances in any possible world. We have carefully studied, which concepts may be disjoint. The possible combinations of CRM concepts are many, the decision often not possible, and the practical use of such statement is questionable, when the fact is obvious to any expert. There are however two non-obvious cases, that are fundamental to the comprehension of the CRM:

· E2 Temporal Entity is disjoint from E77 Persistent Item. Instances of E2 are also called “perdurants”, and instances of E77 “endurants” []. Even though “persistent items” have a limited existence in time, we regard them as fundamentally different, because they preserve their identity between events, such as in the phrase “it is still there”. This position fits to the distinctions made in real data structures.

· E18 Physical Stuff is disjoint from E28 Conceptual Object. The distinction is between material and immaterial items, the latter being exclusively man-made. They differ in the way they are produced – incorporating material or not; in the way they participate in events – in one at a time, or in many at a time via multiple physical carriers; in the way they perish – by destruction, or by loss of the last carrier or forgetting.

About Types

Virtually any cultural data record begins with an object identifier and the “type” of the described item. Often such a field is analysed into “Classification”, “Category”, “Object Name”, “Role” etc. These terms all refer to classes or categories of items on different levels of specialization and in different contexts. I.e., they declare the respective item to be an instance of this class. In the CRM we found, that we do not create any ambiguity, if we describe them all by one term: E55 Type. So, actually, E55 Type is a class of classes, a metaclass.

On the other side, creating a record in a table “Object” for instance, also declares the item to be instance of a class “Object”. The practical difference is, that the declared type has no implications on the data structure used, whereas the creation of a record has. The CRM describes data structure semantics. Therefore we follow this practice in the CRM, declaring only classes as CRM concepts, which have a declared relationship (property) to another CRM class, except if the class is needed to group or link other CRM classes (such as E13 Attribute Assignment or E21 Person). Consequently we endow all CRM classes with the property “P2 has type”, that allows for refining the classification of any item to any level of detail. This is the link of the CRM to terminological systems, frequently provided in the form of thesauri or ontologies as well. Those are not the target of the CRM.

In an isolated Relational database, the table is always the most general class that is assigned to an item. The declared types form an IsA (subclass) hierarchy below the table level. So there is no conflict between the types and the table. In object-oriented systems like the CRM, the classes (corresponding to tables) form an IsA hierarchy themselves. Therefore any type hierarchy used for CRM compatible data must be an extension of the IsA hierarchy of CRM classes. This applies equally to types of properties.

E55 Type is also a range of many properties, such as “P125 used general object”. These properties, except for “P2 has type”, declare a kind of general knowledge about an object, quite frequent in cultural data, such as “this object was produced by a mold”, meaning that there has been an instance of “mold” that was actually used. This information allows to connect the object to all those that are of type (“P2 has type”) “mold”. This consistent treatment of general (metaclass) knowledge gives the CRM a particular power, one of the keys to integrate cultural knowledge. However, in order not to overload this standard with a complex theory, we do not express formally a constraint like: “the range of P125 is restricted to types of E70 Stuff”, even though it is understood that each CRM class corresponds to a respective subclass of E55 Type.

Finally, types play an extraordinary role in the history of the human mind. They are intellectual products, objects of our discourse, and their history and justification by physical evidence is a target of documentation, particularly in archaeology and Natural History. Therefore the CRM regards them as “conceptual objects”, parallel to their structural role. The CRM elegantly integrates both aspects in a way adequate to cultural data and Natural History documentation.

Completeness

Of necessity, some concepts are less thoroughly elaborated than others: “E39 Actor” and “E30 Right”, for example. This is a natural consequence of focussing on specific functionality in an intrinsically unlimited field. These ‘underdeveloped’ concepts can be considered as hook-in points for extensions compatible with the model. However, even without these extensions, the CRM is nevertheless ‘complete’ in that, through the use of free text fields (“has note”), it allows information to be captured that is not modelled explicitly. Indeed, some information has deliberately not been developed into formal properties. This approach is preferable when detailed, targeted queries are not expected: a good text description, a drawing or diagram provides often a better source of information than formally encoded knowledge. In general, only those concepts on which formal querying is required need to be made explicit - rather than all the information which needs to be stored and retrieved.

Extensions

The CRM has been designed to be extensible, giving credit to the fact, that the intended scope of the CIDOC CRM is not finite. This makes only sense in conjunction with a notion of compatibility, such that data described by an extension of the CRM can still be regarded as valid instances of the CRM. In practical terms this means that queries the CRM concepts allow to answer on a set of CRM instance data, can also be answered on the extension data (query enclosure, []). Note, that we talk about semantics, not about formalisms. For instance a query “list all events” is only correctly answered, if it returns everything that the CRM experts regard as an event, not only what an extension may have put under “event”.

A sufficient condition for compatibility of an extension is, that CRM classes subsume all classes of the extension, and all properties of the extension are either subsumed by CRM properties, or are part of a path, for which a CRM property is a shortcut. Obviously, such a condition can only intellectually be answered. The user has the last word, who is (or not) satisfied with the answer.

Examples

Fig. 1 reasoning about spatial information

The diagram above shows a partial view of the CRM representing spatial information. Five of the main hierarchy branches are included in this view: Actor, Contact Point, Appellation, Place, and Physical Stuff. The relationships between these main classes and their subclasses are shown as branching lines. Properties between classes are shown as green ovals. A ‘shortcut’ property is included in this view: has section (is located on or within) between Place and Physical Object is a shortcut of the path through Section Definition. In some cases the order of priority for property names has been modified in order to facilitate reading the model from left to right.

As can be seen, a Place is identified by a Place Appellation, which may be an Address, Spatial Coordinates, a Place Name, or a Section Definition such as ‘basement’, ‘prow’, or ‘lower left-hand corner’. A Place may consist of or form part of another place, thereby allowing a hierarchy of physical ‘containers’ to be constructed.

An Address can be considered both as a Place Appellation – a way of referring to a place – and as a Contact Point for an Actor. An Actor may have any number of Contact Points.

An interesting aspect of the model is the defines section property between section definition and physical object, (and the corresponding shortcut from place to physical object). This effectively means that a section of a physical object is the reference for a place. We may know, for example that Nelson died on a particular spot on the Victory, without being able to locate the exact position of the vessel in geospatial terms. Similarly, a signature or inscription can be located 'on the lower right hand corner of’ a painting, regardless of where the painting is hanging.

Fig. 2 reasoning about temporal information

This second example shows how the model handles temporal information. Four of the main hierarchy branches are included in this view: Temporal Entity, Time-Span, Appellation and Place. The Temporal Entity class serves to group together all classes which have a temporal component, such as historical Periods, Events and Condition States. Typically, Periods and Events are identified by a name or Period Appellation. A Time-Span is simply a temporal interval that does not make any reference to cultural or geographical contexts, unlike Periods, which take place at a particular Place. Time-Spans are sometimes named, generally by reference to Dates. Time Appellations differ from Period Appellations in that one refers to a Period within a geo-cultural context while the other is purely temporal - a distinction which is often hard to recognise in natural language. Both Time-Span and Period have the reflexive properties consists of and falls within. Both of these allow part-whole hierarchies to be constructed. The distinction between the two types of property is that in first case the whole is thought to be composed of or defined by its parts whereas in the second the relationship is merely contingent. An example might be a period of national celebration, which could be said to be composed of the individual events, whereas the construction of a building might simply fall within the period of a particular government.

The Entity and Property List

The following is the list of all entities and properties contained in the model. It consists of an index of entities and an index of properties, followed by the complete list of entity declarations and the complete list of property declarations. The list is ordered by the unique identifiers, which have been assigned in historical order from version 2 on.

The entity index has the following format:

· Unique identifier consisting of the letter “E” for “entity” and a number

· a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.

· The English name of the entity itself.

· The index is ordered by hierarchic level, in a “depth first” manner, from the smaller to the larger subhierarchies, and alphabetically between equal siblings.

· Entities that reappear at another position in the hierarchy due to multiple inheritance are marked using italics

The property index has the following format:

· Unique identifier consisting of the letter “P” for “property” and a number

· a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.

· The English name of the property itself, followed by its name in parenthesis for reading it in inverse direction.

· The “domain” entity in which it is declared

· The “range” entity where it points to.

· The index is ordered by hierarchic level, in a “depth first” manner, from the smaller to the larger subhierarchies, and by property number between equal siblings.

· Properties that reappear at another position in the hierarchy due to multiple inheritance are marked using italics

Entity declarations use the following format:

· Entity names (terms) are presented as headings in bold face, preceded by the unique identifier.

· The line “Subclass of:” declares the superclass of the entity, from which it inherits properties.

· The line “Superclass of:” is a cross-reference to the following subclasses of this entity.

· The line “Scope note” contains the textual definition of the concept the entity represents.

· The title “Properties” announces the list of properties.

· Each property is represented by its unique identifier, its forward and backward name, and the entity it links to, separated by colon.

· Inherited properties are not represented.

· Properties of properties are given in an indented position in parenthesis under the respective property.

Property declarations use the following format:

· Property names are presented as headings in bold face, preceded by the unique identifier.

· The line “Domain:” declares the entity, for which this property is defined.

· The line “Range:” declares the entity, to which this property points, or which provides the values for this property.

· The line “Superclass of:” is a cross-reference to the following subclasses of this entity.

· The line “Cardinality:” declares the possible number of occurrences for an individual entity. Possible values are: 1:many, many:many, many:1.

· The line “Scope note” contains the textual definition of the concept the entity represents.

· The title “Properties” announces the list of properties of properties.

· Each property of a property is represented by a unique identifier relative to the property for which it is defined, its forward and backward name, and the entity it links to, separated by colon.

· Inherited properties are not represented.