Definition of the object-oriented and Crossreference Manual |
This page is the introductory page of Definition of CIDOC object-oriented Conceptual Reference Model and Crossreference Manual
Contents:
This document is a formal definition of the oo CIDOC Conceptual
Reference Model (referred to in the following as the “CRM”). It is the
result of work done by the CIDOC Documentation Standards Group, from 1994-2000,
and the CIDOC CRM Special Interest Group from 2000-2002, as the result of an
initiative to define the underlying semantics of database schemata and document
structures needed in museum documentation for the support of good practice in
conceptual modelling, data transformation, data exchange, information
integration and mediation of heterogeneous sources.
The intended scope should be understood as the
domain that the CRM would ideally aim to cover, given sufficient time and
resources, and is expressed as a definition of principles. The practical scope
is, necessarily, a subset of the intended scope. The intended scope is
difficult to define with the same degree of precision as the practical scope
since it depends on concepts such as "cultural heritage" which are
themselves complex and difficult to define. The objectives provided by the
intended scope are important, however, since they allow appropriate sources to
be selected for inclusion in the practical scope. The practical scope is
expressed in terms of the reference documents and sources that have been used
in its elaboration. The CRM covers the same domain as these reference sources
([x]…[y]). This means, that data encoded following one of those sources can be
transformed or integrated into a CRM compatible form without loss of meaning,
as long as the meaning remains within the intended scope of the CIDOC CRM.
The intended scope of the CRM may be defined as all information required
for the scientific documentation of cultural heritage collections, with a view
to enabling wide area information exchange and integration of heterogeneous
sources. This definition requires some explanation:
·
The term scientific documentation is intended to convey the requirement
that the depth and quality of descriptive information which can be handled
by the CRM should be sufficient for serious academic research into a given
field and not merely that required for casual browsing. This does not mean
that information intended for presentation to members of the general public
is excluded, but rather that the CRM is intended to provide the level of detail
and precision expected and required by museum professionals and researchers
in the field.
·
The term cultural heritage collections
is intended to cover all types of material collected and displayed by museums
and related institutions, as defined by ICOM [1]. This includes collections,
sites and monuments relating to natural history, ethnography, archaeology,
historic monuments, as well as collections of fine and applied arts. The exchange
of relevant information with libraries and archives, and the harmonisation
of the CRM with their models, fall within the CRM's intended scope.
·
The documentation of collections is intended to encompass the detailed
description both of individual items within collections as well as groups
of items and collections as a whole.
The scope of the CRM is the curated knowledge of museums. Information required
solely for the administration and management of cultural heritage institutions, such as information relating to
personnel, accounting, and visitor statistics, falls outside the intended
scope.
·
The CRM is specifically intended to cover contextual information: the historical,
geographical and theoretical background in which individual items are placed
and which gives them much of their
significance and value.
·
The goal of enabling information exchange and integration between heterogeneous
sources determines the constructs and level of detail of the CRM. It also
determines its perspective, which is necessarily supra-institutional and abstracted
from any specific local context.
·
The CRM aims to leverage contemporary technology while enabling communication
with legacy systems.
The CRM is a domain ontology in the sense used in computer science[]. As
such, the model is designed to be explanatory and extensible rather than
prescriptive and restrictive. Currently, no specific formalism for semantic
models has been widely accepted as standard, nevertheless the semantic
deviations between the various available models are minimal. Consequently, the
model has been formulated as an object-oriented semantic model[], which can
easily be converted into other object-oriented models. It is our intention that
this presentation format should be both natural and expressive for domain
experts, and easily be converted to other machine readable formats such as RDF
and XML. The definition itself, as presented here, is not enough to fully
comprehend the model and its application. The use of rich specialization
hierarchies causes a complex set of inherited properties and cross-references.
So this relatively compact definition of about 220 elements corresponds to
several thousand properties of the declared classes. A full set of direct and
inherited declaration can automatically be generated from this definition, and
be made available as a separate document.
From the various terminologies in use for object-oriented models, we
have selected the following for ease of understanding by non-computer experts.
They are motivated by the terminology of RDF (Resource Description Framework),
a recommendation of W3C. We use a slightly more constraint semantic model than
RDF, a subset of TELOS [], which is compatible with many formalisms and can be
implemented on a large variety of tools. We use:
“class” for a concept
that may be called “class”, “individual class”, “entity” or “node”, in contrast
to properties. It denotes a category of items the definition of whose does not
depend on the existence of other categories, such as “physical object”, in
contrast to “carried out”, which requires persons and activities to make sense.
A class plays a role similar to a grammatical noun. A class is characterized by
an intention, which is conveyed by a scope note.
“scope note” is not a definition in absolute terms but a text
making clear to the user the relation of the class to known concepts of the
domain. A class is associated in real life with a set of instances, the extension.
This set is open and unknown – we do not know all instances of a concept in the
current world or the past, no can we foresee the future. No
concept in the CRM is defined by its extension (such as enumeration sets),
however references of good examples are used to clarifying the intention.
“property” for a concept that may be
called “attribute”, “reference”, “link”, “role” or “property”. A property denotes
a category of items the definition of whose does depend on the existence of two
other items, such as “carried out”, which requires some persons and an action
to make sense, in contrast to “physical object”, which is defined on its own.
It plays a role similar to a verb, which stands between a grammatical subject
and an object. In the manner of semantic networks, we do not distinguish
between internal attributes and references. Rather, every value associated with
an instance of a class is regarded as a reference to that value, which in turn
is instance of another class. The property is characterized by an intention,
which is conveyed by a scope note. It is also associated with an extension.
For instance, “my role in the writing of this document” is regarded an instance
of “carried out”. Properties may themselves have properties, which point to other
entities. In the CRM this is used for dynamically specializing properties, such
as in the case of multiple potential kinds of roles.
“domain” for the class a property is defined for – like a
grammatical subject. It is the most specialized class meaningful to the expert
that comprises all potential uses of the property. We allow only one
domain per property. Note, that it may always contain instances for which the
property is not applicable.
“range” for the class a property
refers – like a grammatical object. It is the most specialized class meaningful
to the expert that comprises all potential values of the property. We allow
only one range per property. Note, that it may always contain
instances for which the property is not applicable. Note that the difference
between range and domain is not substantial but conventional. We always give in
parenthesis the name the property has if domain and range are interchanged,
frequently simply the passive voice. Hence which one is domain and which one
range depends only on the name of the property, such as “carried out” versus
“was carried out by”.
“Superclass-Subclass” for “IsA” relations, which may be called “superclass – subclass”, “
parent class - derived class”, “generalization - specialization”,
“genus-species”, “subsumes – is subsumed by”. An IsA relation between classes,
such “a birth IsA event”, denotes that the subclass has all properties
of its superclasses (strict inheritance), and some more of its
own, and that the superclass comprises all instances of all subclasses and some
more of its own. As domain and range are interchangeable,
inheritance holds for both. An instance of a subclass of a domain of a property
can use this property, and any property instance can refer to an instance of a
subclass of the range of this property. A
class may have more than one immediate superclass (multiple
inheritance). The “IsA” relation holds also for properties. E.g.
“carried out” is a subproperty of “participated”. If a property A is subclass
of a property B, we require that domain of A is subclass of domain of B, and
range of A is subclass of range of B.
We use Cardinality constraints of properties (e.g. one-to-many).
They are not implementation
recommendations. They only serve semantic
clarification. As the model is designed to compile alternative
opinions, and incomplete information, all properties should be
implemented as multi-valued and optional, if not more complex relationships to
information sources are used. Possible values are:
“1:many”: an individual domain
instance can have zero or more such properties, but one of the values cannot be
referred to by more than one (“fan-out”). E.g. one Formation Event may form
many Groups, but one Group is formed by only one Formation Event (P95).
“many:1”: an individual domain
instance can have zero or one such properties, but one of the values can be
referred to by more than one (“fan-in”). E.g. one Birth can be only by one
mother, but a mother may have had many births (P96).
“many:many”: unconstrained.
We have applied the following naming rules:
·
Classes are named using initial capitals, preceeded by “E” (like “entity”),
and an identification number.
·
Classes are named using noun phrases (nominal groups).
·
Properties are named using lower case letters and are labelled in both
directions. They are preceeded by “P” (like “property”), and an identification
number.
·
The direction of properties, and hence their names, are in accordance with
the following priority list:
·
Events
·
Objects
·
Actors
·
Other
·
property names are to be read from left to right for the domain – range
direction, and, in brackets, from right to left, for the range – domain direction.
Implementers can choose the appropriate name according to the orientation
of their property of field attachment.
·
Properties are named using verbal phrases. Properties with the character
of states are named in present tense, such as “has type”, whereas properties
related to events are named in past tense, such as “carried out”.
The purpose of the CRM is not to analyse the philosophical substance of
the concepts it defines, nor to provide a formal account of if an item is
instance of one of its classes. Rather, it is to provide a core language that
allows for integrating the semantics of heterogeneous data structures, or to
develop data structures. The expert must be able to comprehend the meaning of a
CRM concept, and decide, which of his data structure elements or intended
meaning in a planned system are compatible with a CRM concept. We try to
restrict the CRM to minimal notions that can safely be standardized.
As a model for information integration, the CRM tries to be monotonic
under increase of knowledge in an “Open World”: No construct
should become invalid, if knowledge increases. The CRM does not provide any
constraints to “improve” the quality of data produced by scholars and scientists,
nor to enforce a certain “truth” for data about the past, such as requiring
people to have one father only.
Consequently, there are no properties in the CRM that help justifying
classification by one of its classes, such as “human DNA” may justify being a
“person”. No definition of a CRM class is based on the existence of an instance
of some CRM property. For instance, even an “information carrier” may not carry
information, such as an empty diskette.
CRM concepts are “primitive”; they cannot be logically
derived from other CRM classes and properties.
Siblings of CRM classes and properties under the same superclass are non-exclusive
per default. E.g., an object may be a “biological object” and “man-made”. We do
not declare complements, such as “former owner”,
once we have declared a “current owner”.
Some properties are declared as “shortcuts” of a path, that
connects the same domain and range as the respective property, but leading
through multiple properties and classes (normally one intermediate class). The
declaration denotes, that all instances of this path can be seen as instances
of the “shortcut” property. The opposite is normally not true: It may not be
possible to infer the path from the existence of an instance of the shortcut
property. In some cases, it may be possible to infer a path with a hypothetical
intermediate node which is uniquely defined by a property, domain and range
instance.
Disjoint sets are sets that share no instance. We call two concepts A
and B “disjoint”, if their extensions should not share instances in any
possible world. We have carefully studied, which concepts may be disjoint. The
possible combinations of CRM concepts are many, the decision often not
possible, and the practical use of such statement is questionable, when the
fact is obvious to any expert. There are however two non-obvious cases, that
are fundamental to the comprehension of the CRM:
·
E2 Temporal Entity is
disjoint from E77 Persistent Item. Instances of E2 are also called “perdurants”,
and instances of E77 “endurants” []. Even though “persistent items” have
a limited existence in time, we regard them as fundamentally different, because
they preserve their identity between events, such as in the phrase “it is still
there”. This position fits to the distinctions made in real data structures.
·
E18 Physical Stuff is
disjoint from E28 Conceptual Object. The distinction is between material and
immaterial items, the latter being exclusively man-made. They differ in the way
they are produced – incorporating material or not; in the way they participate
in events – in one at a time, or in many at a time via multiple physical
carriers; in the way they perish – by destruction, or by loss of the last
carrier or forgetting.
Virtually any cultural data record begins with an
object identifier and the “type” of the described item. Often such a field is
analysed into “Classification”, “Category”, “Object Name”, “Role” etc. These
terms all refer to classes or categories of items on different levels of
specialization and in different contexts. I.e., they declare the respective
item to be an instance of this class. In the CRM we found, that we do not
create any ambiguity, if we describe them all by one term: E55 Type. So, actually,
E55 Type is a class of classes, a metaclass.
On the other side, creating a record in a table
“Object” for instance, also declares the item to be instance of a class
“Object”. The practical difference is, that the declared type has no
implications on the data structure used, whereas the creation of a record has.
The CRM describes data structure semantics.
Therefore we follow this practice in the CRM, declaring only classes as
CRM concepts, which have a declared relationship (property) to another CRM class, except if the class is needed to
group or link other CRM classes (such as E13 Attribute Assignment or E21
Person). Consequently we endow all CRM classes with the property “P2 has type”,
that allows for refining the classification of any item to any level of detail.
This is the link of the CRM to terminological systems, frequently
provided in the form of thesauri or ontologies as well. Those are not
the target of the CRM.
In an isolated Relational database, the table is
always the most general class that is assigned to an item. The declared types
form an IsA (subclass) hierarchy below the table level. So there is no conflict
between the types and the table. In object-oriented systems like the CRM, the
classes (corresponding to tables) form an IsA hierarchy themselves. Therefore
any type hierarchy used for CRM compatible data
must be an extension of the IsA hierarchy of CRM classes. This
applies equally to types of properties.
E55 Type is also a range of many properties, such as
“P125 used general object”. These properties, except for “P2 has type”, declare
a kind of general knowledge about an object, quite frequent in cultural data,
such as “this object was produced by a mold”, meaning that there has been an
instance of “mold” that was actually used. This information allows to connect
the object to all those that are of type (“P2 has type”) “mold”. This
consistent treatment of general (metaclass) knowledge gives the CRM a
particular power, one of the keys to integrate cultural knowledge. However, in order
not to overload this standard with a complex theory, we do not express formally
a constraint like: “the range of P125 is restricted to types of E70 Stuff”,
even though it is understood that each CRM class corresponds to a respective
subclass of E55 Type.
Finally, types play an extraordinary role in the
history of the human mind. They are intellectual products, objects of our
discourse, and their history and justification by physical evidence is a target
of documentation, particularly in archaeology and Natural History. Therefore
the CRM regards them as “conceptual objects”, parallel to their structural
role. The CRM elegantly integrates both aspects in a way adequate to cultural
data and Natural History documentation.
Of necessity, some concepts are less thoroughly elaborated than others:
“E39 Actor” and “E30 Right”, for example. This is a natural consequence of
focussing on specific functionality in an intrinsically unlimited field. These
‘underdeveloped’ concepts can be considered as hook-in points for extensions
compatible with the model. However, even without these extensions, the CRM is
nevertheless ‘complete’ in that, through the use of free text fields (“has
note”), it allows information to be captured that is not modelled explicitly. Indeed,
some information has deliberately not been developed into formal
properties. This approach is preferable when detailed, targeted queries are not
expected: a good text description, a drawing or diagram provides often a better
source of information than formally encoded knowledge. In general, only those
concepts on which formal querying is required need to be made explicit -
rather than all the information which needs to be stored and retrieved.
The CRM has been designed to be extensible, giving credit to the fact,
that the intended scope of the CIDOC CRM is not finite. This makes only sense
in conjunction with a notion of compatibility, such that data described
by an extension of the CRM can still be regarded as valid instances of the CRM.
In practical terms this means that queries the CRM concepts allow to answer on
a set of CRM instance data, can also be answered on the extension data (query
enclosure, []). Note, that we talk about semantics, not about formalisms.
For instance a query “list all events” is only correctly answered, if it
returns everything that the CRM experts regard as an event, not only what an
extension may have put under “event”.
A sufficient condition for compatibility of an extension
is, that CRM classes subsume all classes of the extension, and all properties
of the extension are either subsumed by CRM properties, or are part of a path,
for which a CRM property is a shortcut. Obviously, such a condition can only
intellectually be answered. The user has the last word, who is (or not)
satisfied with the answer.
Fig. 1 reasoning about
spatial information
The diagram above shows a partial view of the CRM
representing spatial information. Five of the main hierarchy branches are
included in this view: Actor, Contact Point, Appellation, Place, and Physical
Stuff. The relationships between these main classes and their subclasses are
shown as branching lines. Properties between classes are shown as green ovals.
A ‘shortcut’ property is included in this view: has section (is located on or
within) between Place and Physical Object is a shortcut of the path through
Section Definition. In some cases the order of priority for property names has
been modified in order to facilitate reading the model from left to right.
As can be seen, a Place is identified by a
Place Appellation, which may be an Address, Spatial Coordinates, a Place Name,
or a Section Definition such as ‘basement’, ‘prow’, or ‘lower left-hand
corner’. A Place may consist of or form part of another place, thereby allowing
a hierarchy of physical ‘containers’ to be constructed.
An Address can be considered both as a Place
Appellation – a way of referring to a place – and as a Contact Point for an
Actor. An Actor may have any number of Contact Points.
An interesting aspect of the model is the defines section property between section
definition and physical object, (and the corresponding shortcut from place to
physical object). This effectively means that a section of a physical object is the reference for a place. We may know, for example that
Nelson died on a particular spot on the Victory, without being able to locate
the exact position of the vessel in geospatial terms. Similarly, a signature or
inscription can be located 'on the lower right hand corner of’ a painting, regardless of where the painting
is hanging.
Fig. 2 reasoning about temporal
information
This second example shows how the model handles temporal information.
Four of the main hierarchy branches are included in this view: Temporal Entity,
Time-Span, Appellation and Place. The Temporal Entity class serves to group
together all classes which have a temporal component, such as historical
Periods, Events and Condition States. Typically, Periods and Events are
identified by a name or Period Appellation. A Time-Span is simply a temporal
interval that does not make any reference to cultural or geographical contexts,
unlike Periods, which take place at a particular Place. Time-Spans
are sometimes named, generally by reference to Dates. Time Appellations differ
from Period Appellations in that one refers to a Period within a geo-cultural
context while the other is purely temporal - a distinction which is often hard
to recognise in natural language. Both Time-Span and Period have the reflexive
properties consists of and falls within. Both of these allow
part-whole hierarchies to be constructed. The distinction between the two types
of property is that in first case the whole is thought to be composed of
or defined by its parts whereas in the second the relationship is merely
contingent. An example might be a period of national celebration, which could
be said to be composed of the individual events, whereas the construction of a
building might simply fall within the period of a particular government.
The following is the list of all entities and
properties contained in the model. It consists of an index of entities
and an index of properties, followed by the complete list of entity
declarations and the complete list of property declarations. The list is
ordered by the unique identifiers, which have been assigned in historical order
from version 2 on.
The entity index has the following format:
·
Unique identifier consisting of the letter “E” for “entity” and a number
·
a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.
·
The English name of the entity itself.
·
The index is ordered by hierarchic level, in a “depth first” manner,
from the smaller to the larger subhierarchies, and alphabetically between equal
siblings.
·
Entities that reappear at another position in the hierarchy due to
multiple inheritance are marked using italics
The property index has the following format:
·
Unique identifier consisting of the letter “P” for “property” and a
number
·
a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.
·
The English name of the property itself, followed by its name in
parenthesis for reading it in inverse direction.
·
The “domain” entity in which it is declared
·
The “range” entity where it points to.
·
The index is ordered by hierarchic level, in a “depth first” manner,
from the smaller to the larger subhierarchies, and by property number between
equal siblings.
·
Properties that reappear at another position in the hierarchy due to
multiple inheritance are marked using italics
Entity declarations use the following format:
·
Entity names (terms) are presented as headings in bold face, preceded by
the unique identifier.
·
The line “Subclass of:” declares the superclass of the entity, from
which it inherits properties.
·
The line “Superclass of:” is a cross-reference to the following
subclasses of this entity.
·
The line “Scope note” contains the textual definition of the concept the
entity represents.
·
The title “Properties” announces the list of properties.
·
Each property is represented by its unique identifier, its forward and
backward name, and the entity it links to, separated by colon.
·
Inherited properties are not represented.
·
Properties of properties are given in an indented position in
parenthesis under the respective property.
Property declarations use the following format:
·
Property names are presented as
headings in bold face, preceded by the unique identifier.
·
The line “Domain:” declares the entity, for which this property is
defined.
·
The line “Range:” declares the entity, to which this property points, or
which provides the values for this property.
·
The line “Superclass of:” is a cross-reference to the following
subclasses of this entity.
·
The line “Cardinality:” declares the possible number of occurrences for
an individual entity. Possible values are: 1:many, many:many, many:1.
·
The line “Scope note” contains the textual definition of the concept the
entity represents.
·
The title “Properties” announces the list of properties of properties.
·
Each property of a property is represented by a unique identifier
relative to the property for which it is defined, its forward and backward
name, and the entity it links to, separated by colon.
·
Inherited properties are not represented.