|
|
Definition of the
CIDOC
object-oriented
Conceptual Reference Model
and
Crossreference Manual
This page is the introductory page of Definition of CIDOC object-oriented Conceptual Reference Model and Crossreference Manual
Contents:
This document is a formal definition of the oo CIDOC Conceptual Reference
Model (referred to in the following as the “CRM”). It is the result of
work done by the CIDOC Documentation Standards Group, from 1994-2000, and
the CIDOC CRM Special Interest Group from 2000-2002, as the result of an initiative
to define the underlying semantics of database schemata and document structures
needed in museum documentation for the support of good practice in conceptual
modelling, data transformation, data exchange, information integration and
mediation of heterogeneous sources.
The intended scope should be understood as the
domain that the CRM would ideally aim to cover, given sufficient time and
resources, and is expressed as a definition of principles. The practical scope
is, necessarily, a subset of the intended scope. The intended scope is difficult
to define with the same degree of precision as the practical scope since it
depends on concepts such as "cultural heritage" which are themselves
complex and difficult to define. The objectives provided by the intended scope
are important, however, since they allow appropriate sources to be selected
for inclusion in the practical scope. The practical scope is expressed in
terms of the reference documents and sources that have been used in its elaboration.
The CRM covers the same domain as these reference sources ([x]…[y]). This
means, that data encoded following one of those sources can be transformed
or integrated into a CRM compatible form without loss of meaning, as long
as the meaning remains within the intended scope of the CIDOC CRM.
The intended scope of the CRM may be defined as all information required
for the scientific documentation of cultural heritage collections, with a
view to enabling wide area information exchange and integration of heterogeneous
sources. This definition requires some explanation:
·
The term scientific documentation is intended to convey the requirement
that the depth and quality of descriptive information which can be handled
by the CRM should be sufficient for serious academic research into a given
field and not merely that required for casual browsing. This does not mean
that information intended for presentation to members of the general public
is excluded, but rather that the CRM is intended to provide the level of detail
and precision expected and required by museum professionals and researchers
in the field.
·
The term cultural heritage collections
is intended to cover all types of material collected and displayed by museums
and related institutions, as defined by ICOM [1]. This includes collections,
sites and monuments relating to natural history, ethnography, archaeology,
historic monuments, as well as collections of fine and applied arts. The exchange
of relevant information with libraries and archives, and the harmonisation
of the CRM with their models, fall within the CRM's intended scope.
·
The documentation of collections is intended to encompass the detailed
description both of individual items within collections as well as groups
of items and collections as a whole.
The scope of the CRM is the curated knowledge of museums. Information required
solely for the administration and management of cultural heritage institutions, such as information relating to
personnel, accounting, and visitor statistics, falls outside the intended
scope.
·
The CRM is specifically intended to cover contextual information: the historical,
geographical and theoretical background in which individual items are placed
and which gives them much of their
significance and value.
·
The goal of enabling information exchange and integration between heterogeneous
sources determines the constructs and level of detail of the CRM. It also
determines its perspective, which is necessarily supra-institutional and abstracted
from any specific local context.
·
The CRM aims to leverage contemporary technology while enabling communication
with legacy systems.
The CRM is a domain ontology in the sense used in computer science[]. As
such, the model is designed to be explanatory and extensible rather than prescriptive
and restrictive. Currently, no specific formalism for semantic models has
been widely accepted as standard, nevertheless the semantic deviations between
the various available models are minimal. Consequently, the model has been
formulated as an object-oriented semantic model[], which can easily be converted
into other object-oriented models. It is our intention that this presentation
format should be both natural and expressive for domain experts, and easily
be converted to other machine readable formats such as RDF and XML. The definition
itself, as presented here, is not enough to fully comprehend the model and
its application. The use of rich specialization hierarchies causes a complex
set of inherited properties and cross-references. So this relatively compact
definition of about 220 elements corresponds to several thousand properties
of the declared classes. A full set of direct and inherited declaration can
automatically be generated from this definition, and be made available as
a separate document.
From the various terminologies in use for object-oriented models, we have
selected the following for ease of understanding by non-computer experts.
They are motivated by the terminology of RDF (Resource Description Framework),
a recommendation of W3C. We use a slightly more constraint semantic model
than RDF, a subset of TELOS [], which is compatible with many formalisms and
can be implemented on a large variety of tools. We use:
“class” for a concept
that may be called “class”, “individual class”, “entity” or “node”, in contrast
to properties. It denotes a category of items the definition of whose does
not depend on the existence of other categories, such as “physical object”,
in contrast to “carried out”, which requires persons and activities to make
sense. A class plays a role similar to a grammatical noun. A class is characterized
by an intention, which is conveyed by a scope note.
“scope note” is not a definition in absolute terms but a text making
clear to the user the relation of the class to known concepts of the domain.
A class is associated in real life with a set of instances, the extension.
This set is open and unknown – we do not know all instances of a concept in
the current world or the past, no can we foresee the future. No
concept in the CRM is defined by its extension (such as enumeration sets),
however references of good examples are used to clarifying the intention.
“property” for a concept that may be called
“attribute”, “reference”, “link”, “role” or “property”. A property denotes
a category of items the definition of whose does depend on the existence of
two other items, such as “carried out”, which requires some persons and an
action to make sense, in contrast to “physical object”, which is defined on
its own. It plays a role similar to a verb, which stands between a grammatical
subject and an object. In the manner of semantic networks, we do not distinguish
between internal attributes and references. Rather, every value associated
with an instance of a class is regarded as a reference to that value, which
in turn is instance of another class. The property is characterized by an
intention, which is conveyed by a scope note. It is also
associated with an extension. For instance, “my role in the
writing of this document” is regarded an instance of “carried out”. Properties may themselves have properties, which point to other entities.
In the CRM this is used for dynamically specializing properties, such as in
the case of multiple potential kinds of roles.
“domain” for the class a property is defined for – like a grammatical
subject. It is the most specialized class meaningful to the expert that comprises
all potential uses of the property. We allow only one domain
per property. Note, that it may always contain instances for which the property
is not applicable.
“range” for the class a property refers
– like a grammatical object. It is the most specialized class meaningful to
the expert that comprises all potential values of the property. We allow only
one range per property. Note, that it may always contain instances
for which the property is not applicable. Note that the difference between
range and domain is not substantial but conventional. We always give in parenthesis
the name the property has if domain and range are interchanged, frequently
simply the passive voice. Hence which one is domain and which one range depends
only on the name of the property, such as “carried out” versus “was carried
out by”.
“Superclass-Subclass” for “IsA” relations, which may be called “superclass – subclass”, “ parent
class - derived class”, “generalization - specialization”, “genus-species”,
“subsumes – is subsumed by”. An IsA relation between classes, such “a birth
IsA event”, denotes that the subclass has all properties of
its superclasses (strict inheritance), and some more of its
own, and that the superclass comprises all instances of all subclasses and
some more of its own. As domain and range are interchangeable,
inheritance holds for both. An instance of a subclass of a domain of a property can
use this property, and any property instance can refer to an instance of a
subclass of the range of this property. A class
may have more than one immediate superclass (multiple
inheritance). The “IsA” relation holds also for properties. E.g. “carried
out” is a subproperty of “participated”. If a property A is subclass of a
property B, we require that domain of A is subclass of domain of B, and range
of A is subclass of range of B.
We use Cardinality constraints of properties (e.g. one-to-many).
They are not implementation recommendations.
They only serve semantic clarification.
As the model is designed to compile alternative opinions, and
incomplete information, all properties should be implemented
as multi-valued and optional, if not more complex relationships to information
sources are used. Possible values are:
“1:many”: an individual domain instance
can have zero or more such properties, but one of the values cannot be referred
to by more than one (“fan-out”). E.g. one Formation Event may form many Groups,
but one Group is formed by only one Formation Event (P95).
“many:1”: an individual domain instance
can have zero or one such properties, but one of the values can be referred
to by more than one (“fan-in”). E.g. one Birth can be only by one mother,
but a mother may have had many births (P96).
“many:many”: unconstrained.
We have applied the following naming rules:
·
Classes are named using initial capitals, preceeded by “E” (like “entity”),
and an identification number.
·
Classes are named using noun phrases (nominal groups).
·
Properties are named using lower case letters and are labelled in both
directions. They are preceeded by “P” (like “property”), and an identification
number.
·
The direction of properties, and hence their names, are in accordance with
the following priority list:
·
Events
·
Objects
·
Actors
·
Other
·
property names are to be read from left to right for the domain – range
direction, and, in brackets, from right to left, for the range – domain direction.
Implementers can choose the appropriate name according to the orientation
of their property of field attachment.
·
Properties are named using verbal phrases. Properties with the character
of states are named in present tense, such as “has type”, whereas properties
related to events are named in past tense, such as “carried out”.
The purpose of the CRM is not to analyse the philosophical substance
of the concepts it defines, nor to provide a formal account of if an item
is instance of one of its classes. Rather, it is to provide a core language
that allows for integrating the semantics of heterogeneous data structures,
or to develop data structures. The expert must be able to comprehend the meaning
of a CRM concept, and decide, which of his data structure elements or intended
meaning in a planned system are compatible with a CRM concept. We try to restrict
the CRM to minimal notions that can safely be standardized.
As a model for information integration, the CRM tries to be
monotonic under increase of knowledge in an “Open World”:
No construct should become invalid, if knowledge increases. The CRM does not
provide any constraints to “improve” the quality of data produced by scholars
and scientists, nor to enforce a certain “truth” for data about the past,
such as requiring people to have one father only.
Consequently, there are no properties in the CRM that help justifying
classification by one of its classes, such as “human DNA” may justify being
a “person”. No definition of a CRM class is based on the existence of an instance
of some CRM property. For instance, even an “information carrier” may not
carry information, such as an empty diskette.
CRM concepts are “primitive”; they cannot be logically
derived from other CRM classes and properties.
Siblings of CRM classes and properties under the same superclass
are non-exclusive per default. E.g., an object may be a “biological
object” and “man-made”. We do not declare complements,
such as “former owner”, once we have declared a “current owner”.
Some properties are declared as “shortcuts” of a path,
that connects the same domain and range as the respective property, but leading
through multiple properties and classes (normally one intermediate class).
The declaration denotes, that all instances of this path can be seen as instances
of the “shortcut” property. The opposite is normally not true: It may not
be possible to infer the path from the existence of an instance of the shortcut
property. In some cases, it may be possible to infer a path with a hypothetical
intermediate node which is uniquely defined by a property, domain and range
instance.
Disjoint sets are sets that share no instance. We call two concepts
A and B “disjoint”, if their extensions should not share instances
in any possible world. We have carefully studied, which concepts may be disjoint.
The possible combinations of CRM concepts are many, the decision often not
possible, and the practical use of such statement is questionable, when the
fact is obvious to any expert. There are however two non-obvious cases, that
are fundamental to the comprehension of the CRM:
·
E2
Temporal Entity is disjoint from E77 Persistent Item. Instances of E2 are
also called “perdurants”, and instances of E77 “endurants” [].
Even though “persistent items” have a limited existence in time, we regard
them as fundamentally different, because they preserve their identity between
events, such as in the phrase “it is still there”. This position fits to the
distinctions made in real data structures.
·
E18
Physical Stuff is disjoint from E28 Conceptual Object. The distinction is
between material and immaterial items, the latter being exclusively man-made.
They differ in the way they are produced – incorporating material or not;
in the way they participate in events – in one at a time, or in many at a
time via multiple physical carriers; in the way they perish – by destruction,
or by loss of the last carrier or forgetting.
Virtually any cultural data record begins with an object
identifier and the “type” of the described item. Often such a field is analysed
into “Classification”, “Category”, “Object Name”, “Role” etc. These terms
all refer to classes or categories of items on different levels of specialization
and in different contexts. I.e., they declare the respective item to be an
instance of this class. In the CRM we found, that we do not create any ambiguity,
if we describe them all by one term: E55 Type. So, actually, E55 Type is a
class of classes, a metaclass.
On the other side, creating a record in a table “Object”
for instance, also declares the item to be instance of a class “Object”. The
practical difference is, that the declared type has no implications on the
data structure used, whereas the creation of a record has. The CRM describes
data structure semantics. Therefore
we follow this practice in the CRM, declaring only classes as CRM concepts,
which have a declared relationship (property) to another CRM class, except if the class is needed to
group or link other CRM classes (such as E13 Attribute Assignment or E21 Person).
Consequently we endow all CRM classes with the property “P2 has type”, that
allows for refining the classification of any item to any level of detail.
This is the link of the CRM to terminological systems, frequently provided
in the form of thesauri or ontologies as well. Those are not
the target of the CRM.
In an isolated Relational database, the table is always
the most general class that is assigned to an item. The declared types form
an IsA (subclass) hierarchy below the table level. So there is no conflict
between the types and the table. In object-oriented systems like the CRM,
the classes (corresponding to tables) form an IsA hierarchy themselves. Therefore
any type hierarchy used for CRM compatible data
must be an extension of the IsA hierarchy of CRM classes. This
applies equally to types of properties.
E55 Type is also a range of many properties, such as
“P125 used general object”. These properties, except for “P2 has type”, declare
a kind of general knowledge about an object, quite frequent in cultural data,
such as “this object was produced by a mold”, meaning that there has been
an instance of “mold” that was actually used. This information allows to connect
the object to all those that are of type (“P2 has type”) “mold”. This consistent
treatment of general (metaclass) knowledge gives the CRM a particular power,
one of the keys to integrate cultural knowledge. However, in order not to
overload this standard with a complex theory, we do not express formally a
constraint like: “the range of P125 is restricted to types of E70 Stuff”,
even though it is understood that each CRM class corresponds to a respective
subclass of E55 Type.
Finally, types play an extraordinary role in the history
of the human mind. They are intellectual products, objects of our discourse,
and their history and justification by physical evidence is a target of documentation,
particularly in archaeology and Natural History. Therefore the CRM regards
them as “conceptual objects”, parallel to their structural role. The CRM elegantly
integrates both aspects in a way adequate to cultural data and Natural History
documentation.
Of necessity, some concepts are less thoroughly elaborated than others:
“E39 Actor” and “E30 Right”, for example. This is a natural consequence of
focussing on specific functionality in an intrinsically unlimited field. These
‘underdeveloped’ concepts can be considered as hook-in points for extensions
compatible with the model. However, even without these extensions, the CRM
is nevertheless ‘complete’ in that, through the use of free text fields (“has
note”), it allows information to be captured that is not modelled explicitly.
Indeed, some information has deliberately not been developed into formal
properties. This approach is preferable when detailed, targeted queries are
not expected: a good text description, a drawing or diagram provides often
a better source of information than formally encoded knowledge. In general,
only those concepts on which formal querying is required need to be
made explicit - rather than all the information which needs to be stored and
retrieved.
The CRM has been designed to be extensible, giving credit to
the fact, that the intended scope of the CIDOC CRM is not finite. This makes
only sense in conjunction with a notion of compatibility, such that
data described by an extension of the CRM can still be regarded as valid instances
of the CRM. In practical terms this means that queries the CRM concepts allow
to answer on a set of CRM instance data, can also be answered on the extension
data (query enclosure, []). Note, that we talk about semantics, not
about formalisms. For instance a query “list all events” is only correctly
answered, if it returns everything that the CRM experts regard as an event,
not only what an extension may have put under “event”.
A sufficient condition for compatibility of an
extension is, that CRM classes subsume all classes of the extension, and all
properties of the extension are either subsumed by CRM properties, or are
part of a path, for which a CRM property is a shortcut. Obviously, such a
condition can only intellectually be answered. The user has the last word,
who is (or not) satisfied with the answer.
Fig. 1 reasoning about spatial
information
The diagram above shows a partial view of the
CRM representing spatial information. Five of the main hierarchy branches
are included in this view: Actor, Contact Point, Appellation, Place, and Physical
Stuff. The relationships between these main classes and their subclasses are
shown as branching lines. Properties between classes are shown as green ovals.
A 'shortcut' property is included in this view: has section (is located on
or within) between Place and Physical Object is a shortcut of the path through
Section Definition. In some cases the order of priority for property names
has been modified in order to facilitate reading the model from left to right.
As can be seen, a Place is identified by a Place Appellation, which may be
an Address, Spatial Coordinates, a Place Name, or a Section Definition such
as 'basement', 'prow', or 'lower left-hand corner'. A Place may consist of
or form part of another place, thereby allowing a hierarchy of physical 'containers'
to be constructed.
An Address can be considered both as a Place Appellation - a way of referring
to a place - and as a Contact Point for an Actor. An Actor may have any number
of Contact Points. Physical Stuff is found on locations as a consequence of
being created there or being moved there. Therefore the properties is former
or current location of and currently holds are regarded shortcut of the paths
through the respective events. Currently holds is a subproperty of is former
or current location of. The latter is a container for location information
without any knowledge about time of validity and related events.
An interesting aspect of the model is the defines section property between
Section Definition and Physical Stuff, (and the corresponding shortcut from
Place to Physical Object). This effectively means that a section of a Physical
Object is the reference for a Place. We may know, for example that Nelson
died on a particular spot on the Victory, without being able to locate the
exact position of the vessel in geospatial terms. Similarly, a signature or
inscription can be located 'on the lower right hand corner of' a painting,
regardless of where the painting is hanging.
Fig. 2 reasoning about temporal
information
This second example shows how the model handles temporal information. Four
of the main hierarchy branches are included in this view: Temporal Entity,
Time-Span, Appellation and Place. The Temporal Entity class serves to group
together all classes which have a temporal component, such as historical Periods,
Events and Condition States. Typically, Periods and Events are identified
by a name or Period Appellation. A Time-Span is simply a temporal interval
that does not make any reference to cultural or geographical contexts, unlike
Periods, which take place at a particular Place. Time-Spans are sometimes
named, generally by reference to Dates. Time Appellations differ from Period
Appellations in that one refers to a Period within a geo-cultural context
while the other is purely temporal - a distinction which is often hard to
recognise in natural language. Time-Span has the reflexive property falls
within, a pure incidental inclusion and Period has also the reflexive property
consists of that allows part-whole hierarchies to be constructed. The distinction
between the two types of property is that in first case the relationship is
merely contingent whereas in the second the whole is thought to be composed
of or defined by its parts. An example might be a period of national celebration,
which could be said to be composed of the individual phases, whereas the construction
of a building might simply fall within the period of a particular government.
Time-Spans can be approximated by outer bounds (indeterminacy interval) by
the property at some time within and by inner bounds via ongoing throughout,
where Time Primitive refers to an interval of dates that is better provided
as basic type (Primitive Value) by a database system to support the suitable
query operations.
The following is the list of all entities and
properties contained in the model. It consists of an index of entities
and an index of properties, followed by the complete list of entity declarations
and the complete list of property declarations. The list is ordered by the
unique identifiers, which have been assigned in historical order from version
2 on.
The entity index has the following format:
·
Unique identifier consisting of the letter “E” for “entity” and a number
·
a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.
·
The English name of the entity itself.
·
The index is ordered by hierarchic level, in a “depth first” manner, from
the smaller to the larger subhierarchies, and alphabetically between equal
siblings.
·
Entities that reappear at another position in the hierarchy due to multiple
inheritance are marked using italics
The property index has the following format:
·
Unique identifier consisting of the letter “P” for “property” and a number
·
a series “-“ (minus) symbols indicating the depth in the IsA hierarchy.
·
The English name of the property itself, followed by its name in parenthesis
for reading it in inverse direction.
·
The “domain” entity in which it is declared
·
The “range” entity where it points to.
·
The index is ordered by hierarchic level, in a “depth first” manner, from
the smaller to the larger subhierarchies, and by property number between equal
siblings.
·
Properties that reappear at another position in the hierarchy due to multiple
inheritance are marked using italics
Entity declarations use the following format:
·
Entity names (terms) are presented as headings in bold face, preceded by
the unique identifier.
·
The line “Subclass of:” declares the superclass of the entity, from which
it inherits properties.
·
The line “Superclass of:” is a cross-reference to the following subclasses
of this entity.
·
The line “Scope note” contains the textual definition of the concept the
entity represents.
·
The title “Properties” announces the list of properties.
·
Each property is represented by its unique identifier, its forward and
backward name, and the entity it links to, separated by colon.
·
Inherited properties are not represented.
·
Properties of properties are given in an indented position in parenthesis
under the respective property.
Property declarations use the following format:
·
Property names are presented as
headings in bold face, preceded by the unique identifier.
·
The line “Domain:” declares the entity, for which this property is defined.
·
The line “Range:” declares the entity, to which this property points, or
which provides the values for this property.
·
The line “Superclass of:” is a cross-reference to the following subclasses
of this entity.
·
The line “Cardinality:” declares the possible number of occurrences for
an individual entity. Possible values are: 1:many, many:many, many:1.
·
The line “Scope note” contains the textual definition of the concept the
entity represents.
·
The title “Properties” announces the list of properties of properties.
·
Each property of a property is represented by a unique identifier relative
to the property for which it is defined, its forward and backward name, and
the entity it links to, separated by colon.
·
Inherited properties are not represented.