...
- All entities (resources in the linked data jargon) in the data are named with URIs (Uniform Resource Identifier).
- The names should in most cases be HTTP(S) URIs, as this allows a standardized way to resolve the names (i.e. access the resources).
- When a client resolves the name, relevant information about the resource should be provided. This means for example, that a human user receives an easily human-readable representation of the resource, while a machine receives a machine-readable representation of it.
- The resources should refer (be linked) to other resources when it aids in discoverability, contextualizing, validating, or otherwise improving the useability of the data.
It is crucial to understand that by nature, linked data is atomic and always in the form of a graph. What all the resources are, depends entirely on the use case at hand. There is a deep philosophical distinction though between linked data and traditional data modeling. Traditionally data modeling is done in a siloed way with an emphasis on modeling records: records of income taxes, places of residence, medical history. Through unique identifiers these records can be connected to the individual in question, but there is a clear separation of concerns: the data model is typically somewhat denormalized and serves a process or system view of the domain at hand.
With linked data the records view of the world is certanily possible to model and in some cases the appropriate solution, but in principle, linked data due to being in graph format allows statements of the world to be expressed in a much more natural way, particularly in RDF (Resource Description Framework) which is the lingua franca of linked data. With RDF, the data is expressed as triples, which you can think of simply rows of data in three columns. The first column (subject) determines the perspective, i.e. from what point of view we are talking about. The second column (predicate) determines the context or theme we are talking about. The third column (object) determines the target to which the context is applied to. As an example:
Naming Things
As mentioned, all resources are named ("minted") with URI identifiers, which we can then use to refer to them when needed. URLs and URNs are subset of URIs, so any URL - be it for an image, web site, REST endpoint address or whatever - is already ready to be incorporated to the linked data ecosystem. URNs (e.g. urn:isbn:0-123-456-789-123
) can be used as well, but unlike the aforementioned URLs they can't be directly resolved.
There is a deep philosophical difference and reasons between how and what things are named in linked data compared to traditional data modeling e.g. with UML, but covering this requires going through the elementary principles first.
Linked Data is Atomic
It is crucial to understand that by nature, linked data is atomic and always in the form of a graph. The lingua franca of linked data is RDF (Resource Description Framework), which allows for a very intuitive and natural way of representing information. In RDF everything is expressed as triples (3-tuples): statements consisting of three components (resources). You can think of triples as simply rows of data in a three column data structure: the first column represents the subject resource (from whose point of view the statement is made), the second column represents the context resource of what is being stated by the subject, and the third column represents the object or value resource of the statement. Simplified to the extreme, "Finland (subject) is a (predicate) country (object)" is a statement in this form.
If all data is explicitly in RDF, it means we have a fully atomic dataset where everything from the types of entities down to their attributes exist as individual resources (nodes) connected by associations (edges). If we expand the example above to also include statements about the number of lakes and Finland's capital, we could end up with the following dataset:
As you can see, there is no traditional class/instance structure with inner fields. Finland as an entity does not have fixed attribute slots inside it for the number of lakes nor its capital: everything is expressed by simply adding more associations between individual nodes. As mentioned above, everything is named with an URI, so a more realistic example would actually look like this:
The triples in this dataset can be serialized very simply, or stored e.g. in a three column tabular structure:
subject resource | predicate resource | object resource |
---|---|---|
<https://finland.fi/> | <https://foobar/isA> |
|
<https://finland.fi/> | <https://foobar/numberOfLakes> |
|
<https://finland.fi/> | <https://foobar/hasCapital> |
|
A small exception to the naming rule is that the literal integer value of 187888 does not have an identity (nor do any other literal values).
You might have already guessed that this kind of data structure becomes cumbersome when it is used for example to store lists or arrays. Both are possible in RDF, but the flexibility of linked data
Everything Has an Identity
Another
The literal numeric value 187888 is also a resource (node), but it does not have an identity
What all the resources are, depends entirely on the use case at hand. There is a deep philosophical distinction though between linked data and traditional data modeling. Traditionally data modeling is done in a siloed way with an emphasis on modeling records, in other words a data structure that describes a set of data for a specific use case. As an example, different information systems might hold data about an individual's income taxes, medical history etc. These data sets relate to the individual indirectly via some kind of a permanent identifier, such as the finnish Personal Identity Code, but the identifier nor the records are meant to represent the concepts
records of income taxes, places of residence, medical history, etc. Through unique identifiers these records can be connected to the individual in question, but there is a clear separation of concerns: the data model is typically somewhat denormalized and serves a process or system view of the domain at hand.
With linked data the records view of the world is certanily possible to model and in some cases the appropriate solution, but in principle, linked data due to being in graph format allows statements of the world to be expressed in a much more natural wayAs an example:
subject resource | predicate resource | object resource |
---|---|---|
<> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <https://schema.org/LandmarksOrHistoricalBuildings> |
subject | predicate | object |
Matti Meikäläinen | is a | Finnish citizen. |
As a graph, this data would look like:
...
In this example, "Matti Meikäläinen" is an individual ("instance"), and "Finnish citizen" is a class where all individuals classified as Finnish citizens belong to. "Is a" functions as an association denoting the class membership. But as stated earlier - the
Core Vocabularies (Ontologies)
...