How to Organise Your Data
Last updated
Last updated
Please read this chapter carefully, it is very important to further understanding of Databus
Databus can be seen as a data repository (analogous to software repositories, e.g. Maven or Pip). A key differentiation of Databus from a typical software repository is that it does not store the actual data (i.e. files on disk) but focuses solely on capturing metadata describing your data.
To utilize the Databus, you need to possess data that you intend to publish on the Databus. The minimal requirement is a single file containing your data and the data file must be accessible via the HTTP protocol.
In the Databus model a dataset corresponds to an artifact. The artifacts are versioned and grouped together into groups. Groups in the DBpedia Databus serve as a means of grouping together related datasets. They allow users to navigate and discover datasets that are associated with a specific topic of interest.
By organizing dataset artifacts into groups, the DBpedia Databus provides a structured and coherent way to manage and publish data files, making it easier for users to find and access datasets that are relevant to their needs.
The relationships among the group, artifact, version and file are depicted in the following diagram.
Metadata for your datasets is made available under a specific URI identifier. Certain path segments of the URIs are shared between related (via groups) datasets (see an example below). The identifiers reflect the above-mentioned group-artifact-version-file structure.
Note Choosing descriptive (i.e. human readable) names for your identifiers and putting thought into the partitioning of your metadata entries can greatly impact the understandability, findability and usefulness of your data.
The identifier URI hierarchy is organized as follows:
Account (Username)
Thus, the identifiers of your metadata entries on the Databus are a composite the following parts:
The Databus base URI (e.g. https://databus.dbpedia.org) The base URI is the URI under which is located your Databus instance is deployed.
Your account name for that Databus (e.g. janfo) The username of the dataset owner on the particular Databus instance.
The group name (e.g. animals) The id/name of the published group.
The artifact name (e.g. cats) The id/name of the published artifact.
The version (e.g. 2023-03-30) The version of the published data.
The distribution name (e.g. cats.ttl.bz) The id of the published data distribution.
An example of a full URI identifier is provided in the following figure.
The identifiers encoded in the example above are as follows:
Account identifier: https://databus.dbpedia.org/janfo
Group identifier: https://databus.dbpedia.org/janfo/animals
Artifact identifier: https://databus.dbpedia.org/janfo/animals/cats
Version identifier: https://databus.dbpedia.org/janfo/animals/cats/2023-03-30
Distribution identifier: https://databus.dbpedia.org/janfo/animals/cats/2023-03-30/cats.ttl.bz
See more on the URI design here.
The metadata publisher has complete control over the names of the Databus identifiers. Though, below we provide the best practices when designing identifiers:
The account name
The account name is chosen on account creation, i.e. when registering at the particular Datbaus instane. It is advised to use your personal username or the name of your institution/company. In other words, the account name is the identifier of the data owner/publisher. E.g. DBpedia publishes the regular releases under the account name dbpedia.
The group name
The group can be understood as folder for multiple related artifacts. Generally, it is recommended to create one group for one project - the same way you would create a folder on-disk for a certain project. However, if a project manages a large amount of datasets (and thus, artifacts) it is a good idea to use multiple groups. For example, DBpedia introduces groups such as mappings, generic, text, etc. and publishes artifacts under these groups where under each group are published artifacts which are generated using particural extraction processes. E.g. under the mappings group are published artifacts generated by the Mapping Extractors, while under the generic group are published artifacts generated by the Generic Extractors.
The artifact
An artifact consists of multiple versions of different files, but all the files should somehow be related, so an artifact should represent data of one certain topic or data from certain source. For example, under the mappings group DBpedia publishes the artifacts which are generated using the mappings based extractors.
The version
A version represents the state of the artifact at a particular point of time. Under each version aree publiished one or more distributions of the dataset, each represented as a file and tracks the evolution of one artifact. Read more for the versioning best practises.
The distribution
A distribution represents one certain file representing an artifact with some version. An artifact can have multiple distributions which are distinguished via content variants. As the "content variant" term suggests, having an artifact with different content variants means that the files are closely related or even represent the same data in different flavors.
Few examples of content variants:
the same data in different languages or encodings/file types
a file and other files generated from that file
a file and other files describing that file
Of course, it is not always feasible, but it is highly recommended to keep the names of content variants same across different versions of the dataset. This allows easy search and retrieval of the datasets using SPARQL.