Ubuntu Manpage: datalad aggregate-metadata - aggregate metadata of one or more datasets for later query.

NAME

       datalad aggregate-metadata - aggregate metadata of one or more datasets for later query.

SYNOPSIS

       datalad aggregate-metadata [-h] [-d DATASET] [-r] [-R LEVELS] [--update-mode {all|target}]
              [--incremental] [--force-extraction] [--nosave] [PATH [PATH ...]]

DESCRIPTION

Metadata aggregation refers to a procedure that extracts metadata present in a dataset
into a portable representation that is stored a single standardized format. Moreover,
metadata aggregation can also extract metadata in this format from one dataset and store
it in another (super)dataset. Based on such collections of aggregated metadata it is
possible to discover particular datasets and specific parts of their content, without
having to obtain the target datasets first (see the DataLad 'search' command).

To enable aggregation of metadata that are contained in files of a dataset, one has to
enable one or more metadata extractor for a dataset. DataLad supports a number of common
metadata standards, such as the Exchangeable Image File Format (EXIF), Adobe's Extensible
Metadata Platform (XMP), and various audio file metadata systems like ID3. DataLad
extension packages can provide metadata data extractors for additional metadata sources.
For example, the neuroimaging extension provides extractors for scientific (meta)data
standards like BIDS, DICOM, and NIfTI1. Some metadata extractors depend on particular
3rd-party software. The list of metadata extractors available to a particular DataLad
installation is reported by the 'wtf' command ('datalad wtf').

Enabling a metadata extractor for a dataset is done by adding its name to the dataset's
configuration file (.datalad/config), e.g.::

[datalad "metadata"]
nativetype = exif
nativetype = xmp

If an enabled metadata extractor is not available in a particular DataLad installation,
metadata extraction will not succeed in order to avoid inconsistent aggregation results.

Enabling multiple extractors is supported. In this case, metadata are extracted by each
extractor individually, and stored alongside each other. Metadata aggregation will also
extract DataLad's own metadata (extractors

Metadata aggregation can be performed recursively, in order to aggregate all metadata
across all subdatasets, for example, to be able to search across any content in any
dataset of a collection. Aggregation can also be performed for subdatasets that are not
available locally. In this case, pre-aggregated metadata from the closest available
superdataset will be considered instead.

Depending on the versatility of the present metadata and the number of dataset or files,
aggregated metadata can grow prohibitively large. A number of configuration switches are
provided to mitigate such issues.

datalad.metadata.aggregate-content-<extractor-name> If set to false, content metadata
aggregation will not be performed for the named metadata extractor (a potential underscore
'_' in the extractor name must be replaced by a dash '-'). This can substantially reduce
the runtime for metadata extraction, and also reduce the size of the generated metadata
aggregate. Note, however, that some extractors may not produce any metadata when this is
disabled, because their metadata might come from individual file headers only.
'datalad.metadata.store-aggregate-content' might be a more appropriate setting in such
cases.

datalad.metadata.aggregate-ignore-fields Any metadata key matching any regular expression
in this configuration setting is removed prior to generating the dataset-level metadata
summary (keys and their unique values across all dataset content), and from the dataset
metadata itself. This switch can also be used to filter out sensitive information prior
aggregation.

datalad.metadata.generate-unique-<extractor-name> If set to false, DataLad will not
auto-generate a summary of unique content metadata values for a particular extractor as
part of the dataset-global metadata (a potential underscore '_' in the extractor name must
be replaced by a dash '-'). This can be useful if such a summary is bloated due to minor
uninformative (e.g. numerical) differences, or when a particular extractor already
provides a carefully designed content metadata summary.

datalad.metadata.maxfieldsize Any metadata value that exceeds the size threshold given by
this configuration setting (in bytes/characters) is removed.

datalad.metadata.store-aggregate-content If set, extracted content metadata are still used
to generate a dataset-level summary of present metadata (all keys and their unique values
across all files in a dataset are determined and stored as part of the dataset-level
metadata aggregate, see datalad.metadata.generate-unique-<extractor-name>), but metadata
on individual files are not stored. This switch can be used to avoid prohibitively large
metadata files. Discovery of datasets containing content matching particular metadata
properties will still be possible, but such datasets would have to be obtained first in
order to discover which particular files in them match these properties.

OPTIONS

PATH path to datasets that shall be aggregated. When a given path is pointing into a
dataset, the metadata of the containing dataset will be aggregated. If no paths
given, current dataset metadata is aggregated. Constraints: value must be a string

-h, -\-help, -\-help-np
show this help message. --help-np forcefully disables the use of a pager for
displaying the help message

-d DATASET, -\-dataset DATASET
topmost dataset metadata will be aggregated into. All dataset between this dataset
and any given path will receive updated aggregated metadata from all given paths.
Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a
path)

-r, -\-recursive
if set, recurse into potential subdataset.

-R LEVELS, -\-recursion-limit LEVELS
limit recursion into subdataset to the given number of levels. Constraints: value
must be convertible to type 'int'

-\-update-mode {all|target}
which datasets to update with newly aggregated metadata: all datasets from any leaf
dataset to the top-level target dataset including all intermediate datasets (all),
or just the top-level target dataset (target). Constraints: value must be one of
('all', 'target') [Default: 'target']

-\-incremental
If set, all information on metadata records of subdatasets that have not been
(re-)aggregated in this run will be kept unchanged. This is useful when
(re-)aggregation only a subset of a dataset hierarchy, for example, because not all
subdatasets are locally available.

-\-force-extraction
If set, all enabled extractors will be engaged regardless of whether change
detection indicates that metadata has already been extracted for a given dataset
state.

-\-nosave
by default all modifications to a dataset are immediately saved. Giving this option
will disable this behavior.

AUTHORS

        datalad is developed by The DataLad Team and Contributors <team@datalad.org>.