Ubuntu Manpage: datalad search - search dataset metadata

NAME

       datalad search - search dataset metadata

SYNOPSIS

       datalad  search  [-h]  [-d  DATASET]  [--reindex]  [--max-nresults  MAX_NRESULTS]  [--mode
              {egrep,textblob,autofield}]   [--full-record]    [--show-keys    {name,short,full}]
              [--show-query] [QUERY [QUERY ...]]

DESCRIPTION

       DataLad can search metadata extracted from a dataset and/or aggregated into a superdataset
       (see the AGGREGATE-METADATA command). This makes it  possible  to  discover  datasets,  or
       individual files in a dataset even when they are not available locally.

       Ultimately  DataLad  metadata are a graph of linked data structures. However, this command
       does not (yet) support queries that can exploit all information stored in the metadata. At
       the  moment the following search modes are implemented that represent different trade-offs
       between the expressiveness of a query and the computational and storage resources required
       to execute a query.

       - egrep (default)

       - egrepcs [case-sensitive egrep]

       - textblob

       - autofield

       An  alternative  default  mode  can  be  configured  by  tuning the configuration variable
       'datalad.search.default-mode'::

       [datalad "search"]
         default-mode = egrepcs

       Each search mode has its own default configuration for what kind of  documents  to  query.
       The respective default can be changed via configuration variables::

       [datalad "search"]
         index-<mode_name>-documenttype = (all|datasets|files)

   Mode: egrep/egrepcs
       These  search  modes  are  largely  ignorant of the metadata structure, and simply perform
       matching of a search pattern against a flat string-representation  of  metadata.  This  is
       advantageous  when  the  query  is  simple  and  the  metadata structure is irrelevant, or
       precisely known.  Moreover, it does not require a  search  index,  hence  results  can  be
       reported  without  an  initial  latency  for  building  a search index when the underlying
       metadata has changed (e.g. due to a dataset update). By default, these search  modes  only
       consider  datasets  and do not investigate records for individual files for speed reasons.
       Search results are reported in the order in which they were discovered.

       Queries     can     make     use     of     Python     regular      expression      syntax
       (https://docs.python.org/3/library/re.html).  In  EGREP mode, matching is case-insensitive
       when the query does not contain upper case characters, but is case-sensitive when it does.
       In  EGREPCS  mode, matching is always case-sensitive. Expressions will match anywhere in a
       metadata string, not only at the start.

       When multiple queries are given,  all  queries  have  to  match  for  a  search  hit  (AND
       behavior).

       It is possible to search individual metadata key/value items by prefixing the query with a
       metadata key name, separated by a colon  (':').  The  key  name  can  also  be  a  regular
       expression  to match multiple keys. A query match happens when any value of an item with a
       matching key name matches the query (OR behavior). See examples for more information.

       Examples:

       Query for (what happens to be) an author::

         % datalad search haxby

       Queries are case-INsensitive when the query contains no upper case characters, and can  be
       regular  expressions.  Use  EGREPCS  mode  when  it is desired to perform a case-sensitive
       lowercase match::

         % datalad search --mode egrepcs halchenko.*haxby

       This search mode performs NO analysis of the  metadata  content.   Therefore  queries  can
       easily  fail  to  match.  For example, the above query implicitly assumes that authors are
       listed in alphabetical order.  If that is the case (which may or may  not  be  true),  the
       following query would yield NO hits::

         % datalad search Haxby.*Halchenko

       The TEXTBLOB search mode represents an alternative that is more robust in such cases.

       For more complex queries multiple query expressions can be provided that all have to match
       to be considered a hit  (AND  behavior).  This  query  discovers  all  files  (non-default
       behavior) that match 'bids.type=T1w' AND 'nifti1.qform_code=scanner'::

         %    datalad   -c   datalad.search.index-egrep-documenttype=all   search   bids.type:T1w
       nifti1.qform_code:scanner

       Key name selectors can also be expressions, which can be used to select multiple  keys  or
       construct "fuzzy" queries. In such cases a query matches when any item with a matching key
       matches the query (OR behavior).  However, multiple queries are always evaluated using  an
       AND  conjunction.   The  following query extends the example above to match any files that
       have either 'nifti1.qform_code=scanner' or 'nifti1.sform_code=scanner'::

         %   datalad   -c   datalad.search.index-egrep-documenttype=all   search    bids.type:T1w
       nifti1.(q|s)form_code:scanner

   Mode: textblob
       This  search  mode  is  very  similar to the EGREP mode, but with a few key differences. A
       search index is built from the string-representation of metadata records. By default, only
       datasets  are included in this index, hence the indexing is usually completed within a few
       seconds, even for hundreds of datasets. This mode uses its own query language (not regular
       expressions) that is similar to other search engines. It supports logical conjunctions and
       fuzzy search terms. More information on this is available from the Whoosh project  (search
       engine implementation):

       - Description of the Whoosh query language:
         http://whoosh.readthedocs.io/en/latest/querylang.html)

       - Description of a number of query language customizations that are
         enabled in DataLad, such as, fuzzy term matching:
         http://whoosh.readthedocs.io/en/latest/parsing.html#common-customizations

       Importantly,  search  hits are scored and reported in order of descending relevance, hence
       limiting the number of search results is more meaningful than in the 'egrep' mode and  can
       also reduce the query duration.

       Examples:

       Search  for (what happens to be) two authors, regardless of the order in which those names
       appear in the metadata::

         % datalad search --mode textblob halchenko haxby

       Fuzzy search when you only have an approximate idea what you are looking for or how it  is
       spelled::

         % datalad search --mode textblob haxbi~

       Very  fuzzy  search,  when you are basically only confident about the first two characters
       and how it sounds approximately (or more precisely: allow  for  three  edits  and  require
       matching of the first two characters)::

         % datalad search --mode textblob haksbi~3/2

       Combine fuzzy search with logical constructs::

         % datalad search --mode textblob 'haxbi~ AND (hanke OR halchenko)'

   Mode: autofield
       This  mode  is  similar  to  the 'textblob' mode, but builds a vastly more detailed search
       index that represents individual metadata variables as individual fields. By default, this
       search  index  includes records for datasets and individual fields, hence it can grow very
       quickly into a huge structure that can easily take an hour or more to  build  and  require
       more  than  a  GB  of  storage.  However, limiting it to documents on datasets (see above)
       retains the enhanced expressiveness of queries while dramatically  reducing  the  resource
       demands.

       Examples:

       List names of search index fields (auto-discovered from the set of indexed datasets)::

         % datalad search --mode autofield --show-keys name

       Fuzzy  search  for  datasets  with  an  author  that is specified in a particular metadata
       field::

         % datalad search --mode autofield bids.author:haxbi~ type:dataset

       Search for individual files that carry a particular description prefix in  their  'nifti1'
       metadata::

         % datalad search --mode autofield nifti1.description:FSL* type:file

   Reporting
       Search   hits  are  returned  as  standard  DataLad  results.  On  the  command  line  the
       '--output-format' (or '-f') option can be used to tweak results for further processing.

       Examples:

       Format search hits as a JSON stream (one hit per line)::

         % datalad -f json search haxby

       Custom formatting: which terms  matched  the  query  of  particular  results.  Useful  for
       investigating fuzzy search results::

         $ datalad -f '{path}: {query_matched}' search --mode autofield bids.author:haxbi~

OPTIONS

QUERY query string, supported syntax and features depends on the selected search mode
(see documentation).

-h, -\-help, -\-help-np
show this help message. --help-np forcefully disables the use of a pager for
displaying the help message

-d DATASET, -\-dataset DATASET
specify the dataset to perform the query operation on. If no dataset is given, an
attempt is made to identify the dataset based on the current working directory
and/or the PATH given. Constraints: Value must be a Dataset or a valid identifier
of a Dataset (e.g. a path)

-\-reindex
force rebuilding the search index, even if no change in the dataset's state has
been detected, for example, when the index documenttype configuration has changed.

-\-max-nresults MAX_NRESULTS
maxmimum number of search results to report. Setting this to 0 will report all
search matches. Depending on the mode this can search substantially slower. If not
specified, a mode-specific default setting will be used. Constraints: value must be
convertible to type 'int'

-\-mode {egrep, textblob, autofield}
Mode of search index structure and content. See section SEARCH MODES for details.

-\-full-record, -f
If set, return the full metadata record for each search hit. Depending on the
search mode this might require additional queries. By default, only data that is
available to the respective search modes is returned. This always includes
essential information, such as the path and the type.

-\-show-keys {name, short, full}
if given, a list of known search keys is shown. If 'name' - only the name is
printed one per line. If 'short' or 'full', statistics (in how many datasets, and
how many unique values) are printed. 'short' truncates the listing of unique
values. No other action is performed (except for reindexing), even if other
arguments are given. Each key is accompanied by a term definition in parenthesis
(TODO). In most cases a definition is given in the form of a URL. If an ontology
definition for a term is known, this URL can resolve to a webpage that provides a
comprehensive definition of the term. However, for speed reasons term resolution is
solely done on information contained in a local dataset's metadata, and definition
URLs might be outdated or point to no longer existing resources.

-\-show-query
if given, the formal query that was generated from the given query string is shown,
but not actually executed. This is mostly useful for debugging purposes.

AUTHORS

        datalad is developed by The DataLad Team and Contributors <team@datalad.org>.