lunar (1) intake.1.gz

Provided by: python3-intake_0.6.6-1_amd64 bug

NAME

       intake - Intake Documentation

       Taking the pain out of data access and distribution

       Intake  is  a  lightweight  package  for finding, investigating, loading and disseminating
       data. It will appeal to different groups for some of the reasons below, but is useful  for
       all  and acts as a common platform that everyone can use to smooth the progression of data
       from developers and providers to users.

       Intake contains the following main components. You do  not  need  to  use  them  all!  The
       library is modular, only use the parts you need:

       • A  set of data loaders (Drivers) with a common interface, so that you can investigate or
         load anything, from local or remote, with the exact same call,  and  turning  into  data
         structures that you already know how to manipulate, such as arrays and data-frames.

       • A  Cataloging system (Catalogs) for listing data sources, their metadata and parameters,
         and referencing which of the Drivers should load each. The catalogs for a  hierarchical,
         searchable  structure,  which can be backed by files, Intake servers or third-party data
         services

       • Sets of convenience functions to  apply  to  various  data  sources,  such  as  data-set
         persistence,   automatic  concatenation  and  metadata  inference  and  the  ability  to
         distribute catalogs and data sources using simple packaging abstractions.

       • A GUI layer accessible in the Jupyter notebook  or  as  a  standalone  webserver,  which
         allows  you  to  find  and  navigate catalogs, investigate data sources, and plot either
         predefined visualisations or interactively find the right view yourself

       • A client-server protocol to allow for arbitrary data cataloging services or to serve the
         data itself, with a pluggable auth model.

DATA USER

       • Intake  loads  the  data  for  a  range of formats and types (see Plugin Directory) into
         containers you already use, like Pandas dataframes, Python lists, NumPy arrays, and more

       • Intake loads, then gets out of your way

       • GUI search and introspect data-sets in Catalogs: quickly find what you need to  do  your
         work

       • Install data-sets and automatically get requirements

       • Leverage cloud resources and distributed computing.

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_scientist.ipynb

DATA PROVIDER

       • Simple spec to define data sources

       • Single point of truth, no more copy&paste

       • Distribute data using packages, shared files or a server

       • Update definitions in-place

       • Parametrise user options

       • Make use of additional functionality like filename parsing and caching.

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdata_engineer.ipynb

IT

       • Create catalogs out of established departmental practices

       • Provide data access credentials via Intake parameters

       • Use server-client architecture as gatekeeper:

            • add authentication methods

            • add monitoring point; track the data-sets being accessed.

       • Hook Intake into proprietary data access systems.

DEVELOPER

       • Turn boilerplate code into a reusable Driver

       • Pluggable architecture of Intake allows for many points to add and improve

       • Open, simple code-base -- come and get involved on github!

       See the executable tutorial:

       https://mybinder.org/v2/gh/intake/intake-examples/master?filepath=tutorial%2Fdev.ipynb

       The  Start  here  document  contains the sections that all users new to Intake should read
       through. Use Cases - I want to... shows specific problems that Intake solves.  For a brief
       demonstration, which you can execute locally, go to Quickstart.  For a general description
       of all of the components of Intake and how they fit together, go to Overview. Finally, for
       some notebooks using Intake and articles about Intake, go to Examples and intake-examples.
       These and other documentation pages will make reference to concepts that  are  defined  in
       the Glossary.

START HERE

       These  documents will familiarise you with Intake, show you some basic usage and examples,
       and describe Intake's place in the wider python data world.

   Quickstart
       This guide will show you how to get started using Intake to read  data,  and  give  you  a
       flavour  of  how  Intake  feels  to the Data User.  It assumes you are working in either a
       conda or a virtualenv/pip  environment.  For  notebooks  with  executable  code,  see  the
       Examples. This walk-through can be run from a notebook or interactive python session.

   Installation
       If you are using Anaconda or Miniconda, install Intake with the following commands:

          conda install -c conda-forge intake

       If you are using virtualenv/pip, run the following command:

          pip install intake

       Note  that this will install with the minimum of optional requirements. If you want a more
       complete install, use intake[complete] instead.

   Creating Sample Data
       Let's begin by creating a sample data set and catalog.   At  the  command  line,  run  the
       intake  example command.  This will create an example data Catalog and two CSV data files.
       These files contains some basic facts about the 50 US states, and the catalog  includes  a
       specification of how to load them.

   Loading a Data Source
       Data sources can be created directly with the open_*() functions in the intake module.  To
       read our example data:

          >>> import intake
          >>> ds = intake.open_csv('states_*.csv')
          >>> print(ds)
          <intake.source.csv.CSVSource object at 0x1163882e8>

       Each open function has different arguments, specific for the data format or service  being
       used.

   Reading Data
       Intake reads data into memory using containers you are already familiar with:

          • Tables: Pandas DataFrames

          • Multidimensional arrays: NumPy arrays

          • Semistructured data: Python lists of objects (usually dictionaries)

       To  find  out  what  kind  of  container a data source will produce, inspect the container
       attribute:

          >>> ds.container
          'dataframe'

       The result will be dataframe, ndarray, or python.  (New container types will be  added  in
       the future.)

       For data that fits in memory, you can ask Intake to load it directly:

          >>> df = ds.read()
          >>> df.head()
                  state        slug code                nickname  ...
          0     Alabama     alabama   AL      Yellowhammer State
          1      Alaska      alaska   AK       The Last Frontier
          2     Arizona     arizona   AZ  The Grand Canyon State
          3    Arkansas    arkansas   AR       The Natural State
          4  California  california   CA            Golden State

       Many  data  sources will also have quick-look plotting available. The attribute .plot will
       list a number of built-in plotting methods, such as .scatter(), see Plotting.

       Intake data sources can have partitions.  A partition refers to a contiguous chunk of data
       that  can  be  loaded  independent  of  any  other  partition.  The partitioning scheme is
       entirely up to the plugin author.  In the case of the CSV plugin,  each  .csv  file  is  a
       partition.

       To  read data from a data source one chunk at a time, the read_chunked() method returns an
       iterator:

          >>> for chunk in ds.read_chunked(): print('Chunk: %d' % len(chunk))
          ...
          Chunk: 24
          Chunk: 26

   Working with Dask
       Working with large datasets is much easier with a parallel, out-of-core computing  library
       like Dask.  Intake can create Dask containers (like dask.dataframe) from data sources that
       will load their data only when required:

          >>> ddf = ds.to_dask()
          >>> ddf
          Dask DataFrame Structure:
                      admission_date admission_number capital_city capital_url    code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url    slug   state state_flag_url state_seal_url twitter_url website
          npartitions=2
                              object            int64       object      object  object           object       object                   object        object   object      int64           int64                 object  object  object         object         object      object  object
                                  ...              ...          ...         ...     ...              ...          ...                      ...           ...      ...        ...             ...                    ...     ...     ...            ...            ...         ...     ...
                                  ...              ...          ...         ...     ...              ...          ...                      ...           ...      ...        ...             ...                    ...     ...     ...            ...            ...         ...     ...
          Dask Name: from-delayed, 4 tasks

       The Dask containers will be partitioned in  the  same  way  as  the  Intake  data  source,
       allowing  different chunks to be processed in parallel. Please read the Dask documentation
       to  understand  the  differences  when  working  with  Dask  collections  (Bag,  Array  or
       Data-frames).

   Opening a Catalog
       A  Catalog  is  an  inventory  of data sources, with the type and arguments prescribed for
       each, and arbitrary metadata about each source.  In the simplest case, a  catalog  can  be
       described  by  a  file  in YAML format, a "Catalog file". In real usage, catalogues can be
       defined in a number of ways, such as remote files, by connecting  to  a  third-party  data
       service  (e.g.,  SQL server) or through an Intake Server protocol, which can implement any
       number of ways to search and deliver data sources.

       The intake example command, above, created a catalog file with the  following  YAML-syntax
       content:

          sources:
            states:
              description: US state information from [CivilServices](https://civil.services/)
              driver: csv
              args:
                urlpath: '{{ CATALOG_DIR }}/states_*.csv'
              metadata:
                origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'

       To load a Catalog from a Catalog file:

          >>> cat = intake.open_catalog('us_states.yml')
          >>> list(cat)
          ['states']

       This catalog contains one data source, called states.  It can be accessed by attribute:

          >>> cat.states.to_dask()[['state','slug']].head()
                  state        slug
          0     Alabama     alabama
          1      Alaska      alaska
          2     Arizona     arizona
          3    Arkansas    arkansas
          4  California  california

       Placing data source specifications into a catalog like this enables declaring data sets in
       a single canonical place, and not having to use boilerplate code in  each  notebook/script
       that  makes  use  of  the  data.  The  catalogs  can also reference one-another, be stored
       remotely, and include extra metadata such as a set of  named  quick-look  plots  that  are
       appropriate for the particular data source. Note that catalogs are not restricted to being
       stored in YAML files, that just happens to be the simplest way to display them.

       Many catalog entries will also contain "user_parameter" blocks, which are  indications  of
       options  explicitly allowed by the catalog author, or for validation or the values passed.
       The user can customise how  a  data  source  is  accessed  by  providing  values  for  the
       user_parameters, overriding the arguments specified in the entry, or passing extra keyword
       arguments to be passed to the driver. The keywords that should be passed  are  limited  to
       the user_parameters defined and the inputs expected by the specific driver - such usage is
       expected only from those already familiar with the specifics of the given format.  In  the
       following  example, the user overrides the "csv_kwargs" keyword, which is described in the
       documentation for CSVSource and gets passed down to the CSV reader:

          # pass extra kwargs understood by the csv driver
          >>> intake.cat.states(csv_kwargs={'header': None, 'skiprows': 1}).read().head()
                     0           1   ...                                17
          0     Alabama     alabama  ...    https://twitter.com/alabamagov
          1      Alaska      alaska  ...        https://twitter.com/alaska

       Note that, if you are creating such catalogs, you may well start by  trying  the  open_csv
       command,  above,  and then use print(ds.yaml()). If you do this now, you will see that the
       output is very similar to the catalog file we have provided.

   Installing Data Source Packages
       Intake makes it possible to create Data packages (pip or conda) that install data  sources
       into  a  global  catalog.   For example, we can install a data package containing the same
       data we have been working with:

          conda install -c intake data-us-states

       Conda      installs      the      catalog      file      in      this      package      to
       $CONDA_PREFIX/share/intake/us_states.yml.   Now,  when  we  import intake, we will see the
       data from this package appear as part of a  global  catalog  called  intake.cat.  In  this
       particular  case  we  use Dask to do the reading (which can handle larger-than-memory data
       and parallel processing), but read() would work also:

          >>> import intake
          >>> intake.cat.states.to_dask()[['state','slug']].head()
                  state        slug
          0     Alabama     alabama
          1      Alaska      alaska
          2     Arizona     arizona
          3    Arkansas    arkansas
          4  California  california

       The global  catalog  is  a  union  of  all  catalogs  installed  in  the  conda/virtualenv
       environment and also any catalogs installed in user-specific locations.

   Adding Data Source Packages using the Intake path
       Intake  checks  the  Intake  config  file  for  catalog_path  or  the environment variable
       "INTAKE_PATH" for a colon separated list of paths (semicolon on  windows)  to  search  for
       catalog  files.  When you import intake we will see all entries from all of the catalogues
       referenced as part of a global catalog called intake.cat.

   Using the GUI
       A graphical data browser is available in the Jupyter notebook  environment  or  standalone
       web-server.   It  will  show  the  contents  of  any  installed  catalogs, plus allows for
       selecting local and remote catalogs, to browse and select entries from these. See GUI.

   Use Cases - I want to...
       Here follows a list of specific things that people may want to get done,  and  details  of
       how  Intake  can help. The details of how to achieve each of these activities can be found
       in the rest of the detailed documentation.

   Avoid copy&paste of blocks of code for accessing data
       This is a very common pattern, if you want to load some specific data,  to  find  someone,
       perhaps  a  colleague,  who has accessed it before, and copy that code. Such a practice is
       extremely error prone, and cause a proliferation of copies of code, which may evolve  over
       time, with various versions simultaneously in use.

       Intake separates the concerns of data-source specification from code. The specs are stored
       separately, and all users can reference the one and only authoritative definition, whether
       in  a  shared file, a service visible to everyone or by using the Intake server. This spec
       can be updated so that everyone gets the current version instead of  relying  on  outdated
       code.

   Version control data sources
       Version  control (e.g., using git) is an essential practice in modern software engineering
       and data science. It ensures that the change history is recorded, with times, descriptions
       and authors along with the changes themselves.

       When data is specified using a well-structured syntax such as YAML, it can be checked into
       a version controlled repository in  the  usual  fashion.  Thus,  you  can  bring  rigorous
       practices to your data as well as your code.

       If  using  conda  packages  to  distribute  data specifications, these come with a natural
       internal version numbering system, such that users need only do conda update  ...  to  get
       the latest version.

   Install data
       Often, finding and grabbing data is a major hurdle to productivity. People may be required
       to download artifacts from various places or search through storage systems  to  find  the
       specific  thing  that  they  are  after.  One-line commands which can retrieve data-source
       specifications or the files themselves can be  a  massive  time-saver.  Furthermore,  each
       data-set will typically need its own code to be able to access it, and probably additional
       software dependencies.

       Intake allows you to build conda packages, which can  include  catalog  files  referencing
       online  resources, or to include data files directly in that package.  Whether uploaded to
       anaconda.org or hosted on a private enterprise channel, getting the data becomes a  single
       conda  install ... command, whereafter it will appear as an entry in intake.cat. The conda
       package brings versioning and dependency declaration for free, and  you  can  include  any
       code that may be required for that specific data-set directly in the package too.

   Update data specifications in-place
       Individual  data-sets often may be static, but commonly, the "best" data to get a job done
       changes with time as new facts emerge. Conversely, the very  same  data  might  be  better
       stored  in  a different format which is, for instance, better-suited to parallel access in
       the cloud. In such situations, you really don't want to force all the data scientists  who
       rely on it to have their code temporarily broken and be forced to change this code.

       By  working  with  a  catalog  file/service  in a fixed shared location, it is possible to
       update the data source specs in-place. When users now run their code, they  will  get  the
       latest version. Because all Intake drivers have the same API, the code using the data will
       be identical and not need to be  changed,  even  when  the  format  has  been  updated  to
       something more optimised.

   Access data stored on cloud resources
       Services  such as AWS S3, GCS and Azure Datalake (or private enterprise variants of these)
       are increasingly popular locations to amass large amounts  of  data.  Not  only  are  they
       relatively cheap per GB, but they provide long-term resilience, metadata services, complex
       access control patterns and can have very large data throughput when accessed in  parallel
       by machines on the same architecture.

       Intake  comes  with  integration  to  cloud-based  storage  out-of-the box for most of the
       file-based data formats, to be able to access the data directly in-place and in  parallel.
       For  the  few  remaining  cases where direct access is not feasible, the caching system in
       Intake allows for download of files on first use, so that  all  further   access  is  much
       faster.

   Work with Big Data
       The  era  of  Big  Data  is here! The term means different things to different people, but
       certainly implies that an individual data-set is too large to fit into  the  memory  of  a
       typical  workstation computer (>>10GB). Nevertheless, most data-loading examples available
       use functions in packages such as pandas and  expect  to  be  able  to  produce  in-memory
       representations  of  the  whole data. This is clearly a problem, and a more general answer
       should be available aside from "get more memory in your machine".

       Intake integrates with Dask and Spark, which both offer out-of-core  computation  (loading
       the  data  in  chunks which fit in memory and aggregating result) or can spread their work
       over a cluster of machines, effectively making use of the shared memory resources  of  the
       whole  cluster. Dask integration is built into the majority of the the drivers and exposed
       with the .to_dask() method, and Spark integration is  available  for  a  small  number  of
       drivers  with  a  similar  .to_spark()  method,  as well as directly with the intake-spark
       package.

       Intake also integrates with many data  services  which  themselves  can  perform  big-data
       computations, only extracting the smaller aggregated data-sets that do fit into memory for
       further analysis. Services such as SQL systems, solr, elastic-search, splunk, accumulo and
       hbase  all  can  distribute  the  work  required to fulfill a query across many nodes of a
       cluster.

   Find the right data-set
       Browsing for the data-set which will solve a particular problem can be hard, even when the
       data  have been curated and stored in a single, well-structured system. You do not want to
       rely on word-of-mouth to specify which data is right for which job.

       Intake catalogs allow for self-description of data-sets, with simple  text  and  arbitrary
       metadata,  with  a  consistent access pattern. Not only can you list the data available to
       you, but you can find out what exactly that data represents, and the form the  data  would
       take  if  loaded  (table  versus  list of items, for example). This extra metadata is also
       searchable: you can descend through a hierarchy of catalogs with a single search, and find
       all the entries containing some particular keywords.

       You can use the Intake GUI to graphically browse through your available data-sets or point
       to catalogs available to you, look through the entries listed there  and  get  information
       about each, or even show a sample of the data or quick-look plots. The GUI is also able to
       execute searches and browse file-systems to find data artifacts  of  interest.  This  same
       functionality is also available via a command-line interface or programmatically.

   Work remotely
       Interacting  with  cloud  storage  resources  is very convenient, but you will not want to
       download large amounts of data to your laptop or workstation for  analysis.  Intake  finds
       itself  at home in the remote-execution world of jupyter and Anaconda Enterprise and other
       in-browser technologies. For instance, you can run the Intake GUI either as a  stand-alone
       application  for  browsing data-sets or in a notebook for full analytics, and have all the
       runtime live on a remote machine, or perhaps a cluster which is co-located with  the  data
       storage.  Together  with  cloud-optimised  data  formats such as parquet, this is an ideal
       set-up for processing data at web scale.

   Transform data to efficient formats for sharing
       A massive amount of data exists in human-readable formats such as JSON, XML and CSV, which
       are  not very efficient in terms of space usage and need to be parsed on load to turn into
       arrays or tables. Much faster processing times can be had with modern  compact,  optimised
       formats, such as parquet.

       Intake  has  a "persist" mechanism to transform any input data-source into the format most
       appropriate for that type of data, e.g., parquet for tabular data. The persisted data will
       be  used  in  preference at analysis time, and the schedule for updating from the original
       source is configurable. The location of these  persisted  data-sets  can  be  shared  with
       others, so they can also gain the benefits, or the "export" variant can be used to produce
       an independent version in the same format, together with a spec to reference  it  by;  you
       would then share this spec with others.

   Access data without leaking credentials
       Security  is  important.  Users'  identity  and  authority to view specific data should be
       established before handing over any sensitive bytes. It is, unfortunately, all too  common
       for  data scientists to include their username, passwords or other credentials directly in
       code, so that it can run automatically, thus presenting a potential security gap.

       Intake does not manage credentials or user identities directly, but does provide hooks for
       fetching details from the environment or other service, and using the values in templating
       at the time of reading the data. Thus, the details are not included in the code, but every
       access still requires for them to be present.

       In  other cases, you may want to require the user to provide their credentials every time,
       rather that automatically establish them, and "user parameters" can be specified in Intake
       to cover this case.

   Establish a data gateway
       The  Intake  server  protocol allows you fine-grained control over the set of data sources
       that are listed, and exactly what to return to a user when they want to read some of  that
       data. This is an ideal opportunity to include authorisation checks, audit logging, and any
       more complicated access patterns, as required.

       By streaming the data through a single channel on the server, rather than  allowing  users
       direct access to the data storage backend, you can log and verify all access to your data.

   Clear distinction between data curator and analyst roles
       It  is  desirable to separate out two tasks: the definition of data-source specifications,
       and accessing and using data. This is so that those who understand the origins of the data
       and  the  implications  of  various formats and other storage options (such as chunk-size)
       should make those decisions and encode what they have done into specs. It leaves the  data
       users,  e.g.,  data  scientists,  free to find and use the data-sets appropriate for their
       work and simply get on with their job - without having  to  learn  about  various  storage
       formats and access APIs.

       This separation is at the very core of what Intake was designed to do.

   Users to be able to access data without learning every backend API
       Data  formats  and  services are a wide mess of many libraries and APIs. A large amount of
       time can be wasted in the life of a data scientist or engineer in finding out the  details
       of  the  ones  required by their work. Intake wraps these various libraries, REST APIs and
       similar, to provide a consistent experience for the data user. source.read()  will  simply
       get  all  of  the  data  into  memory  in  the container type for that source - no further
       parameters or knowledge required.

       Even for the curator of data catalogs or data driver authors, the framework established by
       Intake  provides  a lot of convenience and simplification which allows each person to deal
       with only the specifics of their job.

   Data sources to be self-describing
       Having a bunch of files in some directory is a very common pattern for data storage in the
       wild.  There  may  or  may  not  be  a README file co-located giving some information in a
       human-readable form, but generally not structured - such files are  usually  different  in
       every case.

       When  a data source is encoded into a catalog, the spec offers a natural place to describe
       what that data is, along with the possibility to provide an arbitrary amount of structured
       metadata  and  to  describe  any  parameters  that  are  to  be  exposed  for user choice.
       Furthermore, Intake data sources each have a particular container type, so that users know
       whether to expect a dataframe, array, etc., and simple introspection methods like describe
       and discover which return basic information about the data without having to load  all  of
       it into memory first.

   A data source hierarchy for natural structuring
       Usually,  the  set  of  data  sources  held  by  an organisation have relationships to one
       another, and would be poorly served to be provided as a simple  flat  list  of  everything
       available.   Intake  allows  catalogs to refer to other catalogs. This means, that you can
       group  data  sources  by  various  facets  (type,  department,  time...)   and   establish
       hierarchical  data-source trees within which to find the particular data most likely to be
       of interest.  Since the catalogs live outside and separate from the data files themselves,
       as many hierarchy structures as thought useful could be created.

       For  even  more  complicated  data source meta-structures, it is possible to store all the
       details and even metadata in some external service (e.g.,  traditional  SQL  tables)  with
       which  Intake  can  interact  to  perform  queries  and  return  particular subsets of the
       available data sources.

   Expose several data collections under a single system
       There are already several catalog-like data services in existence in the world,  and  some
       organisation  may  have  several  of  these  in-house for various different purposes.  For
       example, an SQL server may hold details of customer lists and transactions, but historical
       time-series  and  reference  data  may  be  held  separately in archival data formats like
       parquet on a file-storage system; while real-time system monitoring is done by  a  totally
       unrelated system such as Splunk or elastic search.

       Of  course,  Intake  can read from various file formats and data services. However, it can
       also interpret the internal conception of data catalogs that some data services may  have.
       For  example, all of the tables known to the SQL server, or all of the pre-defined queries
       in Splunk can be automatically included as  catalogs  in  Intake,  and  take  their  place
       amongst  the  regular  YAML-specified data sources, with exactly the same usage for all of
       them.

       These data sources and their hierarchical structure can then be exposed via the  graphical
       data browser, for searching, selecting and visualising data-sets.

   Modern visualisations for all data-sets
       Intake  is  integrated with the comprehensive holoviz suite, particularly hvplot, to bring
       simple yet powerful data visualisations to any Intake data source by using just one single
       method  for  everything.  These plots are interactive, and can include server-side dynamic
       aggregation of very large data-sets to display more  data  points  than  the  browser  can
       handle.

       You  can  specify  specific  plot types right in the data source definition, to have these
       customised visualisations available to the user as simple one-liners known to  reveal  the
       content  of  the  data,  or  even view the same visuals right in the graphical data source
       browser application.  Thus,  Intake  is  already  an  all-in-one  data  investigation  and
       dashboarding app.

   Update data specifications in real time
       Intake  data catalogs are not limited to reading static specification from files. They can
       also execute queries on remote data services and return lists of data sources  dynamically
       at  runtime.  New  data  sources may appear, for example, as directories of data files are
       pushed to a storage service, or new tables are created within a SQL server.

   Distribute data in a custom format
       Sometimes, the well-known data formats are just not right for  a  given  data-set,  and  a
       custom-built format is required. In such cases, the code to read the data may not exist in
       any  library.  Intake  allows  for  code  to  be  distributed  along  with   data   source
       specs/catalogs  or  even  files  in  a single conda package.  That encapsulates everything
       needed to describe and use that particular data, and can then be distributed as  a  single
       entity, and installed with a one-liner.

       Furthermore,  should  the  few builtin container types (sequence, array, dataframe) not be
       sufficient, you can supply your own, and then build drivers that use it.  This  was  done,
       for  example,  for  xarray-type data, where multiple related N-D arrays share a coordinate
       system and metadata.  By  creating  this  container,  a  whole  world  of  scientific  and
       engineering data was opened up to Intake. Creating new containers is not hard, though, and
       we foresee more coming, such as machine-learning models and streaming/real-time data.

   Create Intake data-sets from scratch
       If you have a set of files or a data service which you wish to make into  a  data-set,  so
       that  you  can include it in a catalog, you should use the set of functions intake.open_*,
       where you need to pick the function appropriate for your  particular  data.  You  can  use
       tab-completion to list the set of data drivers you have installed, and find others you may
       not yet have installed at Plugin Directory.  Once you have determined  the  right  set  of
       parameters  to  load  the  data  in  the manner you wish, you can use the source's .yaml()
       method to find the spec that describes the source, so you can insert  it  into  a  catalog
       (with  appropriate description and metadata). Alternatively, you can open a YAML file as a
       catalog with intake.open_catalog and use its .add() method to insert the source  into  the
       corresponding file.

       If,  instead,  you  have data in your session in one of the containers supported by Intake
       (e.g., array, data-frame), you can use the intake.upload() function to save it to files in
       an  appropriate  format  and  a  location  you  specify,  and  give you back a data-source
       instance, which, again, you can use with .yaml() or .add(), as above.

   Overview
   Introduction
       This page describes the technical design of Intake, with brief details of the aims of  the
       project and components of the library

   Why Intake?
       Intake solves a related set of problems:

       • Python  API  standards  for  loading  data  (such  as  DB-API  2.0)  are  optimized  for
         transactional databases and query results that are processed one row at a time.

       • Libraries that do load data in bulk tend to each have their own API for doing so,  which
         adds friction when switching data formats.

       • Loading  data  into  a  distributed  data structure (like those found in Dask and Spark)
         often requires writing a separate loader.

       • Abstractions often focus on just  one  data  model  (tabular,  n-dimensional  array,  or
         semi-structured), when many projects need to work with multiple kinds of data.

       Intake  has  the  explicit goal of not defining a computational expression system.  Intake
       plugins load the data into containers (e.g., arrays or  data-frames)  that  provide  their
       data processing features.  As a result, it is very easy to make a new Intake plugin with a
       relatively small amount of Python.

   Structure
       Intake is a Python library for accessing data in a simple and uniform way.  It consists of
       three parts:

       1.  A  lightweight  plugin  system for adding data loader drivers for new file formats and
       servers (like databases, REST endpoints or other cataloging services)

       2. A cataloging system for specifying these sources in simple YAML syntax, or with plugins
       that read source specs from some external data service

       3.  A server-client architecture that can share data catalog metadata over the network, or
       even stream the data directly to clients if needed

       Intake supports loading data into standard Python  containers.  The  list  can  be  easily
       extended, but the currently supported list is:

       • Pandas Dataframes - tabular data

       • NumPy Arrays - tensor data

       • Python lists of dictionaries - semi-structured data

       Additionally,  Intake  can  load  data  into  distributed  data  structures.  Currently it
       supports Dask, a flexible parallel computing  library  with  distributed  containers  like
       dask.dataframe,  dask.array,  and  dask.bag.   In  the future, other distributed computing
       systems could use Intake to create similar data structures.

   Concepts
       Intake is built out of four core concepts:

       • Data Source classes: the "driver" plugins that each implement loading of  some  specific
         type of data into python, with plugin-specific arguments.

       • Data  Source:  An  object  that  represents  a  reference to a data source.  Data source
         instances have methods for loading  the  data  into  standard  containers,  like  Pandas
         DataFrames, but do not load any data until specifically requested.

       • Catalog:  An  inventory of catalog entries, each of which defines a Data Source. Catalog
         objects can be created from local YAML definitions, by connecting to remote servers,  or
         by some driver that knows how to query an external data service.

       • Catalog  Entry:  A  named data source held internally by catalog objects, which generate
         data source instances when accessed.  The catalog  entry  includes  metadata  about  the
         source, as well as the name of the driver and arguments. Arguments can be parameterized,
         allowing one entry to return different subsets of data depending on the user request.

       The business of a plugin is to go from some data format (bunch of  files  or  some  remote
       service) to a "Container" of the data (e.g., data-frame), a thing on which you can perform
       further analysis.  Drivers can be used directly by the user, or  indirectly  through  data
       catalogs.  Data sources can be pickled, sent over the network to other hosts, and reopened
       (assuming the remote system has access to the required files or servers).

       See also the Glossary.

   Future Directions
       Ongoing work for enhancements, as well as requests for plugins, etc., can be found at  the
       issue tracker. See the Roadmap for general mid- and long-term goals.

   Examples
       Here  we list links to notebooks and other code demonstrating the use of Intake in various
       scenarios. The first section is of general interest to various  users,  and  the  sections
       that follow tend to be more specific about particular features and workflows.

       Many  of  the entries here include a link to Binder, which a service that lest you execute
       code live in a notebook environment. This is a great way to experience using  Intake.   It
       can take a while, sometimes, for Binder to come up; please have patience.

       See also the examples repository, containing data-sets which can be built and installed as
       conda packages.

   General
       • Basic Data scientist workflow: using Intake [Static] [Executable].

       • Workflow for creating catalogs:  a  Data  Engineer's  approach  to  Intake  [Static]  [‐
         Executable]

   Developer
       Tutorials delving deeper into the Internals of Intake, for those who wish to contribute

       • How you would go about writing a new plugin [Static] [Executable]

   Features
       More specific examples of Intake functionality

       • Caching:

            • New-style data package creation [Static]

            • Using automatically cached data-files [Static] [Executable]

            • Earth science demonstration of cached dataset [Static] [Executable]

       • File-name pattern parsing:

            • Satellite imagery, science workflow [Static] [Executable]

            • How to set up pattern parsing [Static] [Executable]

       • Custom catalogs:

            • A custom intake plugin that adapts DCAT catalogs [Static] [Executable]

   DataAnaconda package data, originally announced in this blogPlanet Four Catalog, originally from https://www.planetfour.org/results

       • The official Intake examples

   Blogs
       These are Intake-related articles that may be of interest.

       • Discovering and Exploring Data in a Graphical InterfaceTaking the Pain out of Data AccessCaching Data on First Read Makes Future Analysis FasterParsing Data from Filenames and PathsIntake for cataloguing SparkIntake released on Conda-Forge

   Talks__init__ podcast interview (May 2019)AnacondaCon (March 2019)PyData DC (November 2018)PyData NYC (October 2018)ESIP tech dive (November 2018)

   News
       • See out Wiki page

   Deployment Scenarios
       In  the  following  sections, we will describe some of the ways in which Intake is used in
       real production systems. These go well beyond the typical  YAML  files  presented  in  the
       quickstart  and  examples  sections,  which  are  necessarily short and simple, and do not
       demonstrate the full power of Intake.

   Sharing YAML files
       This is the simplest scenario,  and  amply  described  in  these  documents.  The  primary
       advantage is simplicity: it is enough to put a file in an accessible place (even a gist or
       repo), in order for someone else to be able to discover and load that  data.  Furthermore,
       such  files  can  easily refer to one-another, to build up a full tree of data assets with
       minimum pain Since YAML files are text, this  also  lends  itself  to  working  well  with
       version  control  systems.   Furthermore, all sources can describe themselves as YAML, and
       the export and upload commands can produce an efficient format (possibly remote)  together
       with YAML definition in a single step.

   Pangeo
       The  Pangeo  collaboration uses Intake to catalog their data holdings, which are generally
       in various forms of netCDF-compliant formats, massive multi-dimensional arrays  with  data
       relating  to  earth  and  climate  science and meteorology. On their cloud-based platform,
       containers start up jupyter-lab sessions which have Intake installed,  and  therefore  can
       simply  pick  and  load  the  data that each researcher needs - often requiring large Dask
       clusters to actually do the processing.

       A static rendering of the catalog contents is available, so  that  users  can  browse  the
       holdings  without  even starting a python session. This rendering is produced by CI on the
       repo whenever new definitions are added, and it  also  checks  (using  Intake)  that  each
       definition is indeed loadable.

       Pangeo  also  developed  intake-stac,  which  can  talk  to STAC servers to make real-time
       queries and  parse  the  results  into  Intake  data  sources.  This  is  a  standard  for
       spaceo-temporal data assets, and indexes massive amounts of cloud-stored data.

   Anaconda Enterprise
       Intake  will  be  the  basis  of  the  data  access and cataloging service within Anaconda
       Enterprise,  running  as  a  micro-service  in  a  container,  and  offering  data  source
       definitions  to  users. The access control, who gets to see which data-set, and serving of
       credentials to be able to read from the various data storage services, will all be handled
       by the platform and be fully configurable by admins.

   National Center for Atmospheric Research
       NCAR  has  developed  intake-esm,  a mechanism for creating file-based Intake catalogs for
       climate data from project efforts such as the Coupled Model Intercomparison Project (CMIP)
       and  the  Community  Earth  System  Model  (CESM)  Large Ensemble Project.  These projects
       produce a huge of amount climate data persisted on tape, disk  storage  components  across
       multiple (of the order ~300,000) netCDF files. Finding, investigating, loading these files
       into data array containers such as xarray can be a daunting task due to the  large  number
       of files a user may be interested in.  Intake-esm addresses this issue in three steps:

       •

         `Dataset Catalog Curation`_
          in  form  of  YAML  files.  These  YAML files provide information about data locations,
         access  pattern,   directory  structure,  etc.  intake-esm  uses  these  YAML  files  in
         conjunction  with  file  name  templates to construct a local database. Each row in this
         database consists of a set of metadata such as  experiment,  modeling  realm,  frequency
         corresponding to data contained in one netCDF file.

          cat = intake.open_esm_metadatastore(catalog_input_definition="GLADE-CMIP5")

       • Search  and  Discovery: once the database is built, intake-esm can be used for searching
         and discovering of climate datasets by  eliminating  the  need  for  the  user  to  know
         specific locations (file path) of their data set of interest:

          sub_cat = cat.search(variable=['hfls'], frequency='mon', modeling_realm='atmos', institute=['CCCma', 'CNRM-CERFACS'])

       • Access:  when  the  user  is  satisfied  with  the  results of their query, they can ask
         intake-esm to load the actual netCDF files into xarray datasets:

          dsets = cat.to_xarray(decode_times=True, chunks={'time': 50})

   Brookhaven Archive
       The Bluesky project uses Intake to dynamically query a MongoDB instance, which  holds  the
       details of experimental and simulation data catalogs, to return a custom Catalog for every
       query. Data-sets can then be loaded into python, or the original raw data can be  accessed
       ...

   Zillow
       Zillow  is  developing  Intake  to meet the needs of their datalake access layer (DAL), to
       encapsulate the highly hierarchical nature of their data. Of particular importance, is the
       ability  to  provide different version (testing/production, and different storage formats)
       of the same logical dataset, depending on whether it is being read on a laptop versus  the
       production infrastructure ...

   Intake Server
       The server protocol (see Server Protocol) is simple enough that anyone can write their own
       implementation with full customisation and behaviour. In particular, auth  and  monitoring
       would be essential for a production-grade deployment.

USER GUIDE

       More  detailed information about specific parts of Intake, such as how to author catalogs,
       how to use the graphical interface, plotting, etc.

   GUI
   Using the GUI
       Note: the GUI requires panel and bokeh to be available in the current environment.

       The Intake top-level singleton intake.gui gives access to a graphical data browser  within
       the  Jupyter  notebook.  To  expose  it,  simply  enter  it  into  a  code  cell  (Jupyter
       automatically display the last object in a code cell).  [image]

       New instances of the GUI are also  available  by  instantiating  intake.interface.gui.GUI,
       where you can specify a list of catalogs to initially include.

       The GUI contains three main areas:

       • a  list  of  catalogs.  The  "builtin" catalog, displayed by default, includes data-sets
         installed in the system, the same as intake.cat.

       • a list of sources within the currently selected catalog.

       • a description of the currently selected source.

   Catalogs
       Selecting a catalog from the list will  display  nested  catalogs  below  the  parent  and
       display source entries from the catalog in the list of sources.

       Below  the  lists  of  catalogs is a row of buttons that are used for adding, removing and
       searching-within catalogs:

       • Add: opens a sub-panel for adding catalogs to the interface, by either  browsing  for  a
         local YAML file or by entering a URL for a catalog, which can be a remote file or Intake
         server

       • Remove: deletes the currently selected catalog from the list

       • Search: opens a sub-panel for finding entries in the currently selected catalog (and its
         sub-catalogs)

   Add Catalogs
       The  Add  button  (+)  exposes  a  sub-panel  with  two  main  ways to add catalogs to the
       interface: [image]

       This panel has a tab to load files from local; from  that  you  can  navigate  around  the
       filesystem  using  the  arrow  or by editing the path directly. Use the home button to get
       back to the starting place. Select the catalog file you need. Use the "Add Catalog" button
       to add the catalog to the list above.  [image]

       Another tab loads a catalog from remote. Any URL is valid here, including cloud locations,
       "gcs://bucket/...",  and  intake  servers,  "intake://server:port".  Without  a   protocol
       specifier,  this  can  be  a  local  path.  Again, use the "Add Catalog" button to add the
       catalog to the list above.  [image]

       Finally, you can add catalogs to the interface in code, using the .add() method, which can
       take filenames, remote URLs or existing Catalog instances.

   Remove Catalogs
       The  Remove  button  (-)  deletes  the  currently  selected  catalog  from the list. It is
       important to note that this action does not have any impact on files, it only affects what
       shows up in the list.  [image]

   Search
       The  sub-panel  opened  by  the  Search  button  (🔍) allows the user to search within the
       selected catalog [image]

       From the Search sub-panel the user enters for free-form text. Since some catalogs  contain
       nested  sub-catalogs,  the  Depth  selector  allows the search to be limited to the stated
       number of nesting levels.  This may be necessary, since, in theory, catalogs  can  contain
       circular references, and therefore allow for infinite recursion.  [image]

       Upon  execution  of  the  search, the currently selected catalog will be searched. Entries
       will be considered to match if any of the entered words is found in the description of the
       entry  (this  is  case-insensitive). If any matches are found, a new entry will be made in
       the catalog list, with the suffix "_search".  [image]

   Sources
       Selecting a source from the list updates the description text on the left-side of the gui.

       Below the list of sources is a row of buttons for inspecting the selected data source:

       • Plot: opens a sub-panel for viewing the pre-defined (specified in the  yaml)  plots  for
         the selected source.

   Plot
       The  Plot  button  (📊)  opens  a  sub-panel  with  an area for viewing pre-defined plots.
       [image]

       These plots are specified in the catalog yaml and that yaml can be displayed  by  checking
       the box next to "show yaml".  [image]

       The     holoviews     object     can     be     retrieved     from     the    gui    using
       intake.interface.source.plot.pane.object, and you can then use it in Python or  export  it
       to a file.

   Interactive Visualization
       If  you  have installed the optional extra packages dfviz and xrviz, you can interactively
       plot your dataframe or array data, respectively.  [image]

       The button "customize" will be available for data sources of the appropriate type.   Click
       this  to  open  the  interactive interface. If you have not selected a predefined plot (or
       there are none), then the interface will start without any prefilled values, but if you do
       first select a plot, then the interface will have its options pre-filled from the options

       For  specific  instructions  on  how  to  use  the  interfaces  (which  can  also  be used
       independently of the Intake GUI), please navigate to the linked documentation.

       Note that the final parameters that are sent to hvPlot to produce the output each  time  a
       plot  if  updated, are explicitly available in YAML format, so that you can save the state
       as a "predefined plot" in the catalog. The same set of parameters  can  also  be  used  in
       code, with datasource.plot(...).  [image]

   Using the Selection
       Once  catalogs  are  loaded  and the desired sources has been identified and selected, the
       selected sources will be available at the .sources attribute  (intake.gui.sources).   Each
       source  entry  has  informational methods available and can be opened as a data source, as
       with any catalog entry:

          In [ ]: source_entry = intake.gui.sources[0]
                  source_entry
          Out   :
          name: sea_ice_origin
          container: dataframe
          plugin: ['csv']
          description: Arctic/Antarctic Sea Ice
          direct_access: forbid
          user_parameters: []
          metadata:
          args:
            urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv

          In [ ]: data_source = source_entry()  # may specify parameters here
                  data_source.read()
          Out   : < some data >

          In [ ]: source_entry.plot()  # or skip data source step
          Out   : < graphics>

   Catalogs
       Data catalogs provide an abstraction that allows you to externally define, and  optionally
       share,  descriptions  of  datasets, called catalog entries.  A catalog entry for a dataset
       includes information like:

       • The name of the Intake driver that can load the data

       • Arguments to the __init__() method of the driver

       • Metadata provided by the catalog author (such as field descriptions and types,  or  data
         provenance)

       In  addition,  Intake  allows  the  arguments  to  data  sources to be templated, with the
       variables explicitly expressed as "user parameters".  The  given  arguments  are  rendered
       using jinja2, the values of named user parameterss, and any overrides.  The parameters are
       also offer validation of the allowed types and values, for both the  template  values  and
       the  final arguments passed to the data source. The parameters are named and described, to
       indicate to the user what they are for. This  kind  of  structure  can  be  used  to,  for
       example,  choose between two parts of a given data source, like "latest" and "stable", see
       the entry1_part entry in the example below.

       The user of the catalog can always override any template or argument  value  at  the  time
       that they access a give source.

   The Catalog class
       In  Intake,  a  Catalog instance is an object with one or more named entries.  The entries
       might be read from a static file (e.g., YAML, see the next section), from an Intake server
       or  from  any  other  data  service  that  has a driver. Drivers which create catalogs are
       ordinary DataSource classes, except that they have the container type  "catalog",  and  do
       not return data products via the read() method.

       For  example,  you  might  choose  to  instantiate the base class and fill in some entries
       explicitly in your code

          from intake.catalog import Catalog
          from intake.catalog.local import LocalCatalogEntry
          mycat = Catalog.from_dict({
              'source1': LocalCatalogEntry(name, description, driver, args=...),
              ...
              })

       Alternatively, subclasses of Catalog can define how entries  are  created  from  whichever
       file   format  or  service  they  interact  with,  examples  including  RemoteCatalog  and
       SQLCatalog. These generate  entries  based  on  their  respective  targets;  some  provide
       advanced search capabilities executed on the server.

   YAML Format
       Intake  catalogs  can most simply be described with YAML files. This is very common in the
       tutorials and this documentation, because it simple to  understand,  but  demonstrate  the
       many features of Intake. Note that YAML files are also the easiest way to share a catalog,
       simply by copying to a publicly-available location such as a cloud storage  bucket.   Here
       is an example:

          metadata:
            version: 1
            parameters:
              file_name:
                type: str
                description: default file name for child entries
                default: example_file_name
          sources:
            example:
              description: test
              driver: random
              args: {}

            entry1_full:
              description: entry1 full
              metadata:
                foo: 'bar'
                bar: [1, 2, 3]
              driver: csv
              args: # passed to the open() method
                urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'

            entry1_part:
              description: entry1 part
              parameters: # User parameters
                part:
                  description: section of the data
                  type: str
                  default: "stable"
                  allowed: ["latest", "stable"]
              driver: csv
              args:
                urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'

            entry2:
              description: entry2
              driver: csv
              args:
                # file_name parameter will be inherited from file-level parameters, so will
                # default to "example_file_name"
                urlpath: '{{ CATALOG_DIR }}/entry2/{{ file_name }}.csv`

   Metadata
       Arbitrary extra descriptive information can go into the metadata section. Some fields will
       be claimed for internal use and some fields may be restricted to local  reading;  but  for
       now  the  only  field  that  is expected is version, which will be updated when a breaking
       change is made to the file format. Any catalog will have .metadata and .version attributes
       available.

       Note that each source also has its own metadata.

       The  metadata section an also contain parameters which will be inherited by the sources in
       the file (note that these sources can augment these  parameters,  or  override  them  with
       their own parameters).

   Extra drivers
       The  driver:  entry of a data source specification can be a driver name, as has been shown
       in the examples so far.  It can also be an absolute class path to use for the data source,
       in  which  case  there  will  be  no ambiguity about how to load the data. That is the the
       preferred way to be explicit, when the  driver  name  alone  is  not  enough  (see  Driver
       Selection, below).

          plugins:
            source:
              - module: intake.catalog.tests.example1_source
          sources:
            ...

       However,  you  do not, in general, need to do this, since the driver: field of each source
       can also explicitly refer to the plugin class.

   Sources
       The majority of a catalog file is composed of data sources, which are named data sets that
       can  be  loaded  for  the user.  Catalog authors describe the contents of data set, how to
       load it, and optionally offer some customization of the returned data.  Each  data  source
       has several attributes:

       • name:  The  canonical name of the source.  Best practice is to compose source names from
         valid Python identifiers.  This allows Intake to support things like tab  completion  of
         data  source  names on catalog objects.  For example, monthly_downloads is a good source
         name.

       • description: Human readable description of the source.  To help catalog browsing  tools,
         the description should be Markdown.

       • driver:  Name  of  the  Intake  Driver  to use with this source.  Must either already be
         installed in the current Python environment (i.e. with conda or pip) or  loaded  in  the
         plugin  section  of the file. Can be a simple driver name like "csv" or the full path to
         the implementation class like "package.module.Class".

       • args: Keyword arguments to the init method of the driver.  Arguments  may  use  template
         expansion.

       • metadata:  Any  metadata  keys  that  should be attached to the data source when opened.
         These will be supplemented by additional  metadata  provided  by  the  driver.   Catalog
         authors  can  use  whatever  key  names  they  would  like, with the exception that keys
         starting with a leading underscore are reserved for future internal use by Intake.

       • direct_access: Control whether the data is directly accessed by the client,  or  proxied
         through a catalog server.  See Server Catalogs for more details.

       • parameters: A dictionary of data source parameters.  See below for more details.

   Caching Source Files Locally
       This  method  of  defining the cache  with a dedicated block is deprecated, see the Remote
       Access section, below

       To enable caching on the first read of remote data source files,  add  the  cache  section
       with the following attributes:

       • argkey: The args section key which contains the URL(s) of the data to be cached.

       • type: One of the keys in the cache registry [intake.source.cache.registry], referring to
         an implementation of caching behaviour. The default is "file" for the caching of one  or
         more files.

       Example:

          test_cache:
            description: cache a csv file from the local filesystem
            driver: csv
            cache:
              - argkey: urlpath
                type: file
            args:
              urlpath: '{{ CATALOG_DIR }}/cache_data/states.csv'

       The   cache_dir   defaults  to  ~/.intake/cache,  and  can  be  specified  in  the  intake
       configuration file or INTAKE_CACHE_DIR environment  variable,  or  at  runtime  using  the
       "cache_dir"  key  of  the  configuration.   The special value "catdir" implies that cached
       files will appear in the same directory as the catalog file in which the  data  source  is
       defined, within a directory named "intake_cache". These will not appear in the cache usage
       reported by the CLI.

       Optionally, the cache section can have a regex attribute, that modifies the  path  of  the
       cache  on the disk. By default, the cache path is made by concatenating cache_dir, dataset
       name, hash of the url, and the url itself (without the protocol). regex  attribute  allows
       one to remove part of the url (the matching part).

       Caching   can   be  disabled  at  runtime  for  all  sources  regardless  of  the  catalog
       specification:

          from intake.config import conf

          conf['cache_disabled'] = True

       By default, progress bars are shown during downloads if the package tqdm is available, but
       this can be disabled (e.g., for consoles that don't support complex text) with
          conf['cache_download_progress'] = False

       or, equivalently, the environment parameter INTAKE_CACHE_PROGRESS.

       The  "types" of caching are that supported are listed in intake.source.cache.registry, see
       the docstrings of each for specific parameters that should appear in the cache block.

       It is possible to work with compressed source files by setting type:  compression  in  the
       cache specification.  By default the compression type is inferred from the file extension,
       otherwise it can be set by assigning the decomp variable to any of the options  listed  in
       intake.source.decompress.decomp.  This will extract all the file(s) in the compressed file
       referenced by urlpath and store them in the cache directory.

       In cases where miscellaneous files are present in  the  compressed  file,  a  regex_filter
       parameter can be used. Only the extracted filenames that match the pattern will be loaded.
       The cache path is appended to the filename so it is necessary to include a wildcard to the
       beginning of the pattern.

       Example:

          test_compressed:
            driver: csv
            args:
              urlpath: 'compressed_file.tar.gz'
            cache:
              - type: compressed
                decomp: tgz
                argkey: urlpath
                regex_filter: '.*data.csv'

   Templating
       Intake  catalog  files support Jinja2 templating for driver arguments. Any occurrence of a
       substring like {{field}} will be replaced by the value of the user  parameters  with  that
       same  name,  or  the  value explicitly provided by the user. For how to specify these user
       parameters, see the next section.

       Some additional values are available for templating. The following  is  always  available:
       CATALOG_DIR,  the  full  path  to the directory containing the YAML catalog file.  This is
       especially useful for constructing paths relative to the catalog directory to locate  data
       files  and  custom  drivers.   For  example, the search for CSV files for the two "entry1"
       blocks, above, will happen in the same directory as where the catalog file was found.

       The following functions may be available. Since these execute code, the user of a  catalog
       may decide whether they trust those functions or not.

       • env("USER"): look in the set environment variables for the named variable

       • client_env("USER"):  exactly  the same, except that when using a client-server topology,
         the value will come from the environment of the client.

       • shell("get_login thisuser -t"): execute the command, and use the output  as  the  value.
         The output will be trimmed of any trailing whitespace.

       • client_shell("get_login  thisuser  -t"):  exactly  the  same,  except  that when using a
         client-server topology, the value will come from the system of the client.

       The reason for the "client" versions of the functions is to prevent leakage of potentially
       sensitive  information between client and server by controlling where lookups happen. When
       working without a server, only the ones without "client" are used.

       An example:

          sources:
            personal_source:
              description: This source needs your username
              args:
                url: "http://server:port/user/{{env(USER)}}"

       Here,  if  the   user   is   named   "blogs",   the   url   argument   will   resolve   to
       "http://server:port/user/blogs";  if  the  environment  variable  is  not defined, it will
       resolve to "http://server:port/user/"

   Parameter Definition
   Source parameters
       A source definition can contain a "parameters" block.  Expressed in YAML, a parameter  may
       look as follows:

          parameters:
            name:
              description: name to use  # human-readable text for what this parameter means
              type: str  # one of bool, str, int, float, list[str | int | float], datetime, mlist
              default: normal  # optional, value to assume if user does not override
              allowed: ["normal", "strange"]  # optional, list of values that are OK, for validation
              min: "n"  # optional, minimum allowed, for validation
              max: "t"  # optional, maximum allowed, for validation

       A parameter, not to be confused with an argument, can have one of two uses:

       • to  provide  values for variables to be used in templating the arguments. If the pattern
         "{{name}}" exists in any of the source arguments, it will be replaced by  the  value  of
         the    parameter.    If    the    user    provides    a    value    (e.g.,    source   =
         cat.entry(name='something")), that will be used, otherwise the default value.  If  there
         is  no  user input or default, the empty value appropriate for type is used. The default
         field allows for the same function expansion as listed for arguments, above.

       • If an argument with the same  name  as  the  parameter  exists,  its  value,  after  any
         templating, will be coerced to the given type of the parameter and validated against the
         allowed/max/min. It is therefore possible to use the string templating system (e.g.,  to
         get a value from the environment), but pass the final value as, for example, an integer.
         It makes no sense to provide a default for this case (the argument already has a value),
         but providing a default will not raise an exception.

       • the  "mlist"  type  is special: it means that the input must be a list, whose values are
         chosen from the allowed list. This is the only type where the parameter value is not the
         same  type  as  the  allowed list's values, e.g., if a list of str is set for allowed, a
         list of str must also be the final value.

       Note: the datetime type accepts multiple values: Python datetime,  ISO8601  string,   Unix
       timestamp int, "now" and  "today".

   Catalog parameters
       You  can  also  define user parameters at the catalog level. This applies the parameter to
       all entries within that catalog, without having to define it for  each  and  every  entry.
       Furthermore, catalogs dested within the catalog will also inherit the parameter(s).

       For example, with the following spec

          metadata:
            version: 1
            parameters:
              bucket:
                type: str
                description: description
                default: test_bucket
          sources:
            param_source:
              driver: parquet
              description: description
              args:
                urlpath: s3://{{bucket}}/file.parquet
            subcat:
              driver: yaml_file
              path: "{{CATALOG_DIR}}/other.yaml"

       If  cat  is  the  corresponsing  catalog instance, the URL of source cat.param_source will
       evaluate  to  "s3://test_bucket/file.parquet"  by  default,  but  the  parameter  can   be
       overridden  with  cat.param_source(bucket="other_bucket").  Also,  any  entries of subcat,
       another catalog referenced  from  here,  would  also  have  the  "bucket"-named  parameter
       attached to all sources. Of course, those sources do no need to make use of the parameter.

       To change the default, we can gerenate a new instance

          cat2 = cat(bucket="production")  # sets default value of "bucket" for cat2
          subcat = cat.subcat(bucket="production")  # sets default only for the nested catalog

       Of  course,  in these situations you can still override the value of the parameter for any
       source, or pass explicit values for the arguments of the source, as normal.

       For cases where the catalog is not defined in a YAML spec, the argument user_parameters to
       the constructor takes the same form as parameters above: a dict of user parameters, either
       as UserParameter instances or as a dictionary spec for each one.

   Templating parameters
       Template functions can also be used in parameters (see Templating, above), but you can use
       the available functions directly without the extra {{...}}.

       For  example,  this catalog entry uses the env("HOME") functionality as described to set a
       default based on the user's home directory.

          sources:
            variabledefault:
              description: "This entry leads to an example csv file in the user's home directory by default, but the user can pass root="somepath" to override that."
              driver: csv
              args:
                path: "{{root}}/example.csv"
              parameters:
                root:
                  description: "root path"
                  type: str
                  default: "env(HOME)"

   Driver Selection
       In some cases, it may be possible that multiple backends are capable of loading  from  the
       same  data format or service. Sometimes, this may mean two drivers with unique names, or a
       single driver with a parameter to choose between the different backends.

       However, it is possible that multiple drivers for reading a particular type of  data  also
       share  the  same  driver  name:  for  example,  both the intake-iris and the intake-xarray
       packages contain drivers with the name "netcdf", which are capable  of  reading  the  same
       files,  but  with  different  backends. Here we will describe the various possibilities of
       coping with this situation. Intake's plugin system makes it easy to encode such choices.

       It may be acceptable to use any driver which claims to handle that data type, or  to  give
       the  option  of  which  driver to use to the user, or it may be necessary to specify which
       precise driver(s) are appropriate for that particular data. Intake  allows  all  of  these
       possibilities, even if the backend drivers require extra arguments.

       Specifying  a  single driver explicitly, rather than using a generic name, would look like
       this:

          sources:
            example:
              description: test
              driver: package.module.PluginClass
              args: {}

       It is also possible to describe a list of drivers with the  same  syntax.  The  first  one
       found  will  be  the one used. Note that the class imports will only happen at data source
       instantiation, i.e., when the entry is selected from the catalog.

          sources:
            example:
              description: test
              driver:
                - package.module.PluginClass
                - another_package.PluginClass2
              args: {}

       These alternative plugins can also be given data-source specific names, allowing the  user
       to  choose  at  load  time  with  driver= as a parameter. Additional arguments may also be
       required for each option (which, as usual, may include user parameters); however, the same
       global arguments will be passed to all of the drivers listed.

          sources:
            example:
              description: test
              driver:
                first:
                  class: package.module.PluginClass
                  args:
                    specific_thing: 9
                second:
                  class: another_package.PluginClass2
              args: {}

   Remote Access
       (see also Remote Data for the implementation details)

       Many drivers support reading directly from remote data sources such as HTTP, S3 or GCS. In
       these cases, the path to read from is usually given with a protocol prefix such as gcs://.
       Additional dependencies will typically be required (requests, s3fs, gcsfs, etc.), any data
       package should specify these.  Further parameters may be necessary for communicating  with
       the storage backend and, by convention, the driver should take a parameter storage_options
       containing arguments to pass to the backend. Some remote backends may  also  make  use  of
       environment variables or config files to determine their default behaviour.

       The  special template variable "CATALOG_DIR" may be used to construct relative URLs in the
       arguments to a source. In such  cases,  if  the  filesystem  used  to  load  that  catalog
       contained  arguments,  then  the storage_options of that file system will be extracted and
       passed to the source. Therefore, all sources which can accept general  URLs  (beyond  just
       local paths) must make sure to accept this argument.

       As  an example of using storage_options, the following two sources would allow for reading
       CSV data from S3 and GCS backends without authentication (anonymous access), respectively

          sources:
            s3_csv:
              driver: csv
              description: "Publicly accessible CSV data on S3; requires s3fs"
              args:
                urlpath: s3://bucket/path/*.csv
                storage_options:
                  anon: true
            gcs_csv:
              driver: csv
              description: "Publicly accessible CSV data on GCS; requires gcsfs"
              args:
                urlpath: gcs://bucket/path/*.csv
                storage_options:
                  token: "anon"

       Using S3 Profiles

       An AWS profile may be specified as an argument under  storage_options  via  the  following
       format:

          args:
            urlpath: s3://bucket/path/*.csv
            storage_options:
              profile: aws-profile-name

   Caching
       URLs  interpreted  by  fsspec  offer  automatic caching. For example, to enable file-based
       caching for the first source above, you can do:

          sources:
            s3_csv:
              driver: csv
              description: "Publicly accessible CSV data on S3; requires s3fs"
              args:
                urlpath: simplecache::s3://bucket/path/*.csv
                storage_options:
                  s3:
                    anon: true

       Here we have added the "simplecache" to the URL (this caching backend does not  store  any
       metadata  about  the  cached  file) and specified that the "anon" parameter is meant as an
       argument to s3, not to the caching mechanism. As each file in  s3  is  accessed,  it  will
       first be downloaded and then the local version used instead.

       You  can tailor how the caching works. In particular the location of the local storage can
       be set with the cache_storage parameter (under the "simplecache" group of storage_options,
       of  course)  -  otherwise they are stored in a temporary location only for the duration of
       the current python session. The cache location is particularly useful in conjunction  with
       an environment variable, or relative to "{{CATALOG_DIR}}", wherever the catalog was loaded
       from.

       Please see the fsspec documentation for the full set of  cache  types  and  their  various
       options.

   Local Catalogs
       A  Catalog  can  be  loaded from a YAML file on the local filesystem by creating a Catalog
       object:

          from intake import open_catalog
          cat = open_catalog('catalog.yaml')

       Then sources can be listed:

          list(cat)

       and data sources are loaded via their name:

          data = cat.entry_part1

       and you can optionally configure new instances of the source to define user parameters  or
       override arguments by calling either of:

          data = cat.entry_part1.configure_new(part='1')
          data = cat.entry_part1(part='1')  # this is a convenience shorthand

       Intake also supports loading a catalog from all of the files ending in .yml and .yaml in a
       directory, or by using an explicit glob-string. Note that the URL provided may refer to  a
       remote storage systems by passing a protocol specifier such as s3://, gcs://.:

          cat = open_catalog('/research/my_project/catalog.d/')

       Intake Catalog objects will automatically reload changes or new additions to catalog files
       and directories on disk.  These changes will not affect already-opened data sources.

   Catalog Nesting
       A catalog is just another type of data source for Intake. For example,  you  can  print  a
       YAML specification corresponding to a catalog as follows:

          cat = intake.open_catalog('cat.yaml')
          print(cat.yaml())

       results in:

          sources:
            cat:
              args:
                path: cat.yaml
              description: ''
              driver: intake.catalog.local.YAMLFileCatalog
              metadata: {}

       The point here, is that this can be included in another catalog.  (It would, of course, be
       better to include a description and the full path of the catalog file here.)  If the entry
       above  were  saved  to  another  file,  "root.yaml", and the original catalog contained an
       entry, data, you could access it as:

          root = intake.open_catalog('root.yaml')
          root.cat.data

       It is, therefore, possible to build up a hierarchy of  catalogs  referencing  each  other.
       These  can,  of  course,  include remote URLs and indeed catalog sources other than simple
       files (all the tables on a SQL  server,  for  instance).  Plus,  since  the  argument  and
       parameter  system  also applies to entries such as the example above, it would be possible
       to give the user a runtime choice of multiple catalogs  to  pick  between,  or  have  this
       decision depend on an environment variable.

   Server Catalogs
       Intake  also  includes a server which can share an Intake catalog over HTTP (or HTTPS with
       the help of a TLS-enabled reverse proxy).  From  the  user  perspective,  remote  catalogs
       function identically to local catalogs:

          cat = open_catalog('intake://catalog1:5000')
          list(cat)

       The difference is that operations on the catalog translate to requests sent to the catalog
       server.  Catalog servers provide access to data sources in one of two modes:

       • Direct access: In this mode, the catalog server tells the client how to load  the  data,
         but  the client uses its local drivers to make the connection.  This requires the client
         has the required driver already installed and has direct access to  the  files  or  data
         servers that the driver will connect to.

       • Proxied access: In this mode, the catalog server uses its local drivers to open the data
         source and stream the data over the network to the client.  The client does not need any
         special  drivers to read the data, and can read data from files and data servers that it
         cannot access, as long as the catalog server has the required access.

       Whether a particular catalog entry supports direct or proxied access is determined by  the
       direct_access option:

       • forbid (default): Force all clients to proxy data through the catalog server

       • allow:  If  the  client  has  the required driver, access the source directly, otherwise
         proxy the data through the catalog server.

       • force: Force all clients to access the data directly.  If they do not have the  required
         driver, an exception will be raised.

       Note  that  when the client is loading a data source via direct access, the catalog server
       will need to  send  the  driver  arguments  to  the  client.   Do  not  include  sensitive
       credentials in a data source that allows direct access.

   Client Authorization Plugins
       Intake  servers  can  check if clients are authorized to access the catalog as a whole, or
       individual catalog entries.  Typically a matched pair of  server-side  plugin  (called  an
       "auth  plugin") and a client-side plugin (called a "client auth plugin) need to be enabled
       for authorization checks to work.  This feature is still in  early  development,  but  see
       module   intake.auth.secret  for  a  demonstration  pair  of  server  and  client  classes
       implementation auth via a shared secret. See Authorization Plugins.

   Command Line Tools
       The package installs two executable commands: for  starting  the  catalog  server;  and  a
       client for accessing catalogs and manipulating the configuration.

   Configuration
       A  file-based configuration service is available to Intake. This file is by default sought
       at  the  location  ~/.intake/conf.yaml,  but   either   of   the   environment   variables
       INTAKE_CONF_DIR  or  INTAKE_CONF_FILE can be used to specify another directory or file. If
       both are given, the latter takes priority.

       At present, the configuration file might look as follows:

          auth:
            cls: "intake.auth.base.BaseAuth"
          port: 5000
          catalog_path:
            - /home/myusername/special_dir

       These are the defaults, and any parameters not specified will take the values above

       • the Intake Server will listen on port 5000 (this can be overridden on the command  line,
         see below)

       • and  the  auth system used will be the fully qualified class given (which, for BaseAuth,
         always allows access). For further information on securing the Intake  Server,  see  the
         Authorization Plugins.

       See intake.config.defaults for a full list of keys and their default values.

   Log Level
       The logging level is configurable using Python's built-in logging module.

       The  config  option  'logging' holds the current level for the intake logger, and can take
       values such as 'INFO' or 'DEBUG'. This can be set in the  conf.yaml  file  of  the  config
       directory (e.g., ~/.intake/), or overridden by the environment variable INTAKE_LOG_LEVEL.

       Furthermore, the level and settings of the logger can be changed programmatically in code:

          import logging
          logger = logging.getLogger('intake')
          logger.setLevel(logging.DEBUG)
          logget.addHandler(..)

   Intake Server
       The  server takes one or more catalog files as input and makes them available on port 5000
       by default.

       You can see the full description of the server command with:

          >>> intake-server --help

          usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
                               [--flatten] [--no-flatten] [-a ADDRESS]
                               FILE [FILE ...]

          Intake Catalog Server

          positional arguments:
            FILE                  Name of catalog YAML file

          optional arguments:
            -h, --help            show this help message and exit
            -p PORT, --port PORT  port number for server to listen on
            --list-entries        list catalog entries at startup
            --sys-exit-on-sigterm
                                  internal flag used during unit testing to ensure
                                  .coverage file is written
            --flatten
            --no-flatten
            -a ADDRESS, --address ADDRESS
                                  address to use as a host, defaults to the address in
                                  the configuration file, if provided otherwise localhost
            usage: intake-server [-h] [-p PORT] [--list-entries] [--sys-exit-on-sigterm]
                         [--flatten] [--no-flatten] [-a ADDRESS]
                         FILE [FILE ...]

       To start the server with a local catalog file, use the following:

          >>> intake-server intake/catalog/tests/catalog1.yml
          Creating catalog from:
            - intake/catalog/tests/catalog1.yml
          catalog_args ['intake/catalog/tests/catalog1.yml']
          Entries: entry1,entry1_part,use_example1
          Listening on port 5000

       You can use the catalog client (defined below) using:

          $ intake list intake://localhost:5000
          entry1
          entry1_part
          use_example1

   Intake Client
       While the Intake data sources will typically be accessed through the Python API,  you  can
       use the client to verify a catalog file.

       Unlike the server command, the client has several subcommands to access a catalog. You can
       see the list of available subcommands with:

          >>> intake --help
          usage: intake {list,describe,exists,get,discover} ...

       We go into further detail in the following sections.

   List
       This subcommand lists the names of all available catalog entries.  This  is  useful  since
       other subcommands require these names.

       If  you  wish  to  see the details about each catalog entry, use the --full flag.  This is
       equivalent to running the intake describe subcommand for all catalog entries.

          >>> intake list --help
          usage: intake list [-h] [--full] URI

          positional arguments:
            URI         Catalog URI

          optional arguments:
            -h, --help  show this help message and exit
            --full

          >>> intake list intake/catalog/tests/catalog1.yml
          entry1
          entry1_part
          use_example1
          >>> intake list --full intake/catalog/tests/catalog1.yml
          [entry1] container=dataframe
          [entry1] description=entry1 full
          [entry1] direct_access=forbid
          [entry1] user_parameters=[]
          [entry1_part] container=dataframe
          [entry1_part] description=entry1 part
          [entry1_part] direct_access=allow
          [entry1_part] user_parameters=[{'default': '1', 'allowed': ['1', '2'], 'type': u'str', 'name': u'part', 'description': u'part of filename'}]
          [use_example1] container=dataframe
          [use_example1] description=example1 source plugin
          [use_example1] direct_access=forbid
          [use_example1] user_parameters=[]

   Describe
       Given the name of a catalog entry, this subcommand lists the  details  of  the  respective
       catalog entry.

          >>> intake describe --help
          usage: intake describe [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake describe intake/catalog/tests/catalog1.yml entry1
          [entry1] container=dataframe
          [entry1] description=entry1 full
          [entry1] direct_access=forbid
          [entry1] user_parameters=[]

   Discover
       Given  the name of a catalog entry, this subcommand returns a key-value description of the
       data source. The exact details are subject to change.

          >>> intake discover --help
          usage: intake discover [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake discover intake/catalog/tests/catalog1.yml entry1
          {'npartitions': 2, 'dtype': dtype([('name', 'O'), ('score', '<f8'), ('rank', '<i8')]), 'shape': (None,), 'datashape':None, 'metadata': {'foo': 'bar', 'bar': [1, 2, 3]}}

   Exists
       Given the name of a catalog entry, this subcommand returns whether or not  the  respective
       catalog entry is valid.

          >>> intake exists --help
          usage: intake exists [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake exists intake/catalog/tests/catalog1.yml entry1
          True
          >>> intake exists intake/catalog/tests/catalog1.yml entry2
          False

   Get
       Given  the  name  of  a  catalog  entry, this subcommand outputs the entire data source to
       standard output.

          >>> intake get --help
          usage: intake get [-h] URI NAME

          positional arguments:
            URI         Catalog URI
            NAME        Catalog name

          optional arguments:
            -h, --help  show this help message and exit

          >>> intake get intake/catalog/tests/catalog1.yml entry1
                 name  score  rank
          0    Alice1  100.5     1
          1      Bob1   50.3     2
          2  Charlie1   25.0     3
          3      Eve1   25.0     3
          4    Alice2  100.5     1
          5      Bob2   50.3     2
          6  Charlie2   25.0     3
          7      Eve2   25.0     3

   Config and Cache
       CLI functions starting with intake cache  and  intake  config  are  available  to  provide
       information about the system: the locations and value of configuration parameters, and the
       state of cached files.

   Persisting Data
       (this is an experimental new feature, expect enhancements and changes)

   Introduction
       As defined in the glossary, to Persist is to convert data into  the  storage  format  most
       appropriate  for  the  container  type,  and  save  a copy of this for rapid lookup in the
       future.  This is of great potential benefit where the creation or transfer of the original
       data source takes some time.

       This is not to be confused with the file Cache.

   Usage
       Any  Data  Source has a method .persist(). The only option that you will need to pick is a
       TTL, the number of seconds that the persisted version lasts before expiry (leave  as  None
       for  no  expiry).  This  creates  a  local  copy in the persist directory, which may be in
       "~/.intake/persist, but can be configured.

       Each  container  type  (dataframe,  array,  ...)  will  have  its  own  implementation  of
       persistence,  and  a particular file storage format associated. The call to .persist() may
       take arguments to tune how the local files are created, and  in  some  cases  may  require
       additional optional packages to be installed.

       Example:

          cat = intake.open_catalog('mycat.yaml')  # load a remote cat
          source = cat.csvsource()  # source pointing to remote data
          source.persist()

          source = cat.csvsource()  # future use now gives local intake_parquet.ParquetSource

       To control whether a catalog will automatically give you the persisted version of a source
       in this way using the argument persist_mode, e.g., to ignore locally  persisted  versions,
       you could have done:

          cat = intake.open_catalog('mycat.yaml', persist_mode='never')
          or
          source = cat.csvsource(persist_mode='never')

       Note  that if you give a TTL (in seconds), then the original source will be accessed and a
       new persisted version written transparently when the old persisted version has expired.

       Note that after persisting, the original source  will  have  source.has_been_persisted  ==
       True  and  the  persisted  source  (i.e.,  the  one  loaded  from  local  files) will have
       source.is_persisted == True.

   Export
       A similar concept to Persist, Export allows you to make a copy of some data source, in the
       format  appropriate for its container, and place this data-set in whichever location suits
       you, including remote locations. This functionality (source.export()) does not  touch  the
       persist  store;  instead, it returns a YAML text representation of the output, so that you
       can put it into a catalog of your own. It would be this catalog that you share with  other
       people.

       Note  that  "exported"  data-sources  like this do contain the information of the original
       source they were made from in their metadata, so you can recreate the original source,  if
       you want to, and read from there.

   Persisting to Remote
       If  you  are  typically  running your code inside of ephemoral containers, then persisting
       data-sets may be something that you want to do (because the original source  is  slow,  or
       parsing  is  CPU/memory intensive), but the local storage is not useful. In some cases you
       may have access to some shared network storage mounted on the instance, but in other cases
       you will want to persist to a remote store.

       The  config  value  'persist_path',  which  can  also  be  set by the environment variable
       INTAKE_PERSIST_PATH can be a remote location  such  as  s3://mybucket/intake-persist.  You
       will  need to install the appropriate package to talk to the external storage (e.g., s3fs,
       gcsfs, pyarrow), but otherwise everything should work as before, and you  can  access  the
       persisted data from any container.

   The Persist Store
       You can interact directly with the class implementing persistence:

          from intake.container.persist import store

       This  singleton  instance,  which acts like a catalog, allows you to query the contents of
       the instance store and to add and remove entries. It also allows you to find the  original
       source for any given persisted source, and refresh the persisted version on demand.

       For   details   on   the  methods  of  the  persist  store,  see  the  API  documentation:
       intake.container.persist.PersistStore(). Sources in the store carry a lot  of  information
       about  the  sources they were made from, so that they can be remade successfully. This all
       appears in the source metadata.  The sources use the "token" of the original  data  source
       as  their  key  in the store, a value which can be found by dask.base.tokenize(source) for
       the original source, or can be taken from the metadata of a persisted source.

       Note that all of the information about persisted sources is held in a single YAML file  in
       the  persist directory (typically /persisted/cat.yaml within the config directory, but see
       intake.config.conf['persist_path']). This file can be edited by hand if you wanted to, for
       example, set some persisted source not to expire. This is only recommended for experts.

   Future Enhancements
       • CLI functionality to investigate and alter the state of the persist store.

       • Time  check-pointing of persisted data, such that you can not only get the "most recent"
         but any version in the time-series.

       • (eventually) pipeline functionality, whereby a persisted data source depends on  another
         persisted data source, and the whole train can be refreshed on a schedule or on demand.

   Plotting
       Intake  provides  a  plotting  API  based on the hvPlot library, which closely mirrors the
       pandas plotting API but generates interactive plots using HoloViews and Bokeh.

       The hvPlot website provides comprehensive documentation  on  using  the  plotting  API  to
       quickly  visualize  and explore small and large datasets. The main features offered by the
       plotting API include:

          • Support for tabular data stored in pandas and dask dataframes

          • Support for gridded data stored in xarray backed nD-arrays

          • Support for plotting large datasets with datashader

       Using Intake alongside  hvPlot  allows  declaratively  persisting  plot  declarations  and
       default options in the regular catalog.yaml files.

   Setup
       For  detailed  installation  instructions  see  the  getting started section in the hvPlot
       documentation.  To start with install hvplot using conda:

          conda install -c conda-forge hvplot

       or using pip:

          pip install hvplot

   Usage
       The plotting API is designed to work well in and outside  the  Jupyter  notebook,  however
       when using it in JupyterLab the PyViz lab extension must be installed first:

          jupyter labextension install @pyviz/jupyterlab_pyviz

       For  detailed instructions on displaying plots in the notebook and from the Python command
       prompt see the hvPlot user guide.

   Python Command Prompt & Scripts
       Assuming the US Crime dataset has been installed (in the  intake-examples  repo,  or  from
       conda with conda install -c intake us_crime):

       Once  installed  the  plot  API  can  be  used,  by  using  the  .plot method on an intake
       DataSource:

          import intake
          import hvplot as hp

          crime = intake.cat.us_crime
          columns = ['Burglary rate', 'Larceny-theft rate', 'Robbery rate', 'Violent Crime rate']

          violin = crime.plot.violin(y=columns, group_label='Type of crime',
                                     value_label='Rate per 100k', invert=True)
          hp.show(violin)
       [image]

   Notebook
       Inside the notebook plots will display themselves, however the notebook extension must  be
       loaded  first. The extension may be loaded by importing hvplot.intake module or explicitly
       loading the holoviews extension, or by calling intake.output_notebook():

          # To load the extension run this import
          import hvplot.intake

          # Or load the holoviews extension directly
          import holoviews as hv
          hv.extension('bokeh')

          # convenience function
          import intake
          intake.output_notebook()

          crime = intake.cat.us_crime
          columns = ['Violent Crime rate', 'Robbery rate', 'Burglary rate']
          crime.plot(x='Year', y=columns, value_label='Rate (per 100k people)')

   Predefined Plots
       Some catalogs will define plots appropriate to a  specific  data  source.  These  will  be
       specified  such  that  the  user  gets  the  right view with the right columns and labels,
       without having to investigate the data in detail -- this is ideal for quick-look  plotting
       when browsing sources.

          import intake
          intake.us_crime.plots

       Returns ['example']. This works whether accessing the entry object or the source instance.
       To visualise

          intake.us_crime.plot.example()

   Persisting metadata
       Intake allows catalog yaml files to declare metadata fields for each data source which are
       made  available  alongside the actual dataset. The plotting API reserves certain fields to
       define default plot options, to label and annotate the data fields in  a  dataset  and  to
       declare pre-defined plots.

   Declaring defaults
       The  first  set  of  metadata  used  by the plotting API is the plot field in the metadata
       section. Any options found in the metadata field will apply to all  plots  generated  from
       that  data source, allowing the definition of plotting defaults. For example when plotting
       a fairly large dataset such as the  NYC  Taxi  data,  it  might  be  desirable  to  enable
       datashader by default ensuring that any plot that supports it is datashaded. The syntax to
       declare default plot options is as follows:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                plot:
                  datashade: true

   Declaring data fields
       The columns of a CSV or parquet file or the coordinates and data  variables  in  a  NetCDF
       file  often  have  shortened,  or cryptic names with underscores. They also do not provide
       additional information about the units of the data or the range of values,  therefore  the
       catalog  yaml  specification  also  provides  the ability to define additional information
       about the fields in a dataset.

       Valid attributes that may be defined for the data fields include:

       • label: A readable label for the field which will be used to label axes and widgets

       • unit: A unit associated with the values inside a data field

       • range: A range associated with a  field  declaring  limits  which  will  override  those
         computed from the data

       Just  like  the default plot options the fields may be declared under the metadata section
       of a data source:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                fields:
                  dropoff_x:
                    label: Longitude
                  dropoff_y:
                    label: Latitude
                  total_fare:
                    label: Fare
                    unit: $

   Declaring custom plots
       As shown in the hvPlot user guide, the plotting API provides  a  variety  of  plot  types,
       which  can  be declared using the kind argument or via convenience methods on the plotting
       API, e.g. cat.source.plot.scatter(). In addition to declaring  default  plot  options  and
       field  metadata data sources may also declare custom plot, which will be made available as
       methods on the plotting API. In this way a catalogue may  declare  any  number  of  custom
       plots alongside a datasource.

       To  make  this  more  concrete consider the following custom plot declaration on the plots
       field in the metadata section:

          sources:
            nyc_taxi:
              description: NYC Taxi dataset
              driver: parquet
              args:
                urlpath: 's3://datashader-data/nyc_taxi_wide.parq'
              metadata:
                plots:
                  dropoff_scatter:
                    kind: scatter
                    x: dropoff_x
                    y: dropoff_y
                    datashade: True
                    width: 800
                    height: 600

       This declarative specification creates a new custom  plot  called  dropoff_scatter,  which
       will  be  available on the catalog under cat.nyc_taxi.plot.dropoff_scatter(). Calling this
       method on the plot API will automatically  generate  a  datashaded  scatter  plot  of  the
       dropoff locations in the NYC taxi dataset.

       Of  course  the three metadata fields may also be used together, declaring global defaults
       under the plot field, annotations for the data fields under  the  fields  key  and  custom
       plots via the plots field.

   Plugin Directory
       This  is  a  list of known projects which install driver plugins for Intake, and the named
       drivers each contains in parentheses:

       • builtin to Intake (catalog, csv, intake_remote, ndzarr, numpy, textfiles, yaml_file_cat,
         yaml_files_cat, zarr_cat, json, jsonl)

       • intake-astro Table and array loading of FITS astronomical data (fits_array, fits_table)

       • intake-accumulo Apache Accumulo clustered data storage (accumulo)

       • intake-avro: Apache Avro data serialization format (avro_table, avro_sequence)

       • intake-bluesky: search and retrieve data in the bluesky data model

       • intake-dcat Browse and load data from DCAT catalogs. (dcat)

       • intake-dynamodb link to Amazon DynamoDB (dynamodb)

       • intake-elasticsearch:  Elasticsearch  search  and  analytics  engine (elasticsearch_seq,
         elasticsearch_table)

       • intake-esm:  Plugin for building and loading intake catalogs for earth system data  sets
         holdings,  such  as CMIP (Coupled Model Intercomparison Project) and CESM Large Ensemble
         datasets.

       • intake-geopandas: load from ESRI Shape Files, GeoJSON,  and  geospatial  databases  with
         geopandas   (geojson,   postgis,  shapefile,  spatialite)  and  regionmask  for  opening
         shapefiles into regionmask.

       • intake-google-analytics: run Google Analytics queries  and  load  data  as  a  DataFrame
         (google_analytics_query)

       • intake-hbase: Apache HBase database (hbase)

       • intake-iris load netCDF and GRIB files with IRIS (grib, netcdf)

       • intake-metabase:   Generate  catalogs  and  load  tables  as  DataFrames  from  Metabase
         (metabase_catalog, metabase_table)

       • intake-mongo: MongoDB noSQL query (mongo)

       • intake-nested-yaml-catalog: Plugin supporting a  single  YAML  hierarchical  catalog  to
         organize datasets and avoid a data swamp. (nested_yaml_cat)

       • intake-netflow: Netflow packet format (netflow)

       • intake-notebook:  Experimental  plugin  to access parameterised notebooks through intake
         and executed via papermill (ipynb)

       • intake-odbc: ODBC database (odbc)

       • intake-parquet: Apache Parquet file format (parquet)

       • intake-pattern-catalog: Plugin for specifying a file-path pattern which can represent  a
         number of different entries (pattern_cat)

       • intake-pcap: PCAP network packet format (pcap)

       • intake-postgres: PostgreSQL database (postgres)

       • intake-s3-manifests (s3_manifest)

       • intake-salesforce:  Generate  catalogs  and  load  tables  as DataFrames from Salesforce
         (salesforce_catalog, salesforce_table)

       • intake-sklearn: Load scikit-learn models from Pickle files (sklearn)

       • intake-solr: Apache Solr search platform (solr)

       • intake-stac: Intake Driver for SpatioTemporal Asset Catalogs (STAC).

       • intake-stripe:  Generate  catalogs  and  load   tables   as   DataFrames   from   Stripe
         (stripe_catalog, stripe_table)

       • intake-spark: data processed by Apache Spark (spark_cat, spark_rdd, spark_dataframe)

       • intake-sql: Generic SQL queries via SQLAlchemy (sql_cat, sql, sql_auto, sql_manual)

       • intake-sqlite:   Local   caching  of  remote  SQLite  DBs  and  queries  via  SQLAlchemy
         (sqlite_cat, sqlite, sqlite_auto, sqlite_manual)

       • intake-splunk: Splunk machine data query (splunk)

       • intake-streamz: real-time event processing using Streamz (streamz)

       • intake-thredds:   Intake   interface   to   THREDDS    data    catalogs    (thredds_cat,
         thredds_merged_source)

       • intake-xarray: load netCDF, Zarr and other multi-dimensional data (xarray_image, netcdf,
         grib, opendap, rasterio, remote-xarray, zarr)

       The status of these projects is available at Status Dashboard.

       Don't see your favorite format?  See Making Drivers for how to create new plugins.

       Note that if you want your  plugin  listed  here,  open  an  issue  in  the  Intake  issue
       repository  and  add  an  entry  to the status dashboard repository. We also have a plugin
       wishlist Github issue that shows the breadth of plugins we hope to see for Intake.

   Server Protocol
       This page gives deeper details on how the Intake server is implemented. For  those  simply
       wishing to run and configure a server, see the Command Line Tools section.

       Communication between the intake client and server happens exclusively over HTTP, with all
       parameters passed using msgpack UTF8 encoding. The  server  side  is  implemented  by  the
       module intake.cli.server. Currently, only the following two routes are available:

          • http://server:port/v1/infohttp://server:port/v1/source.

       The  server  may  be configured to use auth services, which, when passed the header of the
       incoming call, can determine whether the  given  request  is  allowed.  See  Authorization
       Plugins.

   GET /info
       Retrieve  information  about the data-sets available on this server. The list of data-sets
       may be paginated, in order to avoid excessively long transactions. Notice that the catalog
       for  which  a  listing  is  being requested can itself be a data-source (when source-id is
       passed) - this is how nested sub-catalogs are handled on the server.

   Parameterspage_size, int or none (optional): to enable pagination, set this value. The  number  of
         entries  returned  will  be  this  value  at most. If None, returns all entries. This is
         passed as a query parameter.

       • page_offset, int (optional): when paginating, start the list from this numerical offset.
         The  order  of entries is guaranteed if the base catalog has not changed. This is passed
         as a query parameter.

       • source-id, uuid string (optional): when the catalog being  accessed  is  not  the  route
         catalog,  but an open data-source on the server, this is its unique identifier. See POST
         /source for how these IDs are generated.  If the catalog  being  accessed  is  the  root
         Catalog, this parameter should be omitted. This is passed as an HTTP header.

   Returnsversion, string: the server's Intake version

       • sources,  list  of objects: the main payload, where each object contains a name, and the
         result of calling .describe() on the  corresponding  data-source,  i.e.,  the  container
         type, description, metadata.

       • metadata, object: any metadata associated with the whole catalog

   GET /source
       Fetch  information  about  a specific source. This is the random-access variant of the GET
       /info route, by which a particular data-source can be accessed without paginating  through
       all of the sources.

   Parametersname,  string (required): the data source name being accessed, one of the members of the
         catalog. This is passed as a query parameter.

       • source-id, uuid string (optional): when the catalog  being  accessed  is  not  the  root
         catalog,  but an open data-source on the server, this is its unique identifier. See POST
         /source for how these IDs are generated.  If the catalog  being  accessed  is  the  root
         Catalog, this parameter should be omitted. This is passed as an HTTP header.

   Returns
       Same  as  one  of  the  entries in sources for GET /info: the result of .describe() on the
       given data-source in the server

   POST /source, action="search"
       Searching a Catalog returns search results in the form of a new  Catalog.  This  "results"
       Catalog is cached on the server the same as any other Catalog.

   Parameterssource-id,  uuid  string  (optional):  When  the  catalog being searched is not the root
         catalog, but a subcatalog on the server, this is its unique identifier. If  the  catalog
         being  searched is the root Catalog, this parameter should be omitted. This is passed as
         an HTTP header.

       • query: tuple of (args, kwargs): These will be unpacked into Catalog.search on the server
         to create the "results" Catalog. This is passed in the body of the message.

   Returnssource_id,  uuid  string:  the  identifier of the results Catalog in the server's source
         cache

   POST /source, action="open"
       This is a more involved processing of a data-source, and, if successful,  returns  one  of
       two possible scenarios:

       • direct-access,  in which all the details required for reading the data directly from the
         client are passed, and the client then creates a local copy of the data source and needs
         no further involvement from the server in order to fetch the data

       • remote-access,  in  which the client is unable or unwilling to create a local version of
         the data-source, and instead created a remote data-source which will fetch the data  for
         each partition from the server.

       The  set of parameters supplied and the server/client policies will define which method of
       access is employed. In the case of remote-access, the data source is instantiated  on  the
       server,  and  .discover() run on it. The resulting information is passed back, and must be
       enough to instantiate a subclass of intake.container.base.RemoteSource appropriate for the
       container  of  the  data-set in question (e.g., RemoteArray when container="ndarray").  In
       this case, the response also includes a UUID string for the open instance on  the  server,
       referencing the cache of open sources maintained by the server.

       Note  that  "opening" a data entry which is itself is a catalog implies instantiating that
       catalog object on the server and returning its UUID, such that a listing can be made using
       GET/ info or GET /source.

   Parametersname,  string (required): the data source name being accessed, one of the members of the
         catalog. This is passed in the body of the request.

       • source-id, uuid string (optional): when the catalog  being  accessed  is  not  the  root
         catalog,  but  an  open data-source on the server, this is its unique identifier. If the
         catalog being accessed is the root Catalog, this parameter should be  omitted.  This  is
         passed as an HTTP header.

       • available_plugins, list of string (optional): the set of named data drivers supported by
         the client. If the driver required by the data-source is not supported  by  the  client,
         then the source must be opened remote-access. This is passed in the body of the request.

       • parameters,  object  (optional):  user  parameters  to  pass  to  the  data-source  when
         instantiating. Whether or not direct-access is possible may,  in  principle,  depend  on
         these parameters, but this is unlikely. Note that some parameter default value functions
         are designed to be evaluated on the server, which may have access to, for example,  some
         credentials  service  (see  Parameter  Definition).  This  is  passed in the body of the
         request.

   Returns
       If direct-access, the driver plugin name  and  set  of  arguments  for  instantiating  the
       data-soruce in the client.

       If  remote-access,  the  data-source container, schema and source-ID so that further reads
       can be made from the server.

   POST /source, action="read"
       This  route  fetches  data  from  the  server  once  a  data-source  has  been  opened  in
       remote-access mode.

   Parameterssource-id,  uuid  string  (required):  the identifier of the data-source in the server's
         source cache. This is returned when action="open". This is passed in  the  body  of  the
         request.

       • partition, int or tuple (optional, but necessary for some sources): section/chunk of the
         data to fetch.  In cases where the data-source is partitioned, the client will fetch the
         data  one  partition at a time, so that it will appear partitioned in the same manner on
         the client side for iteration of passing to  Dask.  Some  data-sources  do  not  support
         partitioning,  and  then  this  parameter is not required/ignored. This is passed in the
         body of the request.

       • accepted_formats, accepted_compression, list  of  strings  (required):  to  specify  how
         serialization  of  data  happens.  This  is  an  expert  feature, see docs in the module
         intake.container.serializer. This is passed in the body of the request.

   Dataset Transforms
       aka. derived datasets.

       WARNING:
          experimental feature, the API may change. The data sources in intake.source.derived are
          not yet declared as top-level named drivers in the package entrypoints.

       Intake  allows for the definition of data sources which take as their input another source
       in the same directory, so that you have the opportunity to present processing to the  user
       of the catalog.

       The "target" or a derived data source will normally be a string. In the simple case, it is
       the name of a data source in the same catalog. However, we use the syntax "catalog:source"
       to  refer  to  sources  in  other  catalogs,  where  the part before ":" will be passed to
       intake.open_catalog(), together with any keyword arguments from cat_kwargs.

       This    can     be     done     by     defining     classes     which     inherit     from
       intake.source.derived.DerivedSource,  or  using one of the pre-defined classes in the same
       module, which usually need to be passed a reference to a function in a python  module.  We
       will demonstrate both.

   Example
       Consider  the following target dataset, which loads some simple facts about US states from
       a CSV file. This  example is taken from the Intake test suite.

       We now show two ways to apply a super-simple transform to this data, which selects two  of
       the dataframe's columns.

   Class Example
       The  first  version  uses  an  approach in which the transform is derived in a data source
       class, and the parameters passed are specific to the transform type.  Note that the driver
       is referred to by it's fully-qualified name in the Intake package.

       The source class for this is included in the Intake codebase, but the important part is:

          class Columns(DataFrameTransform):
              ...

              def pick_columns(self, df):
                  return df[self._params["columns"]]

       We    see    that   this   specific   class   inherits   from   DataFrameTransform,   with
       transform=self.pick_columns. We know that the inputs and outputs are both dataframes. This
       allows for some additional validation and an automated way to infer the output dataframe's
       schema that reduces the number of line of code required.

       The given method does exactly what you might imagine: it takes  and  input  dataframe  and
       applies a column selection to it.

       Running  cat.derive_cols.read()  will  indeed,  as expected, produce a version of the data
       with only the selected columns included. It does this by defining  the  original  dataset,
       appying  the  selection,  and then getting Dask to generate the output. For some datasets,
       this can mean that the selection is pushed down to  the  reader,  and  the  data  for  the
       dropped  columns  is  never  loaded.  The  user  may  choose to do .to_dask() instead, and
       manipulate the lazy dataframe directly, before loading.

   Functional Example
       This  second  version  of  the  same  output  uses   the   more   generic   and   flexible
       intake.source.derived.DataFrameTransform.

          derive_cols_func:
            driver: intake.source.derived.DataFrameTransform
            args:
              targets:
                - input_data
              transform: "intake.source.tests.test_derived._pick_columns"
              transform_kwargs:
                columns: ["state", "slug"]

       In  this  case,  we  pass  a  reference  to  a  function defined in the Intake test suite.
       Normally this would be declared in user modules,  where  perhaps  those  declarations  and
       catalog(s) are distributed together as a package.

          def _pick_columns(df, columns):
              return df[columns]

       This  is,  of  course, very similar to the method shown in the previous section, and again
       applies the selection in the given named argument to the input. Note that Intake does  not
       support  including actual code in your catalog, since we would not want to allow arbitrary
       execution of code on catalog load, as opposed to execution.

       Loading this data source proceeds exactly the same way as the class-based approach, above.
       Both  Dask  and in-memory (Pandas, via .read()) methods work as expected.  The declaration
       in YAML, above, is slightly more  verbose,  but  the  amount  of  code  is  smaller.  This
       demonstrates  a  tradeoff between flexibility and concision. If there were validation code
       to add for the arguments or input dataset, it would be less obvious  where  to  put  these
       things.

   Barebone Example
       The  previous  two  examples  both did dataframe to dataframe transforms. However, totally
       arbitrary computations are possible. Consider the following:

          barebones:
            driver: intake.source.derived.GenericTransform
            args:
              targets:
                - input_data
              transform: builtins.len
              transform_kwargs: {}

       This applies len  to  the  input  dataframe.  cat.barebones.describe()  gives  the  output
       container  type  as  "other",  i.e., not specified. The result of read() on this gives the
       single number 50, the number of rows in the input data. This class, and  DerivedDataSource
       and  included  with  the  intent  as  superclasses, and probably will not be used directly
       often.

   Execution engine
       None of the  above  examples  specified  explicitly  where  the  compute  implied  by  the
       transformation  will take place. However, most Intake drivers support in-memory containers
       and Dask; remembering that the input dataset here is a dataframe. However,  the  behaviour
       is  defined in the driver class itself - so it would be fine to write a driver in which we
       make different assumptions. Let's suppose, for instance, that the original source is to be
       loaded  from  spark  (see  the  intake-spark  package),  the  driver could explicitly call
       .to_spark on the original source, and be assured that it has a Spark object to work  with.
       It  should,  of  course,  explain in its documentation what assumptions are being made and
       that, presumably, the user is expected to also call .to_spark if they wished  to  directly
       manipulate the spark object.

   Plugin examples
          • call .sel on xarray datasets xarray-plugin-transform

   API
           ┌──────────────────────────────────────────────┬──────────────────────────────────┐
           │intake.source.derived.DerivedSource(*args,    │ Base   source   deriving    from │
           │...)                                          │ another   source   in  the  same │
           │                                              │ catalog                          │
           ├──────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.derived.GenericTransform(...)   │                                  │
           ├──────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.derived.DataFrameTransform(...) │ Transform where  the  input  and │
           │                                              │ output  are both Dask-compatible │
           │                                              │ dataframes                       │
           ├──────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.derived.Columns(*args,          │ Simple  dataframe  transform  to │
           │**kwargs)                                     │ pick columns                     │
           └──────────────────────────────────────────────┴──────────────────────────────────┘

       class intake.source.derived.DerivedSource(*args, **kwargs)
              Base source deriving from another source in the same catalog

              Target picking and parameter validation are performed here, but you  probably  want
              to subclass from one of the more specific classes like DataFrameTransform.

              __init__(targets,      target_chooser=<function     first>,     target_kwargs=None,
              cat_kwargs=None, container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If string(s), refer to entries of the  same  catalog  as  this
                                   Source

                            target_chooser: function to choose between targets
                                   function(targets,  cat) -> source, or a fully-qualified dotted
                                   string pointing to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.GenericTransform(*args, **kwargs)

              __init__(targets,     target_chooser=<function     first>,      target_kwargs=None,
              cat_kwargs=None, container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If  string(s),  refer  to  entries of the same catalog as this
                                   Source

                            target_chooser: function to choose between targets
                                   function(targets, cat) -> source, or a fully-qualified  dotted
                                   string pointing to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.DataFrameTransform(*args, **kwargs)
              Transform where the input and output are both Dask-compatible dataframes

              This  derives  from  GenericTransform,  and  you  must  supply  transform  and  any
              transform_kwargs.

              __init__(targets,     target_chooser=<function     first>,      target_kwargs=None,
              cat_kwargs=None, container=None, metadata=None, **kwargs)

                     Parameters

                            targets: list of string or DataSources
                                   If  string(s),  refer  to  entries of the same catalog as this
                                   Source

                            target_chooser: function to choose between targets
                                   function(targets, cat) -> source, or a fully-qualified  dotted
                                   string pointing to it

                            target_kwargs: dict of dict with keys matching items of targets

                            cat_kwargs: to pass to intake.open_catalog, if the target is in
                                   another catalog

                            container: str (optional)
                                   Assumed output container, if known/different from input

                            [Note: the exact form of target_kwargs and cat_kwargs may be

                            subject to change]

       class intake.source.derived.Columns(*args, **kwargs)
              Simple dataframe transform to pick columns

              Given  as  an example of how to make a specific dataframe transform.  Note that you
              could use DataFrameTransform directly, by writing a function to choose the  columns
              instead of a method as here.

              __init__(columns, **kwargs)

                     columns: list of labels (usually str) or slice
                            Columns to choose from the target dataframe

REFERENCE

   API
       Auto-generated reference

   End User
       These are reference class and function definitions likely to be useful to everyone.

         ┌─────────────────────────────────────────────────┬──────────────────────────────────┐
         │intake.open_catalog([uri])                       │ Create a Catalog object          │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.registry                                  │ Dict of driver: DataSource class │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.register_driver(name,                     │ Add runtime driver definition    │
         │value[, ...])                                    │                                  │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.upload(data,        path,                 │ Given  a  concrete  data object, │
         │**kwargs)                                        │ store  it  at   given   location │
         │                                                 │ return Source                    │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.csv.CSVSource(*args,               │ Read CSV files into dataframes   │
         │**kwargs)                                        │                                  │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.textfiles.TextFilesSource(...)     │ Read  textfiles  as  sequence of │
         │                                                 │ lines                            │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.jsonfiles.JSONFileSource(...)      │ Read  JSON  files  as  a  single │
         │                                                 │ dictionary or list               │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.jsonfiles.JSONLinesFileSource(...) │ Read      a       JSONL       (‐ │
         │                                                 │ https://jsonlines.org/) file and │
         │                                                 │ return a list of  objects,  each │
         │                                                 │ being valid json object (e.g.    │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.npy.NPySource(*args, **kwargs)     │ Read  numpy binary files into an │
         │                                                 │ array                            │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.source.zarr.ZarrArraySource(*args, ...)   │ Read Zarr format files  into  an │
         │                                                 │ array                            │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.catalog.local.YAMLFileCatalog(*args, ...) │ Catalog as described by a single │
         │                                                 │ YAML file                        │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.catalog.local.YAMLFilesCatalog(*args,     │ Catalog   as   described   by  a │
         │...)                                             │ multiple YAML files              │
         ├─────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.catalog.zarr.ZarrGroupCatalog(*args, ...) │ A catalog of the  members  of  a │
         │                                                 │ Zarr group.                      │
         └─────────────────────────────────────────────────┴──────────────────────────────────┘

       intake.open_catalog(uri=None, **kwargs)
              Create a Catalog object

              Can  load  YAML catalog files, connect to an intake server, or create any arbitrary
              Catalog subclass instance. In the general case, the user should supply driver= with
              a  value  from  the  plugins  registry  which has a container type of catalog. File
              locations can generally be remote, if specifying a URL protocol.

              The default behaviour if not specifying the driver is as follows:

              • if uri is a a single string ending in "yml" or "yaml", open it as a catalog file

              • if uri is a list of strings, a string containing a  glob  character  ("*")  or  a
                string  not  ending  in  "y(a)ml",  open as a set of catalog files. In the latter
                case, assume it is a directory.

              • if uri beings with protocol "intake:", connect to a remote Intake server

              • if uri is None or missing, create a base Catalog object without entries.

              Parameters

                     uri: str or pathlib.Path
                            Designator for the location of the catalog.

                     kwargs:
                            passed to subclass instance,  see  documentation  of  the  individual
                            catalog   classes.   For  example,  yaml_files_cat  (when  specifying
                            multiple uris or  a  glob  string)  takes  the  additional  parameter
                            flatten=True|False, specifying whether all data sources are merged in
                            a single namespace, or each file becomes a sub-catalog.

              SEE ALSO:

                 intake.open_yaml_files_cat, intake.open_yaml_file_cat

                 intake.open_intake_remote

       intake.registry
              Mapping from plugin names to the DataSource classes that implement them. These  are
              the  names  that  should  appear  in the driver: key of each source definition in a
              catalog. See Plugin Directory for more details.

       intake.open_
              Set of functions, one for each plugin, for direct opening of  a  data  source.  The
              names are derived from the names of the plugins in the registry at import time.

       intake.upload(data, path, **kwargs)
              Given a concrete data object, store it at given location return Source

              Use  this  function  to  publicly  share data which you have created in your python
              session. Intake will try each of the container types, to see if  one  of  them  can
              handle  the  input  data,  and write the data to the path given, in the format most
              appropriate for the data type, e.g., parquet for pandas or dask data-frames.

              With the DataSource instance you get back, you can add this to a catalog,  or  just
              get the YAML representation for editing (.yaml()) and sharing.

              Parameters

                     data   instance  The  object to upload and store. In many cases, the dask or
                            in-memory variant are handled equivalently.

                     path   str Location of the output files; can be,  for  instance,  a  network
                            drive for sharing over a VPC, or a bucket on a cloud storage service

                     kwargs passed to the writer for fine control.UNINDENT

                     Returns

                            DataSource instance

   Source classes
       class intake.source.csv.CSVSource(*args, **kwargs)
              Read CSV files into dataframes

              Prototype of sources reading dataframe data

              __init__(urlpath,     csv_kwargs=None,     metadata=None,     storage_options=None,
              path_as_pattern=True)

                     Parameters

                            urlpath
                                   str or iterable, location of data May  be  a  local  path,  or
                                   remote path if including a protocol specifier such as 's3://'.
                                   May include glob wildcards or format  pattern  strings.   Some
                                   examples:

                                   • {{ CATALOG_DIR }}data/precipitation.csvs3://data/*.csvs3://data/precipitation_{state}_{zip}.csvs3://data/{year}/{month}/{day}/precipitation.csv{{ CATALOG_DIR }}data/precipitation_{date:%Y-%m-%d}.csv

                            csv_kwargs
                                   dict Any further arguments to pass to Dask's read_csv (such as
                                   block size) or to the CSV parser  in  pandas  (such  as  which
                                   columns to use, encoding, data-types)

                            storage_options
                                   dict  Any parameters that need to be passed to the remote data
                                   backend, such as credentials.

                            path_as_pattern
                                   bool or str, optional Whether to treat the path as  a  pattern
                                   (ie.  data_{field}.csv)  and  create new columns in the output
                                   corresponding to pattern fields. If str, is treated as pattern
                                   to match on. Default is True.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance,  add  it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be accessed and a new persisted version written  transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By default, assumes i should be an integer  between  zero  and  npartitions;
                     override for more complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.source.zarr.ZarrArraySource(*args, **kwargs)
              Read Zarr format files into an array

              Zarr is an numerical array storage format which works particularly well with remote
              and    parallel    access.     For     specifics     of     the     format,     see
              https://zarr.readthedocs.io/en/stable/

              __init__(urlpath, storage_options=None, component=None, metadata=None, **kwargs)
                     The  parameters  dtype  and shape will be determined from the first file, if
                     not given.

                     Parameters

                            urlpath
                                   str Location of  data  file(s),  possibly  including  protocol
                                   information

                            storage_options
                                   dict Passed on to storage backend for remote files

                            component
                                   str  or  None  If  None, assume the URL points to an array. If
                                   given, assume the URL points to a group, and descend the group
                                   to find the array at this location in the hierarchy.

                            kwargs passed on to dask.array.from_zarr.UNINDENT

                     discover()
                            Open resource and populate the source attributes.

                     export(path, **kwargs)
                            Save this data for sharing with other people

                            Creates a copy of the data in a format appropriate for its container,
                            in the location specified (which can be remote, e.g., s3).

                            Returns the resultant source object, so that you can,  for  instance,
                            add   it   to   a  catalog  (catalog.add(source))  or  get  its  YAML
                            representation (.yaml()).

                     persist(ttl=None, **kwargs)
                            Save data from this source to local persistent storage

                            Parameters

                                   ttl: numeric, optional
                                          Time to live in  seconds.  If  provided,  the  original
                                          source  will  be  accessed  and a new persisted version
                                          written transparently when more than ttl  seconds  have
                                          passed since the old persisted version was written.

                                   kargs: passed to the _persist method on the base container.

                     read() Load entire dataset into a container and return it

                     read_partition(i)
                            Return a part of the data corresponding to i-th partition.

                            By  default,  assumes  i  should  be  an  integer  between  zero  and
                            npartitions; override for more complex indexing schemes.

                     to_dask()
                            Return a dask container for this data source

       class intake.source.textfiles.TextFilesSource(*args, **kwargs)
              Read textfiles as sequence of lines

              Prototype of sources reading sequential data.

              Takes a set of files, and returns an iterator over the text in each of  them.   The
              files  can  be  local  or  remote.  Extra  parameters  for  encoding, etc., go into
              storage_options.

              __init__(urlpath,    text_mode=True,    text_encoding='utf8',     compression=None,
              decoder=None, read=True, metadata=None, storage_options=None)

                     Parameters

                            urlpath
                                   str  or  list(str) Target files. Can be a glob-path (with "*")
                                   and include protocol specified (e.g., "s3://"). Can also be  a
                                   list of absolute paths.

                            text_mode
                                   bool  Whether  to  open the file in text mode, recoding binary
                                   characters on the fly

                            text_encoding
                                   str If text_mode is True, apply this encoding. UTF* is by  far
                                   the most common

                            compression
                                   str or None If given, decompress the file with the given codec
                                   on load. Can be something like "gzip", "bz2",  or  to  try  to
                                   guess from the filename, 'infer'

                            decoder
                                   function,  str  or  None  Use  this  to decode the contents of
                                   files. If None, you will get a list of lines of text/bytes. If
                                   a  function,  it must operate on an open file-like object or a
                                   bytes/str instance, and return a list

                            read   bool If decoder  is  not  None,  this  flag  controls  whether
                                   bytes/str  get  passed to the function indicated (True) or the
                                   open file-like object (False)

                            storage_options: dict
                                   Options  to  pass  to  the  file  reader  backend,   including
                                   text-specific  encoding  arguments, and parameters specific to
                                   the remote file-system driver, if using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns  the  resultant source object, so that you can, for instance, add it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be  accessed and a new persisted version written transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By  default,  assumes  i  should be an integer between zero and npartitions;
                     override for more complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.source.jsonfiles.JSONFileSource(*args, **kwargs)
              Read JSON files as a single dictionary or list

              The files can be local or remote. Extra parameters  for  encoding,  etc.,  go  into
              storage_options.

              __init__(urlpath:  str,  text_mode:  bool  =  True,  text_encoding:  str  = 'utf8',
              compression: Optional[str] = None, read: bool = True,  metadata:  Optional[dict]  =
              None, storage_options: Optional[dict] = None)

                     Parameters

                            urlpath
                                   str   Target  file.  Can  include  protocol  specified  (e.g.,
                                   "s3://").

                            text_mode
                                   bool Whether to open the file in text  mode,  recoding  binary
                                   characters on the fly

                            text_encoding
                                   str  If text_mode is True, apply this encoding. UTF* is by far
                                   the most common

                            compression
                                   str or None If given, decompress the file with the given codec
                                   on load. Can be something like "zip", "gzip", "bz2", or to try
                                   to guess from the filename, 'infer'

                            storage_options: dict
                                   Options  to  pass  to  the  file  reader  backend,   including
                                   text-specific  encoding  arguments, and parameters specific to
                                   the remote file-system driver, if using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns  the  resultant source object, so that you can, for instance, add it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be  accessed and a new persisted version written transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

       class intake.source.jsonfiles.JSONLinesFileSource(*args, **kwargs)
              Read a JSONL (https://jsonlines.org/) file and return a list of objects, each being
              valid json object (e.g. a dictionary or list)

              __init__(urlpath: str,  text_mode:  bool  =  True,  text_encoding:  str  =  'utf8',
              compression:  Optional[str]  =  None, read: bool = True, metadata: Optional[dict] =
              None, storage_options: Optional[dict] = None)

                     Parameters

                            urlpath
                                   str  Target  file.  Can  include  protocol  specified   (e.g.,
                                   "s3://").

                            text_mode
                                   bool  Whether  to  open the file in text mode, recoding binary
                                   characters on the fly

                            text_encoding
                                   str If text_mode is True, apply this encoding. UTF* is by  far
                                   the most common

                            compression
                                   str or None If given, decompress the file with the given codec
                                   on load. Can be something like "zip", "gzip", "bz2", or to try
                                   to guess from the filename, 'infer'.

                            storage_options: dict
                                   Options   to  pass  to  the  file  reader  backend,  including
                                   text-specific encoding arguments, and parameters  specific  to
                                   the remote file-system driver, if using.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance,  add  it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              head(nrows: int = 100)
                     return the first nrows lines from the file

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be accessed and a new persisted version written  transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

       class intake.source.npy.NPySource(*args, **kwargs)
              Read numpy binary files into an array

              Prototype source showing example of working with arrays

              Each file becomes one or more partitions, but partitioning within a  file  is  only
              along the largest dimension, to ensure contiguous data.

              __init__(path,    dtype=None,    shape=None,   chunks=None,   storage_options=None,
              metadata=None)
                     The parameters dtype and shape will be determined from the  first  file,  if
                     not given.

                     Parameters

                            path: str of list of str
                                   Location of data file(s), possibly including glob and protocol
                                   information

                            dtype: str dtype spec
                                   In known, the dtype (e.g., "int64" or "f4").

                            shape: tuple of int
                                   If known, the length of each axis

                            chunks: int
                                   Size of chunks within a file along biggest  dimension  -  need
                                   not be an exact factor of the length of that dimension

                            storage_options: dict
                                   Passed to file-system backend.

              discover()
                     Open resource and populate the source attributes.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance,  add  it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be accessed and a new persisted version written  transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By default, assumes i should be an integer  between  zero  and  npartitions;
                     override for more complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.catalog.local.YAMLFileCatalog(*args, **kwargs)
              Catalog as described by a single YAML file

              __init__(path=None, text=None, autoreload=True, **kwargs)

                     Parameters

                            path: str
                                   Location of the file to parse (can be remote)

                            text: str
                                   YAML contents of catalog, takes precedence over path

                            reload bool  Whether to watch the source file for changes; make False
                                   if you want an editable Catalog

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns  the  resultant source object, so that you can, for instance, add it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be  accessed and a new persisted version written transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number  of  levels  to  descend;  needed  to truncate circular
                                   references and for cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.local.YAMLFilesCatalog(*args, **kwargs)
              Catalog as described by a multiple YAML files

              __init__(path, flatten=True, **kwargs)

                     Parameters

                            path: str
                                   Location of the files to  parse  (can  be  remote),  including
                                   possible  glob  (*)  character(s).  Can also be list of paths,
                                   without glob characters.

                            flatten: bool (True)
                                   Whether to list all entries in  the  cats  at  the  top  level
                                   (True) or create sub-cats from each file (False).

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance,  add  it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be accessed and a new persisted version written  transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels  to  descend;  needed  to  truncate  circular
                                   references and for cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.zarr.ZarrGroupCatalog(*args, **kwargs)
              A catalog of the members of a Zarr group.

              __init__(urlpath,      storage_options=None,     component=None,     metadata=None,
              consolidated=False, name=None)

                     Parameters

                            urlpath
                                   str Location of  data  file(s),  possibly  including  protocol
                                   information

                            storage_options
                                   dict, optional Passed on to storage backend for remote files

                            component
                                   str, optional If None, build a catalog from the root group. If
                                   given, build the catalog from the group at  this  location  in
                                   the hierarchy.

                            metadata
                                   dict,  optional  Catalog  metadata.  If  not provided, will be
                                   populated from Zarr group attributes.

                            consolidated
                                   bool,  optional  If  True,  assume  Zarr  metadata  has   been
                                   consolidated.

              export(path, **kwargs)
                     Save this data for sharing with other people

                     Creates a copy of the data in a format appropriate for its container, in the
                     location specified (which can be remote, e.g., s3).

                     Returns the resultant source object, so that you can, for instance,  add  it
                     to a catalog (catalog.add(source)) or get its YAML representation (.yaml()).

              persist(ttl=None, **kwargs)
                     Save data from this source to local persistent storage

                     Parameters

                            ttl: numeric, optional
                                   Time to live in seconds. If provided, the original source will
                                   be accessed and a new persisted version written  transparently
                                   when more than ttl seconds have passed since the old persisted
                                   version was written.

                            kargs: passed to the _persist method on the base container.

              reload()
                     Reload catalog if sufficient time has passed

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels  to  descend;  needed  to  truncate  circular
                                   references and for cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

   Base Classes
       This is a reference API class listing, useful mainly for developers.

           ┌─────────────────────────────────────────────┬──────────────────────────────────┐
           │intake.source.base.DataSourceBase(*args,     │ An object which can produce data │
           │...)                                         │                                  │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.base.DataSource(*args,         │ A  Data Source will all optional │
           │**kwargs)                                    │ functionality                    │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.base.PatternMixin()            │ Helper    class    to    provide │
           │                                             │ file-name parsing abilities to a │
           │                                             │ driver class                     │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.container.base.RemoteSource(*args,    │ Base  class  for all DataSources │
           │...)                                         │ living on an Intake server       │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.catalog.Catalog(*args, **kwargs)      │ Manages  a  hierarchy  of   data │
           │                                             │ sources as a collective unit.    │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.catalog.entry.CatalogEntry(*args,     │ A single  item  appearing  in  a │
           │...)                                         │ catalog                          │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.catalog.local.UserParameter(*args,    │ A  user-settable  item  that  is │
           │...)                                         │ passed   to  a  DataSource  upon │
           │                                             │ instantiation.                   │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.auth.base.BaseAuth(*args,             │ Base class for authorization     │
           │**kwargs)                                    │                                  │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.cache.BaseCache(driver,        │ Provides utilities for  managing │
           │spec)                                        │ cached data files.               │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.source.base.Schema(**kwargs)          │ Holds     details     of    data │
           │                                             │ description  for  any  type   of │
           │                                             │ data-source                      │
           ├─────────────────────────────────────────────┼──────────────────────────────────┤
           │intake.container.persist.PersistStore(*args, │ Specialised     catalog      for │
           │...)                                         │ persisted data-sources           │
           └─────────────────────────────────────────────┴──────────────────────────────────┘

       class intake.source.base.DataSource(*args, **kwargs)
              A Data Source will all optional functionality

              When  subclassed,  child classes will have the base data source functionality, plus
              caching, plotting and persistence abilities.

              plot   Accessor for HVPlot methods.  See Plotting for more details.

       class intake.catalog.Catalog(*args, **kwargs)
              Manages a hierarchy of data sources as a collective unit.

              A catalog is a set of available data  sources  for  an  individual  entity  (remote
              server,  local   file,  or  a  local  directory  of files). This can be expanded to
              include a collection of subcatalogs, which are then managed as a single unit.

              A catalog is created with a single URI or a group of URIs. A URI can  either  be  a
              URL or a file path.

              Each  catalog  in  the hierarchy is responsible for caching the most recent refresh
              time to prevent overeager queries.

              Attributes

                     metadata
                            dict Arbitrary information to carry along with the data source specs.

              configure_new(**kwargs)
                     Create a new instance of this source with altered arguments

                     Enables  the  picking  of  options  and  re-evaluating  templates  from  any
                     user-parameters  associated  with this source, or overriding any of the init
                     arguments.

                     Returns a new data source instance. The instance will be recreated from  the
                     original entry definition in a catalog if this source was originally created
                     from a catalog.

              discover()
                     Open resource and populate the source attributes.

              filter(func)
                     Create a Catalog of a subset of entries based on a condition

                     WARNING:
                        This function operates on CatalogEntry objects not DataSource objects.

                     NOTE:
                        Note that, whatever specific class  this  is  performed  on,  the  return
                        instance  is  a  Catalog. The entries are passed unmodified, so they will
                        still reference the original catalog instance  and  include  its  details
                        such as directory,.

                     Parameters

                            func   function  This  should  take a CatalogEntry and return True or
                                   False. Those items returning True will be included in the  new
                                   Catalog, with the same entry names

                     Returns

                            Catalog
                                   New catalog with Entries that still refer to their parents

              force_reload()
                     Imperative reload data now

              classmethod from_dict(entries, **kwargs)
                     Create Catalog from the given set of entries

                     Parameters

                            entries
                                   dict-like  A  mapping  of  name:entry which supports dict-like
                                   functionality, e.g., is derived from collections.abc.Mapping.

                            kwargs passed on the constructor  Things  like  metadata,  name;  see
                                   __init__.

                     Returns

                            Catalog instance

              get(**kwargs)
                     Create a new instance of this source with altered arguments

                     Enables  the  picking  of  options  and  re-evaluating  templates  from  any
                     user-parameters associated with this source, or overriding any of  the  init
                     arguments.

                     Returns  a new data source instance. The instance will be recreated from the
                     original entry definition in a catalog if this source was originally created
                     from a catalog.

              property gui
                     Source GUI, with parameter selection and plotting

              items()
                     Get an iterator over (key, source) tuples for the catalog entries.

              keys() Entry names in this catalog as an iterator (alias for __iter__)

              pop(key)
                     Remove entry from catalog and return it

                     This  relies  on the _entries attribute being mutable, which it normally is.
                     Note that if a catalog automatically reloads, any  entry  removed  here  may
                     soon reappear

                     Parameters

                            key    str Key to give the entry in the cat

              reload()
                     Reload catalog if sufficient time has passed

              save(url, storage_options=None)
                     Output this catalog to a file as YAML

                     Parameters

                            url    str Location to save to, perhaps remote

                            storage_options
                                   dict Extra arguments for the file-system

              serialize()
                     Produce YAML version of this catalog.

                     Note  that  this  is  not  the  same as .yaml(), which produces a YAML block
                     referring to this catalog.

              values()
                     Get an iterator over the sources for catalog entries.

              walk(sofar=None, prefix=None, depth=2)
                     Get all entries in this catalog and sub-catalogs

                     Parameters

                            sofar: dict or None
                                   Within recursion, use this dict for output

                            prefix: list of str or None
                                   Names of levels already visited

                            depth: int
                                   Number of levels  to  descend;  needed  to  truncate  circular
                                   references and for cleaner output

                     Returns

                            Dict where the keys are the entry names in dotted syntax, and the

                            values are entry instances.

       class intake.catalog.entry.CatalogEntry(*args, **kwargs)
              A single item appearing in a catalog

              This  is the base class, used by local entries (i.e., read from a YAML file) and by
              remote entries (read from a server).

              describe()
                     Get a dictionary of attributes of this entry.

                     Returns: dict with keys

                            name: str
                                   The name of the catalog entry.

                            container
                                   str kind of container used by this data source

                            description
                                   str Markdown-friendly description of data source

                            direct_access
                                   str Mode of remote access: forbid, allow, force

                            user_parameters
                                   list[dict] List of user parameters defined by this entry

              get(**user_parameters)
                     Open the data source.

                     Equivalent to calling the catalog entry like a function.

                     Note: entry(), entry.attr, entry[item]  check  for  persisted  sources,  but
                     directly  calling  .get() will always ignore the persisted store (equivalent
                     to self._pmode=='never').

                     Parameters

                            user_parameters
                                   dict Values for user-configurable  parameters  for  this  data
                                   source

                     Returns

                            DataSource

              property has_been_persisted
                     For the source created with the given args, has it been persisted?

              property plots
                     List custom associated quick-plots

       class intake.container.base.RemoteSource(*args, **kwargs)
              Base class for all DataSources living on an Intake server

              to_dask()
                     Return a dask container for this data source

       class intake.catalog.local.UserParameter(*args, **kwargs)
              A user-settable item that is passed to a DataSource upon instantiation.

              For  string parameters, default may include special functions func(args), which may
              be expanded from environment variables or by executing a shell command.

              Parameters

                     name: str
                            the key that appears in the DataSource argument strings

                     description: str
                            narrative text

                     type: str
                            one of list (COERSION_RULES)

                     default: type value
                            same type as type. It a  str,  may  include  special  functions  env,
                            shell, client_env, client_shell.

                     min, max: type value
                            for validation of user input

                     allowed: list of type
                            for validation of user input

              describe()
                     Information about this parameter

              expand_defaults(client=False, getenv=True, getshell=True)
                     Compile env, client_env, shell and client_shell commands

              validate(value)
                     Does value meet parameter requirements?

       class intake.auth.base.BaseAuth(*args, **kwargs)
              Base class for authorization

              Subclass this and override the methods to implement a new type of auth.

              This basic class allows all access.

              allow_access(header, source, catalog)
                     Is the given HTTP header allowed to access given data source

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

                            source: CatalogEntry
                                   The data source the user wants to access.

                            catalog: Catalog
                                   The catalog object containing this data source.

              allow_connect(header)
                     Is the requests header given allowed to talk to the server

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

              get_case_insensitive(dictionary, key, default=None)
                     Case-insensitive search of a dictionary for key.

                     Returns the value if key match is found, otherwise default.

       class    intake.source.cache.BaseCache(driver,    spec,    catdir=None,    cache_dir=None,
       storage_options={})
              Provides utilities for managing cached data files.

              Providers of caching functionality should derive from this, and appear  as  entries
              in  registry.  The  principle methods to override are _make_files() and _load() and
              _from_metadata().

              clear_all()
                     Clears all cache and metadata.

              clear_cache(urlpath)
                     Clears cache and metadata for a given urlpath.

                     Parameters

                            urlpath: str, location of data
                                   May be a local path, or remote path if  including  a  protocol
                                   specifier such as 's3://'. May include glob wildcards.

              get_metadata(urlpath)

                     Parameters

                            urlpath: str, location of data
                                   May  be  a  local path, or remote path if including a protocol
                                   specifier such as 's3://'. May include glob wildcards.

                     Returns

                            Metadata (dict) about a given urlpath.

              load(urlpath, output=None, **kwargs)
                     Downloads data from a given url, generates a hashed filename, logs metadata,
                     and caches it locally.

                     Parameters

                            urlpath: str, location of data
                                   May  be  a  local path, or remote path if including a protocol
                                   specifier such as 's3://'. May include glob wildcards.

                            output: bool
                                   Whether to show progress bars; turn off for testing

                     Returns

                            List of local cache_paths to be opened instead of the remote file(s).
                            If

                            caching is disable, the urlpath is returned.

       class intake.source.base.PatternMixin
              Helper class to provide file-name parsing abilities to a driver class

       class intake.source.base.Schema(**kwargs)
              Holds details of data description for any type of data-source

              This should always be pickleable, so that it can be sent from a server to a client,
              and contain all information needed to recreate a RemoteSource on the client.

       class intake.container.persist.PersistStore(*args, **kwargs)
              Specialised catalog for persisted data-sources

              add(key, source)
                     Add the persisted source to the store under the given key

                     key    str The unique token of the un-persisted, original source

                     source DataSource instance The thing to  add  to  the  persisted  catalogue,
                            referring to persisted data

              backtrack(source)
                     Given a unique key in the store, recreate original source

              get_tok(source)
                     Get string token from object

                     Strings  are assumed to already be a token; if source or entry, see if it is
                     a persisted thing ("original_tok" is in its metadata), else generate its own
                     token.

              needs_refresh(source)
                     Has the (persisted) source expired in the store

                     Will  return  True  if the source is not in the store at all, if it's TTL is
                     set to None, or if more seconds have passed than the TTL.

              refresh(key)
                     Recreate and re-persist the source for the given unique ID

              remove(source, delfiles=True)
                     Remove a dataset from the persist store

                     source str or DataSource or Lo If a str,  this  is  the  unique  ID  of  the
                            original source, which is the key of the persisted dataset within the
                            store. If a source, can be  either  the  original  or  the  persisted
                            source.

                     delfiles
                            bool Whether to remove the on-disc artifact

   Other Classes
   Cache Types
            ┌────────────────────────────────────────────┬──────────────────────────────────┐
            │intake.source.cache.FileCache(driver,       │ Cache specific set of files      │
            │spec)                                       │                                  │
            ├────────────────────────────────────────────┼──────────────────────────────────┤
            │intake.source.cache.DirCache(driver,        │ Cache a complete directory tree  │
            │spec[, ...])                                │                                  │
            ├────────────────────────────────────────────┼──────────────────────────────────┤
            │intake.source.cache.CompressedCache(driver, │ Cache   files   extracted   from │
            │spec)                                       │ downloaded compressed source     │
            ├────────────────────────────────────────────┼──────────────────────────────────┤
            │intake.source.cache.DATCache(driver, spec[, │ Use   the   DAT   protocol    to │
            │...])                                       │ replicate data                   │
            └────────────────────────────────────────────┴──────────────────────────────────┘

            │intake.source.cache.CacheMetadata(*args,    │ Utility   class   for   managing │
            │...)                                        │ persistent  metadata  stored  in │
            │                                            │ the Intake config directory.     │
            └────────────────────────────────────────────┴──────────────────────────────────┘

       class    intake.source.cache.FileCache(driver,    spec,    catdir=None,    cache_dir=None,
       storage_options={})
              Cache specific set of files

              Input  is  a single file URL, URL with glob characters or list of URLs. Output is a
              specific set of local files.

       class    intake.source.cache.DirCache(driver,    spec,    catdir=None,     cache_dir=None,
       storage_options={})
              Cache a complete directory tree

              Input  is  a  directory  root  URL,  plus  a depth parameter for how many levels of
              subdirectories to search. All regular files will be copied. Output is the resultant
              local directory tree.

       class   intake.source.cache.CompressedCache(driver,   spec,  catdir=None,  cache_dir=None,
       storage_options={})
              Cache files extracted from downloaded compressed source

              For one or more remote compressed files,  downloads  to  local  temporary  dir  and
              extracts  all  contained  files  to  local cache. Input is URL(s) (including globs)
              pointing to remote compressed files, plus optional  decomp,  which  is  "infer"  by
              default   (guess   from   file   extension)   or   one   of   the  key  strings  in
              intake.source.decompress.decomp. Optional regex_filter parameter is  used  to  load
              only  the  extracted files that match the pattern.  Output is the list of extracted
              files.

       class    intake.source.cache.DATCache(driver,    spec,    catdir=None,     cache_dir=None,
       storage_options={})
              Use the DAT protocol to replicate data

              For  details  of  the protocol, see https://docs.datproject.org/ The executable dat
              must be available.

              Since in this case, it is not possible to access the remote  files  directly,  this
              cache  mechanism takes no parameters. The expectation is that the url passed by the
              driver is of the form:

                 dat://<dat hash>/file_pattern

              where the file pattern will typically be a glob string like "*.json".

       class intake.source.cache.CacheMetadata(*args, **kwargs)
              Utility class  for  managing  persistent  metadata  stored  in  the  Intake  config
              directory.

              keys() -> a set-like object providing a view on D's keys

              pop(k[, d]) -> v, remove specified key and return the corresponding value.
                     If key is not found, d is returned if given, otherwise KeyError is raised.

              update([E], **F) -> None. Update D from mapping/iterable E and F.
                     If  E present and has a .keys() method, does:     for k in E: D[k] = E[k] If
                     E present and lacks .keys() method, does:     for (k, v) in E: D[k] =  v  In
                     either case, this is followed by: for k, v in F.items(): D[k] = v

   Auth
            ┌────────────────────────────────────────────┬──────────────────────────────────┐
            │intake.auth.secret.SecretAuth(*args,        │ A  very  simple  auth  mechanism │
            │**kwargs)                                   │ using a shared secret            │
            ├────────────────────────────────────────────┼──────────────────────────────────┤
            │intake.auth.secret.SecretClientAuth(secret) │ Matching client auth  plugin  to │
            │                                            │ SecretAuth                       │
            └────────────────────────────────────────────┴──────────────────────────────────┘

       class intake.auth.secret.SecretAuth(*args, **kwargs)
              A very simple auth mechanism using a shared secret

              Parameters

                     secret: str
                            The  string  that  must be matched in the requests. If None, a random
                            UUID is generated and logged.

                     key: str
                            Header entry in which to seek the secret

              allow_access(header, source, catalog)
                     Is the given HTTP header allowed to access given data source

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

                            source: CatalogEntry
                                   The data source the user wants to access.

                            catalog: Catalog
                                   The catalog object containing this data source.

              allow_connect(header)
                     Is the requests header given allowed to talk to the server

                     Parameters

                            header: dict
                                   The HTTP header from the incoming request

       class intake.auth.secret.SecretClientAuth(secret, key='intake-secret')
              Matching client auth plugin to SecretAuth

              Parameters

                     secret: str
                            The string that must be included requests.

                     key: str
                            HTTP Header key for the shared secret

              get_headers()
                     Returns a dictionary of HTTP headers for the remote catalog request.

   Containers
     ┌──────────────────────────────────────────────────────────┬──────────────────────────────────┐
     │intake.container.dataframe.RemoteDataFrame(...)           │ Dataframe on an Intake server    │
     ├──────────────────────────────────────────────────────────┼──────────────────────────────────┤
     │intake.container.ndarray.RemoteArray(*args,               │ nd-array on an Intake server     │
     │...)                                                      │                                  │
     ├──────────────────────────────────────────────────────────┼──────────────────────────────────┤
     │intake.container.semistructured.RemoteSequenceSource(...) │ Sequence-of-things  source on an │
     │                                                          │ Intake server                    │
     └──────────────────────────────────────────────────────────┴──────────────────────────────────┘

       class intake.container.dataframe.RemoteDataFrame(*args, **kwargs)
              Dataframe on an Intake server

              read() Load entire dataset into a container and return it

              to_dask()
                     Return a dask container for this data source

       class intake.container.ndarray.RemoteArray(*args, **kwargs)
              nd-array on an Intake server

              read() Load entire dataset into a container and return it

              read_partition(i)
                     Return a part of the data corresponding to i-th partition.

                     By default, assumes i should be an integer  between  zero  and  npartitions;
                     override for more complex indexing schemes.

              to_dask()
                     Return a dask container for this data source

       class intake.container.semistructured.RemoteSequenceSource(*args, **kwargs)
              Sequence-of-things source on an Intake server

              read() Load entire dataset into a container and return it

              to_dask()
                     Return a dask container for this data source

   Server
         ┌──────────────────────────────────────────────────┬──────────────────────────────────┐
         │intake.cli.server.server.IntakeServer(catalog)    │ Main    intake-server    tornado │
         │                                                  │ application                      │
         ├──────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.cli.server.server.ServerInfoHandler(...)   │ Basic info about the server      │
         ├──────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.cli.server.server.SourceCache()            │ Stores DataSources requested  by │
         │                                                  │ some user                        │
         ├──────────────────────────────────────────────────┼──────────────────────────────────┤
         │intake.cli.server.server.ServerSourceHandler(...) │ Open or stream data source       │
         └──────────────────────────────────────────────────┴──────────────────────────────────┘

       class intake.cli.server.server.IntakeServer(catalog)
              Main intake-server tornado application

       class   intake.cli.server.server.ServerInfoHandler(application:   tornado.web.Application,
       request: tornado.httputil.HTTPServerRequest, **kwargs: Any)
              Basic info about the server

       class intake.cli.server.server.SourceCache
              Stores DataSources requested by some user

              peek(uuid)
                     Get the source but do not change the last access time

       class  intake.cli.server.server.ServerSourceHandler(application:  tornado.web.Application,
       request: tornado.httputil.HTTPServerRequest, **kwargs: Any)
              Open or stream data source

              The requests "action" field (open|read) specified what the  request  wants  to  do.
              Open caches the source and created an ID for it, read uses that ID to reference the
              source and read a partition.

              get()  Access one source's info.

                     This is for direct access to an entry by name for random  access,  which  is
                     useful  to  the  client when the whole catalog has not first been listed and
                     pulled locally (e.g., in the case of pagination).

   GUI
   Changelog
   0.6.6
       Released on August 26, 2022.

       • Fixed bug in json and jsonl driver.

       • Ensure description is retained in the catalog.

       • Fix cache issue when running inside a notebook.

       • Add templating parameters.

       • Plotting api keeps hold of hvplot calls to allow other plots to be made.

       • docs updates

       • fix urljoin for server via proxy

   0.6.5
       Released on January 9, 2022.

       • Added link to intake-google-analytics.

       • Add tiled driver.

       • Add json and jsonl drivers.

       • Allow parameters to be passed through catalog.

       • Add mlist type which allows inputs from a known list of values.

   Making Drivers
       The goal of the Intake plugin system is to make it very simple to implement a Driver for a
       new data source, without any special knowledge of Dask or the Intake catalog system.

   Assumptions
       Although  Intake  is  very  flexible  about  data, there are some basic assumptions that a
       driver must satisfy.

   Data Model
       Intake currently supports 3 kinds of containers, represented the most common  data  models
       used in Python:

       • dataframe

       • ndarray

       • python (list of Python objects, usually dictionaries)

       Although  a  driver  can load any type of data into any container, and new container types
       can be added to the list above, it is reasonable to expect that the  number  of  container
       types  remains  small.  Declaring a container type is only informational for the user when
       read locally, but streaming of data from a server requires  that  the  container  type  be
       known to both server and client.

       A  given  driver  must only return one kind of container.  If a file format (such as HDF5)
       could reasonably be interpreted as two different data models depending on usage (such as a
       dataframe  or  an  ndarray),  then two different drivers need to be created with different
       names.  If a driver returns the python container, it should document what  Python  objects
       will appear in the list.

       The  source  of  data should be essentially permanent and immutable.  That is, loading the
       data should not destroy or modify the data, nor should closing the data source destroy the
       data  either.   When a data source is serialized and sent to another host, it will need to
       be reopened at the destination, which may cause queries to be re-executed and files to  be
       reopened.   Data  sources that treat readers as "consumers" and remove data once read will
       cause erratic behavior, so Intake is not suitable for accessing things like  FIFO  message
       queues.

   Schema
       The  schema  of a data source is a detailed description of the data, which can be known by
       loading only metadata or by loading only some small representative portion of the data. It
       is  information  to  present to the user about the data that they are considering loading,
       and may be important in the case of server-client communication. In  the  latter  context,
       the  contents of the schema must be serializable by msgpack (i.e., numbers, strings, lists
       and dictionaries only).

       There may be unknown parts of the schema before the  whole  data  is  read.   drivers  may
       require  this  unknown  information  in the __init__() method (or the catalog spec), or do
       some kind of partial data inspection to determine the schema; or more simply, may be given
       as unknown None values.  Regardless of method used, the time spent figuring out the schema
       ahead of time should be short and not scale with the size of the data.

       Typical fields in a schema dictionary are npartitions, dtype, shape, etc., which  will  be
       more appropriate for some drivers/data-types than others.

   Partitioning
       Data  sources  are assumed to be partitionable.  A data partition is a randomly accessible
       fragment of the data.  In the case of sequential and data-frame  sources,  partitions  are
       numbered,  starting  from  zero, and correspond to contiguous chunks of data divided along
       the first dimension of  the  data  structure.  In  general,  any  partitioning  scheme  is
       conceivable, such as a tuple-of-ints to index the chunks of a large numerical array.

       Not  all  data  sources  can be partitioned.  For example, file formats without sufficient
       indexing often can only be read from beginning to end.  In  these  cases,  the  DataSource
       object  should report that there is only 1 partition.  However, it often makes sense for a
       data source to be able to represent a directory of files, in which  case  each  file  will
       correspond to one partition.

   Metadata
       Once  opened,  a  DataSource  object  can have arbitrary metadata associated with it.  The
       metadata for a data source should be a dictionary that can be serialized  as  JSON.   This
       metadata comes from the following sources:

       1. A  data  catalog  entry  can  associate  fixed  metadata with the data source.  This is
          helpful for data formats that do not have any support  for  metadata  within  the  file
          format.

       2. The  driver handling the data source may have some general metadata associated with the
          state of the  system  at  the  time  of  access,  available  even  before  loading  any
          data-specific information.

       2. A  driver  can  add  additional metadata when the schema is loaded for the data source.
          This allows metadata embedded in the data source to be exported.

       From the user perspective, all of the metadata should be loaded once the data  source  has
       loaded the rest of the schema (after discover(), read(), to_dask(), etc have been called).

   Subclassing intake.source.base.DataSourceBase
       Every  Intake  driver  class  should  be a subclass of intake.source.base.DataSource.  The
       class should have the following attributes to identify itself:

       • name: The short name of the driver.  This should be  a  valid  python  identifier.   You
         should not include the word intake in the driver name.

       • version:  A  version  string  for the driver.  This may be reported to the user by tools
         based on Intake, but has no semantic importance.

       • container: The container type of data sources created by this object,  e.g.,  dataframe,
         ndarray,  or python, one of the keys of intake.container.container_map.  For simplicity,
         a driver many only return one typed of container.  If a particular source of data  could
         be  used in multiple ways (such as HDF5 files interpreted as dataframes or as ndarrays),
         two drivers must be created.  These two drivers can be part of the same Python package.

       • partition_access: Do the data sources returned by this driver have multiple  partitions?
         This may help tools in the future make more optimal decisions about how to present data.
         If in doubt (or the answer depends on  init  arguments),  True  will  always  result  in
         correct behavior, even if the data source has only one partition.

       The  __init()__  method  should always accept a keyword argument metadata, a dictionary of
       metadata from the  catalog  to  associate  with  the  source.   This  dictionary  must  be
       serializable as JSON.

       The  DataSourceBase  class has a small number of methods which should be overridden.  Here
       is an example producing a data-frame:

          class FooSource(intake.source.base.DataSource):
              container = 'dataframe'
              name = 'foo'
              version = '0.0.1'
              partition_access = True

              def __init__(self, a, b, metadata=None):
                  # Do init here with a and b
                  super(FooSource, self).__init__(
                      metadata=metadata
                  )

              def _get_schema(self):
                  return intake.source.base.Schema(
                      datashape=None,
                      dtype={'x': "int64", 'y': "int64"},
                      shape=(None, 2),
                      npartitions=2,
                      extra_metadata=dict(c=3, d=4)
                  )

              def _get_partition(self, i):
                  # Return the appropriate container of data here
                  return pd.DataFrame({'x': [1, 2, 3], 'y': [10, 20, 30]})

              def read(self):
                  self._load_metadata()
                  return pd.concat([self.read_partition(i) for i in range(self.npartitions)])

              def _close(self):
                  # close any files, sockets, etc
                  pass

       Most of the work typically happens in the following methods:

       • __init__(): Should be very lightweight and fast.  No files or network  resources  should
         be  opened,  and  no  significant memory should be allocated yet.  Data sources might be
         serialized immediately.  The default implementation of the pickle protocol in  the  base
         class  will  record  all  the arguments to __init__() and recreate the object with those
         arguments when unpickled, assuming the class has no side effects.

       • _get_schema(): May open files and network resources and return as much of the schema  as
         possible in small amount of approximately constant  time. Typically, imports of packages
         needed by the source only happen here.  The npartitions  and  extra_metadata  attributes
         must  be  correct  when  _get_schema  returns.  Further keys such as dtype, shape, etc.,
         should reflect the container type of the data-source, and can  be  None  if  not  easily
         knowable,  or  include  None  for some elements. File-based sources should use fsspec to
         open a local or remote URL, and pass storage_options to it. This  ensures  compatibility
         and  extra  features such as caching. If the backend can only deal with local files, you
         may still want to use fsspec.open_local to allow for caching.

       • _get_partition(self, i): Should return all of the data from partition id i, where  i  is
         typically  an  integer,  but  may  be  something  more  complex.   The  base  class will
         automatically verify that i is in the range [0, npartitions), so no  range  checking  is
         required in the typical case.

       • _close(self):  Close  any network or file handles and deallocate any significant memory.
         Note that these resources may be need to be reopened/reallocated if  a  read  is  called
         again later.

       The full set of user methods of interest are as follows:

       • discover(self):   Read   the   source   attributes,  like  npartitions,  etc.   As  with
         _get_schema() above, this method is assumed to be fast, and make a best  effort  to  set
         attributes.  The output should be serializable, if the source is to be used on a server;
         the details contained will be used for creating a remote-source on the client.

       • read(self): Return all the data in memory in one in-memory container.

       • read_chunked(self): Return an iterator that returns contiguous chunks of the data.   The
         chunking  is  generally assumed to be at the partition level, but could be finer grained
         if desired.

       • read_partition(self, i): Returns the data for a given partition id.  It is assumed  that
         reading  a  given partition does not require reading the data that precedes it.  If i is
         out of range, an IndexError should be raised.

       • to_dask(self): Return a (lazy) Dask data structure corresponding to  this  data  source.
         It  should  be assumed that the data can be read from the Dask workers, so the loads can
         be done in future tasks.  For further information, see the Dask documentation.

       • close(self): Close network or file handles and deallocate memory.  If other methods  are
         called after close(), the source is automatically reopened.

       • to_*:  for  some  sources,  it makes sense to provide alternative outputs aside from the
         base container (dataframe, array, ...) and Dask variants.

       Note that all of these methods typically call _get_schema, to make sure  that  the  source
       has been initialised.

   Subclassing intake.source.base.DataSource
       DataSource  provides  the  same  functionality  as DataSourceBase, but has some additional
       mixin classes to provide some extras. A developer may choose to derive from DataSource  to
       get all of these, or from DataSourceBase and make their own choice of mixins to support.

       • HoloviewsMixin: provides plotting and GUI capabilities via the holoviz stack

       • PersistMixin:  allows  for  storing  a  local  copy  in  a  default format for the given
         container type

       • CacheMixin: allows for local storage of data files for a source. Deprecated, you  should
         use one of the caching mechanisms in fsspec.

   Driver Discovery
       Intake  discovers  available  drivers in three different ways, described below.  After the
       discovery phase, Intake will automatically create open_[driver_name] convenience functions
       under  the  intake  module namespace.  Calling a function like open_csv() is equivalent to
       instantiating the corresponding data-source class.

   Entrypoints
       If you are packaging your driver into an installable package to be shared, you should  add
       the following to the package's setup.py:

          setup(
              ...
              entry_points={
                  'intake.drivers': [
                      'some_format_name = some_package.and_maybe_a_submodule:YourDriverClass',
                      ...
                  ]
              },
          )

       IMPORTANT:
          Some critical details of Python's entrypoints feature:

          • Note  the  unusual  syntax of the entrypoints. Each item is given as one long string,
            with the = as part of the string. Modules are separated by ., and  the  final  object
            name is preceded by :.

          • The  right  hand  side  of the equals sign must point to where the object is actually
            defined.  If  YourDriverClass  is   defined   in   foo/bar.py   and   imported   into
            foo/__init__.py  you  might  expect foo:YourDriverClass to work, but it does not. You
            must spell out foo.bar:YourDriverClass.

       Entry points are a  way  for  Python  packages  to  advertise  objects  with  some  common
       interface.  When  Intake  is  imported, it discovers all packages installed in the current
       environment that advertise 'intake.drivers' in this way.

       Most packages that define intake drivers have a dependency on intake itself,  for  example
       in  order  to  use intake's base classes. This can create a circular dependency: importing
       the package imports intake, which tries  to  discover  and  import  packages  that  define
       drivers.  To  avoid  this pitfall, just ensure that intake is imported first thing in your
       package's __init__.py. This ensures that the driver-discovery code runs first.  Note  that
       you are not required to make your package depend on intake. The rule is that if you import
       intake you must import it  first  thing.  If  you  do  not  import  intake,  there  is  no
       circularity.

   Configuration
       The intake configuration file can be used to:

       • Specify  precedence  in the event of name collisions---for example, if two different csv
         drivers are installed.

       • Disable a troublesome driver.

       • Manually make intake aware of a driver, which can  be  useful  for  experimentation  and
         early development until a setup.py with an entrypoint is prepared.

       • Assign a driver to a name other than the one assigned by the driver's author.

       The commandline invocation

          intake drivers enable some_format_name some_package.and_maybe_a_submodule.YourDriverClass

       is equivalent to adding this to your intake configuration file:

          drivers:
            some_format_name: some_package.and_maybe_a_submodule.YourDriverClass

       You can also disable a troublesome driver

          intake drivers disable some_format_name

       which is equivalent to

          drivers:
            your_format_name: false

   Deprecated: Package Scan
       When  Intake  is  imported,  it  will  search  the Python module path (by default includes
       site-packages and other directories  in  your  $PYTHONPATH)  for  packages  starting  with
       intake\_  and  discover  DataSource subclasses inside those packages to register.  drivers
       will be registered based on the``name`` attribute of the object.  By  convention,  drivers
       should  have  names  that  are lowercase, valid Python identifiers that do not contain the
       word intake.

       This approach is deprecated because it is limiting (requires the  package  to  begin  with
       "intake_")  and  because  the  package  scan  can  be  slow. Using entrypoints is strongly
       encouraged. The package scan may be disabled by default in some future release of  intake.
       During  the  transition  period,  if a package named intake_* provides an entrypoint for a
       given name, that will take precedence over any  drivers  gleaned  from  the  package  scan
       having  that name. If intake discovers any names from the package scan for which there are
       no entrypoints, it will issue a FutureWarning.

   Python API to Driver Discovery
   Remote Data
       For drivers loading from files, the author should be aware that it is  easy  to  implement
       loading  from  files  stored  in remote services. A simplistic case is demonstrated by the
       included CSV driver, which simply passes a URL to Dask, which in turn  can  interpret  the
       URL  as  a  remote  data  service,  and  use the storage_options as required (see the Dask
       documentation on remote data).

       More advanced usage, where a Dask loader does not  already  exist,  will  likely  rely  on
       fsspec.open_files  . Use this function to produce lazy OpenFile object for local or remote
       data, based on a URL, which will have a protocol designation and possibly contain glob "*"
       characters.   Additional  parameters  may  be  passed  to  open_files,  which  should,  by
       convention, be supplied by a driver argument named storage_options (a dictionary).

       To use an OpenFile object, make it concrete by using a context:

          # at setup, to discover the number of files/partitions
          set_of_open_files = fsspec.open_files(urlpath, mode='rb', **storage_options)

          # when actually loading data; here we loop over all files, but maybe we just do one partition
          for an_open_file in set_of_open_files:
              # `with` causes the object to become concrete until the end of the block
              with an_open_file as f:
                  # do things with f, which is a file-like object
                  f.seek(); f.read()

       The textfiles builtin drivers implements this mechanism, as an example.

   Structured File Paths
       The CSV driver sets up an example of how to gather data which is  encoded  in  file  paths
       like  ('data_{site}_.csv')  and  return that data in the output.  Other drivers could also
       follow the same structure where data is being loaded from a set  of  filenames.  Typically
       this  would apply to data-frame output.  This is possible as long as the driver has access
       to each of the file paths at some point in _get_schema. Once the file paths are known, the
       driver  developer  can  use the helper functions defined in intake.source.utils to get the
       values for each field in the pattern for each file in the list. These values  should  then
       be added to the data, a process which normally would happen within the _get_schema method.

       The  PatternMixin defines driver properties such as urlpath, path_as_pattern, and pattern.
       The implementation might look something like this:

          from intake.source.utils import reverse_formats

          class FooSource(intake.source.base.DataSource, intake.source.base.PatternMixin):
              def __init__(self, a, b, path_as_pattern, urlpath, metadata=None):
                  # Do init here with a and b
                  self.path_as_pattern = path_as_pattern
                  self.urlpath = urlpath

                  super(FooSource, self).__init__(
                      container='dataframe',
                      metadata=metadata
                  )
              def _get_schema(self):
                  # read in the data
                  values_by_field = reverse_formats(self.pattern, file_paths)
                  # add these fields and map values to the data
                  return data

       Since dask already has a specific method for  including  the  file  paths  in  the  output
       dataframe, in the CSV driver we set include_path_column=True, to get a dataframe where one
       of the columns contains all the file paths. In this case, add these fields and  values  to
       data is a mapping between the categorical file paths column and the values_by_field.

       In other drivers where each file is read in independently the driver developer can set the
       new fields on the data from each file before  concattenating.   This  pattern  looks  more
       like:

          from intake.source.utils import reverse_format

          class FooSource(intake.source.base.DataSource):
              ...

              def _get_schema(self):
                  # get list of file paths
                  for path in file_paths:
                      # read in the file
                      values_by_field = reverse_format(self.pattern, path)
                      # add these fields and values to the data
                  # concatenate the datasets
                  return data

       To toggle on and off this path as pattern behavior, the CSV and intake-xarray drivers uses
       the bool path_as_pattern keyword argument.

   Authorization Plugins
       Authorization plugins are classes that can be used to customize access permissions to  the
       Intake  catalog  server.   The  Intake  server  and  client communicate over HTTP, so when
       security is a concern, the most important step to take is to  put  a  TLS-enabled  reverse
       proxy (like nginx) in front of the Intake server to encrypt all communication.

       Whether  or  not  the  connection  is  encrypted,  the Intake server by default allows all
       clients to list the full catalog, and open any of the entries.  For many use  cases,  this
       is  sufficient, but if the visibility of catalog entries needs to be limited based on some
       criteria, a server- (and/or client-) side authorization plugin can be used.

   Server Side
       An Intake server can have exactly one server side plugin enabled at startup.   The  plugin
       is  activated  using  the  Intake  configuration  file, which lists the class name and the
       keyword arguments it takes.  For example, the "shared secret" plugin would  be  configured
       this way:

          auth:
            cls: intake.auth.secret.SecretAuth
            kwargs:
              secret: A_SECRET_HASH

       This  plugin is very simplistic, and exists as a demonstration of how an auth plugin might
       function for more realistic scenarios.

       For more information about configuring the Intake server, see Configuration.

       The server auth plugin has two methods.  The allow_connect()  method  decides  whether  to
       allow  a  client  to  make any request to the server at all, and the allow_access() method
       decides whether the client is allowed to see a particular catalog entry in the listing and
       whether  they  are  allowed to open that data source.  Note that for catalog entries which
       allow  direct  access  to  the  data  (via  network  or  shared  filesystem),  the  Intake
       authorization  plugins  have  no impact on the visibility of the underlying data, only the
       entries in the catalog.

       The actual implementation of a plugin is very short.  Here is a simplified version of  the
       shared secret auth plugin:

          class SecretAuth(BaseAuth):
              def __init__(self, secret, key='intake-secret'):
                  self.secret = secret
                  self.key = key

              def allow_connect(self, header):
                  try:
                      return self.get_case_insensitive(header, self.key, '') \
                                  == self.secret
                  except:
                      return False

              def allow_access(self, header, source, catalog):
                  try:
                      return self.get_case_insensitive(header, self.key, '') \
                                  == self.secret
                  except:
                      return False

       The  header  argument  is  a  dictionary  of  HTTP headers that were present in the client
       request.  In this case, the plugin is looking for a  special  intake-secret  header  which
       contains  the  shared secret token.  Because HTTP header names are not case sensitive, the
       BaseAuth  class  provides  a  helper  method  get_case_insensitive(),  which  will   match
       dictionary keys in a case-insensitive way.

       The  allow_access  method also takes two additional arguments.  The source argument is the
       instance of LocalCatalogEntry for the data  source  being  checked.   Most  commonly  auth
       plugins  will  want  to  inspect the _metadata dictionary for information used to make the
       authorization decision.  Note that it is entirely up to the plugin author to  decide  what
       sections  they  want  to  require  in  the  metadata section.  The catalog argument is the
       instance of Catalog that contains the catalog entry.  Typically, plugins will want to  use
       information from the catalog.metadata dictionary to control global defaults, although this
       is also up to the plugin.

   Client Side
       Although server side auth plugins can function entirely independently, some  authorization
       schemes  will  require  the client to add special HTTP headers for the server to look for.
       To facilitate this, the Catalog constructor accepts an optional  auth  parameter  with  an
       instance of a client auth plugin class.

       The corresponding client plugin for the shared secret use case describe above looks like:

          class SecretClientAuth(BaseClientAuth):
              def __init__(self, secret, key='intake-secret'):
                  self.secret = secret
                  self.key = key

              def get_headers(self):
                  return {self.key: self.secret}

       It  defines  a  single  method,  get_headers(),  which  is  called  to get a dictionary of
       additional headers to add to the HTTP request to the catalog server.  To use this  plugin,
       we would do the following:

          import intake
          from intake.auth.secret import SecretClientAuth

          auth = SecretClientAuth('A_SECRET_HASH')
          cat = intake.Catalog('http://example.com:5000', auth=auth)

       Now all requests made to the remote catalog will contain the intake-secret header.

   Making Data Packages
       Intake can used to create Data packages, so that you can easily distribute your catalogs -
       others can just "install data". Since you may also want to distribute  custom  catalogues,
       perhaps  with  visualisations, and driver code, packaging these things together is a great
       convenience. Indeed, packaging gives you the opportunity to version-tag your  distribution
       and  to  declare  the  requirements  needed  to  be able to use the data. This is a common
       pattern for distributing code for python and other languages, but not  commonly  seen  for
       data artifacts.

       The  current version of Intake allows making data packages using standard python tools (to
       be installed, for example, using pip).  The previous, now deprecated, technique  is  still
       described below, under Pure conda solution and is specific to the conda packaging system.

   Python packaging solution
       Intake  allows  you to register data artifacts (catalogs and data sources) in the metadata
       of a python package.  This  means,  that  when  you  install  that  package,  intake  will
       automatically  know  of  the  registered  items, and they will appear within the "builtin"
       catalog intake.cat.

       Here we assume that you understand what is meant by  a  python  package  (i.e.,  a  folder
       containing  __init__.py  and  other code, config and data files).  Furthermore, you should
       familiarise  yourself  with  what  is  required  for  bundling  such  a  package  into   a
       distributable   package   (one   with  a  setup.py)  by  reading  the  official  packaging
       documentation

       The intake examples contains a full tutorial for packaging and  distributing  intake  data
       and/or catalogs for pip and conda, see the directory "data_package/".

   Entry points definition
       Intake  uses the concept of entry points to define the entries that are defined by a given
       package. Entry points provide a mechanism to register metadata about a package at  install
       time,  so  that it can easily be found by other packages such as Intake.  Entry points was
       originally a separate package, but is included in the standard library as  of  python  3.8
       (you will not need to install it, as Intake requires it).

       All you need to do to register an entry in intake.cat is:

       • define  a  data  source  somewhere in your package. This object can be of any ttype that
         makes sense to Intake, including Catalogs, and sources that have drivers defined in  the
         very  same  package. Obviously, if you can have catalogs, you can populate these however
         you wish, including with more catalogs.  You need not be restricted to simply loading in
         YAML files.

       • include a block in your call to setp in setup.py with code something like

            entry_points={
                'intake.catalogs': [
                    'sea_cat = intake_example_package:cat',
                    'sea_data = intake_example_package:data'
                ]
            }

          Here only the lines with "sea_cat" and "sea_data" are specific to the example
          package, the rest is required boilerplate. Each of those two lines defines a name
          for the data entry (before the "=" sign) and the location to load from, in
          module:object format.

       • install the package using pip, python setup.py, or package it for conda

   Intake's process
       When   Intake   is  imported,  it  investigates  all  registered  entry  points  with  the
       "intake.catalogs" group. It will go through and assign each name to the given location  of
       the  final  object.  In the above example, intake.cat.sea_cat would be associated with the
       cat object in the intake_example_package package, and so on.

       Note that Intake does not immediately import the given package or module, because  imports
       can  sometimes  be  expensive,  and  if  you have a lot of data packages, it might cause a
       slow-down every time that Intake is imported. Instead, a placeholder entry is created, and
       whenever the entry is accessed, that's when the particular package will be imported.

          In [1]: import intake

          In [2]: intake.cat.sea_cat  # does not import yet
          Out[2]: <Entry containing Catalog named sea_cat>

          In [3]: cat = intake.cat.sea_cat()  # imports now

          In [4]: cat   # this data source happens to be a catalog
          Out[4]: <Intake catalog: sea>

       (note  here  the  parentheses  -  this explicitly initialises the source, and normally you
       don't have to do this)

   Pure conda solution
       This packaging method is deprecated, but still available.

       Combined with the Conda Package Manger, Intake makes it possible to create  Data  packages
       which  can  be  installed  and  upgraded just like software packages.  This offers several
       advantages:

          • Distributing Catalogs and Drivers becomes as easy as conda install

          • Data packages can be versioned, improving reproducibility in some cases

          • Data packages can depend on the libraries required for reading

          • Data packages can be self-describing using Intake catalog files

          • Applications that need certain Catalogs can include data packages in their dependency
            list

       In  this  tutorial,  we  give  a  walk-through to enable you to distribute any Catalogs to
       others, so that they can access the data using Intake  without  worrying  about  where  it
       resides or how it should be loaded.

   Implementation
       The  function intake.catalog.default.load_combo_catalog searches for YAML catalog files in
       a number of place at import. All entries in these catalogs are flattened and placed in the
       "builtin" intake.cat.

       The places searched are:

          • a platform-specific user directory as given by the appdirs package

          • in  the  environment's  "/share/intake"  data  directory,  where  the location of the
            current environment is found from virtualenv or conda environment variables

          • in directories listed in the "INTAKE_PATH"  environment  variable  or  "catalog_path"
            config parameter

   Defining a Package
       The steps involved in creating a data package are:

       1. Identifying  a  dataset, which can be accessed via a URL or included directly as one or
          more files in the package.

       2. Creating a package containing:

          • an intake catalog file

          • a meta.yaml file (description of the data, version, requirements, etc.)

          • a script to copy the data

       3. Building the package using the command conda build.

       4. Uploading the package to a package repository  such  as  Anaconda  Cloud  or  your  own
          private repository.

       Data  packages  are  standard  conda packages that install an Intake catalog file into the
       user's  conda  environment  ($CONDA_PREFIX/share/intake).   A  data   package   does   not
       necessarily  imply there are data files inside the package.  A data package could describe
       remote data sources (such as files in S3) and take up very little space on disk.

       These packages are considered noarch packages, so that one package can be installed on any
       platform,  with  any  version  of Python (or no Python at all).  The easiest way to create
       such a package is using a conda build recipe.

       Conda-build recipes are stored in a directory that contains a files like:

          • meta.yaml - description of package metadata

          • build.sh - script for building/installing package contents (on Linux/macOS)

          • other files needed by the package (catalog files and data files for data packages)

       An example that packages up data from a Github repository would look like this:

          # meta.yaml
          package:
            version: '1.0.0'
            name: 'data-us-states'

          source:
            git_rev: v1.0.0
            git_url: https://github.com/CivilServiceUSA/us-states

          build:
            number: 0
            noarch: generic

          requirements:
            run:
              - intake
            build: []

          about:
            description: Data about US states from CivilServices (https://civil.services/)
            license: MIT
            license_family: MIT
            summary: Data about US states from CivilServices

       The key parts of a data package recipe (different from typical conda recipes) is the build
       section:

          build:
            number: 0
            noarch: generic

       This  will  create  a  package  that  can  be installed on any platform, regardless of the
       platform where the package is built.  If you need to rebuild a package, the  build  number
       can be incremented to ensure users get the latest version when they conda update.

       The corresponding build.sh file in the recipe looks like this:

          #!/bin/bash

          mkdir -p $CONDA_PREFIX/share/intake/civilservices
          cp $SRC_DIR/data/states.csv $PREFIX/share/intake/civilservices
          cp $RECIPE_DIR/us_states.yaml $PREFIX/share/intake/

       The  $SRC_DIR  variable  refers  to  any  source  tree  checked  out (from Github or other
       service), and the $RECIPE_DIR refers to the directory where the meta.yaml is located.

       Finishing out this example, the catalog file for this data source looks like this:

          sources:
            states:
              description: US state information from [CivilServices](https://civil.services/)
              driver: csv
              args:
                urlpath: '{{ CATALOG_DIR }}/civilservices/states.csv'
              metadata:
                origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'

       The {{ CATALOG_DIR }} Jinja2 variable is used to construct a path relative  to  where  the
       catalog file was installed.

       To build the package, you must have conda-build installed:

          conda install conda-build

       Building the package requires no special arguments:

          conda build my_recipe_dir

       Conda-build will display the path of the built package, which you will need to upload it.

       If  you want your data package to be publicly available on Anaconda Cloud, you can install
       the anaconda-client utility:

          conda install anaconda-client

       Then you can register your Anaconda Cloud credentials and upload the package:

          anaconda login
          anaconda upload /Users/intake_user/anaconda/conda-bld/noarch/data-us-states-1.0.0-0.tar.bz2

   Best Practices
   Versioning
       • Versions for data packages should be used to indicate changes  in  the  data  values  or
         schema.  This allows applications to easily pin to the specific data version they depend
         on.

       • Putting data files into a package ensures reproducibility by allowing a  version  number
         to be associated with files on disk.  This can consume quite a bit of disk space for the
         user, however. Large data files are not generally included in pip or conda packages  so,
         if possible, you should reference the data assets in an external place where they can be
         loaded.

   Packaging
       • Packages that refer to remote data sources (such as databases and  REST  APIs)  need  to
         think  about  authentication.   Do  not include authentication credentials inside a data
         package.  They should be obtained from the environment.

       • Data packages should depend on the Intake plugins required to read the data,  or  Intake
         itself.

       • You  may  well want to break any driver code code out into a separate package so that it
         can be updated independent of the data. The data package would then depend on the driver
         package.

   Nested catalogs
       As  noted above, entries will appear in the users' builtin catalog as intake.cat.*. In the
       case that the catalog has multiple entries, it may be desirable to put the entries below a
       namespace  as  intake.cat.data_package.*.  This  can  be  achieved  by  having one catalog
       containing the (several) data sources, with only a single top-level entry pointing to  it.
       This  catalog  could be defined in a YAML file, created using any other catalog driver, or
       constructed in the code, e.g.:

          from intake.catalog import Catalog
          from intake.catalog.local import LocalCatalogEntry as Entry
          cat = intake.catalog.Catalog()
          cat._entries = {name: Entry(name, descr, driver='package.module.driver',
                                        args={"urlpath": url})
                                    for name, url in my_input_list}

       If your package contains many sources of different types, you may even nest the  catalogs,
       i.e., have a top-level whose contents are also catalogs.

          e = Entry('first_cat', 'sample', driver='catalog')
          e._default_source = cat
          top_level = Catalog()
          top_level._entries = {'fist_cat': e, ...}

       where  your  entry  point  might look something like: "my_cat = my_package:top_level". You
       could achieve the same with multiple YAML files.

ROADMAP

       Some high-level work that we expect to be achieved on the time-scale of months. This  list
       is  not  exhaustive,  but  rather  aims to whet the appetite for what Intake can be in the
       future.

       Since Intake aims to be a community of data-oriented pythoneers, nothing written  here  is
       laid in stone, and users and devs are encouraged to make their opinions known!

   Broaden the coverage of formats
       Data-type  drivers  are  easy  to  write,  but  still  require  some effort, and therefore
       reasonable impetus to get the work done. Conversations over the  coming  months  can  help
       determine  the  drivers that should be created by the Intake team, and those that might be
       contributed by the community.

       The next type that we would specifically  like  to  consider  is  machine  learning  model
       artifacts.   EDIT see https://github.com/AlbertDeFusco/intake-sklearn , and hopefully more
       to come.

   Streaming Source
       Many data sources are inherently time-sensitive and event-wise. These are not covered well
       by  existing  Python  tools, but the streamz library may present a nice way to model them.
       From the Intake point of view, the task would be to develop a streaming type, and at least
       one data driver that uses it.

       The most obvious place to start would be read a file: every time a new line appears in the
       file, an event is emitted. This is appropriate, for instance, for watching the  log  files
       of a web-server, and indeed could be extended to read from an arbitrary socket.

       EDIT see: https://github.com/intake/intake-streamz

   Server publish hooks
       To  add  API  endpoints to the server, so that a user (with sufficient privilege) can post
       data specifications to a  running  server,  optionally  saving  the  specs  to  a  catalog
       server-side.  Furthermore, we will consider the possibility of being able to upload and/or
       transform data (rather than refer to it in a third-party location), so that you would have
       a one-line "publish" ability from the client.

       The  server,  in  general,  could  do  with  a lot of work to become more than the current
       demonstration/prototype. In particular, it should be able to be performant  and  scalable,
       meaning that the server implementation ought to keep as little local state as possible.

   Simplify dependencies and class hierarchy
       We  would  like the make it easier to write Intake drivers which don't need any persist or
       GUI functionality, and to be able to install Intake core functionality  (driver  registry,
       data loading and catalog traversal) without needing many other packages at all.

       EDIT this has been partly done, you can derive from DataSourceBase and not have to use the
       full set of Intake's features for simplicity. We have also gone some distance to  separate
       out  dependencies  for  parts  of the package, so that you can install Intake and only use
       some of the subpackages/modules - imports don't happen until those parts of the  code  are
       used.  We  have  not  yet  split  the intake conda package into, for example, intake-base,
       intake-server, intake-gui...

   Reader API
       For those that wish to provide Intake's data source API, and make data  sources  available
       to  Intake  cataloguing, but don't wish to take Intake as a direct dependency.  The actual
       API of DataSources is rather simple:

       • __init__: collect arguments, minimal IO at this point

       • discover(): get metadata from the source, by querying the files/service itself

       • read(): return in-memory version of the data

       • to_*: return reference objects for the given compute engine, typically Dask

       • read_partition(...): read part of the data into memory, where the argument  makes  sense
         for the given type of data

       • configure_new(): create new instance with different arguments

       • yaml(): representation appropriate for inclusion in a YAML catalogue

       • close(): release any resources

       Of  these,  only  the  first  three are really necessary for a iminal interface, so Intake
       might do well to publish this protocol specification, so that new drivers can  be  written
       that can be used by Intake but do not need Intake, and so help adoption.

GLOSSARY

       Argument
              One  of  a  set  of values passed to a function or class. In the Intake sense, this
              usually is the set of key-value pairs defined in the "args"  section  of  a  source
              definition;  unless  the  user  overrides, these will be used for instantiating the
              source.

       Cache  Local  copies  of  remote  files.  Intake  allows  for  download-on-first-use   for
              data-sources,  so that subsequent access is much faster, see caching. The format of
              the files is unchanged in this case, but may be decompressed.

       Catalog
              An inventory of entries, each of which corresponds to a specific  Data-set.  Within
              these  docs, a catalog is most commonly defined in a YAML file, for simplicity, but
              there are other possibilities, such as connecting to an Intake  server  or  another
              third-party data service, like a SQL database. Thus, catalogs form a hierarchy: any
              catalog can contain other, nested catalogs.

       Catalog file
              A YAML specification file which contains a list of named entries describing how  to
              load data sources. Catalogs.

       Conda  A  package  and  environment  management  package for the python ecosystem, see the
              conda website. Conda ensures dependencies and correct versions  are  installed  for
              you,   provides  precompiled,  binary-compatible  software,  and  extends  to  many
              languages beyond python, such as R, javascript and C.

       Conda package
              A single installable item which the Conda application can install.  A  package  may
              include  a Catalog, data-files and maybe some additional code. It will also include
              a specification of  the  dependencies  that  it  requires  (e.g.,  Intake  and  any
              additional  Driver), so that Conda can install those automatically. Packages can be
              created locally, or can be found on anaconda.org or other package repositories.

       Container
              One of the supported data formats. Each Driver outputs its data in  one  of  these.
              The  containers  correspond  to  familiar data structures for end-analysis, such as
              list-of-dicts, Numpy nd-array or Pandas data-frame.

       Data-set
              A specific assemblage of data. The type  of  data  (tabular,  multi-dimensional  or
              something else) and the format (file type, data service type) are all attributes of
              the data-set. In addition, in the context of Intake, data-sets are usually  entries
              within  a Catalog with additional descriptive text and metadata and a specification
              of how to load the data.

       Data Source
              An Intake specification for a specific Data-set. In most cases, the two  terms  are
              synonymous.

       Data User
              A  person  who  uses  data to produce models and other inferences/conclusions. This
              person generally uses standard python analysis packages like Numpy, Pandas, SKLearn
              and  may produce graphical output. They will want to be able to find the right data
              for a given job, and for the data to be available in a standard format  as  quickly
              and  easily  as  possible.  In many organisations, the appropriate job title may be
              Data Scientist, but research scientists and BI/analysts also fit this description.

       Data packages
              Data packages are standard conda packages that install an Intake catalog file  into
              the  user’s conda environment ($CONDA_PREFIX/share/intake). A data package does not
              necessarily imply there are data files inside the package.  A  data  package  could
              describe remote data sources (such as files in S3) and take up very little space on
              disk.

       Data Provider
              A person whose main objective is to curate data sources, get them into  appropriate
              formats,  describe the contents, and disseminate the data to those that need to use
              them. Such a person may care about the specifics of the storage format and  backing
              store,  the  right  number of fields to keep and removing bad data. They may have a
              good idea of the best way to visualise any give data-set. In an organisation,  this
              job  may  be  known as Data Engineer, but it could as easily be done by a member of
              the IT team. These people are the most likely to author Catalogs.

       Developer
              A person who writes or fixes code. In the context of Intake, a developer  may  make
              new  format  Drivers,  create authentication systems or add functionality to Intake
              itself. They can take existing code for loading data in  other  projects,  and  use
              Intake to add extra functionality to it, for instance, remote data access, parallel
              processing, or file-name parsing.

       Driver The thing that does the work of reading the data for a catalog entry is known as  a
              driver,  often  referred  to using a simple name such as "csv". Intake has a plugin
              architecture,  and  new  drivers  can  be  created  or  installed,   and   specific
              catalogs/data-sets may require particular drivers for their contained data-sets. If
              installed  as  Conda  packages,  then  these  requirements  will  be  automatically
              installed for you. The driver's output will be a Container, and often the code is a
              simpler layer over existing functionality in a third-party package.

       GUI    A Graphical User Interface. Intake comes with  a  GUI  for  finding  and  selecting
              data-sets, see GUI.

       IT     The  Information  Technology team for an organisation. Such a team may have control
              of the computing infrastructure  and  security  (sys-ops),  and  may  well  act  as
              gate-keepers  when  exposing  data  for  use  by other colleagues. Commonly, IT has
              stronger policy enforcement requirements that other groups, for instance  requiring
              all data-set copy actions to be logged centrally.

       Persist
              A  process of making a local version of a data-source. One canonical format is used
              for each of the container types, optimised for quick and parallel access.  This  is
              particularly useful if the data takes a long time to acquire, perhaps because it is
              the result of a complex query on a remote service. The resultant output can be  set
              to  expire  and be automatically refreshed, see Persisting Data. Not to be confused
              with the cache.

       Plugin Modular extra functionality for Intake, provided by a  package  that  is  installed
              separately.  The  most  common  type  of  plugin  will be for a Driver to load some
              particular  data  format;  but  other  parts  of  Intake  are  pluggable,  such  as
              authentication mechanisms for the server.

       Server A  remote  source  for  Intake  catalogs.  The  server  will  provide  data  source
              specifications (i.e., a remote Catalog), and may also  provide  the  raw  data,  in
              situations  where  the  client is not able or not allowed to access it directly. As
              such, the server can act as a gatekeeper of the data for  security  and  monitoring
              purposes.  The  implementation  of  the  server  in  Intake  is  accessible  as the
              intake-server command, and acts as a reference: other implementations can easily be
              created for specific circumstances.

       TTL    Time-to-live,  how  long  before  the  given  entity is considered to have expired.
              Usually in seconds.

       User Parameter
              A data source definition can contain a  "parameters"  section,  which  can  act  as
              explicit  decision  indicators for the user, or as validation and type coersion for
              the definition's Argument s. See Parameter Definition.

       YAML   A text-based format for expressing data with  a  dictionary  (key-value)  and  list
              structure,  with  a limited number of data-types, such as strings and numbers. YAML
              uses indentations to nest objects, making it easy to read  and  write  for  humans,
              compared to JSON. Intake's catalogs and config are usually expressed in YAML files.

COMMUNITY

       Intake  is  used  and  developed  by individuals at a variety of institutions.  It is open
       source (license) and sits within the broader Python numeric ecosystem commonly referred to
       as PyData or SciPy.

   Discussion
       Conversation happens in the following places:

       1. Usage questions are directed to Stack Overflow with the #intake tag.  Intake developers
          monitor this tag.

       2. Bug reports and feature requests are managed on the GitHub  issue  tracker.  Individual
          intake  plugins  are  managed in separate repositories each with its own issue tracker.
          Please consult the Plugin Directory for a list of available plugins.

       3. Chat occurs on at gitter.im/ContinuumIO/intake.  Note that because gitter chat  is  not
          searchable  by future users we discourage usage questions and bug reports on gitter and
          instead ask people to use Stack Overflow or GitHub.

       4. Monthly community meeting happens the first Thursday of the month at  9:00  US  Central
          Time.  See https://github.com/intake/intake/issues/596, with a reminder sent out on the
          gitter channel. Strictly informal chatter.

   Asking for help
       We welcome usage questions and bug reports from all users, even those who are new to using
       the  project.   There  are  a  few  things you can do to improve the likelihood of quickly
       getting a good answer.

       1. Ask questions in the right place:  We strongly prefer the  use  of  Stack  Overflow  or
          GitHub  issues  over Gitter chat.  GitHub and Stack Overflow are more easily searchable
          by future users, and therefore is more efficient for everyone's time.  Gitter  chat  is
          strictly reserved for developer and community discussion.

          If  you  have a general question about how something should work or want best practices
          then use Stack Overflow.  If you think you have found a bug then use GitHub

       2. Ask only in one place: Please restrict yourself to posting your question  in  only  one
          place (likely Stack Overflow or GitHub) and don't post in both

       3. Create  a  minimal  example:   It  is  ideal  to  create  minimal, complete, verifiable
          examples.  This significantly reduces the time that answerers spend understanding  your
          situation, resulting in higher quality answers more quickly.

       • IndexModule IndexSearch Page

AUTHOR

       Anaconda

       2022, Anaconda