Ubuntu Manpage: mpi4py-fft - mpi4py-fft Documentation

name
mpi4py-fft
author
copyright

Provided by: python3-mpi4py-fft-doc_2.0.3-3build2_all

NAME

       mpi4py-fft - mpi4py-fft Documentation

MPI4PY-FFT

       Documentation Status

       mpi4py-fft  is  a  Python  package  for  computing  Fast  Fourier  Transforms  (FFTs).   Large arrays are
       distributed and communications are handled under the hood by MPI for Python (mpi4py). To distribute large
       arrays  we  are  using  a  new  and  completely  generic  algorithm  that  allows  for any index set of a
       multidimensional array to be distributed. We can distribute just one index (a  slab  decomposition),  two
       index sets (pencil decomposition) or even more for higher-dimensional arrays.

       In  mpi4py-fft  there is also included a Python interface to the FFTW library. This interface can be used
       without MPI, much like pyfftw, and even  for  real-to-real  transforms,  like  discrete  cosine  or  sine
       transforms.

   Introduction
       The  Python  package  mpi4py-fft  is  a tool primarily for working with Fast Fourier Transforms (FFTs) of
       (large) multidimensional arrays. There is really no limit as to how large the arrays can be, just as long
       as  there is sufficient computing powers available. Also, there are no limits as to how transforms can be
       configured. Just about any combination of  transforms  from  the  FFTW  library  is  supported.  Finally,
       mpi4py-fft can also be used simply to distribute and redistribute large multidimensional arrays with MPI,
       without any transforms at all.

       The main contribution of mpi4py-fft can be found in just a few classes in the main modules:

          • mpifft

          • pencil

          • distarray

          • libfft

          • fftw

       The mpifft.PFFT class is the major entry point for most users. It is a highly configurable  class,  which
       under  the  hood  distributes  large  dataarrays  and performs any type of transform, along any axes of a
       multidimensional array.

       The pencil module is responsible for global redistributions through MPI.  However, this module is  rarely
       used  on its own, unless one simply needs to do global redistributions without any transforms at all. The
       pencil module is used heavily by the PFFT class.

       The distarray module contains classes for simply distributing multidimensional arrays, with no regards to
       transforms.  The  distributed  arrays  created  from  the  classes  here can very well be used in any MPI
       application that requires a large multidimensional distributed array.

       The libfft module provides a common interface to any of the serial transforms in the FFTW  library.

       The fftw module contains wrappers to the transforms provided by the FFTW  library.  We  provide  our  own
       wrappers  mainly  because  pyfftw  does  not  include  support  for  real-to-real transforms. Through the
       interface in fftw we can do here, in Python, pretty much everything that you can do in the original  FFTW
       library.

   Global Redistributions
       In  high  performance  computing large multidimensional arrays are often distributed and shared amongst a
       large number of different processors.  Consider a  large  three-dimensional  array  of  double  (64  bit)
       precision  and global shape (512, 1024, 2048). To lift this array into RAM requires 8 GB of memory, which
       may be too large for a single, non-distributed machine. If, however, you have  access  to  a  distributed
       architecture,  you can split the array up and share it between, e.g., four CPUs (most supercomputers have
       either 2 or 4 GB of memory per CPU), which will only need to  hold  2  GBs  of  the  global  array  each.
       Moreover,  many  algorithms with varying degrees of locality can take advantage of the distributed nature
       of the array to compute local  array  pieces  concurrently,  effectively  exploiting  multiple  processor
       resources.

       There  are  several  ways  of distributing a large multidimensional array. Two such distributions for our
       three-dimensional global array (using 4 processors) are shown below [image] [image]

       Here each color represents one of the processors. We note that in the first image only one of  the  three
       axes  is distributed, whereas in the second two axes are distributed. The first configuration corresponds
       to a slab, whereas the second corresponds to a pencil distribution. With  either  distribution  only  one
       quarter  of  the  large, global array needs to be kept in rapid (RAM) memory for each processor, which is
       great. However, some operations may then require data that is not available locally in its quarter of the
       total  array.  If  that  is  so,  the  processors  will  need to communicate with each other and send the
       necessary data where it is needed. There are many such MPI routines designed for  sending  and  receiving
       data.

       We are generally interested in algorithms, like the FFT, that work on the global array, along one axis at
       the time. To be able to execute such algorithms, we need to make sure that the local arrays  have  access
       to all of its data along this axis. For the figure above, the slab distribution gives each processor data
       that is fully available along two axes, whereas the pencil distribution only  has  data  fully  available
       along  one  axis.  Rearranging  data,  such  that it becomes aligned in a different direction, is usually
       termed a global redistribution, or a global transpose operation. Note  that  with  mpi4py-fft  we  always
       require that at least one axis of a multidimensional array remains aligned (non-distributed).

       Distribution and global redistribution is in mpi4py-fft handled by three classes in the pencil module:

          • Pencil

          • Subcomm

          • Transfer

       These  classes  are  the  low-level backbone of the higher-level PFFT and DistArray classes. To use these
       low-level classes directly is not recommended and usually not necessary. However, for clarity we start by
       describing how these low-level classes work together.

       Lets  first  consider a 2D dataarray of global shape (8, 8) that will be distributed along axis 0. With a
       high level API we could then simply do:

          import numpy as np
          from mpi4py_fft import DistArray
          N = (8, 8)
          a = DistArray(N, [0, 1])

       where the [0, 1] list decides that the first axis can be distributed, whereas the second  axis  is  using
       one  processor  only  and  as  such is aligned (non-distributed). We may now inspect the low-level Pencil
       class associated with a:

          p0 = a.pencil

       The p0 Pencil object contains information about the distribution of a 2D dataarray of  global  shape  (8,
       8).  The distributed array a has been created using the information that is in p0, and p0 is used by a to
       look up information about the global array, for example:

          >>> a.alignment
          1
          >>> a.global_shape
          (8, 8)
          >>> a.subcomm
          (<mpi4py.MPI.Cartcomm at 0x10cc14a68>, <mpi4py.MPI.Cartcomm at 0x10e028690>)
          >>> a.commsizes
          [1, 1]

       Naturally, the sizes of the communicators will depend on  the  number  of  processors  used  to  run  the
       program. If we used 4, then a.commsizes would return [1, 4].

       We note that a low-level approach to creating such a distributed array would be:

          import numpy as np
          from mpi4py_fft import Pencil, Subcomm
          from mpi4py import MPI
          comm = MPI.COMM_WORLD
          N = (8, 8)
          subcomm = Subcomm(comm, [0, 1])
          p0 = Pencil(subcomm, N, axis=1)
          a0 = np.zeros(p0.subshape)

       Note  that  this  last  array a0 would be equivalent to a, but it would be a pure Numpy array (created on
       each processor) and it would not contain any of the information about the global array that it is part of
       (global_shape,  pencil, subcomm, etc.). It contains the same amount of data as a though and a0 is as such
       a perfectly fine distributed array. Used together with p0 it contains exactly the same information as a.

       Since at least one axis needs to be aligned (non-distributed), a 2D array can only  be  distributed  with
       one  processor group. If we wanted to distribute the second axis instead of the first, then we would have
       done:

          a = DistArray(N, [1, 0])

       With the low-level approach we would have had to use axis=0 in the creation of p0, as well as [1,  0]  in
       the creation of subcomm.  Another way to get the second pencil, that is aligned with axis 0, is to create
       it from p0:

          p1 = p0.pencil(0)

       Now the p1 object will represent a (8, 8) global array distributed in the second axis.

       Lets create a complete script (pencils.py) that fills the array a with the value of each processors  rank
       (note that it would also work to follow the low-level approach and use a0):

          import numpy as np
          from mpi4py_fft import DistArray
          from mpi4py import MPI
          comm = MPI.COMM_WORLD
          N = (8, 8)
          a = DistArray(N, [0, 1])
          a[:] = comm.Get_rank()
          print(a.shape)

       We can run it with:

          mpirun -np 4 python pencils.py

       and obtain the printed results from the last line (print(a.shape)):

          (2, 8)
          (2, 8)
          (2, 8)
          (2, 8)

       The shape of the local a arrays is (2, 8) on all 4 processors. Now assume that we need these data aligned
       in the x-direction (axis=0) instead. For this to happen we need to perform a global  redistribution.  The
       easiest approach is then to execute the following:

          b = a.redistribute(0)
          print(b.shape)

       which would print the following:

          (8, 2)
          (8, 2)
          (8, 2)
          (8, 2)

       Under  the  hood  the  global  redistribution  is  executed  with the help of the Transfer class, that is
       designed to transfer data between any two sets of pencils, like those represented  by  p0  and  p1.  With
       low-level API a transfer object may be created using the pencils and the datatype of the array that is to
       be sent:

          transfer = p0.transfer(p1, np.float)

       Executing the global redistribution is then simply a matter of:

          a1 = np.zeros(p1.subshape)
          transfer.forward(a, a1)

       Now it is important to realise that the global array does not change.  The  local  a1  arrays   will  now
       contain  the  same data as a, only aligned differently.  However, the exchange is not performed in-place.
       The new array is as such a copy of the original that is aligned differently.  Some images,  Fig.  %s  and
       Fig. %s, can be used to illustrate:
         [image]   Original  4  pencils  (p0)  of  shape  (2,  8)  aligned  in   y-direction.  Color  represents
         rank..UNINDENT
         [image] 4 pencils (p1) of shape (8, 2) aligned in x-direction after receiving data from p0. Data is the
         same as in Fig. %s, only aligned differently..UNINDENT

         Mathematically,  we  will  denote  the entries of a two-dimensional global array as u_{j_0, j_1}, where
         j_0\in \textbf{j}_0=[0, 1, \ldots, N_0-1] and j_1\in \textbf{j}_1=[0, 1, \ldots, N_1-1]. The  shape  of
         the  array  is  then (N_0, N_1). A global array u_{j_0, j_1} distributed in the first axis (as shown in
         Fig. %s) by processor group P, containing |P| processors, is denoted as

                                                     u_{j_0/P, j_1}

         The global redistribution, from alignment in axis 1 to alignment in axis 0, as from Fig. %s to Fig.  %s
         above, is denoted as

                              u_{j_0, j_1/P} \xleftarrow[P]{1\rightarrow 0} u_{j_0/P, j_1}

         This operation corresponds exactly to the forward transfer defined above:

          transfer.forward(a0, a1)

       If we need to go the other way

                             u_{j_0/P, j_1} \xleftarrow[P]{0\rightarrow 1} u_{j_0, j_1/P}

       this corresponds to:

          transfer.backward(a1, a0)

       Note that the directions (forward/backward) here depends on how the transfer object is created. Under the
       hood all transfers are executing calls to MPI.Alltoallw.

   Multidimensional distributed arrays
       The procedure discussed above remains the same for  any  type  of  array,  of  any  dimensionality.  With
       mpi4py-fft  we can distribute any array of arbitrary dimensionality using any number of processor groups.
       We only require that the number of processor groups is at least one less than the number  of  dimensions,
       since  one  axis must remain aligned. Apart from this the distribution is completely configurable through
       the classes in the pencil module.

       We denote a global d-dimensional array as u_{j_0, j_1, \ldots,  j_{d-1}},  where  j_m\in\textbf{j}_m  for
       m=[0, 1, \ldots, d-1].  A d-dimensional array distributed with only one processor group in the first axis
       is denoted as u_{j_0/P, j_1, \ldots, j_{d-1}}. If using more than one processor  group,  the  groups  are
       indexed, like P_0, P_1 etc.

       Lets  illustrate  using  a  4-dimensional array with 3 processor groups. Let the array be aligned only in
       axis 3 first (u_{j_0/P_0, j_1/P_1, j_2/P_2, j_3}), and then redistribute for alignment along  axes  2,  1
       and finally 0. Mathematically, we will now be executing the three following global redistributions:

       u_{j_0/P_0, j_1/P_1, j_2, j_3/P_2} \xleftarrow[P_2]{3 \rightarrow 2}  u_{j_0/P_0, j_1/P_1, j_2/P_2, j_3} \\
       u_{j_0/P_0,  j_1, j_2/P_1, j_3/P_2} \xleftarrow[P_1]{2 \rightarrow 1}  u_{j_0/P_0, j_1/P_1, j_2, j_3/P_2}
       \\ u_{j_0, j_1/P_0, j_2/P_1,  j_3/P_2}  \xleftarrow[P_0]{1  \rightarrow  0}   u_{j_0/P_0,  j_1,  j_2/P_1,
       j_3/P_2}

       Note  that in the first step it is only processor group P_2 that is active in the redistribution, and the
       output (left hand side) is now aligned in axis 2. This can be seen since  there  is  no  processor  group
       there to share the j_2 index.  In the second step processor group P_1 is the active one, and in the final
       step P_0.

       Now, it is not necessary to use three processor groups just because we have a four-dimensional array.  We
       could  just  as  well have been using 2 or 1. The advantage of using more groups is that you can then use
       more processors in total. Assuming N = N_0 = N_1 = N_2 = N_3, you can use a maximum  of  N^p  processors,
       where p is the number of processor groups. So for an array of shape (8,8,8,8) it is possible to use 8, 64
       and 512 number of processors for 1, 2 and 3 processor groups, respectively. On the other hand, if you can
       get  away  with  it,  or if you do not have access to a great number of processors, then fewer groups are
       usually found to be faster for the same number of processors in total.

       We can implement the global redistribution using the high-level DistArray class:

          N = (8, 8, 8, 8)
          a3 = DistArray(N, [0, 0, 0, 1])
          a2 = a3.redistribute(2)
          a1 = a2.redistribute(1)
          a0 = a1.redistribute(0)

       Note that the three redistribution steps correspond exactly to the three steps in (1).

       Using a low-level API the same can be achieved with a little more elaborate coding. We start by  creating
       pencils for the 4 different alignments:

          subcomm = Subcomm(comm, [0, 0, 0, 1])
          p3 = Pencil(subcomm, N, axis=3)
          p2 = p3.pencil(2)
          p1 = p2.pencil(1)
          p0 = p1.pencil(0)

       Here  we  have  defined  4  different  pencil  groups,  p0,  p1,  p2,  p3, aligned in axis 0, 1, 2 and 3,
       respectively. Transfer objects for arrays of type np.float are then created as:

          transfer32 = p3.transfer(p2, np.float)
          transfer21 = p2.transfer(p1, np.float)
          transfer10 = p1.transfer(p0, np.float)

       Note that we can create transfer objects between any two pencils, not just neighbouring axes. We may  now
       perform three different global redistributions as:

          a0 = np.zeros(p0.subshape)
          a1 = np.zeros(p1.subshape)
          a2 = np.zeros(p2.subshape)
          a3 = np.zeros(p3.subshape)
          a0[:] = np.random.random(a0.shape)
          transfer32.forward(a3, a2)
          transfer21.forward(a2, a1)
          transfer10.forward(a1, a0)

       Storing this code under pencils4d.py, we can use 8 processors that will give us 3 processor groups with 2
       processors in each group:

          mpirun -np 8 python pencils4d.py

       Note that with the low-level approach we can now easily go back using the reverse backward method of  the
       Transfer objects:

          transfer10.backward(a0, a1)

       A different approach is also possible with the high-level API:

          a0.redistribute(out=a1)
          a1.redistribute(out=a2)
          a2.redistribute(out=a3)

       which  corresponds  to  the backward transfers. However, with the high-level API the transfer objects are
       created (and deleted on exit) during the call to redistribute and as such this  latter  approach  may  be
       slightly less efficient.

   Discrete Fourier Transforms
       Consider  first  two one-dimensional arrays \boldsymbol{u} = \{u_j\}_{j=0}^{N-1} and \boldsymbol{\hat{u}}
       =\{\hat{u}_k\}_{k=0}^{N-1}. We define  the  forward  and  backward  Discrete  Fourier  transforms  (DFT),
       respectively, as

       \hat{u}_k  &=  \frac{1}{N}\sum_{j=0}^{N-1}u_j e^{-2\pi i j k / N}, \quad \forall \, k\in \textbf{k}=0, 1,
                                                    \ldots, N-1, \\
       u_j &= \sum_{k=0}^{N-1}\hat{u}_k e^{2\pi i j k / N}, \quad \forall \, j\in\textbf{j}=0, 1, \ldots, N-1,

       where i=\sqrt{-1}. Discrete Fourier transforms are computed  efficiently  using  algorithms  termed  Fast
       Fourier Transforms, known in short as FFTs.

       NOTE:
          The  index  set  for  wavenumbers  \textbf{k}  is  usually  not  chosen  as  [0,  1, \ldots, N-1], but
          \textbf{k}=[-N/2, -N/2-1, \ldots, N/2-1] for even  N  and  \textbf{k}=[-(N-1)/2,  -(N-1)/2+1,  \ldots,
          (N-1)/2]  for  odd  N.  See  numpy.fft.fftfreq.   Also  note  that it is possible to tweak the default
          normalization used above when calling either forward or backward transforms.

       A more compact notation is commonly used for the DFTs, where the 1D forward and backward  transforms  are
       written as

                                \boldsymbol{\hat{u}} &= \mathcal{F}(\boldsymbol{u}), \\
       \boldsymbol{u} &= \mathcal{F}^{-1}(\boldsymbol{\hat{u}}).

       Numpy,  Scipy, and many other scientific softwares contain implementations that make working with Fourier
       series simple and straight forward. These 1D Fourier transforms can be implemented easily with just Numpy
       as, e.g.:

          import numpy as np
          N = 16
          u = np.random.random(N)
          u_hat = np.fft.fft(u)
          uc = np.fft.ifft(u_hat)
          assert np.allclose(u, uc)

       However,  there  is  a  minor  difference.  Numpy  performs  by default the 1/N scaling with the backward
       transform (ifft) and not the forward as shown in (2). These are  merely  different  conventions  and  not
       important  as  long  as  one is aware of them. We use the scaling on the forward transform simply because
       this follows naturally when using the harmonic functions e^{i k x} as basis functions when  solving  PDEs
       with the spectral Galerkin method or the spectral collocation method (see chap. 3).

       With  mpi4py-fft  the  same  operations  take  just  a  few more steps, because instead of executing ffts
       directly, like in the calls for np.fft.fft and np.fft.ifft, we need to create the objects that are to  do
       the transforms first. We need to plan the transforms:

          from mpi4py_fft import fftw
          u = fftw.aligned(N, dtype=np.complex)
          u_hat = fftw.aligned_like(u)
          fft = fftw.fftn(u, flags=(fftw.FFTW_MEASURE,))        # plan fft
          ifft = fftw.ifftn(u_hat, flags=(fftw.FFTW_ESTIMATE,)) # plan ifft
          u[:] = np.random.random(N)
          # Now execute the transforms
          u_hat = fft(u, u_hat, normalize=True)
          uc = ifft(u_hat)
          assert np.allclose(uc, u)

       The  planning of transforms makes an effort to find the fastest possible transform of the given kind. See
       more in The fftw module.

   Multidimensional transforms
       It is for multidimensional arrays that  it  starts  to  become  interesting  for  the  current  software.
       Multidimensional arrays are a bit tedious with notation, though, especially when the number of dimensions
       grow. We will stick with the index notation because  it  is  most  stright  forward  in  comparison  with
       implementation.

       We denote the entries of a two-dimensional array as u_{j_0, j_1}, which corresponds to a row-major matrix
       \boldsymbol{u}=\{u_{j_0, j_1}\}_{(j_0, j_1) \in \textbf{j}_0 \times \textbf{j}_1} of size  N_0\cdot  N_1.
       Denoting also \omega_m=j_m k_m / N_m, a two-dimensional forward and backward DFT can be defined as

       \hat{u}_{k_0,k_1}  &=  \frac{1}{N_0}\sum_{j_0  \in  \textbf{j}_0}\Big( e^{-2\pi i \omega_0} \frac{1}{N_1}
       \sum_{j_1\in \textbf{j}_1} \Big( e^{-2\pi i \omega_1} u_{j_0,j_1}\Big) \Big), \quad \forall \, (k_0, k_1)
                                       \in \textbf{k}_0  \times \textbf{k}_1, \\
       u_{j_0,  j_1}  &=  \sum_{k_1\in  \textbf{k}_1}  \Big( e^{2\pi i \omega_1} \sum_{k_0\in\textbf{k}_0} \Big(
       e^{2\pi i \omega_0} \hat{u}_{k_0, k_1} \Big) \Big), \quad \forall \, (j_0, j_1) \in  \textbf{j}_0  \times
       \textbf{j}_1.

       Note  that  the forward transform corresponds to taking the 1D Fourier transform first along axis 1, once
       for each of the indices in \textbf{j}_0.  Afterwords the transform is executed  along  axis  0.  The  two
       steps  are  more  easily understood if we break things up a little bit and write the forward transform in
       (3) in two steps as

       \tilde{u}_{j_0,k_1} &= \frac{1}{N_1}\sum_{j_1 \in \textbf{j}_1} u_{j_0,j_1} e^{-2\pi i  \omega_1},  \quad
                                          \forall \, k_1 \in \textbf{k}_1, \\
       \hat{u}_{k_0,k_1}  &=  \frac{1}{N_0}\sum_{j_0 \in \textbf{j}_0} \tilde{u}_{j_0,k_1} e^{-2\pi i \omega_0},
       \quad \forall \, k_0 \in \textbf{k}_0.

       The backward (inverse) transform if performed in the opposite order, axis 0 first and then 1.  The  order
       is  actually  arbitrary,  but  this  is  how  is  is  usually  computed. With mpi4py-fft the order of the
       directional transforms can easily be configured.

       We can write the complete transform on compact notation as

                                \boldsymbol{\hat{u}} &= \mathcal{F}(\boldsymbol{u}), \\
       \boldsymbol{u} &= \mathcal{F}^{-1}(\boldsymbol{\hat{u}}).

       But if we denote the two partial transforms along each axis as \mathcal{F}_0 and  \mathcal{F}_1,  we  can
       also write it as

                       \boldsymbol{\hat{u}} &= \mathcal{F}_0(\mathcal{F}_1(\boldsymbol{u})), \\
       \boldsymbol{u} &= \mathcal{F}_1^{-1}(\mathcal{F}_0^{-1}(\boldsymbol{\hat{u}})).

       Extension  to  multiple  dimensions  is straight forward. We denote a d-dimensional array as u_{j_0, j_1,
       \ldots, j_{d-1}} and a partial transform of u along axis i is denoted as

           \tilde{u}_{j_0, \ldots, k_i, \ldots, j_{d-1}} = \mathcal{F}_i(u_{j_0, \ldots, j_i, \ldots, j_d})

       We get the complete multidimensional transforms on short form still as (5), and with  partial  transforms
       as

          \boldsymbol{\hat{u}} &= \mathcal{F}_0(\mathcal{F}_1( \ldots \mathcal{F}_{d-1}(\boldsymbol{u})), \\
       \boldsymbol{u}          &=         \mathcal{F}_{d-1}^{-1}(         \mathcal{F}_{d-2}^{-1}(         \ldots
       \mathcal{F}_0^{-1}(\boldsymbol{\hat{u}}))).

       Multidimensional transforms are straightforward to implement in Numpy

          import numpy as np
          M, N = 16, 16
          u = np.random.random((M, N))
          u_hat = np.fft.rfftn(u)
          uc = np.fft.irfftn(u_hat)
          assert np.allclose(u, uc)

   The fftw module
       The fftw module provides an interface to most of the FFTW library. In the fftw.xfftn submodule there  are
       planner functions for:

          • fftn() - complex-to-complex forward Fast Fourier Transforms

          • ifftn() - complex-to-complex backward Fast Fourier Transforms

          • rfftn() - real-to-complex forward FFT

          • irfftn() - complex-to-real backward FFT

          • dctn() - real-to-real Discrete Cosine Transform (DCT)

          • idctn() - real-to-real inverse DCT

          • dstn() - real-to-real Discrete Sine Transform (DST)

          • idstn() - real-to-real inverse DST

          • hfftn() - complex-to-real forward FFT with Hermitian symmetry

          • ihfftn() - real-to-complex backward FFT with Hermitian symmetry

       All  these  transform functions return instances of one of the classes fftwf_xfftn.FFT, fftw_xfftn.FFT or
       fftwl_xfftn.FFT, depending on the requested precision being single, double or long double,  respectively.
       Except  from  precision,  the  tree classes are identical.  All transforms are non-normalized by default.
       Note that all these functions are planners. They do not execute the transforms,  they  simply  return  an
       instance of a class that can do it (see docstrings of each function for usage).  For quick reference, the
       2D transform shown for Numpy can be done using fftw as:

          from mpi4py_fft.fftw import rfftn as plan_rfftn, irfftn as plan_irfftn
          from mpi4py_fft.fftw import FFTW_ESTIMATE
          rfftn = plan_rfftn(u.copy(), flags=(FFTW_ESTIMATE,))
          irfftn = plan_irfftn(u_hat.copy(), flags=(FFTW_ESTIMATE,))
          u_hat = rfftn(uc, normalize=True)
          uu = irfftn(u_hat)
          assert np.allclose(uu, uc)

       Note that since all the functions in the above list are planners, an extra step is required in comparison
       with  Numpy.  Also note that we are using copies of the u and u_hat arrays in creating the plans. This is
       done because the provided arrays will be used under the hood as work arrays for the rfftn() and  irfftn()
       functions, and the work arrays may be destroyed upon creation.

       The real-to-real transforms are by FFTW defined as one of (see definitions and extended definitions)

          • FFTW_REDFT00

          • FFTW_REDFT01

          • FFTW_REDFT10

          • FFTW_REDFT11

          • FFTW_RODFT00

          • FFTW_RODFT01

          • FFTW_RODFT10

          • FFTW_RODFT11

       Different   real-to-real   cosine   and   sine   transforms   may  be  combined  into  one  object  using
       factory.get_planned_FFT() with a list of different transform  kinds.  However,  it  is  not  possible  to
       combine,  in  one  single  object, real-to-real transforms with real-to-complex. For such transforms more
       than one object is required.

   Parallel Fast Fourier Transforms
       Parallel FFTs are computed through a combination of global  redistributions  and  serial  transforms.  In
       mpi4py-fft  the  interface  to performing such parallel transforms is the mpifft.PFFT class. The class is
       highly configurable and best explained through a few examples.

   Slab decomposition
       With slab decompositions we use only one  group  of  processors  and  distribute  only  one  index  of  a
       multidimensional array at the time.

       Consider  the  complete transform of a three-dimensional array of random numbers, and of shape (128, 128,
       128). We can plan the transform of such an array with the following code snippet:

          import numpy as np
          from mpi4py import MPI
          from mpi4py_fft import PFFT, newDistArray
          N = np.array([128, 128, 128], dtype=int)
          fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))

       Here the signature N, axes=(0, 1, 2), dtype=np.float, grid=(-1,) tells us that the created  fft  instance
       is  planned  such as to slab distribute (along first axis) and transform any 3D array of shape N and type
       np.float. Furthermore, we plan to transform axis 2 first, and then 1 and 0, which is exactly the  reverse
       order of axes=(0, 1, 2). Mathematically, the planned transform corresponds to

                 \tilde{u}_{j_0/P,k_1,k_2} &= \mathcal{F}_1( \mathcal{F}_{2}(u_{j_0/P, j_1, j_2})), \\
       \tilde{u}_{j_0,   k_1/P,   k_2}   &\xleftarrow[P]{1\rightarrow   0}   \tilde{u}_{j_0/P,   k_1,  k_2},  \\
       \hat{u}_{k_0,k_1/P,k_2} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P, k_2}).

       Note that axis 0 is distributed on the input array and axis 1 on the output  array.  In  the  first  step
       above  we  compute  the  transforms  along axes 2 and 1 (in that order), but we cannot compute the serial
       transform along axis 0 since the global array is distributed in that direction.  We  need  to  perform  a
       global  redistribution, the middle step, that realigns the global data such that it is aligned in axes 0.
       With data aligned in axis 0, we can perform the final transform \mathcal{F}_{0} and be done with it.

       Assume now that all the code in this section is stored to a file named pfft_example.py, and  add  to  the
       above code:

          u = newDistArray(fft, False)
          u[:] = np.random.random(u.shape).astype(u.dtype)
          u_hat = fft.forward(u, normalize=True) # Note that normalize=True is default and can be omitted
          uj = np.zeros_like(u)
          uj = fft.backward(u_hat, uj)
          assert np.allclose(uj, u)
          print(MPI.COMM_WORLD.Get_rank(), u.shape)

       Running  this  code  with two processors (mpirun -np 2 python pfft_example.py) should raise no exception,
       and the output should be:

          1 (64, 128, 128)
          0 (64, 128, 128)

       This shows that the first index has been shared between the two processors  equally.  The  array  u  thus
       corresponds to u_{j_0/P,j_1,j_2}. Note that the newDistArray() function returns a DistArray object, which
       in turn is a subclassed Numpy ndarray. The newDistArray() function uses fft to  determine  the  size  and
       type  of the created distributed array, i.e., (64, 128, 128) and np.float for both processors.  The False
       argument indicates that the shape and type should be that of the input array, as opposed  to  the  output
       array type (\hat{u}_{k_0,k_1/P,k_2} that one gets with True).

       Note  that  because  the input array is of real type, and not complex, the output array will be of global
       shape:

          128, 128, 65

       The output array will be distributed in axis 1, so the output array shape should be (128, 64, 65) on each
       processor. We check this by adding the following code and rerunning:

          u_hat = newDistArray(fft, True)
          print(MPI.COMM_WORLD.Get_rank(), u_hat.shape)

       leading to an additional print of:

          1 (128, 64, 65)
          0 (128, 64, 65)

       To  distribute  in the first axis first is default and most efficient for row-major C arrays. However, we
       can easily configure the fft instance by modifying the axes keyword. Changing for example to:

          fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)

       and axis 1 will be transformed first, such that the global output array will be of shape (128, 65,  128).
       The distributed input and output arrays will now have shape:

          0 (128, 128, 64)
          1 (128, 128, 64)

          0 (64, 65, 128)
          1 (64, 65, 128)

       Note that the input array will be distributed in axis 2 and the output in axis 0.

       Another way to tweak the distribution is to use the Subcomm class directly:

          from mpi4py_fft.pencil import Subcomm
          subcomms = Subcomm(MPI.COMM_WORLD, [1, 0, 1])
          fft = PFFT(subcomms, N, axes=(0, 1, 2), dtype=np.float)

       Here  the subcomms tuple will decide that axis 1 should be distributed, because the only zero in the list
       [1, 0, 1] is along axis 1. The ones determine that axes 0 and 2 should use one processor each, i.e., they
       should be non-distributed.

       The  PFFT class has a few additional keyword arguments that one should be aware of. The default behaviour
       of PFFT is to use one transform  object  for  each  axis,  and  then  use  these  sequentially.   Setting
       collapse=True  will  attempt  to minimize the number of transform objects by combining whenever possible.
       Take our example, the array u_{j_0/P,j_1,j_2} can transform along  both  axes  1  and  2  simultaneously,
       without  any  intermediate  global  redistributions. By setting collapse=True only one object of rfftn(u,
       axes=(1, 2)) will be used instead of two (like fftn(rfftn(u, axes=2), axes=1)).  Note that a collapse can
       also be configured through the axes keyword, using:

          fft = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), dtype=np.float)

       will collapse axes 1 and 2, just like one would obtain with collapse=True.

       If  serial  transforms  other  than  fftn()/rfftn()  and  ifftn()/irfftn() are required, then this can be
       achieved using the transforms keyword and a dictionary pointing from axes to the type  of  transform.  We
       can for example combine real-to-real with real-to-complex transforms like this:

          from mpi4py_fft.fftw import rfftn, irfftn, dctn, idctn
          import functools
          dct = functools.partial(dctn, type=3)
          idct = functools.partial(idctn, type=3)
          transforms = {(0,): (rfftn, irfftn), (1, 2): (dct, idct)}
          r2c = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), transforms=transforms)
          u = newDistArray(r2c, False)
          u[:] = np.random.random(u.shape).astype(u.dtype)
          u_hat = r2c.forward(u)
          uj = np.zeros_like(u)
          uj = r2c.backward(u_hat, uj)
          assert np.allclose(uj, u)

       As  a  more  complex  example  consider  a  5-dimensional array where for some reason you need to perform
       discrete cosine transforms in axes 1 and 2, discrete sine transforms in axes  3  and  4,  and  a  regular
       Fourier  transform  in the first axis.  Here it makes sense to collapse the (1, 2) and (3, 4) axes, which
       leaves only the first axis uncollapsed. Hence we can then  only  use  one  processor  group  and  a  slab
       decomposition,  whereas  without  collapsing we could have used four groups.  A parallel transform object
       can be created and tested as:

          N = (5, 6, 7, 8, 9)
          dctn = functools.partial(fftw.dctn, type=3)
          idctn = functools.partial(fftw.idctn, type=3)
          dstn = functools.partial(fftw.dstn, type=3)
          idstn = functools.partial(fftw.idstn, type=3)
          fft = PFFT(MPI.COMM_WORLD, N, ((0,), (1, 2), (3, 4)), grid=(-1,),
                     transforms={(1, 2): (dctn, idctn), (3, 4): (dstn, idstn)})

          A = newDistArray(fft, False)
          A[:] = np.random.random(A.shape)
          C = fftw.aligned_like(A)
          B = fft.forward(A)
          C = fft.backward(B, C)
          assert np.allclose(A, C)

   Pencil decomposition
       A pencil decomposition uses two groups of processors. Each group then is responsible for distributing one
       index  set each of a multidimensional array.  We can perform a pencil decomposition simply by running the
       first example from the previous section, but now with 4  processors.  To  remind  you,  we  put  this  in
       pfft_example.py, where now grid=(-1,) has been removed in the PFFT calling:

          import numpy as np
          from mpi4py import MPI
          from mpi4py_fft import PFFT, newDistArray

          N = np.array([128, 128, 128], dtype=int)
          fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float)
          u = newDistArray(fft, False)
          u[:] = np.random.random(u.shape).astype(u.dtype)
          u_hat = fft.forward(u)
          uj = np.zeros_like(u)
          uj = fft.backward(u_hat, uj)
          assert np.allclose(uj, u)
          print(MPI.COMM_WORLD.Get_rank(), u.shape)

       The output of running mpirun -np 4 python pfft_example.py will then be:

          0 (64, 64, 128)
          2 (64, 64, 128)
          3 (64, 64, 128)
          1 (64, 64, 128)

       Note  that  now  both  the two first index sets are shared, so we have a pencil decomposition. The shared
       input array is now denoted as u_{j_0/P_0,j_1/P_1,j2} and the  complete  forward  transform  performs  the
       following 5 steps:

                   \tilde{u}_{j_0/P_0,j_1/P_1,k_2} &= \mathcal{F}_{2}(u_{j_0/P_0, j_1/P_1, j_2}), \\
       \tilde{u}_{j_0/P_0, j_1, k_2/P_1} &\xleftarrow[P_1]{2\rightarrow 1} \tilde{u}_{j_0/P_0, j_1/P_1, k_2}, \\
       \tilde{u}_{j_0/P_0,k_1,k_2/P_1} &= \mathcal{F}_1(\tilde{u}_{j_0/P_0, j_1, k_2/P_1}),  \\  \tilde{u}_{j_0,
       k_1/P_0,    k_2/P_1}    &\xleftarrow[P_0]{1\rightarrow   0}   \tilde{u}_{j_0/P_0,   k_1,   k_2/P_1},   \\
       \hat{u}_{k_0,k_1/P_0,k_2/P_1} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P_0, k_2/P_1}).

       Like for the slab decomposition, the order of the different steps  is  configurable.  Simply  change  the
       value of axes, e.g., as:

          fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)

       and the input and output arrays will be of shape:

          3 (64, 128, 64)
          2 (64, 128, 64)
          1 (64, 128, 64)
          0 (64, 128, 64)

          3 (64, 32, 128)
          2 (64, 32, 128)
          1 (64, 33, 128)
          0 (64, 33, 128)

       We see that the input array is aligned in axis 1, because this is the direction transformed first.

   Convolution
       Working with Fourier one sometimes need to transform the product of two or more functions, like

             \widehat{ab}_k = \int_{0}^{2\pi} a b e^{-i k x} dx, \quad \forall k \in [-N/2, \ldots, N/2-1]

       computed with DFT as

       \widehat{ab}_k  =  \frac{1}{N}\sum_{j=0}^{N-1}a_j  b_j  e^{-2\pi i j k / N}, \quad \forall \, k\in [-N/2,
                                                    \ldots, N/2-1].

       NOTE:
          We are here assuming an even number N and use wavenumbers centered around zero.

       If a and b are two Fourier series with their own coefficients:

                                  a &= \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x}, \\
       b &= \sum_{q=-N/2}^{N/2-1} \hat{b}_q e^{i q x},

       then we can insert for the two sums from (11) in (9) and get

       \widehat{ab}_k &= \int_{0}^{2\pi} \left( \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x}  \sum_{q=-N/2}^{N/2-1}
              \hat{b}_q e^{i q x} \right)  e^{-i k x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1] \\
       \widehat{ab}_k  &= \sum_{p=-N/2}^{N/2-1} \sum_{q=-N/2}^{N/2-1} \hat{a}_p  \hat{b}_q \int_{0}^{2\pi} e^{-i
       (p+q-k) x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1]

       The final integral is 2\pi for p+q=k and zero otherwise. Consequently, we get

       \widehat{ab}_k = 2\pi \sum_{p=-N/2}^{N/2-1}\sum_{q=-N/2}^{N/2-1} \hat{a}_p  \hat{b}_{q} \delta_{p+q, k} ,
                                     \quad \forall \, k \in [-N/2, \ldots, N/2-1]

       Unfortunately,  the convolution sum (13) is very expensive to compute, and the direct application of (10)
       leads to aliasing errors. Luckily there is a fast approach that eliminates aliasing as well.

       The fast, alias-free, approach makes use of the FFT and zero-padded coefficient vectors. The idea  is  to
       zero-pad \hat{a} and \hat{b} in spectral space such that we get the extended sums

                          A_j &= \sum_{p=-M/2}^{M/2-1} \hat{\hat{a}}_p e^{2 \pi i p j/M}, \\
       B_j &= \sum_{q=-M/2}^{M/2-1} \hat{\hat{b}}_q e^{2 \pi i q j/M},

       where M>N and where the coefficients have been zero-padded such that

                          \hat{\hat{a}}_p = \begin{cases} \hat{a}_p, &\forall |p| \le N/2 \\
                                       0, &\forall |p| \gt N/2 \end{cases}

       Now compute the nonlinear term in the larger physical space and compute the convolution as

       \widehat{ab}_k  = \frac{1}{M} \sum_{j=0}^{M-1} A_j B_j e^{- 2 \pi i k j/M}, \quad \forall \, k \in [-M/2,
                                                    \ldots, M/2-1]

       Finally, truncate the vector \widehat{ab}_k to the original range k\in[-N/2, \ldots,  N/2-1],  simply  by
       eliminating all the wavenumbers higher than |N/2|.

       With mpi4py-fft we can compute this convolution using the padding keyword of the PFFT class:

          import numpy as np
          from mpi4py_fft import PFFT, newDistArray
          from mpi4py import MPI

          comm = MPI.COMM_WORLD
          N = (128, 128)   # Global shape in physical space
          fft = PFFT(comm, N, padding=[1.5, 1.5], dtype=np.complex)

          # Create arrays in normal spectral space
          a_hat = newDistArray(fft, True)
          b_hat = newDistArray(fft, True)
          a_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
          b_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j

          # Transform to real space with padding
          a = newDistArray(fft, False)
          b = newDistArray(fft, False)
          assert a.shape == (192//comm.Get_size(), 192)
          a = fft.backward(a_hat, a)
          b = fft.backward(b_hat, b)

          # Do forward transform with truncation
          ab_hat = fft.forward(a*b)

       NOTE:
          The  padded  instance  of  the PFFT class is often used in addition to a regular non-padded class. The
          padded version is then used to handle non-linearities, whereas the non-padded takes care of the  rest,
          see demo.

   Storing datafiles
       mpi4py-fft  works with regular Numpy arrays. However, since arrays in parallel can become very large, and
       the arrays live on multiple processors, we require parallel  IO  capabilities  that  goes  beyond  Numpys
       regular  methods.   In  the  mpi4py_fft.io  module there are two helper classes for dumping dataarrays to
       either HDF5 or NetCDF format:

          • HDF5File

          • NCFile

       Both classes have one write and one read method that stores or reads data in parallel. A  simple  example
       of usage is:

          from mpi4py import MPI
          import numpy as np
          from mpi4py_fft import PFFT, HDF5File, NCFile, newDistArray
          N = (128, 256, 512)
          T = PFFT(MPI.COMM_WORLD, N)
          u = newDistArray(T, forward_output=False)
          v = newDistArray(T, forward_output=False, val=2)
          u[:] = np.random.random(u.shape)
          # Store by first creating output files
          fields = {'u': [u], 'v': [v]}
          f0 = HDF5File('h5test.h5', mode='w')
          f1 = NCFile('nctest.nc', mode='w')
          f0.write(0, fields)
          f1.write(0, fields)
          v[:] = 3
          f0.write(1, fields)
          f1.write(1, fields)

       Note  that  we  are  here  creating two datafiles h5test.h5 and nctest.nc, for storing in HDF5 or NetCDF4
       formats respectively. Normally, one would be satisfied using  only  one  format,  so  this  is  only  for
       illustration.  We  store  the  fields u and v on three different occasions, so the datafiles will contain
       three snapshots of each field u and v.

       Also note that an alternative and perhaps simpler approach is to  just  use  the  write  method  of  each
       distributed array:

          u.write('h5test.h5', 'u', step=2)
          v.write('h5test.h5', 'v', step=2)
          u.write('nctest.nc', 'u', step=2)
          v.write('nctest.nc', 'v', step=2)

       The two different approaches can be used on the same output files.

       The stored dataarrays can also be retrieved later on:

          u0 = newDistArray(T, forward_output=False)
          u1 = newDistArray(T, forward_output=False)
          u0.read('h5test.h5', 'u', 0)
          u1.read('h5test.h5', 'u', 1)
          # or alternatively for netcdf
          #u0.read('nctest.nc', 'u', 0)
          #u1.read('nctest.nc', 'u', 1)

       Note  that  one  does not have to use the same number of processors when retrieving the data as when they
       were stored.

       It is also possible to store only parts of the, potentially large,  arrays.   Any  chosen  slice  may  be
       stored, using a global view of the arrays. It is possible to store both complete fields and slices in one
       single call by using the following appraoch:

          f2 = HDF5File('variousfields.h5', mode='w')
          fields = {'u': [u,
                          (u, [slice(None), slice(None), 4]),
                          (u, [5, 5, slice(None)])],
                    'v': [v,
                          (v, [slice(None), 6, slice(None)])]}
          f2.write(0, fields)
          f2.write(1, fields)

       Alternatively, one can use the write method of each field with the global_slice keyword argument:

          u.write('variousfields.h5', 'u', 2)
          u.write('variousfields.h5', 'u', 2, global_slice=[slice(None), slice(None), 4])
          u.write('variousfields.h5', 'u', 2, global_slice=[5, 5, slice(None)])
          v.write('variousfields.h5', 'v', 2)
          v.write('variousfields.h5', 'v', 2, global_slice=[slice(None), 6, slice(None)])

       In the end this will lead to an hdf5-file with groups:

          variousfields.h5/
          ├─ u/
          |  ├─ 1D/
          |  |  └─ 5_5_slice/
          |  |     ├─ 0
          |  |     ├─ 1
          |  |     └─ 3
          |  ├─ 2D/
          |  |  └─ slice_slice_4/
          |  |     ├─ 0
          |  |     ├─ 1
          |  |     └─ 2
          |  ├─ 3D/
          |  |   ├─ 0
          |  |   ├─ 1
          |  |   └─ 2
          |  └─ mesh/
          |      ├─ x0
          |      ├─ x1
          |      └─ x2
          └─ v/
             ├─ 2D/
             |  └─ slice_6_slice/
             |     ├─ 0
             |     ├─ 1
             |     └─ 2
             ├─ 3D/
             |  ├─ 0
             |  ├─ 1
             |  └─ 2
             └─ mesh/
                ├─ x0
                ├─ x1
                └─ x2

       Note that a mesh is stored along with each group of data. This mesh can be given in  two  different  ways
       when creating the datafiles:

          1. A  sequence  of  2-tuples, where each 2-tuple contains the (origin, length) of the domain along its
             dimension. For example, a uniform mesh that originates from the origin,  with  lengths  \pi,  2\pi,
             3\pi, can be given when creating the output file as:

                 f0 = HDF5File('filename.h5', domain=((0, pi), (0, 2*np.pi), (0, 3*np.pi)))

                 or, using the write method of the distributed array:

                 u.write('filename.h5', 'u', 0, domain=((0, pi), (0, 2*np.pi), (0, 3*np.pi)))

          2. A sequence of arrays giving the coordinates for each dimension. For example:

                 d = (np.arange(N[0], dtype=np.float)*1*np.pi/N[0],
                      np.arange(N[1], dtype=np.float)*2*np.pi/N[1],
                      np.arange(N[2], dtype=np.float)*2*np.pi/N[2])
                 f0 = HDF5File('filename.h5', domain=d)

       With  NetCDF4  the layout is somewhat different. For variousfields above, if we were using NCFile instead
       of HDF5File, we would get a datafile that with ncdump -h variousfields.nc would look like:

          netcdf variousfields {
          dimensions:
                  time = UNLIMITED ; // (3 currently)
                  x = 128 ;
                  y = 256 ;
                  z = 512 ;
          variables:
                  double time(time) ;
                  double x(x) ;
                  double y(y) ;
                  double z(z) ;
                  double u(time, x, y, z) ;
                  double u_slice_slice_4(time, x, y) ;
                  double u_5_5_slice(time, z) ;
                  double v(time, x, y, z) ;
                  double v_slice_6_slice(time, x, z) ;
          }

   Postprocessing
       Dataarrays stored to HDF5 files can be visualized using both Paraview and Visit,  whereas  NetCDF4  files
       can at the time of writing only be opened with Visit.

       To  view  the HDF5-files we first need to generate some light-weight xdmf-files that can be understood by
       both Paraview and Visit. To generate  such  files,  simply  throw  the  module  io.generate_xdmf  on  the
       HDF5-files:

          from mpi4py_fft.io import generate_xdmf
          generate_xdmf('variousfields.h5')

       This will create a number of xdmf-files, one for each group that contains 2D or 3D data:

          variousfields.xdmf
          variousfields_slice_slice_4.xdmf
          variousfields_slice_6_slice.xdmf

       These  files  can  be  opened directly in Paraview. However, note that for Visit, one has to generate the
       files using:

          generate_xdmf('variousfields.h5', order='visit')

       because for some reason Paraview and Visit require the mesh in the xdmf-files to be  stored  in  opposite
       order.

   Installation
       Mpi4py-fft has a few dependencies

          • mpi4py

          • FFTW (serial)

          • numpy

          • cython (build dependency)

          • h5py (runtime dependency, optional)

       that  are mostly straight-forward to install, or already installed in most Python environments. The first
       two are usually most troublesome.  Basically, for mpi4py you need to have  a  working  MPI  installation,
       whereas  FFTW  is  available on most high performance computer systems.  If you are using conda, then all
       you need to install a fully functional mpi4py-fft, with all the above dependencies, is

          conda install -c conda-forge mpifpy-fft h5py=*=mpi*

       You probably want to install into a fresh environment, though, which can be achieved with

          conda create --name mpi4py-fft -c conda-forge mpi4py-fft h5py=*=mpi*
          conda activate mpi4py-fft

       Note that this gives you mpi4py-fft with default settings. This means that  you  will  probably  get  the
       openmpi  backend,  and  it  is  also  likely  that  conda-forge  chooses  numpy  with  the  mkl  backend.
       Unfortunately, the mkl python package makes adjustments to the FFTW library and hard to resolve bugs  may
       arise. For this reason it is advisable to make sure that mkl is not installed. This can be achieved with,
       e.g.,

          conda create --name mpi4py-fft -c conda-forge mpi4py-fft mpich nomkl h5py=*=mpi*

       Note that the nomkl package makes sure that numpy is installed without mkl, whereas  mpich  here  chooses
       this backend over openmpi.

       If  you do not use conda, then you need to make sure that MPI and FFTW are installed by some other means.
       You can then install any version of mpi4py-fft hosted on pypi using pip

          pip install mpi4py-fft

       whereas either one of the following will install the latest version from github

          pip install git+https://bitbucket.org/mpi4py/mpi4py-fft@master
          pip install https://bitbucket.org/mpi4py/mpi4py-fft/get/master.zip

       You can also build mpi4py-fft yourselves from the top directory, after cloning or forking

          pip install .

       or using conda-build with the recipes in folder conf/

          conda build -c conda-forge conf/
          conda create --name mpi4py-fft -c conda-forge mpi4py-fft --use-local
          conda activate mpi4py-fft

   Additional dependencies
       For storing and retrieving data you need either HDF5 or netCDF4, compiled with support for MPI.  HDF5  is
       already  available  with parallel support on conda-forge and, if it was not installed at the same time as
       mpi4py-fft, it can be installed (with the mpich backend for MPI) as

          conda install -c conda-forge h5py=*=mpi_mpich_*

       A parallel version of netCDF4 cannot be found on the conda-forge channel, but a precompiled  version  has
       been made available for python 2.7, 3.6 and 3.7 on the spectralDNS channel, for both osx and linux

          conda install -c spectralDNS netcdf4-parallel

       Note  that parallel HDF5 and NetCDF4 often are available as modules on supercomputers. Otherwise, see the
       respective packages for how to install with support for MPI.

   Test installation
       After installing (from source) it may be a good idea to run all the tests located in the tests folder.  A
       range of tests may be run using the runtests.sh script

          conda install scipy, coverage
          cd tests/
          ./runtests.sh

       This test-suit is run automatically on every commit to github, see, e.g., .SS How to cite?

       Please cite mpi4py-fft using

          @article{jpdc_fft,
              author = {{Dalcin, Lisandro and Mortensen, Mikael and Keyes, David E}},
              year = {{2019}},
              title = {{Fast parallel multidimensional FFT using advanced MPI}},
              journal = {{Journal of Parallel and Distributed Computing}},
              doi = {10.1016/j.jpdc.2019.02.006}
          }
          @electronic{mpi4py-fft,
              author = {{Lisandro Dalcin and Mikael Mortensen}},
              title = {{mpi4py-fft}},
              url = {{https://bitbucket.org/mpi4py/mpi4py-fft}}
          }

   How to contribute?
       Mpi4py-fft  is an open source project and anyone is welcome to contribute.  An easy way to get started is
       by suggesting a new enhancement on the issue tracker. If you have found a bug, then either report this on
       the  issue  tracker,  er  even  better, make a fork of the repository, fix the bug and then create a pull
       request to get the fix into the master branch.

   Indices and tables
       • genindex

       • modindex

       • search

AUTHOR

       Mikael Mortensen and Lisandro Dalcin

COPYRIGHT

       2020, Mikael Mortensen and Lisandro Dalcin

                                                  Feb 18, 2020                                     MPI4PY-FFT(1)