Provided by: manpages-dev_6.7-2_all bug

NAME

       madvise - give advice about use of memory

LIBRARY

       Standard C library (libc, -lc)

SYNOPSIS

       #include <sys/mman.h>

       int madvise(void addr[.length], size_t length, int advice);

   Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

       madvise():
           Since glibc 2.19:
               _DEFAULT_SOURCE
           Up to and including glibc 2.19:
               _BSD_SOURCE

DESCRIPTION

       The  madvise()  system  call  is used to give advice or directions to the kernel about the
       address range beginning at address addr and with size length.  madvise() only operates  on
       whole  pages, therefore addr must be page-aligned.  The value of length is rounded up to a
       multiple of page size.  In most cases, the goal of such advice is  to  improve  system  or
       application performance.

       Initially, the system call supported a set of "conventional" advice values, which are also
       available on  several  other  implementations.   (Note,  though,  that  madvise()  is  not
       specified  in  POSIX.)   Subsequently,  a number of Linux-specific advice values have been
       added.

   Conventional advice values
       The advice values listed below allow an application to tell the kernel how it  expects  to
       use  some  mapped  or shared memory areas, so that the kernel can choose appropriate read-
       ahead and caching techniques.  These advice values do not influence the semantics  of  the
       application (except in the case of MADV_DONTNEED), but may influence its performance.  All
       of the advice values listed here have  analogs  in  the  POSIX-specified  posix_madvise(3)
       function, and the values have the same meanings, with the exception of MADV_DONTNEED.

       The advice is indicated in the advice argument, which is one of the following:

       MADV_NORMAL
              No special treatment.  This is the default.

       MADV_RANDOM
              Expect page references in random order.  (Hence, read ahead may be less useful than
              normally.)

       MADV_SEQUENTIAL
              Expect page references in sequential order.  (Hence, pages in the given  range  can
              be aggressively read ahead, and may be freed soon after they are accessed.)

       MADV_WILLNEED
              Expect  access  in  the  near future.  (Hence, it might be a good idea to read some
              pages ahead.)

       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being, the  application  is
              finished  with  the  given  range, so the kernel can free resources associated with
              it.)

              After a successful MADV_DONTNEED operation, the semantics of memory access  in  the
              specified  region  are  changed:  subsequent  accesses  of  pages in the range will
              succeed, but will result in either repopulating the memory contents from the up-to-
              date  contents  of  the  underlying  mapped  file (for shared file mappings, shared
              anonymous mappings, and shmem-based techniques  such  as  System  V  shared  memory
              segments) or zero-fill-on-demand pages for anonymous private mappings.

              Note  that,  when  applied  to  shared  mappings,  MADV_DONTNEED  might not lead to
              immediate freeing of the pages in the range.  The kernel is free to  delay  freeing
              the  pages until an appropriate moment.  The resident set size (RSS) of the calling
              process will be immediately reduced however.

              MADV_DONTNEED cannot be applied to locked pages, or VM_PFNMAP pages.  (Pages marked
              with  the  kernel-internal  VM_PFNMAP  flag  are  special memory areas that are not
              managed by the virtual memory subsystem.   Such  pages  are  typically  created  by
              device drivers that map the pages into user space.)

              Support  for  Huge  TLB pages was added in Linux v5.18.  Addresses within a mapping
              backed by Huge TLB pages must be aligned to the underlying Huge TLB page size,  and
              the range length is rounded up to a multiple of the underlying Huge TLB page size.

   Linux-specific advice values
       The  following  Linux-specific  advice  values have no counterparts in the POSIX-specified
       posix_madvise(3), and may  or  may  not  have  counterparts  in  the  madvise()  interface
       available  on  other  implementations.   Note  that  some  of  these operations change the
       semantics of memory accesses.

       MADV_REMOVE (since Linux 2.6.16)
              Free up a given  range  of  pages  and  its  associated  backing  store.   This  is
              equivalent  to punching a hole in the corresponding range of the backing store (see
              fallocate(2)).  Subsequent accesses in the specified address range  will  see  data
              with a value of zero.

              The  specified  address range must be mapped shared and writable.  This flag cannot
              be applied to locked pages, or VM_PFNMAP pages.

              In the initial implementation, only tmpfs(5) supported MADV_REMOVE; but since Linux
              3.5,  any filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode also
              supports MADV_REMOVE.  Filesystems which do not support MADV_REMOVE fail  with  the
              error EOPNOTSUPP.

              Support for the Huge TLB filesystem was added in Linux v4.3.

       MADV_DONTFORK (since Linux 2.6.16)
              Do  not  make the pages in this range available to the child after a fork(2).  This
              is useful to prevent copy-on-write semantics from changing the physical location of
              a  page  if  the parent writes to it after a fork(2).  (Such page relocations cause
              problems for hardware that DMAs into the page.)

       MADV_DOFORK (since Linux 2.6.16)
              Undo the effect of MADV_DONTFORK, restoring the default behavior, whereby a mapping
              is inherited across fork(2).

       MADV_HWPOISON (since Linux 2.6.32)
              Poison  the  pages  in the range specified by addr and length and handle subsequent
              references to those pages like a hardware memory  corruption.   This  operation  is
              available only for privileged (CAP_SYS_ADMIN) processes.  This operation may result
              in the calling process receiving a SIGBUS and the page being unmapped.

              This feature is intended for testing of memory error-handling code; it is available
              only if the kernel was configured with CONFIG_MEMORY_FAILURE.

       MADV_MERGEABLE (since Linux 2.6.32)
              Enable  Kernel  Samepage Merging (KSM) for the pages in the range specified by addr
              and length.  The kernel regularly scans those areas of user memory that  have  been
              marked  as mergeable, looking for pages with identical content.  These are replaced
              by a single write-protected page (which is automatically copied if a process  later
              wants  to update the content of the page).  KSM merges only private anonymous pages
              (see mmap(2)).

              The KSM feature is intended for applications that generate many  instances  of  the
              same  data  (e.g.,  virtualization  systems  such as KVM).  It can consume a lot of
              processing  power;  use  with   care.    See   the   Linux   kernel   source   file
              Documentation/admin-guide/mm/ksm.rst for more details.

              The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available only if the kernel
              was configured with CONFIG_KSM.

       MADV_UNMERGEABLE (since Linux 2.6.32)
              Undo the effect of an earlier MADV_MERGEABLE operation  on  the  specified  address
              range;  KSM unmerges whatever pages it had merged in the address range specified by
              addr and length.

       MADV_SOFT_OFFLINE (since Linux 2.6.33)
              Soft offline the pages in the range specified by addr and length.   The  memory  of
              each  page  in the specified range is preserved (i.e., when next accessed, the same
              content will be visible, but in a new physical page frame), and the  original  page
              is offlined (i.e., no longer used, and taken out of normal memory management).  The
              effect of the MADV_SOFT_OFFLINE operation is invisible to (i.e.,  does  not  change
              the semantics of) the calling process.

              This feature is intended for testing of memory error-handling code; it is available
              only if the kernel was configured with CONFIG_MEMORY_FAILURE.

       MADV_HUGEPAGE (since Linux 2.6.38)
              Enable Transparent Huge Pages (THP) for pages in the range specified  by  addr  and
              length.  The kernel will regularly scan the areas marked as huge page candidates to
              replace them with huge pages.  The kernel will also allocate  huge  pages  directly
              when the region is naturally aligned to the huge page size (see posix_memalign(2)).

              This feature is primarily aimed at applications that use large mappings of data and
              access large regions of that memory at a time (e.g., virtualization systems such as
              QEMU).   It  can  very  easily  waste  memory  (e.g., a 2 MB mapping that only ever
              accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page).  See
              the  Linux  kernel  source file Documentation/admin-guide/mm/transhuge.rst for more
              details.

              Most common kernels configurations provide MADV_HUGEPAGE-style behavior by default,
              and  thus  MADV_HUGEPAGE  is  normally  not  necessary.   It is mostly intended for
              embedded systems, where MADV_HUGEPAGE-style behavior may not be enabled by  default
              in  the  kernel.   On  such  systems, this flag can be used in order to selectively
              enable THP.  Whenever MADV_HUGEPAGE is used, it should  always  be  in  regions  of
              memory  with  an  access  pattern that the developer knows in advance won't risk to
              increase the memory footprint of the application  when  transparent  hugepages  are
              enabled.

              Since  Linux  5.4,  automatic  scan of eligible areas and replacement by huge pages
              works with private anonymous pages (see  mmap(2)),  shmem  pages,  and  file-backed
              pages.   For  all  memory  types,  memory  may  only  be  replaced by huge pages on
              hugepage-aligned  boundaries.   For  file-mapped  memory  —including   tmpfs   (see
              tmpfs(2))—  the  mapping  must  also be naturally hugepage-aligned within the file.
              Additionally, for file-backed, non-tmpfs memory, the file  must  not  be  open  for
              write and the mapping must be executable.

              The  VMA  must  not  be  marked  VM_NOHUGEPAGE,  VM_HUGETLB,  VM_IO, VM_DONTEXPAND,
              VM_MIXEDMAP, or VM_PFNMAP, nor can it be stack memory or backed  by  a  DAX-enabled
              device (unless the DAX device is hot-plugged as System RAM).  The process must also
              not have PR_SET_THP_DISABLE set (see prctl(2)).

              The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations are available only
              if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory
              is only supported if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS.

       MADV_NOHUGEPAGE (since Linux 2.6.38)
              Ensures that memory in the address range specified by addr and length will  not  be
              backed by transparent hugepages.

       MADV_COLLAPSE (since Linux 6.1)
              Perform a best-effort synchronous collapse of the native pages mapped by the memory
              range into Transparent Huge Pages (THPs).  MADV_COLLAPSE operates  on  the  current
              state  of  memory  of  the  calling  process  and  makes  no  persistent changes or
              guarantees on how pages will be mapped, constructed, or faulted in the future.

              MADV_COLLAPSE supports private anonymous pages  (see  mmap(2)),  shmem  pages,  and
              file-backed   pages.    See   MADV_HUGEPAGE   for  general  information  on  memory
              requirements for THP.  If the range provided spans multiple VMAs, the semantics  of
              the  collapse over each VMA is independent from the others.  If collapse of a given
              huge page-aligned/sized  region  fails,  the  operation  may  continue  to  attempt
              collapsing the remainder of the specified memory.  MADV_COLLAPSE will automatically
              clamp the provided range to be hugepage-aligned.

              All non-resident pages covered by  the  range  will  first  be  swapped/faulted-in,
              before being copied onto a freshly allocated hugepage.  If the native pages compose
              the same PTE-mapped hugepage,  and  are  suitably  aligned,  allocation  of  a  new
              hugepage  may be elided and collapse may happen in-place.  Unmapped pages will have
              their data directly initialized to 0 in  the  new  hugepage.   However,  for  every
              eligible  hugepage-aligned/sized  region  to  be  collapsed, at least one page must
              currently be backed by physical memory.

              MADV_COLLAPSE  is  independent  of  any  sysfs   (see   sysfs(5))   setting   under
              /sys/kernel/mm/transparent_hugepage,  both in terms of determining THP eligibility,
              and     allocation     semantics.      See     Linux     kernel     source     file
              Documentation/admin-guide/mm/transhuge.rst  for  more  information.   MADV_COLLAPSE
              also ignores huge= tmpfs mount when operating on tmpfs files.  Allocation  for  the
              new  hugepage  may  enter direct reclaim and/or compaction, regardless of VMA flags
              (though VM_NOHUGEPAGE is still respected).

              When the system has multiple NUMA nodes, the hugepage will be  allocated  from  the
              node providing the most native pages.

              If  all  hugepage-sized/aligned  regions  covered by the provided range were either
              successfully collapsed, or were already PMD-mapped THPs,  this  operation  will  be
              deemed  successful.  Note that this doesn't guarantee anything about other possible
              mappings of the memory.  In the event multiple hugepage-aligned/sized areas fail to
              collapse, only the most-recently–failed code will be set in errno.

       MADV_DONTDUMP (since Linux 3.4)
              Exclude  from  a  core  dump those pages in the range specified by addr and length.
              This is useful in applications that have large areas of memory that are  known  not
              to be useful in a core dump.  The effect of MADV_DONTDUMP takes precedence over the
              bit mask that is set via the /proc/pid/coredump_filter file (see core(5)).

       MADV_DODUMP (since Linux 3.4)
              Undo the effect of an earlier MADV_DONTDUMP.

       MADV_FREE (since Linux 4.5)
              The application no longer requires the pages in the range  specified  by  addr  and
              len.   The kernel can thus free these pages, but the freeing could be delayed until
              memory pressure occurs.  For each of the pages that has been marked to be freed but
              has  not  yet  been freed, the free operation will be canceled if the caller writes
              into the page.  After a successful  MADV_FREE  operation,  any  stale  data  (i.e.,
              dirty,  unwritten  pages)  will  be lost when the kernel frees the pages.  However,
              subsequent writes to pages in the range will succeed and then  kernel  cannot  free
              those dirtied pages, so that the caller can always see just written data.  If there
              is no subsequent write, the kernel can free the pages at any time.  Once  pages  in
              the  range  have  been  freed,  the  caller will see zero-fill-on-demand pages upon
              subsequent page references.

              The MADV_FREE operation can  be  applied  only  to  private  anonymous  pages  (see
              mmap(2)).  Before Linux 4.12, when freeing pages on a swapless system, the pages in
              the given range are freed instantly, regardless of memory pressure.

       MADV_WIPEONFORK (since Linux 4.14)
              Present the child process with zero-filled memory in this range  after  a  fork(2).
              This  is  useful  in  forking servers in order to ensure that sensitive per-process
              data (for example, PRNG seeds, cryptographic secrets, and so on) is not  handed  to
              child processes.

              The  MADV_WIPEONFORK  operation can be applied only to private anonymous pages (see
              mmap(2)).

              Within the child created by fork(2), the MADV_WIPEONFORK setting remains  in  place
              on the specified address range.  This setting is cleared during execve(2).

       MADV_KEEPONFORK (since Linux 4.14)
              Undo the effect of an earlier MADV_WIPEONFORK.

       MADV_COLD (since Linux 5.4)
              Deactivate  a  given  range  of  pages.   This  will make the pages a more probable
              reclaim target should there  be  a  memory  pressure.   This  is  a  nondestructive
              operation.   The advice might be ignored for some pages in the range when it is not
              applicable.

       MADV_PAGEOUT (since Linux 5.4)
              Reclaim a given range of pages.  This is done to free up memory occupied  by  these
              pages.   If  a page is anonymous, it will be swapped out.  If a page is file-backed
              and dirty, it will be written back to the backing storage.   The  advice  might  be
              ignored for some pages in the range when it is not applicable.

       MADV_POPULATE_READ (since Linux 5.14)
              "Populate  (prefault) page tables readable, faulting in all pages in the range just
              as if manually reading from each page; however, avoid the actual memory access that
              would have been performed after handling the fault.

              In  contrast  to  MAP_POPULATE,  MADV_POPULATE_READ  does  not  hide errors, can be
              applied to (parts of) existing mappings and will always  populate  (prefault)  page
              tables  readable.   One example use case is prefaulting a file mapping, reading all
              file content from disk; however, pages won't be dirtied and consequently won't have
              to be written back to disk when evicting the pages from memory.

              Depending on the underlying mapping, map the shared zeropage, preallocate memory or
              read the underlying file; files with holes might or might not  preallocate  blocks.
              If  populating  fails,  a  SIGBUS  signal  is  not  generated; instead, an error is
              returned.

              If MADV_POPULATE_READ succeeds, all page tables have  been  populated  (prefaulted)
              readable  once.   If  MADV_POPULATE_READ  fails,  some  page tables might have been
              populated.

              MADV_POPULATE_READ cannot be applied  to  mappings  without  read  permissions  and
              special  mappings,  for example, mappings marked with kernel-internal flags such as
              VM_PFNMAP or VM_IO, or secret memory regions created using memfd_secret(2).

              Note that with MADV_POPULATE_READ, the process can be killed at any moment when the
              system runs out of memory.

       MADV_POPULATE_WRITE (since Linux 5.14)
              Populate  (prefault)  page tables writable, faulting in all pages in the range just
              as if manually writing to each each page; however, avoid the actual  memory  access
              that would have been performed after handling the fault.

              In  contrast  to  MAP_POPULATE,  MADV_POPULATE_WRITE  does  not hide errors, can be
              applied to (parts of) existing mappings and will always  populate  (prefault)  page
              tables  writable.   One  example use case is preallocating memory, breaking any CoW
              (Copy on Write).

              Depending on the underlying mapping, preallocate  memory  or  read  the  underlying
              file;  files  with  holes  will  preallocate blocks.  If populating fails, a SIGBUS
              signal is not generated; instead, an error is returned.

              If MADV_POPULATE_WRITE succeeds, all page tables have been  populated  (prefaulted)
              writable  once.   If  MADV_POPULATE_WRITE  fails,  some page tables might have been
              populated.

              MADV_POPULATE_WRITE cannot be applied to mappings  without  write  permissions  and
              special  mappings,  for example, mappings marked with kernel-internal flags such as
              VM_PFNMAP or VM_IO, or secret memory regions created using memfd_secret(2).

              Note that with MADV_POPULATE_WRITE, the process can be killed at  any  moment  when
              the system runs out of memory.

RETURN VALUE

       On  success, madvise() returns zero.  On error, it returns -1 and errno is set to indicate
       the error.

ERRORS

       EACCES advice is MADV_REMOVE, but the specified address range is  not  a  shared  writable
              mapping.

       EAGAIN A kernel resource was temporarily unavailable.

       EBADF  The map exists, but the area maps something that isn't a file.

       EBUSY  (for MADV_COLLAPSE) Could not charge hugepage to cgroup: cgroup limit exceeded.

       EFAULT advice  is  MADV_POPULATE_READ or MADV_POPULATE_WRITE, and populating (prefaulting)
              page tables failed because a SIGBUS would have  been  generated  on  actual  memory
              access  and  the  reason  is  not  a  HW  poisoned page (HW poisoned pages can, for
              example, be created using the MADV_HWPOISON flag described elsewhere in this page).

       EINVAL addr is not page-aligned or length is negative.

       EINVAL advice is not a valid.

       EINVAL advice is MADV_COLD or  MADV_PAGEOUT  and  the  specified  address  range  includes
              locked, Huge TLB pages, or VM_PFNMAP pages.

       EINVAL advice  is  MADV_DONTNEED  or  MADV_REMOVE and the specified address range includes
              locked, Huge TLB pages, or VM_PFNMAP pages.

       EINVAL advice is MADV_MERGEABLE or MADV_UNMERGEABLE, but the  kernel  was  not  configured
              with CONFIG_KSM.

       EINVAL advice  is  MADV_FREE  or  MADV_WIPEONFORK but the specified address range includes
              file, Huge TLB, MAP_SHARED, or VM_PFNMAP ranges.

       EINVAL advice is MADV_POPULATE_READ or  MADV_POPULATE_WRITE,  but  the  specified  address
              range  includes  ranges  with  insufficient  permissions  or  special mappings, for
              example, mappings marked with kernel-internal flags such a VM_IO or  VM_PFNMAP,  or
              secret memory regions created using memfd_secret(2).

       EIO    (for MADV_WILLNEED) Paging in this area would exceed the process's maximum resident
              set size.

       ENOMEM (for MADV_WILLNEED) Not enough memory: paging in failed.

       ENOMEM (for MADV_COLLAPSE) Not enough memory: could not allocate hugepage.

       ENOMEM Addresses in the specified range are not  currently  mapped,  or  are  outside  the
              address space of the process.

       ENOMEM advice  is  MADV_POPULATE_READ or MADV_POPULATE_WRITE, and populating (prefaulting)
              page tables failed because there was not enough memory.

       EPERM  advice is MADV_HWPOISON, but the caller does not have the CAP_SYS_ADMIN capability.

       EHWPOISON
              advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and  populating  (prefaulting)
              page  tables failed because a HW poisoned page (HW poisoned pages can, for example,
              be created using the MADV_HWPOISON flag  described  elsewhere  in  this  page)  was
              encountered.

VERSIONS

       Versions  of this system call, implementing a wide variety of advice values, exist on many
       other implementations.  Other implementations  typically  implement  at  least  the  flags
       listed above under Conventional advice flags, albeit with some variation in semantics.

       POSIX.1-2001     describes     posix_madvise(3)    with    constants    POSIX_MADV_NORMAL,
       POSIX_MADV_RANDOM, POSIX_MADV_SEQUENTIAL,  POSIX_MADV_WILLNEED,  and  POSIX_MADV_DONTNEED,
       and so on, with behavior close to the similarly named flags listed above.

   Linux
       The Linux implementation requires that the address addr be page-aligned, and allows length
       to be zero.  If there are some parts of the specified address range that are  not  mapped,
       the  Linux version of madvise() ignores them and applies the call to the rest (but returns
       ENOMEM from the system call, as it should).

       madvise(0, 0, advice) will return zero iff advice is supported by the kernel  and  can  be
       relied on to probe for support.

STANDARDS

       None.

HISTORY

       First appeared in 4.4BSD.

       Since  Linux  3.18,  support for this system call is optional, depending on the setting of
       the CONFIG_ADVISE_SYSCALLS configuration option.

SEE ALSO

       getrlimit(2), memfd_secret(2),  mincore(2),  mmap(2),  mprotect(2),  msync(2),  munmap(2),
       prctl(2), process_madvise(2), posix_madvise(3), core(5)