Ubuntu Manpage: file_sorter

Provided by: erlang-manpages_16.b.3-dfsg-1ubuntu2.2_all

NAME

       file_sorter - File Sorter

DESCRIPTION

       The  functions  of  this module sort terms on files, merge already sorted files, and check
       files for sortedness. Chunks containing binary terms are read from a  sequence  of  files,
       sorted internally in memory and written on temporary files, which are merged producing one
       sorted file as output. Merging is provided as an optimization; it is faster when the files
       are already sorted, but it always works to sort instead of merge.

       On  a  file, a term is represented by a header and a binary. Two options define the format
       of terms on files:

         * {header, HeaderLength}. HeaderLength determines the number  of  bytes  preceding  each
           binary  and  containing  the length of the binary in bytes. Default is 4. The order of
           the header bytes is defined as follows: if B is a binary containing a header only, the
           size Size of the binary is calculated as <<Size:HeaderLength/unit:8>> = B.

         * {format,  Format}.  The  format determines the function that is applied to binaries in
           order to create the terms that will be sorted. The default value is binary_term, which
           is  equivalent  to fun binary_to_term/1. The value binary is equivalent to fun(X) -> X
           end, which means that the binaries will be sorted as they are.  This  is  the  fastest
           format.  If  Format  is term, io:read/2 is called to read terms. In that case only the
           default value of the header option is allowed. The format option also determines  what
           is  written to the sorted output file: if Format is term then io:format/3 is called to
           write each term, otherwise the binary prefixed by a header is written. Note  that  the
           binary  written  is  the same binary that was read; the results of applying the Format
           function are thrown away as soon as the terms have been sorted.  Reading  and  writing
           terms using the io module is very much slower than reading and writing binaries.

       Other options are:

         * {order,  Order}.  The  default  is  to  sort terms in ascending order, but that can be
           changed by the value descending or by giving an ordering  function  Fun.  An  ordering
           function  is  antisymmetric,  transitive  and total. Fun(A, B) should return true if A
           comes before B in the ordering, false otherwise. An  example  of  a  typical  ordering
           function is less than or equal to, =</2. Using an ordering function will slow down the
           sort considerably. The keysort, keymerge and keycheck functions do not accept ordering
           functions.

         * {unique,  boolean()}.  When  sorting or merging files, only the first of a sequence of
           terms that compare equal (==) is output if this option is set  to  true.  The  default
           value  is  false  which  implies  that  all  terms that compare equal are output. When
           checking files for sortedness, a check that no  pair  of  consecutive  terms  compares
           equal is done if this option is set to true.

         * {tmpdir,  TempDirectory}.  The  directory  where temporary files are put can be chosen
           explicitly. The default, implied by the value "", is to put  temporary  files  on  the
           same  directory  as  the  sorted output file. If output is a function (see below), the
           directory returned by file:get_cwd() is used instead. The names of temporary files are
           derived  from  the  Erlang  nodename  (node()),  the process identifier of the current
           Erlang emulator (os:getpid()), and a timestamp (erlang:now()); a typical name would be
           fs_mynode@myhost_1763_1043_337000_266005.17,  where  17 is a sequence number. Existing
           files will be overwritten. Temporary files  are  deleted  unless  some  uncaught  EXIT
           signal occurs.

         * {compressed,  boolean()}.  Temporary  files and the output file may be compressed. The
           default value false implies that written files are not compressed. Regardless  of  the
           value of the compressed option, compressed files can always be read. Note that reading
           and writing  compressed  files  is  significantly  slower  than  reading  and  writing
           uncompressed files.

         * {size,  Size}.  By  default  approximately  512*1024  bytes read from files are sorted
           internally. This option should rarely be needed.

         * {no_files, NoFiles}. By default 16 files are merged at  a  time.  This  option  should
           rarely be needed.

       As an alternative to sorting files, a function of one argument can be given as input. When
       called with  the  argument  read  the  function  is  assumed  to  return  end_of_input  or
       {end_of_input,  Value}}  when  there  is  no  more  input  (Value  is explained below), or
       {Objects, Fun}, where Objects is a list of binaries or terms depending on the  format  and
       Fun  is  a  new  input  function.  Any other value is immediately returned as value of the
       current call to sort or keysort. Each input function will  be  called  exactly  once,  and
       should  an  error occur, the last function is called with the argument close, the reply of
       which is ignored.

       A function of one argument can be given as output. The results of sorting or  merging  the
       input  is  collected in a non-empty sequence of variable length lists of binaries or terms
       depending on the format. The output function is called with one list at  a  time,  and  is
       assumed to return a new output function. Any other return value is immediately returned as
       value of the current call to the sort or merge function. Each output  function  is  called
       exactly once. When some output function has been applied to all of the results or an error
       occurs, the last function is called with the argument close, and the reply is returned  as
       value  of  the current call to the sort or merge function. If a function is given as input
       and the last input function returns {end_of_input, Value}, the function  given  as  output
       will  be  called  with  the  argument  {value,  Value}. This makes it easy to initiate the
       sequence of output functions with a value calculated by the input functions.

       As an example, consider sorting the terms on a disk log file. A function that reads chunks
       from  the  disk  log  and  returns  a  list  of binaries is used as input. The results are
       collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       Further examples of functions as input  and  output  can  be  found  at  the  end  of  the
       file_sorter module; the term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object,  {bad_object,  FileName}.  Applying  the  format  function failed for some
           binary, or the key(s) could not be extracted from some term.

         * {bad_term, FileName}. io:read/2 failed to read some term.

         * {file_error,  FileName,  file:posix()}.  See  file(3erl)   for   an   explanation   of
           file:posix().

         * {premature_eof, FileName}. End-of-file was encountered inside some binary term.

DATA TYPES

       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() = end_of_input
                 | {end_of_input, value()}
                 | {[object()], infun()}
                 | input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() = {compressed, boolean()}
                | {header, header_length()}
                | {format, format()}
                | {no_files, no_files()}
                | {order, order()}
                | {size, size()}
                | {tmpdir, tmp_directory()}
                | {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() = bad_object
                | {bad_object, file_name()}
                | {bad_term, file_name()}
                | {file_error,
                   file_name(),
                   file:posix() | badarg | system_limit}
                | {premature_eof, file_name()}

EXPORTS

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(Input, Output) is equivalent to sort(Input, Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. keysort(N, FileName) is equivalent to keysort(N, [FileName],
              FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. The sort is performed on the element(s) mentioned in KeyPos.
              If  two  tuples compare equal (==) on one element, next element according to KeyPos
              is compared. The sort is stable.

              keysort(N, Input, Output) is equivalent to keysort(N, Input, Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames, Output, []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges tuples on files. Each input file is assumed to be sorted on key(s).

              keymerge(KeyPos, FileNames, Output) is equivalent  to  keymerge(KeyPos,  FileNames,
              Output, []).

       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks  files  for  sortedness.  If  a  file  is not sorted, the first out-of-order
              element is returned. The first term on a file has position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a  file  is  not  sorted,  the  first  out-of-order
              element is returned. The first term on a file has position 1.

              keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos, [FileName], []).