Ubuntu Manpage: file_sorter

Provided by: erlang-manpages_16.b.3-dfsg-1ubuntu2.2_all

NAME

       file_sorter - File Sorter

DESCRIPTION

       The  functions  of  this  module  sort  terms  on  files, merge already sorted files, and check files for
       sortedness. Chunks containing binary terms are read from a sequence of files, sorted internally in memory
       and written on temporary files, which are merged producing one sorted file as output. Merging is provided
       as an optimization; it is faster when the files are already sorted, but it always works to  sort  instead
       of merge.

       On  a  file,  a  term  is represented by a header and a binary. Two options define the format of terms on
       files:

         * {header, HeaderLength}. HeaderLength determines  the  number  of  bytes  preceding  each  binary  and
           containing  the length of the binary in bytes. Default is 4. The order of the header bytes is defined
           as follows: if B is a binary containing a header only, the size Size of the binary is  calculated  as
           <<Size:HeaderLength/unit:8>> = B.

         * {format,  Format}.  The format determines the function that is applied to binaries in order to create
           the terms that will be sorted.  The  default  value  is  binary_term,  which  is  equivalent  to  fun
           binary_to_term/1.  The  value  binary is equivalent to fun(X) -> X end, which means that the binaries
           will be sorted as they are. This is the fastest format. If Format is term,  io:read/2  is  called  to
           read  terms.  In  that case only the default value of the header option is allowed. The format option
           also determines what is written to the sorted output file: if Format  is  term  then  io:format/3  is
           called to write each term, otherwise the binary prefixed by a header is written. Note that the binary
           written is the same binary that was read; the results of applying the Format function are thrown away
           as  soon  as  the  terms have been sorted. Reading and writing terms using the io module is very much
           slower than reading and writing binaries.

       Other options are:

         * {order, Order}. The default is to sort terms in ascending order, but that can be changed by the value
           descending or by giving an ordering function Fun. An ordering function is  antisymmetric,  transitive
           and  total.  Fun(A,  B)  should  return true if A comes before B in the ordering, false otherwise. An
           example of a typical ordering function is less than or equal to, =</2.  Using  an  ordering  function
           will  slow  down  the  sort  considerably. The keysort, keymerge and keycheck functions do not accept
           ordering functions.

         * {unique, boolean()}. When sorting or merging files, only the  first  of  a  sequence  of  terms  that
           compare  equal (==) is output if this option is set to true. The default value is false which implies
           that all terms that compare equal are output. When checking files for sortedness,  a  check  that  no
           pair of consecutive terms compares equal is done if this option is set to true.

         * {tmpdir,  TempDirectory}.  The  directory where temporary files are put can be chosen explicitly. The
           default, implied by the value "", is to put temporary files on  the  same  directory  as  the  sorted
           output  file.  If  output is a function (see below), the directory returned by file:get_cwd() is used
           instead. The names of temporary files are derived from the  Erlang  nodename  (node()),  the  process
           identifier  of  the  current Erlang emulator (os:getpid()), and a timestamp (erlang:now()); a typical
           name would be fs_mynode@myhost_1763_1043_337000_266005.17, where 17 is a  sequence  number.  Existing
           files will be overwritten. Temporary files are deleted unless some uncaught EXIT signal occurs.

         * {compressed,  boolean()}.  Temporary  files  and the output file may be compressed. The default value
           false implies that written files are not compressed.  Regardless  of  the  value  of  the  compressed
           option,  compressed  files  can  always  be  read.  Note that reading and writing compressed files is
           significantly slower than reading and writing uncompressed files.

         * {size, Size}. By default approximately 512*1024 bytes read from files  are  sorted  internally.  This
           option should rarely be needed.

         * {no_files, NoFiles}. By default 16 files are merged at a time. This option should rarely be needed.

       As  an  alternative  to sorting files, a function of one argument can be given as input. When called with
       the argument read the function is assumed to return end_of_input or {end_of_input, Value}} when there  is
       no more input (Value is explained below), or {Objects, Fun}, where Objects is a list of binaries or terms
       depending on the format and Fun is a new input function. Any other value is immediately returned as value
       of  the  current  call to sort or keysort. Each input function will be called exactly once, and should an
       error occur, the last function is called with the argument close, the reply of which is ignored.

       A function of one argument can be given as output. The  results  of  sorting  or  merging  the  input  is
       collected  in a non-empty sequence of variable length lists of binaries or terms depending on the format.
       The output function is called with one list at a time, and is assumed to return a  new  output  function.
       Any  other  return  value  is  immediately  returned  as  value  of the current call to the sort or merge
       function. Each output function is called exactly once. When some output function has been applied to  all
       of  the results or an error occurs, the last function is called with the argument close, and the reply is
       returned as value of the current call to the sort or merge function. If a function is given as input  and
       the  last  input function returns {end_of_input, Value}, the function given as output will be called with
       the argument {value, Value}. This makes it easy to initiate the sequence of output functions with a value
       calculated by the input functions.

       As an example, consider sorting the terms on a disk log file. A function that reads chunks from the  disk
       log and returns a list of binaries is used as input. The results are collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       Further  examples of functions as input and output can be found at the end of the file_sorter module; the
       term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object, {bad_object, FileName}. Applying the format function  failed  for  some  binary,  or  the
           key(s) could not be extracted from some term.

         * {bad_term, FileName}. io:read/2 failed to read some term.

         * {file_error, FileName, file:posix()}. See file(3erl) for an explanation of file:posix().

         * {premature_eof, FileName}. End-of-file was encountered inside some binary term.

DATA TYPES

       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() = end_of_input
                 | {end_of_input, value()}
                 | {[object()], infun()}
                 | input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() = {compressed, boolean()}
                | {header, header_length()}
                | {format, format()}
                | {no_files, no_files()}
                | {order, order()}
                | {size, size()}
                | {tmpdir, tmp_directory()}
                | {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() = bad_object
                | {bad_object, file_name()}
                | {bad_term, file_name()}
                | {file_error,
                   file_name(),
                   file:posix() | badarg | system_limit}
                | {premature_eof, file_name()}

EXPORTS

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(Input, Output) is equivalent to sort(Input, Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. keysort(N, FileName) is equivalent to keysort(N, [FileName], FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts  tuples on files. The sort is performed on the element(s) mentioned in KeyPos. If two tuples
              compare equal (==) on one element, next element according to  KeyPos  is  compared.  The  sort  is
              stable.

              keysort(N, Input, Output) is equivalent to keysort(N, Input, Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames, Output, []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges tuples on files. Each input file is assumed to be sorted on key(s).

              keymerge(KeyPos, FileNames, Output) is equivalent to keymerge(KeyPos, FileNames, Output, []).

       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks  files for sortedness. If a file is not sorted, the first out-of-order element is returned.
              The first term on a file has position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a file is not sorted, the first out-of-order element is  returned.
              The first term on a file has position 1.

              keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos, [FileName], []).

Ericsson AB                                       stdlib 1.19.4                                file_sorter(3erl)