Ubuntu Manpage: file_sorter

Provided by: erlang-manpages_18.3-dfsg-1ubuntu3.1_all

NAME

       file_sorter - File Sorter

DESCRIPTION

       The  functions  of  this  module  sort  terms  on  files, merge already sorted files, and check files for
       sortedness. Chunks containing binary terms are read from a sequence of files, sorted internally in memory
       and written on temporary files, which are merged producing one sorted file as output. Merging is provided
       as an optimization; it is faster when the files are already sorted, but it always works to  sort  instead
       of merge.

       On  a  file,  a  term  is represented by a header and a binary. Two options define the format of terms on
       files:

         * {header, HeaderLength}. HeaderLength determines  the  number  of  bytes  preceding  each  binary  and
           containing  the length of the binary in bytes. Default is 4. The order of the header bytes is defined
           as follows: if B is a binary containing a header only, the size Size of the binary is  calculated  as
           <<Size:HeaderLength/unit:8>> = B.

         * {format,  Format}.  The format determines the function that is applied to binaries in order to create
           the terms that will be sorted.  The  default  value  is  binary_term,  which  is  equivalent  to  fun
           binary_to_term/1.  The  value  binary is equivalent to fun(X) -> X end, which means that the binaries
           will be sorted as they are. This is the fastest format. If Format is term,  io:read/2  is  called  to
           read  terms.  In  that case only the default value of the header option is allowed. The format option
           also determines what is written to the sorted output file: if Format  is  term  then  io:format/3  is
           called to write each term, otherwise the binary prefixed by a header is written. Note that the binary
           written is the same binary that was read; the results of applying the Format function are thrown away
           as  soon  as  the  terms have been sorted. Reading and writing terms using the io module is very much
           slower than reading and writing binaries.

       Other options are:

         * {order, Order}. The default is to sort terms in ascending order, but that can be changed by the value
           descending or by giving an ordering function Fun. An ordering function is  antisymmetric,  transitive
           and  total.  Fun(A,  B)  should  return true if A comes before B in the ordering, false otherwise. An
           example of a typical ordering function is less than or equal to, =</2.  Using  an  ordering  function
           will  slow  down  the  sort  considerably. The keysort, keymerge and keycheck functions do not accept
           ordering functions.

         * {unique, boolean()}. When sorting or merging files, only the  first  of  a  sequence  of  terms  that
           compare  equal (==) is output if this option is set to true. The default value is false which implies
           that all terms that compare equal are output. When checking files for sortedness,  a  check  that  no
           pair of consecutive terms compares equal is done if this option is set to true.

         * {tmpdir,  TempDirectory}.  The  directory where temporary files are put can be chosen explicitly. The
           default, implied by the value "", is to put temporary files on  the  same  directory  as  the  sorted
           output  file.  If  output is a function (see below), the directory returned by file:get_cwd() is used
           instead. The names of temporary files are derived from the  Erlang  nodename  (node()),  the  process
           identifier    of    the    current    Erlang   emulator   (os:getpid()),   and   a   unique   integer
           (erlang:unique_integer([positive])); a typical name would be fs_mynode@myhost_1763_4711.17, where  17
           is  a  sequence  number.  Existing files will be overwritten. Temporary files are deleted unless some
           uncaught EXIT signal occurs.

         * {compressed, boolean()}. Temporary files and the output file may be  compressed.  The  default  value
           false  implies  that  written  files  are  not  compressed. Regardless of the value of the compressed
           option, compressed files can always be read. Note  that  reading  and  writing  compressed  files  is
           significantly slower than reading and writing uncompressed files.

         * {size,  Size}.  By  default  approximately 512*1024 bytes read from files are sorted internally. This
           option should rarely be needed.

         * {no_files, NoFiles}. By default 16 files are merged at a time. This option should rarely be needed.

       As an alternative to sorting files, a function of one argument can be given as input.  When  called  with
       the  argument read the function is assumed to return end_of_input or {end_of_input, Value}} when there is
       no more input (Value is explained below), or {Objects, Fun}, where Objects is a list of binaries or terms
       depending on the format and Fun is a new input function. Any other value is immediately returned as value
       of the current call to sort or keysort. Each input function will be called exactly once,  and  should  an
       error occur, the last function is called with the argument close, the reply of which is ignored.

       A  function  of  one  argument  can  be  given  as output. The results of sorting or merging the input is
       collected in a non-empty sequence of variable length lists of binaries or terms depending on the  format.
       The  output  function  is called with one list at a time, and is assumed to return a new output function.
       Any other return value is immediately returned as value  of  the  current  call  to  the  sort  or  merge
       function.  Each output function is called exactly once. When some output function has been applied to all
       of the results or an error occurs, the last function is called with the argument close, and the reply  is
       returned  as value of the current call to the sort or merge function. If a function is given as input and
       the last input function returns {end_of_input, Value}, the function given as output will be  called  with
       the argument {value, Value}. This makes it easy to initiate the sequence of output functions with a value
       calculated by the input functions.

       As  an example, consider sorting the terms on a disk log file. A function that reads chunks from the disk
       log and returns a list of binaries is used as input. The results are collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       Further examples of functions as input and output can be found at the end of the file_sorter module;  the
       term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object,  {bad_object,  FileName}.  Applying  the  format  function failed for some binary, or the
           key(s) could not be extracted from some term.

         * {bad_term, FileName}. io:read/2 failed to read some term.

         * {file_error, FileName, file:posix()}. See file(3erl) for an explanation of file:posix().

         * {premature_eof, FileName}. End-of-file was encountered inside some binary term.

DATA TYPES

       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() =
           end_of_input |
           {end_of_input, value()} |
           {[object()], infun()} |
           input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() =
           {compressed, boolean()} |
           {header, header_length()} |
           {format, format()} |
           {no_files, no_files()} |
           {order, order()} |
           {size, size()} |
           {tmpdir, tmp_directory()} |
           {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() =
           bad_object |
           {bad_object, file_name()} |
           {bad_term, file_name()} |
           {file_error,
            file_name(),
            file:posix() | badarg | system_limit} |
           {premature_eof, file_name()}

EXPORTS

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(Input, Output) is equivalent to sort(Input, Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. keysort(N, FileName) is equivalent to keysort(N, [FileName], FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. The sort is performed on the element(s) mentioned in KeyPos. If two  tuples
              compare  equal  (==)  on  one  element,  next element according to KeyPos is compared. The sort is
              stable.

              keysort(N, Input, Output) is equivalent to keysort(N, Input, Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames, Output, []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges tuples on files. Each input file is assumed to be sorted on key(s).

              keymerge(KeyPos, FileNames, Output) is equivalent to keymerge(KeyPos, FileNames, Output, []).

       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a file is not sorted, the first out-of-order element is  returned.
              The first term on a file has position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks  files for sortedness. If a file is not sorted, the first out-of-order element is returned.
              The first term on a file has position 1.

              keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos, [FileName], []).

Ericsson AB                                        stdlib 2.8                                  file_sorter(3erl)