Ubuntu Manpage: file_sorter

name
description
data types
exports

Provided by: erlang-manpages_16.b.3-dfsg-1ubuntu2.2_all

NAME

       file_sorter - File Sorter

DESCRIPTION

       The  functions  of  this  module  sort  terms  on  files, merge already sorted files, and check files for
       sortedness. Chunks containing binary terms are read from a sequence of files, sorted internally in memory
       and written on temporary files, which are merged producing one sorted file as output. Merging is provided
       as an optimization; it is faster when the files are already sorted, but it always works to  sort  instead
       of merge.

       On  a  file,  a  term  is represented by a header and a binary. Two options define the format of terms on
       files:

         * {header, HeaderLength}. HeaderLength determines  the  number  of  bytes  preceding  each  binary  and
           containing  the length of the binary in bytes. Default is 4. The order of the header bytes is defined
           as follows: if B is a binary containing a header only, the size Size of the binary is  calculated  as
           <<Size:HeaderLength/unit:8>> = B.

         * {format,  Format}.  The format determines the function that is applied to binaries in order to create
           the terms that will be sorted.  The  default  value  is  binary_term,  which  is  equivalent  to  fun
           binary_to_term/1.  The  value  binary is equivalent to fun(X) -> X end, which means that the binaries
           will be sorted as they are. This is the fastest format. If Format is term,  io:read/2  is  called  to
           read  terms.  In  that case only the default value of the header option is allowed. The format option
           also determines what is written to the sorted output file: if Format  is  term  then  io:format/3  is
           called to write each term, otherwise the binary prefixed by a header is written. Note that the binary
           written is the same binary that was read; the results of applying the Format function are thrown away
           as  soon  as  the  terms have been sorted. Reading and writing terms using the io module is very much
           slower than reading and writing binaries.

       Other options are:

         * {order, Order}. The default is to sort terms in ascending order, but that can be changed by the value
           descending  or  by giving an ordering function Fun. An ordering function is antisymmetric, transitive
           and total. Fun(A, B) should return true if A comes before B in  the  ordering,  false  otherwise.  An
           example  of  a  typical  ordering function is less than or equal to, =</2. Using an ordering function
           will slow down the sort considerably. The keysort, keymerge and  keycheck  functions  do  not  accept
           ordering functions.

         * {unique,  boolean()}.  When  sorting  or  merging  files,  only the first of a sequence of terms that
           compare equal (==) is output if this option is set to true. The default value is false which  implies
           that  all  terms  that  compare equal are output. When checking files for sortedness, a check that no
           pair of consecutive terms compares equal is done if this option is set to true.

         * {tmpdir, TempDirectory}. The directory where temporary files are put can be  chosen  explicitly.  The
           default,  implied  by  the  value  "",  is to put temporary files on the same directory as the sorted
           output file. If output is a function (see below), the directory returned by  file:get_cwd()  is  used
           instead.  The  names  of  temporary  files are derived from the Erlang nodename (node()), the process
           identifier of the current Erlang emulator (os:getpid()), and a timestamp  (erlang:now());  a  typical
           name  would  be  fs_mynode@myhost_1763_1043_337000_266005.17, where 17 is a sequence number. Existing
           files will be overwritten. Temporary files are deleted unless some uncaught EXIT signal occurs.

         * {compressed, boolean()}. Temporary files and the output file may be  compressed.  The  default  value
           false  implies  that  written  files  are  not  compressed. Regardless of the value of the compressed
           option, compressed files can always be read. Note  that  reading  and  writing  compressed  files  is
           significantly slower than reading and writing uncompressed files.

         * {size,  Size}.  By  default  approximately 512*1024 bytes read from files are sorted internally. This
           option should rarely be needed.

         * {no_files, NoFiles}. By default 16 files are merged at a time. This option should rarely be needed.

       As an alternative to sorting files, a function of one argument can be given as input.  When  called  with
       the  argument read the function is assumed to return end_of_input or {end_of_input, Value}} when there is
       no more input (Value is explained below), or {Objects, Fun}, where Objects is a list of binaries or terms
       depending on the format and Fun is a new input function. Any other value is immediately returned as value
       of the current call to sort or keysort. Each input function will be called exactly once,  and  should  an
       error occur, the last function is called with the argument close, the reply of which is ignored.

       A  function  of  one  argument  can  be  given  as output. The results of sorting or merging the input is
       collected in a non-empty sequence of variable length lists of binaries or terms depending on the  format.
       The  output  function  is called with one list at a time, and is assumed to return a new output function.
       Any other return value is immediately returned as value  of  the  current  call  to  the  sort  or  merge
       function.  Each output function is called exactly once. When some output function has been applied to all
       of the results or an error occurs, the last function is called with the argument close, and the reply  is
       returned  as value of the current call to the sort or merge function. If a function is given as input and
       the last input function returns {end_of_input, Value}, the function given as output will be  called  with
       the argument {value, Value}. This makes it easy to initiate the sequence of output functions with a value
       calculated by the input functions.

       As an example, consider sorting the terms on a disk log file. A function that reads chunks from the  disk
       log and returns a list of binaries is used as input. The results are collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       Further  examples of functions as input and output can be found at the end of the file_sorter module; the
       term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object, {bad_object, FileName}. Applying the format function  failed  for  some  binary,  or  the
           key(s) could not be extracted from some term.

         * {bad_term, FileName}. io:read/2 failed to read some term.

         * {file_error, FileName, file:posix()}. See file(3erl) for an explanation of file:posix().

         * {premature_eof, FileName}. End-of-file was encountered inside some binary term.

DATA TYPES

       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() = end_of_input
                 | {end_of_input, value()}
                 | {[object()], infun()}
                 | input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() = {compressed, boolean()}
                | {header, header_length()}
                | {format, format()}
                | {no_files, no_files()}
                | {order, order()}
                | {size, size()}
                | {tmpdir, tmp_directory()}
                | {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() = bad_object
                | {bad_object, file_name()}
                | {bad_term, file_name()}
                | {file_error,
                   file_name(),
                   file:posix() | badarg | system_limit}
                | {premature_eof, file_name()}

EXPORTS

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files. sort(Input, Output) is equivalent to sort(Input, Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. keysort(N, FileName) is equivalent to keysort(N, [FileName], FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts  tuples on files. The sort is performed on the element(s) mentioned in KeyPos. If two tuples
              compare equal (==) on one element, next element according to  KeyPos  is  compared.  The  sort  is
              stable.

              keysort(N, Input, Output) is equivalent to keysort(N, Input, Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames, Output, []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges tuples on files. Each input file is assumed to be sorted on key(s).

              keymerge(KeyPos, FileNames, Output) is equivalent to keymerge(KeyPos, FileNames, Output, []).

       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks  files for sortedness. If a file is not sorted, the first out-of-order element is returned.
              The first term on a file has position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a file is not sorted, the first out-of-order element is  returned.
              The first term on a file has position 1.

              keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos, [FileName], []).