Ubuntu Manpage: file_sorter

Provided by: erlang-manpages_22.2.7+dfsg-1ubuntu0.2_all

NAME

       file_sorter - File sorter.

DESCRIPTION

       This  module  contains functions for sorting terms on files, merging already sorted files,
       and checking files for sortedness. Chunks containing binary terms are read from a sequence
       of  files,  sorted  internally  in memory and written on temporary files, which are merged
       producing one sorted file as output. Merging is provided as an optimization; it is  faster
       when the files are already sorted, but it always works to sort instead of merge.

       On  a  file, a term is represented by a header and a binary. Two options define the format
       of terms on files:

         {header, HeaderLength}:
           HeaderLength determines the number of bytes preceding each binary and  containing  the
           length of the binary in bytes. Defaults to 4. The order of the header bytes is defined
           as follows: if B is a binary containing a header only, size  Size  of  the  binary  is
           calculated as <<Size:HeaderLength/unit:8>> = B.

         {format, Format}:
           Option  Format determines the function that is applied to binaries to create the terms
           to be sorted. Defaults to binary_term, which is equivalent  to  fun  binary_to_term/1.
           Value  binary  is  equivalent  to  fun(X)  -> X end, which means that the binaries are
           sorted as they are. This is the fastest format. If Format is term, io:read/2 is called
           to read terms. In that case, only the default value of option header is allowed.

           Option  format also determines what is written to the sorted output file: if Format is
           term, then io:format/3 is called to write each term, otherwise the binary prefixed  by
           a  header is written. Notice that the binary written is the same binary that was read;
           the results of applying function Format are thrown  away  when  the  terms  have  been
           sorted.  Reading and writing terms using the io module is much slower than reading and
           writing binaries.

       Other options are:

         {order, Order}:
           The default is to sort terms in ascending order, but that  can  be  changed  by  value
           descending  or  by  specifying  an  ordering  function  Fun.  An  ordering function is
           antisymmetric, transitive, and total. Fun(A, B) is to return true if A comes before  B
           in  the  ordering,  otherwise false. An example of a typical ordering function is less
           than or equal to, =</2. Using an ordering function slows down the  sort  considerably.
           Functions keysort, keymerge and keycheck do not accept ordering functions.

         {unique, boolean()}:
           When  sorting  or  merging  files,  only the first of a sequence of terms that compare
           equal (==) is output if this option is set to true. Defaults to false,  which  implies
           that  all  terms  that compare equal are output. When checking files for sortedness, a
           check that no pair of consecutive terms compares equal is done if this option  is  set
           to true.

         {tmpdir, TempDirectory}:
           The  directory  where  temporary  files are put can be chosen explicitly. The default,
           implied by value "", is to put temporary files on the same  directory  as  the  sorted
           output  file.  If  output  is  a  function  (see  below),  the  directory  returned by
           file:get_cwd() is used instead. The names of temporary  files  are  derived  from  the
           Erlang  nodename  (node()),  the  process  identifier  of  the current Erlang emulator
           (os:getpid()), and a unique  integer  (erlang:unique_integer([positive])).  A  typical
           name  is  fs_mynode@myhost_1763_4711.17, where 17 is a sequence number. Existing files
           are overwritten. Temporary files are deleted unless some uncaught EXIT signal occurs.

         {compressed, boolean()}:
           Temporary files and the output file can be compressed. Defaults false,  which  implies
           that  written  files are not compressed. Regardless of the value of option compressed,
           compressed files can always be read. Notice that reading and writing compressed  files
           are significantly slower than reading and writing uncompressed files.

         {size, Size}:
           By  default about 512*1024 bytes read from files are sorted internally. This option is
           rarely needed.

         {no_files, NoFiles}:
           By default 16 files are merged at a time. This option is rarely needed.

       As an alternative to sorting files, a function of one argument can be specified as  input.
       When called with argument read, the function is assumed to return either of the following:

         * end_of_input or {end_of_input, Value}} when there is no more input (Value is explained
           below).

         * {Objects, Fun}, where Objects is a list of binaries or terms depending on the  format,
           and Fun is a new input function.

       Any  other  value is immediately returned as value of the current call to sort or keysort.
       Each input function is called exactly once. If an  error  occurs,  the  last  function  is
       called with argument close, the reply of which is ignored.

       A  function  of one argument can be specified as output. The results of sorting or merging
       the input is collected in a non-empty sequence of variable length  lists  of  binaries  or
       terms  depending on the format. The output function is called with one list at a time, and
       is assumed to return a new output function. Any other return value is immediately returned
       as value of the current call to the sort or merge function. Each output function is called
       exactly once. When some output function has been applied to all of the results or an error
       occurs,  the  last  function  is  called with argument close, and the reply is returned as
       value of the current call to the sort or merge function.

       If a function is specified as input and the last  input  function  returns  {end_of_input,
       Value},  the  function  specified  as  output is called with argument {value, Value}. This
       makes it easy to initiate the sequence of output functions with a value calculated by  the
       input functions.

       As an example, consider sorting the terms on a disk log file. A function that reads chunks
       from the disk log and returns a list of  binaries  is  used  as  input.  The  results  are
       collected in a list of terms.

       sort(Log) ->
           {ok, _} = disk_log:open([{name,Log}, {mode,read_only}]),
           Input = input(Log, start),
           Output = output([]),
           Reply = file_sorter:sort(Input, Output, {format,term}),
           ok = disk_log:close(Log),
           Reply.

       input(Log, Cont) ->
           fun(close) ->
                   ok;
              (read) ->
                   case disk_log:chunk(Log, Cont) of
                       {error, Reason} ->
                           {error, Reason};
                       {Cont2, Terms} ->
                           {Terms, input(Log, Cont2)};
                       {Cont2, Terms, _Badbytes} ->
                           {Terms, input(Log, Cont2)};
                       eof ->
                           end_of_input
                   end
           end.

       output(L) ->
           fun(close) ->
                   lists:append(lists:reverse(L));
              (Terms) ->
                   output([Terms | L])
           end.

       For more examples of functions as input and output, see the end of the file_sorter module;
       the term format is implemented with functions.

       The possible values of Reason returned when an error occurs are:

         * bad_object, {bad_object, FileName} - Applying the  format  function  failed  for  some
           binary, or the key(s) could not be extracted from some term.

         * {bad_term, FileName} - io:read/2 failed to read some term.

         * {file_error,  FileName,  file:posix()}  -  For  an  explanation  of  file:posix(), see
           file(3erl).

         * {premature_eof, FileName} - End-of-file was encountered inside some binary term.

DATA TYPES

       file_name() = file:name()

       file_names() = [file:name()]

       i_command() = read | close

       i_reply() =
           end_of_input |
           {end_of_input, value()} |
           {[object()], infun()} |
           input_reply()

       infun() = fun((i_command()) -> i_reply())

       input() = file_names() | infun()

       input_reply() = term()

       o_command() = {value, value()} | [object()] | close

       o_reply() = outfun() | output_reply()

       object() = term() | binary()

       outfun() = fun((o_command()) -> o_reply())

       output() = file_name() | outfun()

       output_reply() = term()

       value() = term()

       options() = [option()] | option()

       option() =
           {compressed, boolean()} |
           {header, header_length()} |
           {format, format()} |
           {no_files, no_files()} |
           {order, order()} |
           {size, size()} |
           {tmpdir, tmp_directory()} |
           {unique, boolean()}

       format() = binary_term | term | binary | format_fun()

       format_fun() = fun((binary()) -> term())

       header_length() = integer() >= 1

       key_pos() = integer() >= 1 | [integer() >= 1]

       no_files() = integer() >= 1

       order() = ascending | descending | order_fun()

       order_fun() = fun((term(), term()) -> boolean())

       size() = integer() >= 0

       tmp_directory() = [] | file:name()

       reason() =
           bad_object |
           {bad_object, file_name()} |
           {bad_term, file_name()} |
           {file_error,
            file_name(),
            file:posix() | badarg | system_limit} |
           {premature_eof, file_name()}

EXPORTS

       check(FileName) -> Reply

       check(FileNames, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks files for sortedness. If a  file  is  not  sorted,  the  first  out-of-order
              element is returned. The first term on a file has position 1.

              check(FileName) is equivalent to check([FileName], []).

       keycheck(KeyPos, FileName) -> Reply

       keycheck(KeyPos, FileNames, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Options = options()
                 Reply = {ok, [Result]} | {error, reason()}
                 Result = {FileName, TermPosition, term()}
                 FileName = file_name()
                 TermPosition = integer() >= 1

              Checks  files  for  sortedness.  If  a  file  is not sorted, the first out-of-order
              element is returned. The first term on a file has position 1.

              keycheck(KeyPos, FileName) is equivalent to keycheck(KeyPos, [FileName], []).

       keymerge(KeyPos, FileNames, Output) -> Reply

       keymerge(KeyPos, FileNames, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges tuples on files. Each input file is assumed to be sorted on key(s).

              keymerge(KeyPos, FileNames, Output) is equivalent  to  keymerge(KeyPos,  FileNames,
              Output, []).

       keysort(KeyPos, FileName) -> Reply

              Types:

                 KeyPos = key_pos()
                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files.

              keysort(N, FileName) is equivalent to keysort(N, [FileName], FileName).

       keysort(KeyPos, Input, Output) -> Reply

       keysort(KeyPos, Input, Output, Options) -> Reply

              Types:

                 KeyPos = key_pos()
                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts tuples on files. The sort is performed on the element(s) mentioned in KeyPos.
              If two tuples compare equal (==) on one element,  the  next  element  according  to
              KeyPos is compared. The sort is stable.

              keysort(N, Input, Output) is equivalent to keysort(N, Input, Output, []).

       merge(FileNames, Output) -> Reply

       merge(FileNames, Output, Options) -> Reply

              Types:

                 FileNames = file_names()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | output_reply()

              Merges terms on files. Each input file is assumed to be sorted.

              merge(FileNames, Output) is equivalent to merge(FileNames, Output, []).

       sort(FileName) -> Reply

              Types:

                 FileName = file_name()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files.

              sort(FileName) is equivalent to sort([FileName], FileName).

       sort(Input, Output) -> Reply

       sort(Input, Output, Options) -> Reply

              Types:

                 Input = input()
                 Output = output()
                 Options = options()
                 Reply = ok | {error, reason()} | input_reply() | output_reply()

              Sorts terms on files.

              sort(Input, Output) is equivalent to sort(Input, Output, []).