Ubuntu Manpage: XML::Filter::Sort - SAX filter for sorting elements in XML

Provided by: libxml-filter-sort-perl_1.01-3_all

NAME

       XML::Filter::Sort - SAX filter for sorting elements in XML

SYNOPSIS

         use XML::Filter::Sort;
         use XML::SAX::Machines qw( :all );

         my $sorter = XML::Filter::Sort->new(
           Record  => 'person',
           Keys    => [
                        [ 'lastname',  'alpha', 'asc' ],
                        [ 'firstname', 'alpha', 'asc' ],
                        [ '@age',      'num',   'desc']
                      ],
         );

         my $filter = Pipeline( $sorter => \*STDOUT );

         $filter->parse_file(\*STDIN);

       Or from the command line:

         xmlsort

DESCRIPTION

       This module is a SAX filter for sorting 'records' in XML documents (including documents
       larger than available memory).  The "xmlsort" utility which is included with this
       distribution can be used to sort an XML file from the command line without writing Perl
       code (see "perldoc xmlsort").

EXAMPLES

       These examples assume that you will create an XML::Filter::Sort object and use it in a
       SAX::Machines pipeline (as in the synopsis above).  Of course you could use the object
       directly by hooking up to a SAX generator and a SAX handler but such details are omitted
       from the sample code.

       When you create an XML::Filter::Sort object (with the "new()" method), you must use the
       'Record' option to identify which elements you want sorted.  The simplest way to do this
       is to simply use the element name, eg:

         my $sorter = XML::Filter::Sort->new( Record  => 'colour' );

       Which could be used to transform this XML:

         <options>
           <colour>red</colour>
           <colour>green</colour>
           <colour>blue</colour>
         <options>

       to this:

         <options>
           <colour>blue</colour>
           <colour>green</colour>
           <colour>red</colour>
         </options>

       You can define a more specific path to the record by adding a prefix of element names
       separated by forward slashes, eg:

         my $sorter = XML::Filter::Sort->new( Record  => 'hair/colour' );

       which would only sort <colour> elements contained directly within a <hair> element (and
       would therefore leave our sample document above unchanged).  A path which starts with a
       slash is an 'absolute' path and must specify all intervening elements from the root
       element to the record elements.

       A record element may contain other elements.  The order of the record elements may be
       changed by the sorting process but the order of any child elements within them will not.

       The default sort uses the full text of each 'record' element and uses an alphabetic
       comparison.  You can use the 'Keys' option to specify a list of elements within each
       record whose text content should be used as sort keys.  You can also use this option to
       specify whether the keys should be compared alphabetically or numerically and whether the
       resulting order should be ascending or descending, eg:

         my $sorter = XML::Filter::Sort->new(
           Record  => 'person',
           Keys    => [
                        [ 'lastname',  'alpha', 'asc'  ],
                        [ 'firstname', 'alpha', 'asc'  ],
                        [ '@age',      'alpha', 'desc' ],
                      ]
         );

       Given this record ...

           <person age='35'>
             <firstname>Aardvark</firstname>
             <lastname>Zebedee</lastname>
           </person>

       The above code would use 'Zebedee' as the first (primary) sort key, 'Aardvark' as the
       second sort key and the number 35 as the third sort key.  In this case, records with the
       same first and last name would be sorted from oldest to youngest.

       As with the 'record' path, it is possible to specify a path to the sort key elements (or
       attributes).  To make a path relative to the record element itself, use './' at the start
       of the path.

OPTIONS

Record => 'path string'
A simple path string defining which elements should be treated as 'records' to be
sorted (see "PATH SYNTAX"). Elements which do not match this path will not be altered
by the filter. Elements which do match this path will be re-ordered depending on
their contents and the value of the Keys option.

When a record element is re-ordered, it takes its leading whitespace with it.

Only lists of contiguous record elements will be sorted. A list of records which has
a 'foreign body' (a non-record element, non-whitespace text, a comment or a processing
instruction) between two elements will be treated as two separate lists and each will
be sorted in isolation of the other.

Keys => [ [ 'path string', comparator, order ], ... ]
Keys => 'delimited string'
This option specifies which parts of the records should be used as sort keys. The
first form uses a list-of-lists syntax. Each key is defined using a list of three
elements:

1. The 'path string' defines the path to an element or an attribute whose text
contents should be used as the value of the sort key (see "PATH SYNTAX").

2. The 'comparator' defines how these values should be compared. This can be the
string 'alpha' for alphabetic, the string 'num' for numeric or a reference to a
subroutine taking two parameters and returning -1, 0 or 1 (similar to the standard
Perl sort function but without the $a, $b magic).

This item is optional and defaults to 'alpha'.

3. The 'order' should be 'asc' for ascending or 'desc' for descending and if omitted,
defaults to 'asc'.

You may prefer to define the Keys using a delimited string rather than a list of
lists. Keys in the string should be separated by either newlines or semicolons and
the components of a key should be separated by whitespace or commas. It is not
possible to define a subroutine reference comparator using the string syntax.

IgnoreCase => 1
Enabling this option will make sort comparisions case-insensitive (rather than the
default case-sensitive).

NormaliseKeySpace => 1
The sort key values for each record will be the text content of the child elements
specified using the Keys option (above). If you enable this option, leading and
trailing whitespace will be stripped from the keys and each internal run of spaces
will be collapsed to a single space. The default value for this option is off for
efficiency.

Note: The contents of the record are not affected by this setting - merely the copy of
the data that is used in the sort comparisons.

KeyFilterSub => coderef
You can also supply your own custom 'fix-ups' by passing this option a reference to a
subroutine. The subroutine will be called once for each record and will be passed a
list of the key values for the record. The routine must return the same number of
elements each time it is called, but this may be less than the number of values passed
to it. You might use this option to combine multiple key values into one (eg: using
sprintf).

Note: You can enable both the NormaliseKeySpace and the KeyFilterSub options - space
normalisation will occur first.

TempDir => 'directory path'
This option serves two purposes: it enables disk buffering rather than the default
memory buffering and it allows you to specify where on disk the data should be
buffered. Disk buffering will be slower than memory buffering, so don't ask for it if
you don't need it. For more details, see "IMPLEMENTATION".

Note: It is safe to specify the same temporary directory path for multiple instances
since each will create a uniquely named subdirectory (and clean it up afterwards).

MaxMem => bytes
The disk buffering mode actually sorts chunks of records in memory before saving them
to disk. The default chunk size is 10 megabytes. You can use this option to specify
an alternative chunk size (in bytes) which is more attuned to your available resources
(more is better). A suffix of 'K' or 'M' is recognised as kilobytes or megabytes
respectively.

If you have not enabled disk buffering (using 'TempDir'), the MaxMem option has no
effect. Attempting to sort a large document using only memory buffering may result in
Perl dying with an 'out of memory' error.

SkipIgnorableWS
If your SAX parser can do validation and generates ignorable_whitespace() events, you
can enable this option to discard these events. If you leave this option at it's
default value (implying you want the whitespace), the events will be translated to
characters() events.

PATH SYNTAX

       A simple element path syntax is used in two places:

       1.  with the 'Record' option to define which elements should be sorted

       2.  with the 'Keys' option to define which parts of each record should be used as sort
           keys.

       In each case you can use a just an element name, or a list of element names separated by
       forward slashes.  eg:

         Record => 'ul/li',
         Keys   => 'name'

       If a 'Record' path begins with a '/' then it will be anchored at the document root.  If a
       'Keys' path begins with './' then it is anchored at the current record element.
       Unanchored paths can match at any level.

       A 'Keys' path can include an attribute name prefixed with an '@' symbol, eg:

         Keys   => './@href'

       Each element or attribute name can include a namespace URI prefix in curly braces, eg:

         Record => '{http://www.w3.org/1999/xhtml}li'

       If you do not include a namespace prefix, all elements with the specified name will be
       matched, regardless of any namespace URI association they might have.

       If you include an empty namespace prefix (eg: '{}li') then only records which do not have
       a namespace association will be matched.

IMPLEMENTATION

       In order to arrange records into sorted order, this module uses buffering.  It does not
       need to buffer the whole document, but for any sequence of records within a document, all
       records must be buffered.  Unless you specify otherwise, the records will be buffered in
       memory.  The memory requirements are similar to DOM implementations - 10 to 50 times the
       character count of the source XML.  If your documents are so large that you would not
       process them with a DOM parser then you should enable disk buffering.

       If you enable disk buffering, sequences of records will be assembled into 'chunks' of
       approximately 10 megabytes (this value is configurable).  Each chunk will be sorted and
       saved to disk.  At the end of the record sequence, all the sorted chunks will be merged
       and written out as SAX events.

       The memory buffering mode represents each record an a XML::Filter::Sort::Buffer object and
       uses XML::Filter::Sort::BufferMgr objects to manage the buffers.  For details of the
       internals, see XML::Filter::Sort::BufferMgr.

       The disk buffering mode represents each record an a XML::Filter::Sort::DiskBuffer object
       and uses XML::Filter::Sort::DiskBufferMgr objects to manage the buffers.  For details of
       the internals, see XML::Filter::Sort::DiskBufferMgr.

BUGS

       ignorable_whitespace() events shouldn't be translated to normal characters() events -
       perhaps in a later release they won't be.

COPYRIGHT

       Copyright 2002-2005 Grant McLean <grantm@cpan.org>

       This library is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.