Ubuntu Manpage: Sereal::Encoder - Fast, compact, powerful binary serialization

Provided by: libsereal-encoder-perl_2.03-1_amd64

NAME

       Sereal::Encoder - Fast, compact, powerful binary serialization

SYNOPSIS

         use Sereal::Encoder qw(encode_sereal);

         my $encoder = Sereal::Encoder->new({...options...});
         my $out = $encoder->encode($structure);
         # alternatively:
         $out = encode_sereal($structure, {... options ...});

DESCRIPTION

       This library implements an efficient, compact-output, and feature-rich serializer using a
       binary protocol called Sereal.  Its sister module Sereal::Decoder implements a decoder for
       this format.  The two are released separately to allow for independent and safer
       upgrading.

       The Sereal protocol version emitted by this encoder implementation is currently protocol
       version 2 by default.

       The protocol specification and many other bits of documentation can be found in the github
       repository. Right now, the specification is at
       <https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod>, there is a discussion of
       the design objectives in <https://github.com/Sereal/Sereal/blob/master/README.pod>, and
       the output of our benchmarks can be seen at
       <https://github.com/Sereal/Sereal/wiki/Sereal-Comparison-Graphs>.

CLASS METHODS

new
Constructor. Optionally takes a hash reference as first parameter. This hash reference may
contain any number of options that influence the behaviour of the encoder.

Currently, the following options are recognized, none of them are on by default.

snappy

If set, the main payload of the Sereal document will be compressed using Google's Snappy
algorithm. This can yield anywhere from no effect to significant savings on output size at
rather low run time cost. If in doubt, test with your data whether this helps or not.

The decoder (version 0.04 and up) will know how to handle Snappy-compressed Sereal
documents transparently.

Note: The "snappy_incr" and "snappy" options are identical in Sereal protocol V2 (the
default). If using the "use_protocol_v1" option to emit Sereal V1 documents, this emits
non-incrementally decodable documents. See "snappy_incr" in those cases.

snappy_incr

Same as the "snappy" option for default (Sereal V2) operation.

In Sereal V1, enables a version of the snappy protocol which is suitable for incremental
parsing of packets. See also the "snappy" option above for more details.

snappy_threshold

The size threshold (in bytes) of the uncompressed output below which snappy compression is
not even attempted even if enabled. Defaults to one kilobyte (1024 bytes). Set to 0 and
"snappy" to enabled to always compress. Note that the document will not be compressed if
the resulting size will be bigger than the original size (even if snappy_threshold is 0).

croak_on_bless

If this option is set, then the encoder will refuse to serialize blessed references and
throw an exception instead.

This can be important because blessed references can mean executing a destructor on a
remote system or generally executing code based on data.

See also "no_bless_objects" to skip the blessing of objects. When both flags are set,
"croak_on_bless" has a higher precedence then "no_bless_objects".

freeze_callbacks

This option is new in Sereal v2 and needs a Sereal v2 decoder.

If this option is set, the encoder will check for and possibly invoke the "FREEZE" method
on any object in the input data. An object that was serialized using its "FREEZE" method
will have its corresponding "THAW" class method called during deserialization. The exact
semantics are documented below under "FREEZE/THAW CALLBACK MECHANISM".

Beware that using this functionality means a significant slowdown for object
serialization. Even when serializing objects without a "FREEZE" method, the additional
method look up will cost a small amount of runtime. Yes, "Sereal::Encoder" is so fast
that is may make a difference.

no_bless_objects

If this option is set, then the encoder will serialize blessed references without the
bless information and provide plain data structures instead.

See also the "croak_on_bless" option above for more details.

undef_unknown

If set, unknown/unsupported data structures will be encoded as "undef" instead of throwing
an exception.

Mutually exclusive with "stringify_unknown". See also "warn_unknown" below.

stringify_unknown

If set, unknown/unsupported data structures will be stringified and encoded as that string
instead of throwing an exception. The stringification may cause a warning to be emitted by
perl.

Mutually exclusive with "undef_unknown". See also "warn_unknown" below.

warn_unknown

Only has an effect if "undef_unknown" or "stringify_unknown" are enabled.

If set to a positive integer, any unknown/unsupported data structure encountered will emit
a warning. If set to a negative integer, it will warn for unsupported data structures just
the same as for a positive value with one exception: For blessed, unsupported items that
have string overloading, we silently stringify without warning.

max_recursion_depth

"Sereal::Encoder" is recursive. If you pass it a Perl data structure that is deeply
nested, it will eventually exhaust the C stack. Therefore, there is a limit on the depth
of recursion that is accepted. It defaults to 10000 nested calls. You may choose to
override this value with the "max_recursion_depth" option. Beware that setting it too high
can cause hard crashes, so only do that if you KNOW that it is safe to do so.

Do note that the setting is somewhat approximate. Setting it to 10000 may break at
somewhere between 9997 and 10003 nested structures depending on their types.

sort_keys

Normally "Sereal::Encoder" will output hashes in whatever order is convenient, generally
that used by perl to actually store the hash, or whatever order was returned by a tied
hash.

If this option is enabled then the Encoder will sort the keys before outputting them. It
uses more memory, and is quite a bit slower than the default.

Generally speaking this should mean that a hash and a copy should produce the same output.
Nevertheless the user is warned that Perl has a way of "morphing" variables on use, and
some of its rules are a little arcane (for instance utf8 keys), and so two hashes that
might appear to be the same might still produce different output as far as Sereal is
concerned.

The thusly allocated encoder object and its output buffer will be reused between
invocations of "encode()", so hold on to it for an efficiency gain if you plan to
serialize multiple similar data structures, but destroy it if you serialize a single very
large data structure just once to free the memory.

See "NON-CANONICAL" for why you might want to use this, and for the various caveats
involved.

no_shared_hashkeys

When the "no_shared_hashkeys" option is set ot a true value, then the encoder will disable
the detection and elimination of repeated hash keys. This only has an effect for
serializing structures containing hashes. By skipping the detection of repeated hash
keys, performance goes up a bit, but the size of the output can potentially be much
larger.

Do not disable this unless you have a reason to.

dedupe_strings

If this is option is enabled/true then Sereal will use a hash to encode duplicates of
strings during serialization efficiently using (internal) backreferences. This has a
peformance and memory penalty during encoding so it defaults to off. On the other hand,
data structures with many duplicated strings will see a significant reduction in the size
of the encoded form. Currently only strings longer than 3 characters will be deduped,
however this may change in the future.

Note that Sereal will perform certain types of deduping automatically even without this
option. In particular class names and hash keys (see also the "no_shared_hashkeys"
setting) are deduped regardless of this option. Only enable this if you have good reason
to believe that there are many duplicated strings as values in your data structure.

Use of this option does not require an upgraded decoder (this option was added in
Sereal::Encoder 0.32). The deduping is performed in such a way that older decoders should
handle it just fine. In other words, the output of a Sereal decoder should not depend on
whether this option was used during encoding. See also below: aliased_dedupe_strings.

aliased_dedupe_strings

This is an advanced option that should be used only after fully understanding its
ramifications.

This option enables a mode of operation that is similar to dedupe_strings and if both
options are set, aliased_dedupe_strings takes precedence.

The behaviour of aliased_dedupe_strings differs from dedupe_strings in that the duplicate
occurrances of strings are emitted as Perl language level aliases instead of as Sereal-
internal backreferences. This means that using this option actually produces a different
output data structure when decoding. The upshot is that with this option, the application
using (decoding) the data may save a lot of memory in some situations but at the cost of
potential action at a distance due to the aliasing.

Beware: The test suite currently does not cover this option as well as it probably should.
Patches welcome.

use_protocol_v1

If set, the encoder will emit Sereal documents following protocol version 1. This is
strongly discouraged except for temporary compatibility/migration purposes.

INSTANCE METHODS

   encode
       Given a Perl data structure, serializes that data structure and returns a binary string
       that can be turned back into the original data structure by Sereal::Decoder.

EXPORTABLE FUNCTIONS

   encode_sereal
       The functional interface that is equivalent to using "new" and "encode".  Expects a data
       structure to serialize as first argument, optionally followed by a hash reference of
       options (see documentation for "new()").

       The functional interface is marginally slower than the OO interface since it cannot reuse
       the encoder object.

PERFORMANCE

       The exact performance in time and space depends heavily on the data structure to be
       serialized. For ready-made comparison scripts, see the author_tools/bench.pl and
       author_tools/dbench.pl programs that are part of this distribution. Suffice to say that
       this library is easily competitive in both time and space efficiency with the best
       alternatives.

FREEZE/THAW CALLBACK MECHANISM

       This mechanism is enabled using the "freeze_callbacks" option of the encoder.  It is
       inspired by the equivalent mechanism in CBOR::XS and differs only in one minor detail,
       explained below. The general mechanism is documented in the A GENERIC OBJECT SERIALIATION
       PROTOCOL section of Types::Serializer.  Similar to CBOR using "CBOR", Sereal uses the
       string "Sereal" as a serializer identifier for the callbacks.

       The one difference to the mechanism as supported by CBOR is that in Sereal, the "FREEZE"
       callback must return a single value. That value can be any data structure supported by
       Sereal (hopefully without causing infinite recursion by including the original object).
       But "FREEZE" can't return a list as with CBOR.  This should not be any practical
       limitation whatsoever. Just return an array reference instead of a list.

       Here is a contrived example of a class implementing the "FREEZE" / "THAW" mechansim.

         package
           File;

         use Moo;

         has 'path' => (is => 'ro');
         has 'fh' => (is => 'rw');

         # open file handle if necessary and return it
         sub get_fh {
           my $self = shift;
           # This could also with fancier Moo(se) syntax
           my $fh = $self->fh;
           if (not $fh) {
             open $fh, "<", $self->path or die $!;
             $self->fh($fh);
           }
           return $fh;
         }

         sub FREEZE {
           my ($self, $serializer) = @_;
           # Could switch on $serializer here: JSON, CBOR, Sereal, ...
           # But this case is so simple that it will work with ALL of them.
           # Do not try to serialize our file handle! Path will be enough
           # to recreate.
           return $self->path;
         }

         sub THAW {
           my ($class, $serializer, $data) = @_;
           # Turn back into object.
           return $class->new(path => $data);
         }

       Why is the "FREEZE"/"THAW" mechanism important here? Our contrived "File" class may
       contain a file handle which can't be serialized. So "FREEZE" not only returns just the
       path (which is more compact than encoding the actual object contents), but it strips the
       file handle which can be lazily reopened on the other side of the
       serialization/deserialization pipe.  But this example also shows that a naive
       implementation can easily end up with subtle bugs. A file handle itself has state
       (position in file, etc).  Thus the deserialization in the above example won't accurately
       reproduce the original state. It can't, of course, if it's deserialized in a different
       environment anyway.

THREAD-SAFETY

       "Sereal::Encoder" is thread-safe on Perl's 5.8.7 and higher. This means "thread-safe" in
       the sense that if you create a new thread, all "Sereal::Encoder" objects will become a
       reference to undef in the new thread. This might change in a future release to become a
       full clone of the encoder object.

NON-CANONICAL

You might want to compare two data structures by comparing their serialized byte strings.
For that to work reliably the serialization must take extra steps to ensure that identical
data structures are encoded into identical serialized byte strings (a so-called "canonical
representation").

Currently the Sereal encoder does not provide a mode that will reliably generate a
canonical representation of a data structure. The reasons are many and sometimes subtle.

Sereal does support some use-cases however. In this section we attempt to outline the
issues well enough for you to decide if it is suitable for your needs.

Sereal doesn't order the hash keys by default.
This can be enabled via "sort_keys", see above.

There are multiple valid Sereal documents that you can produce for the same Perl data
structure.
Just sorting hash keys is not enough. A trivial example is PAD bytes which mean
nothing and are skipped. They mostly exist for encoder optimizations to prevent
certain nasty backtracking situations from becoming O(n) at the cost of one byte of
output. An explicit canonical mode would have to outlaw them (or add more of them) and
thus require a much more complicated implementation of refcount/weakref handing in the
encoder while at the same time causing some operations to go from O(1) to a full
memcpy of everything after the point of where we backtracked to. Nasty.

Another example is COPY. The COPY tag indicates that the next element is an identical
copy of a previous element (which is itself forbidden from including COPY's other than
for class names). COPY is purely internal. The Perl/XS implementation uses it to share
hash keys and class names. One could use it for other strings (theoretically), but
doesn't for time-efficiency reasons. We'd have to outlaw the use of this (significant)
optimization of canonicalization.

Sereal represents a reference to an array as a sequence of tags which, in its simplest
form, reads REF, ARRAY $array_length TAG1 TAG2 .... The separation of "REF" and
"ARRAY" is necessary to properly implement all of Perl's referencing and aliasing
semantics correctly. Quite frequently, however, your array is only reference once and
plainly so. If it's also at most 15 elements long, Sereal optimizes all of the "REF"
and "ARRAY" tags, as well as the length into a special one byte ARRAYREF tag. This is
a very significant optimization for common cases. This, however, does mean that most
arrays up to 15 elements could be represented in two different, yet perfectly valid
forms. ARRAYREF would have to be outlawed for a properly canonical form. The exact
same logic applies to HASH vs. HASHREF.

Similar to how Sereal can represent arrays and hashes in a full and a compact form.
For small integers (between -16 and +15 inclusive), Sereal emits only one byte
including the encoding of the type of data. For larger integers, it can use either
varints (positive only) or zigzag encoding, which can also represent negative numbers.
For a canonical mode, the space optimizations would have to be turned off and it would
have to be explicitly specified whether varint or zigzag encoding is to be used for
encoding positive integers.

Perl may choose to retain multiple representations of a scalar. Specifically, it can
convert integers, floating point numbers, and strings on the fly and will aggressively
cache the results. Normally, it remembers which of the representations can be
considered canonical, that means, which can be used to recreate the others reliably.
For example, 0 and "0" can both be considered canonical since they naturally transform
into each other. Beyond intrinsic ambiguity, there are ways to trick Perl into
allowing a single scalar to have distinct string, integer, and floating point
representations that are all flagged as canonical, but can't be transformed into each
other. These are the so-called dualvars. Sereal cannot represent dualvars (and that's
a good thing).

Floating point values can appear to be the same but serialize to different byte
strings due to insignificant 'noise' in the floating point representation. Sereal
supports different floating point precisions and will generally choose the most
compact that can represent your floating point number correctly.

These issues are especially relevant when considering language interoperability.

Often, people don't actually care about "canonical" in the strict sense required for real
identity checking. They just require a best-effort sort of thing for caching. But it's a
slippery slope!

In a nutshell, the "sort_keys" option may be sufficient for an application which is simply
serializing a cache key, and thus there's little harm in an occasional false-negative, but
think carefully before applying Sereal in other use-cases.

BUGS, CONTACT AND SUPPORT

       For reporting bugs, please use the github bug tracker at
       <http://github.com/Sereal/Sereal/issues>.

       For support and discussion of Sereal, there are two Google Groups:

       Announcements around Sereal (extremely low volume):
       <https://groups.google.com/forum/?fromgroups#!forum/sereal-announce>

       Sereal development list: <https://groups.google.com/forum/?fromgroups#!forum/sereal-dev>

AUTHORS

       Yves Orton <demerphq@gmail.com>

       Damian Gryski

       Steffen Mueller <smueller@cpan.org>

       Rafaeel Garcia-Suarez

       AEvar Arnfjoer` Bjarmason <avar@cpan.org>

       Tim Bunce

       Daniel Dragan <bulkdd@cpan.org> (Windows support and bugfixes)

       Some inspiration and code was taken from Marc Lehmann's excellent JSON::XS module due to
       obvious overlap in problem domain. Thank you!

ACKNOWLEDGMENT

       This module was originally developed for Booking.com.  With approval from Booking.com,
       this module was generalized and published on CPAN, for which the authors would like to
       express their gratitude.

COPYRIGHT AND LICENSE

       Copyright (C) 2012, 2013, 2014 by Steffen Mueller Copyright (C) 2012, 2013, 2014 by Yves
       Orton

       The license for the code in this distribution is the following, with the exceptions listed
       below:

       This library is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.

       Except portions taken from Marc Lehmann's code for the JSON::XS module, which is licensed
       under the same terms as this module.

       Also except the code for Snappy compression library, whose license is reproduced below and
       which, to the best of our knowledge, is compatible with this module's license. The license
       for the enclosed Snappy code is:

         Copyright 2011, Google Inc.
         All rights reserved.

         Redistribution and use in source and binary forms, with or without
         modification, are permitted provided that the following conditions are
         met:

           * Redistributions of source code must retain the above copyright
         notice, this list of conditions and the following disclaimer.
           * Redistributions in binary form must reproduce the above
         copyright notice, this list of conditions and the following disclaimer
         in the documentation and/or other materials provided with the
         distribution.
           * Neither the name of Google Inc. nor the names of its
         contributors may be used to endorse or promote products derived from
         this software without specific prior written permission.

         THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
         "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
         LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
         A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
         OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
         SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
         LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
         DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
         THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
         (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
         OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.