Provided by: postgresql-comparator_2.2.2-1_amd64 bug

NAME

       pg_comparator - efficient table content comparison and synchronization

SYNOPSIS

       pg_comparator [options as --help --option --man] conn1 conn2

DESCRIPTION

       This script performs a network and time efficient comparison or synchronization of two possibly large
       tables in PostgreSQL, MySQL or SQLite databases, so as to detect inserted, updated or deleted tuples
       between these tables.  The algorithm is efficient especially if the expected differences are relatively
       small.

       The implementation is quite generic: multi-column keys (but there must be a key!), no assumption of data
       types other that they can be cast to text, subset of columns can be used for the comparison, handling of
       NULL values...

       This script focuses on the comparison algorithm, hence the many options.  The fact that it may do
       anything useful, such as checking that a replication tool does indeed replicates your data, or such as
       synchronizing tables, is a mere side effect.

OPTIONS

       Options allow to request help or to adjust some internal parameters.  Short one-letter options are also
       available, usually with the first letter of the option name.

       "--aggregate=(sum|xor)" or "-a (sum|xor)"
           Aggregation  function  to be used for summaries, either xor or sum.  It must operate on the result of
           the checksum function.  For PostgreSQL and SQLite, the xor aggregate needs to be loaded.  There is  a
           signed/unsigned  issue  on  the  key  hash  when using xor for comparing tables on MySQL or SQLite vs
           PostgreSQL.  We provide a new "ISUM" aggregate for SQLite because both  "SUM"  and  "TOTAL"  do  some
           incompatible handling of integer overflows.

           Default is sum because it is available by default and works in mixed mode.

       "--ask-pass"
           Ask for passwords interactively. See also "--env-pass" option below.

           Default is not to ask for passwords.

       "--asynchronous" or "-A", "--no-asynchronous" or "-X"
           Whether  to run asynchronous queries. This provides some parallelism, however the two connections are
           more or less synchronized per query.

           Default is to use asynchronous queries to enable some parallelism.

       "--checksum-computation=(create|insert)" or "--cc=..."
           How to create the checksum table.  Use create to use a "CREATE ... AS SELECT ..." query, or insert to
           use a "CREATE ...; INSERT ... SELECT ..." query.  The former will require an additional  counting  to
           get  the table size, so in the end there are two queries anyway.  There is a type size issue with the
           insert strategy on MySQL, the cumulated key string length must be under 64 bytes.

           Default is create because it always works for both databases.

       "--checksum-function=fun" or "--cf=fun" or "-c fun"
           Checksum function to use, either ck or md5.   For  PostgreSQL,  MySQL  and  SQLite  the  provided  ck
           checksum functions must be loaded into the target databases.  Choosing md5 does not come free either:
           the  provided  cast  functions  must  be loaded into the target databases and the computation is more
           expensive.

           Default is ck, which is faster, especially if  the  operation  is  cpu-bound  and  the  bandwidth  is
           reasonably high.

       "--checksum-size=n" or "--check-size=n" or "--cs=n" or "-z n"
           Tuple checksum size, must be 2, 4 or 8 bytes.  The key checksum size is always 4 bytes long.

           Default  is  8,  so  that  the  false negative probability is very low.  There should be no reason to
           change that.

       "--cleanup"
           Drop checksum and summary tables beforehand.  Useful after a run with "--no-temp"  and  "--no-clear",
           typically used for debugging.

           Default is not to drop because it is not needed.

       "--clear"
           Drop  checksum  and  summary  tables  explicitly  after  the computation.  Note that they are dropped
           implicitly by default when the connection is closed as they  are  temporary,  see  "-(-no)-temporary"
           option.  This option is useful for debugging.

           Default is not to clear explicitely the checksum and summary tables, as it is not needed.

       "--debug" or "-d"
           Set  debug  mode.  Repeat for higher debug levels. See also "--verbose".  Beware that some safe gards
           about option settings are skipped under debug so as to allow testing under different conditions.

           Default is not to run in debug mode.

       "--env-pass='var'"
           Take password from environment variables "var1", "var2" or "var" for connection one,  two,  or  both.
           This is tried before asking interactively if "--ask-pass" is also set.

           Default is not to look for passwords from environment variables.

       "--expect n" or "-e n"
           Total  number  of differences to expect (updates, deletes and inserts).  This option is only used for
           non regression tests. See the TESTS section.

       "--folding-factor=7" or "-f 7"
           Folding factor: log2 of the number of rows grouped together at each stage, starting from  the  leaves
           so  that  the  first  round always groups as many records as possible. The power of two allows to use
           masked computations.  The minimum value of 1 builds a binary tree.

           Default folding factor log2 is 7, i.e. size 128 folds.  This default  value  was  chosen  after  some
           basic  tests  on  medium-size  cases  with  medium  or  low bandwidth. Values from 4 to 8 should be a
           reasonable choice for most settings.

       "--help" or "-h"
           Show short help.

       "--key-checksum='kcs'" or "--kcs=..."
           Use key checksum attribute of this name, which must be already available in the  tables  to  compare.
           This  option also requires option "--tuple-checksum".  See also the EXAMPLES section below for how to
           set a checksum trigger.  Consider "--use-key" instead if you already have  a  reasonably  distributed
           integer primary key.

           Default is to build both key and tuple checksums on the fly.

       "--lock", "--no-lock"
           Whether  to  lock  tables.   Setting the option explicitely overrides the default one way or another.
           For PostgreSQL, this option requires "--transaction", which is enabled by default.

           Default depends on the current operation: the table is not locked for a comparison, but it is  locked
           for a synchronization.

       "--long-read-len=0" or "-L 0"
           Set  max  size  for  fetched  binary  large  objects.   Well,  it seems to be ignored at least by the
           PostgreSQL driver.

           Default is to keep the default value set by the driver.

       "--man" or "-m"
           Show manual page interactively in the terminal.

       "--max-ratio=0.1"
           Maximum relative search effort. The search is  stopped  if  the  number  of  results  is  above  this
           threshold  expressed relatively to the table size.  Use 2.0 for no limit (all tuples were deleted and
           new one are inserted).

           Default is 0.1, i.e. an overall 10% difference is allowed before giving up.

       "--max-report=n"
           Maximum absolute search effort. The search is stopped if the number of differences goes  beyond  this
           threshold.  If  set,  the  previous "--max-ratio" option is ignored, otherwise the effort is computed
           with the ratio once the table size is known.

           Default is to compute the maximum number of reported differences based on the "--max-ratio" option.

       "--max-levels=0"
           Maximum number of levels used. Allows to cut-off folding. 0 means no cut-off.  Setting a value  of  1
           would  only use the checksum table, without summaries.  A value of 3 or 4 would be raisonable, as the
           last levels of the tree are  nice  for  the  theoretical  complexity  formula,  but  do  not  improve
           performance in practice.

           Default is 0.

       "--null='text'"
           How  to handle NULL values. Either hash to hash all values, where NULL has one special hash value, or
           text where NULL values are substituted by the "NULL" string.

           Default is text because it is faster.

       "--option" or "-o"
           Show option summary.

       "--prefix='pgc_cmp'"
           Name prefix, possibly schema qualified, used for generated comparison tables by appending numbers  to
           it.  Consider  changing  the  prefix if you expect several comparisons to run concurrently agains the
           same database.

           Default is "pgc_cmp". Cheksum tables is named "pgc_cmp_1_0" and "pgc_cmp_2_0", and summary tables are
           named by increasing the last number.

       "--report", "--no-report"
           Report differing keys to stdout as they are found.

           Default is to report.

       "--separator='|'" or "-s '|'"
           Separator string or character used when concatenating key columns for computing checksums.

           Defaults to the pipe '|' character.

       "--size=n"
           Assume this value as the table size. It is sufficient for the algorithm to  perform  well  that  this
           size is in the order of magnitude of the actual table size.

           Default is to query the table sizes, which is skipped if this option is set.

       "--source-1='DBI:...'", "--source-2='...'" or "-1 '...'", "-2 '...'"
           Take  full  control  of DBI data source specification and mostly ignore the comparison authentication
           part of the source or target URLs.  One can connect with "DBI:Pg:service=backup",  use  an  alternate
           driver,  set  any  option  allowed  by  the  driver...  See "DBD::Pg" and "DBD:mysql" manuals for the
           various options that can be set through the DBI data source  specification.   However,  the  database
           server  specified  in  the URL must be consistent with this source specification so that the queries'
           syntax is the right one.

           Default is to rely on the two URL arguments.

       "--stats=(txt|csv)"
           Show various statistics about the comparison performed in this format.  Also,  option  "--stats-name"
           gives the test a name, useful to generate csv files that will be processed automatically.

           Default  is  not  to  show  statistics,  because  it  requires additional synchronizations and is not
           necessarily interesting to the user.

       "--synchronize" or "-S"
           Actually perform operations to synchronize the second table wrt the first.  Well, not really.  It  is
           only done if you add "--do-it" or "-D".  Save your data before attempting anything like that!

           Default is not to synchronize.

       "--temporary", "--no-temporary"
           Whether  to  use  temporary  tables. If you don't, the tables are kept by default at the end, so they
           will have to be deleted by hand. See "--clear" option to request a cleanup. This option is useful for
           debugging.

           Default is to use temporary tables that are automatically wiped out when the connection is closed.

       "--threads" or "-T", "--no-threads" or "-N"
           Highly EXPERIMENTAL feature.

           Try to use threads to perform computations in parallel, with some  hocus-pocus  because  perl  thread
           model  does  not  really  work  well  with  DBI.   Perl  threads are rather heavy and slow, more like
           communicating processes than light weight threads, really.

           This does NOT work at all with PostgreSQL.  It works partially with MySQL, at the  price  of  turning
           off "--transaction".

           Default is not to use threads, as it does not work for all databases.

       "--timeout n"
           Timeout comparison after "n" seconds.

           Default is no timeout. Be patient.

       "--transaction", "--no-transaction"
           Whether to wrap the whole algorithm in a single transaction.

           Default is to use a wrapping transaction, as it seems to be both faster and safer to do so.

       "--tuple-checksum='tcs'" or "--tcs=..."
           Use  tuple checksum attribute of this name, which must be already available in the tables to compare.
           This option requires to set also either  "--use-key"  or  "--key-checksum=..."  above.  The  provided
           checksum  attributes  must  not  appear in the lists of key and value columns.  See also the EXAMPLES
           section below for how to set a checksum trigger.

           Default is to build both key and tuple checksums on the fly.

       "--use-key" or "-u"
           Whether to directly use the value of the key to distribute tuples among branches.  The  key  must  be
           simple,  integer,  not NULL, and evenly distributed.  If you have a reasonably spread integer primary
           key, consider using this option to avoid half of the checksum table hash computations.

           Default is to hash the key, so as to handle any type, composition and distribution.

       "--use-null", "--no-use-null"
           Whether to use the information that a column  is  declared  NOT  NULL  to  simplify  computations  by
           avoiding calls to COALESCE to handle NULL values.

           Default is to use this information, at the price of querying table metadata.

       "--verbose" or "-v"
           Be verbose about what is happening. The more you ask, the more verbose.

           Default is to be quiet, so that possible warnings or errors stand out.

       "--version" or "-V"
           Show version information and exit.

       "--where=..."
           SQL  boolean condition on table tuples for partial comparison.  Useful to reduce the load if you know
           that expected differences are in some parts of your data, say those time-stamped today...   The  same
           condition  is  passed  on both sides, so both tables must be pretty similar so that it works. This is
           usually the case.

           Default is to compare whole tables.

ARGUMENTS

       The two arguments describe database connections with the following URL-like syntax, where square brackets
       denote optional parts. Many parts  are  optional  with  a  default.  The  minimum  syntactically  correct
       specification is "/", but that does not necessary mean anything useful.

         [driver://][login[:pass]@][host][:port]/[base/[[schema.]table[?key[:cols]]]]

       See the EXAMPLES section bellow, and also the "--source-*" options above.

       Note  that  some  default  value  used  by  DBI  drivers  may be changed with driver-specific environment
       variables, and that DBI also provides its own defaults and overrides, so what actually  happens  may  not
       always be clear.

       driver
           Database driver to use.  Use pgsql for PostgreSQL, mysql for MySQL, sqlite for SQLite.  Heterogeneous
           databases  may  be compared and synchronized, however beware that subtle typing, encoding and casting
           issues may prevent heterogeneous comparisons or synchronizations to succeed.  Default  is  pgsql  for
           the first connection, and same as first for second.

           For  SQLite,  the  authentication  part of the URL (login, pass, host, port) is expected to be empty,
           thus the full URL should look like:

             sqlite:///base.db/table?key,col:other,columns

           Moreover, setting the PGC_SQLITE_LOAD_EXTENSION environment variable with ":"-separated shared object
           files loads these into SQLite.

       login
           Login to use when connecting to database. Default is username for first connection, and same as first
           connection for second.

       pass
           Password to use when connecting to database.  Note that it is a bad idea  to  put  a  password  as  a
           command  argument.   Default  is  none  for  the first connection, and the same password as the first
           connection for the second if the connection targets the same host, port and uses the same login.  See
           also "--ask-pass" and "--env-pass" options.

       host
           Hostname or IP to connect to.  Default is the empty string, which means connecting to the database on
           localhost with a UNIX socket.

       port
           TCP-IP port to connect to.  Default is 5432 for PostgreSQL and 3306 for MySQL.

       base
           Database catalog to connect to. Default is username for first connection.  Default is same  as  first
           connection  for  second connection.  For SQLite, provide the database file name. The path is relative
           by default, but can be made absolute by prepending an additional '/':

             sqlite:////var/cache/sqlite/base.db/table?...

       schema.table
           The possibly schema-qualified table to use for comparison.  No default for first connection.  Default
           is same as first connection for second connection.

           Note that MySQL does not have schemas, but strangely enough their database concept  is  just  like  a
           schema,  so  MySQL  really  does  not  have databases, although there is something of that name. Am I
           clear?

       keys
           Comma-separated list of key columns.  Default is table primary key for first connection.  Default  is
           same as first connection for second connection.  The key cannot be empty. If you do not have a way of
           identifying your tuples, then there is no point in looking for differences.

       cols
           Comma-separated  list of columns to compare. May be empty.  Default is all columns but keys for first
           connection.  Default is same as first connection for second connection.  Beware that "...?key:" means
           an empty cols, while "...?key" sets the default by querying table metadata.

EXAMPLES

       Compare tables calvin and hobbes in database family on localhost, with key id and columns c1 and c2:

         ./pg_comparator /family/calvin?id:c1,c2 /family/hobbes

       Compare tables calvin in default database on localhost and the same table in default database on sablons,
       with key id and column data:

         ./pg_comparator localhost/family/calvin?id:data sablons/

       Synchronize "user" table in database "wikipedia" from MySQL on "server1" to PostgreSQL on "server2".

         ./pg_comparator -S -D --ask-pass \
             mysql://calvin@server1/wikipedia/user pgsql://hobbes@server2/

       For PostgreSQL, you may add trigger-maintained key and tuple checksums as:

         -- TABLE Foo(id SERIAL PRIMARY KEY, data ... NOT NULL);
         -- add a key and tuple checksum attributes
         -- the key checksum can be skipped if you use --use-key,
         -- for which the key must be a simple NOT NULL integer.
         ALTER TABLE Foo
           ADD COLUMN key_cs INT4 NOT NULL DEFAULT 0,
           ADD COLUMN tup_cs INT8 NOT NULL DEFAULT 0;
         -- function to update the tuple checksum
         -- if some attributes may be NULL, they must be coalesced
         CREATE FUNCTION foo_cs() RETURNS TRIGGER AS $$
           BEGIN
             -- compute key checksum
             NEW.key_cs = cksum4(NEW.id);
             -- compute tuple checksum
             NEW.tup_cs = cksum8(NEW.id || '|' || NEW.data);
             RETURN NEW;
           END; $$ LANGUAGE plpgsql;
         -- set trigger to call the checksum update function
         CREATE TRIGGER foo_cs_trigger
           BEFORE UPDATE OR INSERT ON Foo
           FOR EACH ROW EXECUTE PROCEDURE foo_cs();
         -- if table Foo is not initially empty,
         -- update its contents to trigger checksum computations
         UPDATE Foo SET id=id;

       Then a fast comparison, which does not need to compute the initial checksum table, can be requested with:

         ./pg_comparator --tcs=tup_cs --kcs=key_cs \
             admin@server1/app/Foo?id:data hobbes@server2/

       As the primary key is a simple integer, the key_cs could be left out and the comparison could be launched
       with:

         ./pg_comparator --tcs=tup_cs --use-key \
             admin@server1/app/Foo?id:data hobbes@server2/

OUTPUT

       The output of the command consists of lines describing the differences found between the two tables. They
       are expressed in term of insertions, updates or deletes and of tuple keys.

       UPDATE k
           Key k tuple is updated from table 1 to table 2.  It exists in both tables with different values.

       INSERT k
           Key k tuple does not appear in table 2, but only in table 1.  It must  be  inserted  in  table  2  to
           synchronize it wrt table 1.

       DELETE k
           Key  k tuple appears in table 2, but not in table 1.  It must be deleted from 2 to synchronize it wrt
           table 1.

       In case of tuple checksum collisions, false negative results may occur.  Changing the  checksum  function
       would help in such cases.  See the ANALYSIS sub-section.

DEPENDENCES

       Three support functions are needed on the database:

       1.
         The "COALESCE" function takes care of NULL values in columns.

       2.
         A  checksum  function  must  be used to reduce and distribute key and columns values. It may be changed
         with the "--checksum" option.  Its size can be selected with the "--checksize" option (currently  2,  4
         or 8 bytes). The checksums also require casts to be converted to integers of various sizes.

         Suitable  implementations  are available for PostgreSQL and can be loaded into the server by processing
         "share/contrib/pgc_checksum.sql" and "share/contrib/pgc_casts.sql". New checksums and  casts  are  also
         available  for  MySQL,  see "mysql_*.sql". An loadable implementation of suitable checksum functions is
         also available for SQLite, see "sqlite_checksum.*".

       3.
         An aggregate function is used to summarize checksums for a range of  rows.   It  must  operate  on  the
         result of the checksum function.  It may be changed with the "--aggregate" option.

         Suitable  implementations  of  a  exclusive-or  "xor" aggregate are available for PostgreSQL and can be
         loaded into the server by processing "share/contrib/xor_aggregate.sql".

         The "sqlite_checksum.*" file also provides a "xor" and "sum" aggregates for SQLite that are  compatible
         with other databases.

       Moreover several perl modules are useful to run this script:

       •   "Getopt::Long" for option management.

       •   "DBI",  "DBD::Pg"  to  connect  to PostgreSQL, "DBD::mysql" to connect to MySQL, and "DBD::SQLite" to
           connect to SQLite.

       •   "Term::ReadPassword" for "--ask-pass" option.

       •   "Pod::Usage" for doc self-extraction ("--man" "--opt" "--help").

       •   "threads" for the experimental threaded version with option "--threads".

       •   "Digest::MD5" for md5 checksum with SQLite.

       Modules are only loaded by the script if they are actually required.

ALGORITHM

       The aim of the algorithm is to compare the content of two tables, possibly on different  remote  servers,
       with minimum network traffic.  It is performed in three phases.

       1.
         A checksum table is computed on each side for the target table.

       2.
         A  fist level summary table is computed on each side by aggregating chunks of the checksum table. Other
         levels of summary aggregations are then performed till there is only one row in the last  table,  which
         then stores a global checksum for the whole initial target tables.

       3.
         Starting  from  the upper summary tables, aggregated checksums are compared from both sides to look for
         differences, down to the initial checksum table.  Keys of differing tuples are displayed.

   CHECKSUM TABLE
       The first phase computes the initial checksum table T(0) on each side.  Assuming that key  is  the  table
       key  columns,  and  cols  is  the  table  data columns that are to be checked for differences, then it is
       performed by querying target table T as follow:

         CREATE TABLE T(0) AS
           SELECT key AS pk,                   -- primary key
                  checksum(key) AS kcs,        -- key checksum
                  checksum(key || cols) AS tcs -- tuple checksum
           FROM t;

       The initial key is kept, as it will be used to show differing keys at the end. The rational for  the  kcs
       column  is  to  randomize the key-values distribution so as to balance aggregates in the next phase.  The
       key must appear in the checksum also, otherwise content exchanged between two keys would not be  detected
       in some cases.

   SUMMARY TABLES
       Now  we  compute  a  set of cascading summary tables by grouping f (folding factor) checksums together at
       each stage. The grouping is based on a mask  on  the  kcs  column  to  take  advantage  of  the  checksum
       randomization. Starting from p=0 we build:

         CREATE TABLE T(p+1) AS
           SELECT kcs & mask(p+1) AS kcs, -- key checksum subset
                  XOR(tcs) AS tcs         -- tuple checksum summary
           FROM T(p)
           GROUP BY kcs & mask(p+1);

       The mask(p) is defined so that it groups together on average f checksums together: mask(0) = ceil2(size);
       mask(p)  =  mask(p-1)/f;  This  leads  to  a hierarchy of tables, each one being a smaller summary of the
       previous one:

       level 0
           checksum table, size rows, i.e. as many rows as the target table.

       level 1
           first summary table, (size/f) rows.

       level p
           intermediate summary table, (size/f**p) rows.

       level n-1
           one before last summary table, less than f rows.

       level n
           last summary table, mask is 0, 1 row.

       It is important that the very same masks are used on both  sides  so  that  aggregations  are  the  same,
       allowing to compare matching contents on both sides.

   SEARCH FOR DIFFERENCES
       After  all  these  support tables are built on both sides comes the search for differences. When checking
       the checksum summary of the last tables (level n) with only one row, it is basically a comparison of  the
       checksum  of  the  whole  table  contents.  If  they  match, then both tables are equal, and we are done.
       Otherwise, if these checksums differ, some investigation is needed to detect offending keys.

       The investigation is performed by going down the table hierarchy and looking for all kcs for which  there
       was  a difference in the checksum on the previous level. The same query is performed on both side at each
       stage:

         SELECT kcs, tcs
         FROM T(p)
         WHERE kcs & mask(p+1) IN (kcs-with-diff-checksums-from-level-p+1)
         ORDER BY kcs [and on level 0: , id];

       And the results from both sides are merged together.  When doing the  merge  procedure,  four  cases  can
       arise:

       1.
         Both kcs and tcs match. Then there is no difference.

       2.
         Although  kcs  does  match, tcs does not. Then this kcs is to be investigated at the next level, as the
         checksum summary differs.  If we are already at the last level, then the offending key can be shown.

       3.
         No kcs match, one supplemental kcs in the first side.  Then this kcs correspond to key(s) that must  be
         inserted for syncing the second table wrt the first.

       4.
         No kcs match, one supplemental kcs in the second side.  Then this kcs correspond to key(s) that must be
         deleted for syncing the second table wrt the first.

       Cases  3 and 4 are simply symmetrical, and it is only an interpretation to decide whether it is an insert
       or a delete, taking the first side as the reference.

   ANALYSIS
       Let n be the number of rows, r the row size, f the folding factor, k the  number  of  differences  to  be
       detected, c the checksum size in bits, then the costs to identify differences and the error rate is:

       network volume
         is  better  than k*f*ceil(log(n)/log(f))*(c+log(n)).  the contents of k blocks of size f is transferred
         on the depth of the tree, and each block identifier is of size log(n) and contains a checksum c.  It is
         independent  of  r,   and   you   want   k<<n.    The   volume   of   the   SQL   requests   is   about
         k*log(n)*ceil(log(n)/log(f)), as the list of non matching checksums k*log(n) may be dragged on the tree
         depth.

       number of requests (on each side, the algorithm is symmetric)
         minimum is 6+ceil(log(n)/log(f)) for equal tables, maximum is 6+2*ceil(log(n)/log(f)).

       disk I/O traffic
         is about n*r+n*ln(n)*(f/(f-1)).

       false negative probability
         i.e.  part  of  the  tables  are considered equal although they are different.  With a perfect checksum
         function, this is the probability of a checksum collision at any point  where  they  are  computed  and
         should have been different: about k*ceil(log(n)/log(f))*2**-c.  For a million row table, expecting 1000
         changes  with  the  default algorithm parameter values, this is about 2**10 *3/2**64, that is about one
         chance in 2**52 merge runs.

       The lower the folding factor f the better for the network volume, but  the  higher  the  better  for  the
       number of requests and disk I/Os: the choice of f is a tradeoff.

       The  lower  the  checksum size c, the better for the network volume, but the worse for the false negative
       probability.

       If the available bandwidth is reasonable, the comparison will most likely be cpu-bound: the time is spent
       mainly on computing the initial checksum table.  Thus if you are planning to check for differences  quite
       often,  consider  maintaining  a  tuple checksum with a trigger, and possibly a key checksum as well, and
       invoke with "--tuple-checksum" and either "--key-checksum" or "--use-key".

   IMPLEMENTATION ISSUES
       The checksum implementation gives integers, which are constant length and easy to manipulate afterwards.

       The xor aggregate is a good choice because there is no overflow issue with it, it takes into account  all
       bits  of  the input, and it can easily be defined on any binary data. The sum aggregate is also okay, but
       it requires some kind of underlying integer type.

       NULL values must be taken care appropriately.

       The folding factor and all modules are taken as power of two so as to use a masks.

       There is a special management of large chunks of deletes or inserts which  is  implemented  although  not
       detailed in the algorithmic overview and complexity analysis.

       There is some efforts to build a PostgreSQL/MySQL compatible implementation of the algorithm, which added
       hacks to deal with type conversions and other stuff.

       This  script  is  reasonably tested, but due to its proof of concept nature there is a lot of options the
       combination of which cannot all be tested.

   NOTE
       If the tables to compare are in the same database, a  simple  SQL  query  can  extract  the  differences.
       Assuming  Tables  T1  and  T2  with  primary key id and non null contents data, then their differences is
       summarized by the following query:

               SELECT COALESCE(T1.id, T2.id) AS key,
                 CASE WHEN T1.id IS NULL THEN 'DELETE'
                      WHEN T2.id IS NULL THEN 'INSERT'
                      ELSE 'UPDATE'
                 END AS operation
               FROM T1 FULL JOIN T2 USING (id)
               WHERE T1.id IS NULL      -- DELETE
                  OR T2.id IS NULL      -- INSERT
                  OR T1.data <> T2.data -- UPDATE

   REFERENCES
       A paper was presented at a conference about this tool and its algorithm: Remote  Comparison  of  Database
       Tables  by Fabien Coelho, In Third International Conference on Advances in Databases, Knowledge, and Data
       Applications  (DBKDA),  pp  23-28,  St  Marteen,  The  Netherlands   Antilles,   January   2011.    ISBN:
       978-1-61208-002-4.        Copyright       IARIA       2011.        Online       at       Think       Mind
       <http://www.thinkmind.org/index.php?view=article&articleid=dbkda_2011_2_10_30021>.

       The algorithm and script was inspired by Taming the Distributed Database  Problem:  A  Case  Study  Using
       MySQL   by   Giuseppe  Maxia  in  Sys  Admin  vol  13  num  8,  Aug  2004,  pp  29-40.   See  Perl  Monks
       <http://www.perlmonks.org/index.pl?node_id=381053> for details.  In  this  paper,  three  algorithms  are
       presented.   The first one compares two tables with a checksum technique.  The second one finds UPDATE or
       INSERT differences based  on  a  2-level  (checksum  and  summary)  table  hierarchy.  The  algorithm  is
       asymmetrical, as different queries are performed on the two tables to compare.  It seems that the network
       traffic  volume  is  in k*(f+(n/f)+r), that it has a probabilistically-buggy merge procedure, and that it
       makes assumptions about the distribution of key values.  The third algorithm looks for DELETE differences
       based on counting, with the implicit assumption that there are only such differences.

       In contrast to this approach, our fully symmetrical algorithm implements all three tasks at once, to find
       UPDATE, DELETE and INSERT between the two tables. The checksum and summary  hierarchical  level  idea  is
       reused and generalized so as to reduce the algorithmic complexity.

       From  the implementation standpoint, the script is as parametric as possible with many options, and makes
       few assumptions about table structures, types and values.

SEE ALSO

       Michael Nacos made a robust implementation pg51g <http://pgdba.net/pg51g/> based  on  triggers.  He  also
       noted  that although database contents are compared by the algorithm, the database schema differences can
       also be detected by comparing system tables which describe them.

       Benjamin Mead Vandiver's PhD Thesis Detecting  and  Tolerating  Byzantine  Faults  in  Database  Systems,
       Massachusset's  Institute  of  Technology,  May  2008 (report number MIT-CSAIL-TR-2008-040).  There is an
       interesting  discussion  in  Chapter  7,  where  experiments  are  presented   with   a   Java/JDBC/MySQL
       implementation of two algorithms, including this one.

       Baron  Schwartz  discusses  comparison  algorithms in an online post <http://tinyurl.com/mysql-data-diff-
       algorithm>.

       Some more links:

       • Adept SQL <http://www.adeptsql.com/>

       • Altova Database Spy <http://www.altova.com/databasespy/>

       • AUI Soft SQLMerger <http://auisoft.com/sqlmerger/>

       • Citrus Tech Data Comparison <http://www.citrustechnology.com/solutions/data-comparison>

       • Clever Components dbcomparer <http://www.clevercomponents.com/products/dbcomparer/>

       • Comparezilla <http://comparezilla.sourceforge.net/>

       • Datanamic Datadiff <http://www.datanamic.com/datadiff/>

       • DB Balance <http://www.dbbalance.com/db_comparison.htm>

       • DBSolo datacomp <http://www.dbsolo.com/datacomp.html>

       • dbForge Data Compare <http://www.devart.com/dbforge/sql/datacompare/>

       • DiffKit <http://www.diffkit.org/>

       • DKGAS DBDiff <http://www.dkgas.com/dbdiff.htm>

       • Maakit mk-table-sync <http://code.google.com/p/maatkit/>

       • MySQL DBCompare <http://dev.mysql.com/doc/workbench/en/mysqldbcompare.html>

       • List of SQL Server Tools <http://www.programurl.com/software/sql-server-comparison.htm>

       • SQL Server tablediff Utility <http://msdn.microsoft.com/en-US/library/ms162843.aspx>

       • Red Gate SQL Data Compare <http://www.red-gate.com/products/sql-development/sql-data-compare/>

       • Spectral Core OmegaSync <http://www.spectralcore.com/omegasync/>,

       • SQL Delta <http://www.sqldelta.com/>

       • AlfaAlfa SQL Server Comparison Tool <http://www.sql-server-tool.com/>

       • SQLyog MySQL GUI <http://www.webyog.com/>

       • xSQL Software Data Compare <http://www.xsqlsoftware.com/Product/Sql_Data_Compare.aspx>

TESTS

       The paper reports numerous performance tests with PostgreSQL under various bandwith constraints.

       Moreover, non regression tests are run over randomly generated tables when the software is upgraded:

       sanity - about 30 seconds & 30 runs
           Run a comparison, synchronization & check for all databases combinaisons and all working asynchronous
           queries and threading options.

       fast - about 5 minutes & 360 runs
           Run 12 tests similar to the previous one with varrying options (number  of  key  columns,  number  of
           value columns, aggregate function, checksum function, null handling, folding factor, table locking or
           not...).

       feature - about 5 minutes & 168 or 474 runs
           Test  various features: cc for checksum computation strategies, auto for trigger-maintained checksums
           on PostgreSQL, empty for corner cases with empty tables, quote for table quoting, engine  for  InnoDB
           vs  MyISAM  MySQL backends, width for large columns, nullkey for possible NULL values in keys, sqlite
           for  SQLite  test,  mylite  for  SQLite/MySQL  mixed  mode  with  some   restrictions,   pglite   for
           SQLite/PostgreSQL mixed mode with some restrictions.

       release - about 20 minutes & 938 runs
           This is feature with two table sizes, fast, and collisions to test possible hash collisions.

       hour - about 1 hour & 2880 runs
           A  combination of 8 fast validations with varrying table sizes and difference ratio ranging from 0.1%
           to 99.9%.

       full - about 6 hours & 16128 runs... seldom run
           A combinatorial test involving numerous options: aggregation,  checksums,  null  handling,  foldings,
           number of key and value attributes...

BUGS

       All softwares have bugs. This is a software, hence it has bugs.

       Reporting bugs is good practice, so tell me if you find one.  If you have a fix, this is even better!

       The implementation does not do many sanity checks.

       Although  the  algorithm  can  work  with  some  normalized columns (say strings are trimmed, lowercased,
       Unicode normalized...), the implementation may not work at all.

       The script is really tested with integer and text types, issues may arise with other types.

       The script handles one table at a time. In order to synchronize several linked tables, you  must  disable
       referential integrity checks, then synchronize each tables, then re-enable the checks.

       There  is  no  real  attempt  at  doing some sensible identifier quoting, although quotes provided in the
       connection url are kept, so it may work after all for simple enough cases.

       There is no neat user interfaces, this is a earthly command line tool.  This is not a bug, but a feature.

       There are too many options.

       Using another language such as Python for this application seems attractive,  but  there  is  no  cleanly
       integrated manual-page style support such as POD, and the documentation is 50% of this script.

       Mixed  SQLite  vs  PostgreSQL  or  MySQL  table comparison may not work properly in all cases, because of
       SQLite dynamic type handling and reduced capabilities.

VERSIONS

       See PG Foundry  <http://pgfoundry.org/projects/pg-comparator/>  for  the  latest  version.  My  web  site
       <http://www.coelho.net/pg_comparator/> for the tool.

       version 2.2.2 (r1485 on 2014-01-08)
           Fix  some  warnings  reported  by  Ivan  Mincik.   Minor doc changes.  The release validation was run
           successfully on PostgreSQL 9.3.2 and MySQL 5.5.34.

       version 2.2.1 (r1480 on 2013-05-09)
           Do not die on missing driver in URL, regression reported by Ivan Mincik.  The release validation  was
           run successfully on PostgreSQL 9.2.4 and MySQL 5.5.31.

       version 2.2.0 (r1473 on 2013-03-07)
           Bug  fix  by Robert Coup, which was triggered on hash collisions (again).  This bug was introduced in
           2.1.0 when getting rid of the key separator, and not caught by the validation.  Factor  out  database
           dependencies  in  a  separate  data  structure,  so  that adding new targets should be simpler in the
           future.  Add SQLite support.  Add experimental Firebird support.  Fix some warnings.  Update "cksum8"
           function to propagate the first checksum half into the computation  of  the  second  half.   Improved
           documentation.   Improved  validation,  in  particular  with a collisions test.  The release and hour
           validations were run successfully on PostgreSQL 9.2.3 and MySQL 5.5.29.

       version 2.1.2 (r1402 on 2012-10-28)
           Fix an issue when table names were quoted, raised by Robert Coup.  Improved documentation, especially
           Section "SEE ALSO".  More precise warning.  Improved validation.  The release  and  hour  validations
           were run successfully on PostgreSQL 9.2.1 and MySQL 5.5.27.

       version 2.1.1 (r1375 on 2012-08-20)
           Synchronization  now handles possible NULLs in keys.  Warn if key is nullable or not an integer under
           "--use-key".  Improved documentation, in particular non regression tests are described.  The  release
           and hour validations were run successfully on PostgreSQL 9.1.4 and MySQL 5.5.24.

       version 2.1.0 (r1333 on 2012-08-18)
           Add "--tuple-checksum" and "--key-checksum" options so as to use existing possibly trigger-maintained
           checksums  in  the  target tables instead of computing them on the fly.  Add "--checksum-computation"
           option to control how the checksum table is built, either "CREATE ... AS ..." or "CREATE ...;  INSERT
           ...".   For MySQL, rely directly on the count returned by "CREATE ... AS" if available.  Add "--lock"
           option for locking tables, which is enabled when synchronizing.  Improve asynchronous query handling,
           especially when creating checksum tables and getting initial table counts, and in some  other  cases.
           Remove  redundant  data  transfers  from  checksum  table  under  option "--use-key".  Get rid of the
           separator when retrieving keys of differing tuples.  Note  that  it  is  still  used  when  computing
           checksums.   Fix  bug in bulk insert and delete key recovery under option "--use-key".  Fix potential
           bug in handling complex conditions with "--where".  Change default prefix to pgc_cmp so  that  it  is
           clearer  that  it  belongs  to  pg_comparator.   Fix  initial  count  query  which  was  not  started
           asynchronously under "--tcs".  Ensure that if not null detection is in doubt,  a  column  is  assumed
           nullable  and  thus is coalesced.  Fix query counters so that they are shared under "--threads".  Fix
           threading for explicit cleanup phase.  Warn if nullable key attributes are encountered.  Make default
           driver for second connection be the same as first.  Rename option "--assume-size" as  "--size".   Add
           short documentation about "--debug".  Multiple "--debug" set DBI tracing levels as well.  Improve the
           difference  computation code so that the algorithm is more readable.  Improve documentation.  Add and
           improve comments in the code.  The release and hour validations were run successfully  on  PostgreSQL
           9.1.4 and MySQL 5.5.24.

       version 2.0.1 (r1159 on 2012-08-10)
           Add  "--source-*"  options  to  allow  taking  over  DBI  data  source specification.  Change default
           aggregate to "sum" so that it  works  as  expected  by  default  when  mixing  PostgreSQL  and  MySQL
           databases.  The  results  are okay with "xor", but more paths than necessary were investigated, which
           can unduly trigger the max report  limit.   Improved  documentation.  In  particular  default  option
           settings  are  provided systematically.  The fast validation was run successfully on PostgreSQL 9.1.4
           and MySQL 5.5.24.

       version 2.0.0 (r1148 on 2012-08-09)
           Use asynchronous queries so as to provide some parallelism  to  the  comparison  without  the  issues
           raised  by threads. It is enabled by default and can be switched off with option "--no-asynchronous".
           Allow  empty  hostname  specification  in  connection  URL  to  use  a  UNIX  socket.   Improve   the
           documentation,  in  particular  the analysis section.  Fix minor typos in the documentation.  Add and
           fix various comments in the code.  The fast validation was run successfully on PostgreSQL  9.1.4  and
           MySQL 5.5.24.

       version 1.8.2 (r1117 on 2012-08-07)
           Bug fix in the merge procedure by Robert Coup that could result in some strange difference reports in
           corner  cases,  when  there  were  collisions  on  the kcs in the initial checksum table.  Fix broken
           synchronization with '|' separator, raised by Aldemir Akpinar.  Warn about possible issues with large
           objects.  Add "--long-read-len" option as a possible way to circumvent such issues.   Try  to  detect
           these issues.  Add a counter for metadata queries.  Minor documentation improvements and fixes.

       version 1.8.1 (r1109 on 2012-03-24)
           Change  default separator again, to '|'.  Fix "--where" option mishandling when counting, pointed out
           by Enrique Corona.

           Post release note: the synchronisation is broken with the default separator in 1.8.1, do not use  it,
           or use --separator='%'.

       version 1.8.0 (r1102 on 2012-01-08)
           Change  default  separator  to  '%', which seems less likely, after issues run into by Emanuel Calvo.
           Add more pointers and documentation.

       version 1.7.0 (r1063 on 2010-11-12)
           Improved documentation.  Enhancement and fix by Maxim Beloivanenko: handle quoted table and attribute
           names; Work around bulk inserts and deletes which  may  be  undefined.   More  stats,  more  precise,
           possibly  in  CSV format.  Add timeout and use-null options.  Fix subtle bug which occurred sometimes
           on kcs collisions in table T(0).

       version 1.6.1 (r754 on 2010-04-16)
           Improved documentation.  Key and columns now defaults to primary key and all other columns  of  table
           in  first  connection.   Password  can be supplied from the environment.  Default password for second
           connection always set depending on the first.  Add max ratio option to express the  relative  maximum
           number  of differences.  Compute grouping masks by shifting left instead of right by default (that is
           doing a divide instead of a modulo).   Threads  now  work  a  little,  although  it  is  still  quite
           experimental.   Fix  a  bug  that  made perl see differing checksum although they were equal, in some
           unclear conditions.

       version 1.6.0 (r701 on 2010-04-03)
           Add more functions (MD5, SUM) and sizes (2, 4, 8).  Remove template parameterization  which  is  much
           too   fragile  to  expose.   Add  a  wrapping  transaction  which  may  speed  up  things  a  little.
           Implementation for MySQL, including synchronizing heterogeneous databases.   Improved  documentation.
           Extensive validation/non regression tests.

       version 1.5.2 (r564 on 2010-03-22)
           More  documentation.   Improved  connection  parsing  with  more  sensible  defaults.   Make the mask
           computation match its above documentation with a bottom-up derivation, instead of a simpler  top-down
           formula which results in bad performances when a power of the factor is close to the size (as pointed
           out  in  Benjamin Mead Vandiver's PhD).  This bad mask computation was introduced somehow between 1.3
           and 1.4 as an attempt at simplifying the code.

       version 1.5.1 (r525 on 2010-03-21)
           More documentation.  Add "--expect" option for non regression tests.

       version 1.5.0 (r511 on 2010-03-20)
           Add more links.  Fix so that with a key only (i.e. without additional columns), although it could  be
           optimized further in this case.  Integrate patch by Erik Aronesty: More friendly "connection parser".
           Add synchronization option to actually synchronize the data.

       version 1.4.4 (r438 on 2008-06-03)
           Manual connection string parsing.

       version 1.4.3 (r424 on 2008-02-17)
           Grumble! wrong tar pushed out.

       version 1.4.2 (r421 on 2008-02-17)
           Minor makefile fix asked for by Roberto C. Sanchez.

       version 1.4.1 (r417 on 2008-02-14)
           Minor fix for PostgreSQL 8.3 by Roberto C. Sanchez.

       version 1.4 (r411 on 2007-12-24)
           Port  to  PostgreSQL  8.2.  Better  documentation.   Fix  mask  bug: although the returned answer was
           correct, the table folding was not.  DELETE/INSERT messages exchanged so as  to  match  a  'sync'  or
           'copy' semantics, as suggested by Erik Aronesty.

       version 1.3 (r239 on 2004-08-31)
           Project moved to PG Foundry <http://pgfoundry.org/>.  Use cksum8 checksum function by default.  Minor
           doc updates.

       version 1.2 (r220 on 2004-08-27)
           Added "--show-all-keys" option for handling big chunks of deletes or inserts.

       version 1.1 (r210 on 2004-08-26)
           Fix  algorithmic  bug:  checksums  must  also  include the key, otherwise exchanged data could be not
           detected if the keys were to be grouped together.  Algorithmic section added to manual page.   Thanks
           to Giuseppe Maxia who asked for it.  Various code cleanups.

       version 1.0 (r190 on 2004-08-25)
           Initial revision.

COPYRIGHT

       Copyright (c) 2004-2014, Fabien Coelho <pg dot comparator at coelho dot net> <http://www.coelho.net/>

       This  software  is  distributed  under  the terms of the BSD Licence.  Basically, you can do whatever you
       want, but you have to keep the license and I'm not responsible for any  consequences.   Beware,  you  may
       lose  your data, your friends or your hairs because of this software!  See the LICENSE file enclosed with
       the distribution for details.

       If you are very happy with this software, I would appreciate a postcard saying so.  See  my  webpage  for
       current address.

perl v5.18.1                                       2014-01-08                                   pg_comparator(1)