percona-toolkit/docs/user/source/pt-table-sync.rst

.. program:: pt-table-sync

==========================
 :program:`pt-table-sync`
==========================

.. highlight:: perl


NAME
====

 :program:`pt-table-sync` - Synchronize |MySQL| table data efficiently.


SYNOPSIS
========


Usage
-----

::

   pt-table-sync [OPTION...] DSN [DSN...]

:program:`pt-table-sync` synchronizes data efficiently between |MySQL| tables.

This tool changes data, so for maximum safety, you should back up your data
before you use it.  When synchronizing a server that is a replication slave with
the --replicate or --sync-to-master methods, it \ **always**\  makes the changes on
the replication master, \ **never**\  the replication slave directly.  This is in
general the only safe way to bring a replica back in sync with its master;
changes to the replica are usually the source of the problems in the first
place.  However, the changes it makes on the master should be no-op changes that
set the data to their current values, and actually affect only the replica.
Please read the detailed documentation that follows to learn more about this.

Sync db.tbl on host1 to host2:


.. code-block:: perl

   pt-table-sync --execute h=host1,D=db,t=tbl h=host2


Sync all tables on host1 to host2 and host3:


.. code-block:: perl

   pt-table-sync --execute host1 host2 host3


Make slave1 have the same data as its replication master:


.. code-block:: perl

   pt-table-sync --execute --sync-to-master slave1


Resolve differences that pt-table-checksum found on all slaves of master1:


.. code-block:: perl

   pt-table-sync --execute --replicate test.checksum master1


Same as above but only resolve differences on slave1:


.. code-block:: perl

   pt-table-sync --execute --replicate test.checksum \
     --sync-to-master slave1


Sync master2 in a master-master replication configuration, where master2's copy
of db.tbl is known or suspected to be incorrect:


.. code-block:: perl

   pt-table-sync --execute --sync-to-master h=master2,D=db,t=tbl


Note that in the master-master configuration, the following will NOT do what you
want, because it will make changes directly on master2, which will then flow
through replication and change master1's data:


.. code-block:: perl

   # Don't do this in a master-master setup!
   pt-table-sync --execute h=master1,D=db,t=tbl master2


RISKS
=====


The following section is included to inform users about the potential risks,
whether known or unknown, of using this tool.  The two main categories of risks
are those created by the nature of the tool (e.g. read-only tools vs. read-write
tools) and those created by bugs.

With great power comes great responsibility!  This tool changes data, so it is a
good idea to back up your data.  It is also very powerful, which means it is
very complex, so you should run it with the :option:`--dry-run` option to see what it
will do, until you're familiar with its operation.  If you want to see which
rows are different, without changing any data, use :option:`--print` instead of
:option:`--execute`.

Be careful when using :program:`pt-table-sync` in any master-master setup.  Master-master
replication is inherently tricky, and it's easy to make mistakes.  You need to
be sure you're using the tool correctly for master-master replication.  See the
"SYNOPSIS" for the overview of the correct usage.

Also be careful with tables that have foreign key constraints with \ ``ON DELETE``\
or \ ``ON UPDATE``\  definitions because these might cause unintended changes on the
child tables.

In general, this tool is best suited when your tables have a primary key or
unique index.  Although it can synchronize data in tables lacking a primary key
or unique index, it might be best to synchronize that data by another means.

At the time of this release, there is a potential bug using
:option:`--lock-and-rename` with |MySQL| 5.1, a bug detecting certain differences,
a bug using ROUND() across different platforms, and a bug mixing collations.

The authoritative source for updated information is always the online issue
tracking system.  Issues that affect this tool will be marked as such.  You can
see a list of such issues at the following URL:
`http://www.percona.com/bugs/pt-table-sync <http://www.percona.com/bugs/pt-table-sync>`_.

See also :ref:`bugs` for more information on filing bugs and getting help.


DESCRIPTION
===========

:program:`pt-table-sync` does one-way and bidirectional synchronization of table data.
It does \ **not**\  synchronize table structures, indexes, or any other schema
objects.  The following describes one-way synchronization.
"BIDIRECTIONAL SYNCING" is described later.

This tool is complex and functions in several different ways.  To use it
safely and effectively, you should understand three things: the purpose
of :option:`--replicate`, finding differences, and specifying hosts.  These
three concepts are closely related and determine how the tool will run.
The following is the abbreviated logic:


.. code-block:: perl

    if DSN has a t part, sync only that table:
       if 1 DSN:
          if --sync-to-master:
             The DSN is a slave.  Connect to its master and sync.
       if more than 1 DSN:
          The first DSN is the source.  Sync each DSN in turn.
    else if --replicate:
       if --sync-to-master:
          The DSN is a slave.  Connect to its master, find records
          of differences, and fix.
       else:
          The DSN is the master.  Find slaves and connect to each,
          find records of differences, and fix.
    else:
       if only 1 DSN and --sync-to-master:
          The DSN is a slave.  Connect to its master, find tables and
          filter with --databases etc, and sync each table to the master.
       else:
          find tables, filtering with --databases etc, and sync each
          DSN to the first.

:program:`pt-table-sync` can run in one of two ways: with :option:`--replicate` or without.

The default is to run without :option:`--replicate` which causes :program:`pt-table-sync`
to automatically find differences efficiently with one of several
algorithms (see "ALGORITHMS").  Alternatively, the value of
:option:`--replicate`, if specified, causes :program:`pt-table-sync` to use the differences
already found by having previously ran pt-table-checksum with its own
\ ``--replicate``\  option.  Strictly speaking, you don't need to use
:option:`--replicate` because :program:`pt-table-sync` can find differences, but many
people use :option:`--replicate` if, for example, they checksum regularly
using pt-table-checksum then fix differences as needed with :program:`pt-table-sync`.
If you're unsure, read each tool's documentation carefully and decide for
yourself, or consult with an expert.

Regardless of whether :option:`--replicate` is used or not, you need to specify
which hosts to sync.  There are two ways: with :option:`--sync-to-master` or
without.  Specifying :option:`--sync-to-master` makes :program:`pt-table-sync` expect
one and only slave DSN on the command line.  The tool will automatically
discover the slave's master and sync it so that its data is the same as
its master.  This is accomplished by making changes on the master which
then flow through replication and update the slave to resolve its differences.
\ **Be careful though**\ : although this option specifies and syncs a single
slave, if there are other slaves on the same master, they will receive
via replication the changes intended for the slave that you're trying to
sync.

Alternatively, if you do not specify :option:`--sync-to-master`, the first
DSN given on the command line is the source host.  There is only ever
one source host.  If you do not also specify :option:`--replicate`, then you
must specify at least one other DSN as the destination host.  There
can be one or more destination hosts.  Source and destination hosts
must be independent; they cannot be in the same replication topology. :program:`pt-table-sync` will die with an error if it detects that a destination
host is a slave because changes are written directly to destination hosts
(and it's not safe to write directly to slaves).  Or, if you specify
:option:`--replicate" (but not "--sync-to-master`) then :program:`pt-table-sync` expects
one and only one master DSN on the command line.  The tool will automatically
discover all the master's slaves and sync them to the master.  This is
the only way to sync several (all) slaves at once (because
:option:`--sync-to-master` only specifies one slave).

Each host on the command line is specified as a DSN.  The first DSN
(or only DSN for cases like :option:`--sync-to-master`) provides default values
for other DSNs, whether those other DSNs are specified on the command line
or auto-discovered by the tool.  So in this example,


.. code-block:: perl

   pt-table-sync --execute h=host1,u=msandbox,p=msandbox h=host2


the host2 DSN inherits the \ ``u``\  and \ ``p``\  DSN parts from the host1 DSN.
Use the :option:`--explain-hosts` option to see how :program:`pt-table-sync` will interpret the DSNs given on the command line.


OUTPUT
======


If you specify the :option:`--verbose` option, you'll see information about the
differences between the tables.  There is one row per table.  Each server is
printed separately.  For example,


.. code-block:: perl

   # Syncing h=host1,D=test,t=test1
   # DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
   #      0       0      3      0 Chunk     13:00:00 13:00:17 2    test.test1


Table test.test1 on host1 required 3 \ ``INSERT``\  statements to synchronize
and it used the Chunk algorithm (see "ALGORITHMS").  The sync operation
for this table started at 13:00:00 and ended 17 seconds later (times taken
from \ ``NOW()``\  on the source host).  Because differences were found, its
"EXIT STATUS" was 2.

If you specify the :option:`--print` option, you'll see the actual SQL statements
that the script uses to synchronize the table if :option:`--execute` is also
specified.

If you want to see the SQL statements that :program:`pt-table-sync` is using to select
chunks, nibbles, rows, etc., then specify :option:`--print` once and :option:`--verbose`
twice.  Be careful though: this can print a lot of SQL statements.

There are cases where no combination of \ ``INSERT``\ , \ ``UPDATE``\  or \ ``DELETE``\
statements can resolve differences without violating some unique key.  For
example, suppose there's a primary key on column a and a unique key on column b.
Then there is no way to sync these two tables with straightforward UPDATE
statements:


.. code-block:: perl

  +---+---+  +---+---+
  | a | b |  | a | b |
  +---+---+  +---+---+
  | 1 | 2 |  | 1 | 1 |
  | 2 | 1 |  | 2 | 2 |
  +---+---+  +---+---+


The tool rewrites queries to \ ``DELETE``\  and \ ``REPLACE``\  in this case.  This is
automatically handled after the first index violation, so you don't have to
worry about it.


REPLICATION SAFETY
==================


Synchronizing a replication master and slave safely is a non-trivial problem, in
general.  There are all sorts of issues to think about, such as other processes
changing data, trying to change data on the slave, whether the destination and
source are a master-master pair, and much more.

In general, the safe way to do it is to change the data on the master, and let
the changes flow through replication to the slave like any other changes.
However, this works only if it's possible to REPLACE into the table on the
master.  REPLACE works only if there's a unique index on the table (otherwise it
just acts like an ordinary INSERT).

If your table has unique keys, you should use the :option:`--sync-to-master` and/or
:option:`--replicate` options to sync a slave to its master.  This will generally do
the right thing.  When there is no unique key on the table, there is no choice
but to change the data on the slave, and :program:`pt-table-sync` will detect that you're
trying to do so.  It will complain and die unless you specify
\ ``--no-check-slave``\  (see :option:`--[no]check-slave`).

If you're syncing a table without a primary or unique key on a master-master
pair, you must change the data on the destination server.  Therefore, you need
to specify \ ``--no-bin-log``\  for safety (see :option:`--[no]bin-log`).  If you don't,
the changes you make on the destination server will replicate back to the
source server and change the data there!

The generally safe thing to do on a master-master pair is to use the
:option:`--sync-to-master` option so you don't change the data on the destination
server.  You will also need to specify \ ``--no-check-slave``\  to keep :program:`pt-table-sync` from complaining that it is changing data on a slave.


ALGORITHMS
==========

:program:`pt-table-sync` has a generic data-syncing framework which uses different
algorithms to find differences.  The tool automatically chooses the best
algorithm for each table based on indexes, column types, and the algorithm
preferences specified by :option:`--algorithms`.  The following algorithms are
available, listed in their default order of preference:


  * ``Chunk``

 Finds an index whose first column is numeric (including date and time types),
 and divides the column's range of values into chunks of approximately
 :option:`--chunk-size` rows.  Syncs a chunk at a time by checksumming the entire
 chunk.  If the chunk differs on the source and destination, checksums each
 chunk's rows individually to find the rows that differ.

 It is efficient when the column has sufficient cardinality to make the chunks
 end up about the right size.

 The initial per-chunk checksum is quite small and results in minimal network
 traffic and memory consumption.  If a chunk's rows must be examined, only the
 primary key columns and a checksum are sent over the network, not the entire
 row.  If a row is found to be different, the entire row will be fetched, but not
 before.


  * ``Nibble``

 Finds an index and ascends the index in fixed-size nibbles of :option:`--chunk-size`
 rows, using a non-backtracking algorithm (see pt-archiver for more on this
 algorithm).  It is very similar to "Chunk", but instead of pre-calculating
 the boundaries of each piece of the table based on index cardinality, it uses
 \ ``LIMIT``\  to define each nibble's upper limit, and the previous nibble's upper
 limit to define the lower limit.

 It works in steps: one query finds the row that will define the next nibble's
 upper boundary, and the next query checksums the entire nibble.  If the nibble
 differs between the source and destination, it examines the nibble row-by-row,
 just as "Chunk" does.


  * ``GroupBy``

 Selects the entire table grouped by all columns, with a COUNT(\*) column added.
 Compares all columns, and if they're the same, compares the COUNT(\*) column's
 value to determine how many rows to insert or delete into the destination.
 Works on tables with no primary key or unique index.


  * ``Stream``

 Selects the entire table in one big stream and compares all columns.  Selects
 all columns.  Much less efficient than the other algorithms, but works when
 there is no suitable index for them to use.


  * ``Future Plans``

 Possibilities for future algorithms are TempTable (what I originally called
 bottom-up in earlier versions of this tool), DrillDown (what I originally
 called top-down), and GroupByPrefix (similar to how SqlYOG Job Agent works).
 Each algorithm has strengths and weaknesses.  If you'd like to implement your
 favorite technique for finding differences between two sources of data on
 possibly different servers, I'm willing to help.  The algorithms adhere to a
 simple interface that makes it pretty easy to write your own.


BIDIRECTIONAL SYNCING
=====================


Bidirectional syncing is a new, experimental feature.  To make it work
reliably there are a number of strict limitations:


.. code-block:: perl

   * only works when syncing one server to other independent servers
   * does not work in any way with replication
   * requires that the table(s) are chunkable with the Chunk algorithm
   * is not N-way, only bidirectional between two servers at a time
   * does not handle DELETE changes


For example, suppose we have three servers: c1, r1, r2.  c1 is the central
server, a pseudo-master to the other servers (viz. r1 and r2 are not slaves
to c1).  r1 and r2 are remote servers.  Rows in table foo are updated and
inserted on all three servers and we want to synchronize all the changes
between all the servers.  Table foo has columns:


.. code-block:: perl

   id    int PRIMARY KEY
   ts    timestamp auto updated
   name  varchar


Auto-increment offsets are used so that new rows from any server do not
create conflicting primary key (id) values.  In general, newer rows, as
determined by the ts column, take precedence when a same but differing row
is found during the bidirectional sync.  "Same but differing" means that
two rows have the same primary key (id) value but different values for some
other column, like the name column in this example.  Same but differing
conflicts are resolved by a "conflict".  A conflict compares some column of
the competing rows to determine a "winner".  The winning row becomes the
source and its values are used to update the other row.

There are subtle differences between three columns used to achieve
bidirectional syncing that you should be familiar with: chunk column
(:option:`--chunk-column"), comparison column(s) ("--columns`), and conflict
column (:option:`--conflict-column`).  The chunk column is only used to chunk the
table; e.g. "WHERE id >= 5 AND id < 10".  Chunks are checksummed and when
chunk checksums reveal a difference, the tool selects the rows in that
chunk and checksums the :option:`--columns` for each row.  If a column checksum
differs, the rows have one or more conflicting column values.  In a
traditional unidirectional sync, the conflict is a moot point because it can
be resolved simply by updating the entire destination row with the source
row's values.  In a bidirectional sync, however, the :option:`--conflict-column`
(in accordance with other \ ``--conflict-\*``\  options list below) is compared
to determine which row is "correct" or "authoritative"; this row becomes
the "source".

To sync all three servers completely, two runs of :program:`pt-table-sync` are required.
The first run syncs c1 and r1, then syncs c1 and r2 including any changes
from r1.  At this point c1 and r2 are completely in sync, but r1 is missing
any changes from r2 because c1 didn't have these changes when it and r1
were synced.  So a second run is needed which syncs the servers in the same
order, but this time when c1 and r1 are synced r1 gets r2's changes.

The tool does not sync N-ways, only bidirectionally between the first DSN
given on the command line and each subsequent DSN in turn.  So the tool in
this example would be ran twice like:


.. code-block:: perl

   pt-table-sync --bidirectional h=c1 h=r1 h=r2


The :option:`--bidirectional` option enables this feature and causes various
sanity checks to be performed.  You must specify other options that tell :program:`pt-table-sync` how to resolve conflicts for same but differing rows.
These options are:


.. code-block:: perl

   * --conflict-column
   * --conflict-comparison
   * --conflict-value
   * --conflict-threshold
   * --conflict-error">  (optional)


Use :option:`--print" to test this option before "--execute`.  The printed
SQL statements will have comments saying on which host the statement
would be executed if you used :option:`--execute`.

Technical side note: the first DSN is always the "left" server and the other
DSNs are always the "right" server.  Since either server can become the source
or destination it's confusing to think of them as "src" and "dst".  Therefore,
they're generically referred to as left and right.  It's easy to remember
this because the first DSN is always to the left of the other server DSNs on
the command line.


EXIT STATUS
===========


The following are the exit statuses (also called return values, or return codes)
when :program:`pt-table-sync` finishes and exits.


.. code-block:: perl

    STATUS  MEANING
    ======  =======================================================
    0       Success.
    1       Internal error.
    2       At least one table differed on the destination.
    3       Combination of 1 and 2.


OPTIONS
=======


Specify at least one of :option:`--print`, :option:`--execute`, or :option:`--dry-run`.

:option:`--where` and :option:`--replicate` are mutually exclusive.

This tool accepts additional command-line arguments.  Refer to the "SYNOPSIS" and usage information for details.


.. option:: --algorithms

 type: string; default: Chunk,Nibble,GroupBy,Stream

 Algorithm to use when comparing the tables, in order of preference.

 For each table, :program:`pt-table-sync` will check if the table can be synced with
 the given algorithms in the order that they're given.  The first algorithm
 that can sync the table is used.  See "ALGORITHMS".


.. option:: --ask-pass

 Prompt for a password when connecting to |MySQL|.


.. option:: --bidirectional

 Enable bidirectional sync between first and subsequent hosts.

 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --[no]bin-log

 default: yes

 Log to the binary log (\ ``SET SQL_LOG_BIN=1``\ ).

 Specifying \ ``--no-bin-log``\  will \ ``SET SQL_LOG_BIN=0``\ .


.. option:: --buffer-in-mysql

 Instruct |MySQL| to buffer queries in its memory.

 This option adds the \ ``SQL_BUFFER_RESULT``\  option to the comparison queries.
 This causes |MySQL| to execute the queries and place them in a temporary table
 internally before sending the results back to :program:`pt-table-sync`.  The advantage of
 this strategy is that :program:`pt-table-sync` can fetch rows as desired without using a
 lot of memory inside the *Perl*  process, while releasing locks on the |MySQL| table
 (to reduce contention with other queries).  The disadvantage is that it uses
 more memory on the |MySQL| server instead.

 You probably want to leave :option:`--[no]buffer-to-client` enabled too, because
 buffering into a temp table and then fetching it all into *Perl* 's memory is
 probably a silly thing to do.  This option is most useful for the GroupBy and
 Stream algorithms, which may fetch a lot of data from the server.


.. option:: --[no]buffer-to-client

 default: yes

 Fetch rows one-by-one from |MySQL| while comparing.

 This option enables \ ``mysql_use_result``\  which causes |MySQL| to hold the selected
 rows on the server until the tool fetches them.  This allows the tool to use
 less memory but may keep the rows locked on the server longer.

 If this option is disabled by specifying \ ``--no-buffer-to-client``\  then
 \ ``mysql_store_result``\  is used which causes |MySQL| to send all selected rows to
 the tool at once.  This may result in the results "cursor" being held open for
 a shorter time on the server, but if the tables are large, it could take a long
 time anyway, and use all your memory.

 For most non-trivial data sizes, you want to leave this option enabled.

 This option is disabled when :option:`--bidirectional` is used.


.. option:: --charset

 short form: -A; type: string

 Default character set.  If the value is utf8, sets *Perl* 's binmode on
 ``STDOUT`` to utf8, passes the mysql_enable_utf8 option to ``DBD::mysql``, and
 runs SET NAMES UTF8 after connecting to |MySQL|.  Any other value sets
 binmode on ``STDOUT`` without the utf8 layer, and runs SET NAMES after
 connecting to |MySQL|.


.. option:: --[no]check-master

 default: yes

 With :option:`--sync-to-master`, try to verify that the detected
 master is the real master.


.. option:: --[no]check-privileges

 default: yes

 Check that user has all necessary privileges on source and destination table.


.. option:: --[no]check-slave

 default: yes

 Check whether the destination server is a slave.

 If the destination server is a slave, it's generally unsafe to make changes on
 it.  However, sometimes you have to; :option:`--replace` won't work unless there's a
 unique index, for example, so you can't make changes on the master in that
 scenario.  By default :program:`pt-table-sync` will complain if you try to change data on
 a slave.  Specify \ ``--no-check-slave``\  to disable this check.  Use it at your own
 risk.


.. option:: --[no]check-triggers

 default: yes

 Check that no triggers are defined on the destination table.

 Triggers were introduced in |MySQL| v5.0.2, so for older versions this option
 has no effect because triggers will not be checked.


.. option:: --chunk-column

 type: string

 Chunk the table on this column.


.. option:: --chunk-index

 type: string

 Chunk the table using this index.


.. option:: --chunk-size

 type: string; default: 1000

 Number of rows or data size per chunk.

 The size of each chunk of rows for the "Chunk" and "Nibble" algorithms.
 The size can be either a number of rows, or a data size.  Data sizes are
 specified with a suffix of k=kibibytes, M=mebibytes, G=gibibytes.  Data sizes
 are converted to a number of rows by dividing by the average row length.


.. option:: --columns

 short form: -c; type: array

 Compare this comma-separated list of columns.


.. option:: --config

 type: Array

 Read this comma-separated list of config files; if specified, this must be the
 first option on the command line.


.. option:: --conflict-column

 type: string

 Compare this column when rows conflict during a :option:`--bidirectional` sync.

 When a same but differing row is found the value of this column from each
 row is compared according to :option:`--conflict-comparison`, :option:`--conflict-value`
 and :option:`--conflict-threshold` to determine which row has the correct data and
 becomes the source.  The column can be any type for which there is an
 appropriate :option:`--conflict-comparison` (this is almost all types except, for
 example, blobs).

 This option only works with :option:`--bidirectional`.
 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --conflict-comparison

 type: string

 Choose the :option:`--conflict-column` with this property as the source.

 The option affects how the :option:`--conflict-column` values from the conflicting
 rows are compared.  Possible comparisons are one of these MAGIC_comparisons:


 .. code-block:: perl

    newest|oldest|greatest|least|equals|matches

    COMPARISON  CHOOSES ROW WITH
    ==========  =========================================================
    newest      Newest temporal --conflict-column value
    oldest      Oldest temporal --conflict-column value
    greatest    Greatest numerical "--conflict-column value
    least       Least numerical --conflict-column value
    equals      --conflict-column value equal to --conflict-value
    matches     --conflict-column value matching *Perl*  regex pattern
                --conflict-value


 This option only works with :option:`--bidirectional`.
 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --conflict-error

 type: string; default: warn

 How to report unresolvable conflicts and conflict errors

 This option changes how the user is notified when a conflict cannot be
 resolved or causes some kind of error.  Possible values are:


 .. code-block:: perl

    * warn: Print a warning to ``STDERR`` about the unresolvable conflict
    * die:  Die, stop syncing, and print a warning to ``STDERR``


 This option only works with :option:`--bidirectional`.
 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --conflict-threshold

 type: string

 Amount by which one :option:`--conflict-column` must exceed the other.

 The :option:`--conflict-threshold` prevents a conflict from being resolved if
 the absolute difference between the two :option:`--conflict-column` values is
 less than this amount.  For example, if two :option:`--conflict-column` have
 timestamp values "2009-12-01 12:00:00" and "2009-12-01 12:05:00" the difference
 is 5 minutes.  If :option:`--conflict-threshold` is set to "5m" the conflict will
 be resolved, but if :option:`--conflict-threshold` is set to "6m" the conflict
 will fail to resolve because the difference is not greater than or equal
 to 6 minutes.  In this latter case, :option:`--conflict-error` will report
 the failure.

 This option only works with :option:`--bidirectional`.
 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --conflict-value

 type: string

 Use this value for certain :option:`--conflict-comparison`.

 This option gives the value for \ ``equals``\  and \ ``matches``\
 :option:`--conflict-comparison`.

 This option only works with :option:`--bidirectional`.
 See "BIDIRECTIONAL SYNCING" for more information.


.. option:: --databases

 short form: -d; type: hash

 Sync only this comma-separated list of databases.

 A common request is to sync tables from one database with tables from another
 database on the same or different server.  This is not yet possible.
 :option:`--databases` will not do it, and you can't do it with the D part of the DSN
 either because in the absence of a table name it assumes the whole server
 should be synced and the D part controls only the connection's default database.


.. option:: --defaults-file

 short form: -F; type: string

 Only read mysql options from the given file.  You must give an absolute pathname.


.. option:: --dry-run

 Analyze, decide the sync algorithm to use, print and exit.

 Implies :option:`--verbose` so you can see the results.  The results are in the same
 output format that you'll see from actually running the tool, but there will be
 zeros for rows affected.  This is because the tool actually executes, but stops
 before it compares any data and just returns zeros.  The zeros do not mean there
 are no changes to be made.


.. option:: --engines

 short form: -e; type: hash

 Sync only this comma-separated list of storage engines.


.. option:: --execute

 Execute queries to make the tables have identical data.

 This option makes :program:`pt-table-sync` actually sync table data by executing all
 the queries that it created to resolve table differences.  Therefore, \ **the
 tables will be changed!**\   And unless you also specify :option:`--verbose`, the
 changes will be made silently.  If this is not what you want, see
 :option:`--print` or :option:`--dry-run`.


.. option:: --explain-hosts

 Print connection information and exit.

 Print out a list of hosts to which :program:`pt-table-sync` will connect, with all
 the various connection options, and exit.


.. option:: --float-precision

 type: int

 Precision for \ ``FLOAT``\  and \ ``DOUBLE``\  number-to-string conversion.  Causes FLOAT
 and DOUBLE values to be rounded to the specified number of digits after the
 decimal point, with the ROUND() function in |MySQL|.  This can help avoid
 checksum mismatches due to different floating-point representations of the same
 values on different |MySQL| versions and hardware.  The default is no rounding;
 the values are converted to strings by the CONCAT() function, and |MySQL| chooses
 the string representation.  If you specify a value of 2, for example, then the
 values 1.008 and 1.009 will be rounded to 1.01, and will checksum as equal.


.. option:: --[no]foreign-key-checks

 default: yes

 Enable foreign key checks (\ ``SET FOREIGN_KEY_CHECKS=1``\ ).

 Specifying \ ``--no-foreign-key-checks``\  will \ ``SET FOREIGN_KEY_CHECKS=0``\ .


.. option:: --function

 type: string

 Which hash function you'd like to use for checksums.

 The default is \ ``CRC32``\ .  Other good choices include \ ``MD5``\  and \ ``SHA1``\ .  If you
 have installed the \ ``FNV_64``\  user-defined function,  :program:`pt-table-sync`  will detect
 it and prefer to use it, because it is much faster than the built-ins.  You can
 also use MURMUR_HASH if you've installed that user-defined function.  Both of
 these are distributed with Maatkit.  See pt-table-checksum for more
 information and benchmarks.


.. option:: --help

 Show help and exit.


.. option:: --[no]hex-blob

 default: yes

 \ ``HEX()``\  \ ``BLOB``\ , \ ``TEXT``\  and \ ``BINARY``\  columns.

 When row data from the source is fetched to create queries to sync the
 data (i.e. the queries seen with :option:`--print` and executed by :option:`--execute`),
 binary columns are wrapped in HEX() so the binary data does not produce
 an invalid SQL statement.  You can disable this option but you probably
 shouldn't.


.. option:: --host

 short form: -h; type: string

 Connect to host.


.. option:: --ignore-columns

 type: Hash

 Ignore this comma-separated list of column names in comparisons.

 This option causes columns not to be compared.  However, if a row is determined
 to differ between tables, all columns in that row will be synced, regardless.
 (It is not currently possible to exclude columns from the sync process itself,
 only from the comparison.)


.. option:: --ignore-databases

 type: Hash

 Ignore this comma-separated list of databases.


.. option:: --ignore-engines

 type: Hash; default: FEDERATED,MRG_MyISAM

 Ignore this comma-separated list of storage engines.


.. option:: --ignore-tables

 type: Hash

 Ignore this comma-separated list of tables.

 Table names may be qualified with the database name.


.. option:: --[no]index-hint

 default: yes

 Add FORCE/USE INDEX hints to the chunk and row queries.

 By default :program:`pt-table-sync` adds a FORCE/USE INDEX hint to each SQL statement
 to coerce |MySQL| into using the index chosen by the sync algorithm or specified
 by :option:`--chunk-index`.  This is usually a good thing, but in rare cases the
 index may not be the best for the query so you can suppress the index hint
 by specifying \ ``--no-index-hint``\  and let |MySQL| choose the index.

 This does not affect the queries printed by :option:`--print`; it only affects the
 chunk and row queries that :program:`pt-table-sync` uses to select and compare rows.


.. option:: --lock

 type: int

 Lock tables: 0=none, 1=per sync cycle, 2=per table, or 3=globally.

 This uses \ ``LOCK TABLES``\ .  This can help prevent tables being changed while
 you're examining them.  The possible values are as follows:


 .. code-block:: perl

    VALUE  MEANING
    =====  =======================================================
    0      Never lock tables.
    1      Lock and unlock one time per sync cycle (as implemented
           by the syncing algorithm).  This is the most granular
           level of locking available.  For example, the Chunk
           algorithm will lock each chunk of C<N> rows, and then
           unlock them if they are the same on the source and the
           destination, before moving on to the next chunk.
    2      Lock and unlock before and after each table.
    3      Lock and unlock once for every server (DSN) synced, with
           C<FLUSH TABLES WITH READ LOCK>.


 A replication slave is never locked if :option:`--replicate" or "--sync-to-master`
 is specified, since in theory locking the table on the master should prevent any
 changes from taking place.  (You are not changing data on your slave, right?)
 If :option:`--wait` is given, the master (source) is locked and then the tool waits
 for the slave to catch up to the master before continuing.

 If \ ``--transaction``\  is specified, \ ``LOCK TABLES``\  is not used.  Instead, lock
 and unlock are implemented by beginning and committing transactions.
 The exception is if :option:`--lock` is 3.

 If \ ``--no-transaction``\  is specified, then \ ``LOCK TABLES``\  is used for any
 value of :option:`--lock`. See :option:`--[no]transaction`.

.. option:: --lock-and-rename

 Lock the source and destination table, sync, then swap names.  This is useful as
 a less-blocking ALTER TABLE, once the tables are reasonably in sync with each
 other (which you may choose to accomplish via any number of means, including
 dump and reload or even something like pt-archiver).  It requires exactly two
 DSNs and assumes they are on the same server, so it does no waiting for
 replication or the like.  Tables are locked with LOCK TABLES.


.. option:: --password

 short form: -p; type: string

 Password to use when connecting.


.. option:: --pid

 type: string

 Create the given PID file.  The file contains the process ID of the script.
 The PID file is removed when the script exits.  Before starting, the script
 checks if the PID file already exists.  If it does not, then the script creates
 and writes its own PID to it.  If it does, then the script checks the following:
 if the file contains a PID and a process is running with that PID, then
 the script dies; or, if there is no process running with that PID, then the
 script overwrites the file with its own PID and starts; else, if the file
 contains no PID, then the script dies.


.. option:: --port

 short form: -P; type: int

 Port number to use for connection.


.. option:: --print

 Print queries that will resolve differences.

 If you don't trust :program:`pt-table-sync`, or just want to see what it will do, this
 is a good way to be safe.  These queries are valid SQL and you can run them
 yourself if you want to sync the tables manually.


.. option:: --recursion-method

 type: string

 Preferred recursion method used to find slaves.

 Possible methods are:


 .. code-block:: perl

    METHOD       USES
    ===========  ================
    processlist  SHOW PROCESSLIST
    hosts        SHOW SLAVE HOSTS


 The processlist method is preferred because SHOW SLAVE HOSTS is not reliable.
 However, the hosts method is required if the server uses a non-standard
 port (not 3306).  Usually :program:`pt-table-sync` does the right thing and finds
 the slaves, but you may give a preferred method and it will be used first.
 If it doesn't find any slaves, the other methods will be tried.


.. option:: --replace

 Write all \ ``INSERT``\  and \ ``UPDATE``\  statements as \ ``REPLACE``\ .

 This is automatically switched on as needed when there are unique index
 violations.


.. option:: --replicate

 type: string

 Sync tables listed as different in this table.

 Specifies that :program:`pt-table-sync`  should examine the specified table to find data
 that differs.  The table is exactly the same as the argument of the same name to
 pt-table-checksum.  That is, it contains records of which tables (and ranges
 of values) differ between the master and slave.

 For each table and range of values that shows differences between the master and
 slave, \ ``pt-table-checksum``\  will sync that table, with the appropriate \ ``WHERE``\
 clause, to its master.

 This automatically sets :option:`--wait` to 60 and causes changes to be made on the
 master instead of the slave.

 If :option:`--sync-to-master` is specified, the tool will assume the server you
 specified is the slave, and connect to the master as usual to sync.

 Otherwise, it will try to use \ ``SHOW PROCESSLIST``\  to find slaves of the server
 you specified.  If it is unable to find any slaves via \ ``SHOW PROCESSLIST``\ , it
 will inspect \ ``SHOW SLAVE HOSTS``\  instead.  You must configure each slave's
 \ ``report-host``\ , \ ``report-port``\  and other options for this to work right.  After
 finding slaves, it will inspect the specified table on each slave to find data
 that needs to be synced, and sync it.

 The tool examines the master's copy of the table first, assuming that the master
 is potentially a slave as well.  Any table that shows differences there will
 \ **NOT**\  be synced on the slave(s).  For example, suppose your replication is set
 up as A->B, B->C, B->D.  Suppose you use this argument and specify server B.
 The tool will examine server B's copy of the table.  If it looks like server B's
 data in table \ ``test.tbl1``\  is different from server A's copy, the tool will not
 sync that table on servers C and D.


.. option:: --set-vars

 type: string; default: wait_timeout=10000

 Set these |MySQL| variables.  Immediately after connecting to |MySQL|, this
 string will be appended to SET and executed.


.. option:: --socket

 short form: -S; type: string

 Socket file to use for connection.


.. option:: --sync-to-master

 Treat the DSN as a slave and sync it to its master.

 Treat the server you specified as a slave.  Inspect \ ``SHOW SLAVE STATUS``\ ,
 connect to the server's master, and treat the master as the source and the slave
 as the destination.  Causes changes to be made on the master.  Sets :option:`--wait`
 to 60 by default, sets :option:`--lock` to 1 by default, and disables
 :option:`--[no]transaction` by default.  See also :option:`--replicate`, which changes
 this option's behavior.


.. option:: --tables

 short form: -t; type: hash

 Sync only this comma-separated list of tables.

 Table names may be qualified with the database name.


.. option:: --timeout-ok

 Keep going if :option:`--wait` fails.

 If you specify :option:`--wait` and the slave doesn't catch up to the master's
 position before the wait times out, the default behavior is to abort.  This
 option makes the tool keep going anyway.  \ **Warning**\ : if you are trying to get a
 consistent comparison between the two servers, you probably don't want to keep
 going after a timeout.


.. option:: --[no]transaction

 Use transactions instead of \ ``LOCK TABLES``\ .

 The granularity of beginning and committing transactions is controlled by
 :option:`--lock`.  This is enabled by default, but since `--lock` is disabled by
 default, it has no effect.

 Most options that enable locking also disable transactions by default, so if
 you want to use transactional locking (via \ ``LOCK IN SHARE MODE``\  and \ ``FOR
 UPDATE``\ , you must specify \ ``--transaction``\  explicitly.

 If you don't specify \ ``--transaction``\  explicitly \ ` :program:`pt-table-sync```\  will decide on
 a per-table basis whether to use transactions or table locks.  It currently
 uses transactions on |InnoDB| tables, and table locks on all others.

 If \ ``--no-transaction``\  is specified, then :program:`pt-table-sync` will not use
 transactions at all (not even for |InnoDB| tables) and locking is controlled
 by :option:`--lock`.

 When enabled, either explicitly or implicitly, the transaction isolation level
 is set \ ``REPEATABLE READ``\  and transactions are started \ ``WITH CONSISTENT
 SNAPSHOT``\ .

.. option:: --trim

 \ ``TRIM()``\  \ ``VARCHAR``\  columns in \ ``BIT_XOR``\  and \ ``ACCUM``\  modes.  Helps when
 comparing |MySQL| 4.1 to >= 5.0.

 This is useful when you don't care about the trailing space differences between
 |MySQL| versions which vary in their handling of trailing spaces. |MySQL| 5.0 and
 later all retain trailing spaces in \ ``VARCHAR``\ , while previous versions would
 remove them.


.. option:: --[no]unique-checks

 default: yes

 Enable unique key checks (\ ``SET UNIQUE_CHECKS=1``\ ).

 Specifying \ ``--no-unique-checks``\  will \ ``SET UNIQUE_CHECKS=0``\ .


.. option:: --user

 short form: -u; type: string

 User for login if not current user.


.. option:: --verbose

 short form: -v; cumulative: yes

 Print results of sync operations.

 See "OUTPUT" for more details about the output.


.. option:: --version

 Show version and exit.


.. option:: --wait

 short form: -w; type: time

 How long to wait for slaves to catch up to their master.

 Make the master wait for the slave to catch up in replication before comparing
 the tables.  The value is the number of seconds to wait before timing out (see
 also :option:`--timeout-ok`).  Sets :option:`--lock` to 1 and :option:`--[no]transaction` to 0 by default.  If you see an error such as the following,


 .. code-block:: perl

    MASTER_POS_WAIT returned -1


 It means the timeout was exceeded and you need to increase it.

 The default value of this option is influenced by other options.  To see what
 value is in effect, run with :option:`--help`.

 To disable waiting entirely (except for locks), specify :option:`--wait` 0.  This
 helps when the slave is lagging on tables that are not being synced.


.. option:: --where

 type: string

 \ ``WHERE``\  clause to restrict syncing to part of the table.


.. option:: --[no]zero-chunk

 default: yes

 Add a chunk for rows with zero or zero-equivalent values.  The only has an
 effect when :option:`--chunk-size` is specified.  The purpose of the zero chunk
 is to capture a potentially large number of zero values that would imbalance
 the size of the first chunk.  For example, if a lot of negative numbers were
 inserted into an unsigned integer column causing them to be stored as zeros,
 then these zero values are captured by the zero chunk instead of the first
 chunk and all its non-zero values.


DSN OPTIONS
===========


These DSN options are used to create a DSN.  Each option is given like
\ ``option=value``\ .  The options are case-sensitive, so P and p are not the
same option.  There cannot be whitespace before or after the \ ``=``\  and
if the value contains whitespace it must be quoted.  DSN options are
comma-separated.  See the percona-toolkit manpage for full details.


  * ``A``

 dsn: charset; copy: yes

 Default character set.


  * ``D``

 dsn: database; copy: yes

 Database containing the table to be synced.


  * ``F``

 dsn: mysql_read_default_file; copy: yes

 Only read default options from the given file


  * ``h``

 dsn: host; copy: yes

 Connect to host.


  * ``p``

 dsn: password; copy: yes

 Password to use when connecting.


  * ``p``

 dsn: port; copy: yes

 Port number to use for connection.


  * ``S``

 dsn: mysql_socket; copy: yes

 Socket file to use for connection.


  * ``t``

 copy: yes

 Table to be synced.


  * ``u``

 dsn: user; copy: yes

 User for login if not current user.


ENVIRONMENT
===========


The environment variable \ ``PTDEBUG``\  enables verbose debugging output to ``STDERR``.
To enable debugging and capture all output to a file, run the tool like:


.. code-block:: perl

    PTDEBUG=1 pt-table-sync ... > FILE 2>&1


Be careful: debugging output is voluminous and can generate several megabytes
of output.


SYSTEM REQUIREMENTS
===================


You need *Perl* , ``DBI``, ``DBD::mysql``, and some core packages that ought to be
installed in any reasonably new version of *Perl* .


BUGS
====


For a list of known bugs, see `http://www.percona.com/bugs/pt-table-sync <http://www.percona.com/bugs/pt-table-sync>`_.

Please report bugs at `https://bugs.launchpad.net/percona-toolkit <https://bugs.launchpad.net/percona-toolkit>`_.


DOWNLOADING
===========


Visit `http://www.percona.com/software/percona-toolkit/ <http://www.percona.com/software/percona-toolkit/>`_ to download the
latest release of Percona Toolkit.  Or, get the latest release from the
command line:


.. code-block:: perl

    wget percona.com/get/percona-toolkit.tar.gz

    wget percona.com/get/percona-toolkit.rpm

    wget percona.com/get/percona-toolkit.deb


You can also get individual tools from the latest release:


.. code-block:: perl

    wget percona.com/get/TOOL


Replace \ ``TOOL``\  with the name of any tool.


AUTHORS
=======


*Baron Schwartz*


ACKNOWLEDGMENTS
===============


My work is based in part on Giuseppe Maxia's work on distributed databases,
`http://www.sysadminmag.com/articles/2004/0408/ <http://www.sysadminmag.com/articles/2004/0408/>`_ and code derived from that
article.  There is more explanation, and a link to the code, at
`http://www.perlmonks.org/?node_id=381053 <http://www.perlmonks.org/?node_id=381053>`_.

Another programmer extended Maxia's work even further.  Fabien Coelho changed
and generalized Maxia's technique, introducing symmetry and avoiding some
problems that might have caused too-frequent checksum collisions.  This work
grew into pg_comparator, `http://www.coelho.net/pg_comparator/ <http://www.coelho.net/pg_comparator/>`_.  Coelho also
explained the technique further in a paper titled "Remote Comparison of Database
Tables" (`http://cri.ensmp.fr/classement/doc/A-375.pdf <http://cri.ensmp.fr/classement/doc/A-375.pdf>`_).

This existing literature mostly addressed how to find the differences between
the tables, not how to resolve them once found.  I needed a tool that would not
only find them efficiently, but would then resolve them.  I first began thinking
about how to improve the technique further with my article
`http://tinyurl.com/mysql-data-diff-algorithm <http://tinyurl.com/mysql-data-diff-algorithm>`_,
where I discussed a number of problems with the Maxia/Coelho "bottom-up"
algorithm.  After writing that article, I began to write this tool.  I wanted to
actually implement their algorithm with some improvements so I was sure I
understood it completely.  I discovered it is not what I thought it was, and is
considerably more complex than it appeared to me at first.  Fabien Coelho was
kind enough to address some questions over email.

The first versions of this tool implemented a version of the Coelho/Maxia
algorithm, which I called "bottom-up", and my own, which I called "top-down."
Those algorithms are considerably more complex than the current algorithms and
I have removed them from this tool, and may add them back later.  The
improvements to the bottom-up algorithm are my original work, as is the
top-down algorithm.  The techniques to actually resolve the differences are
also my own work.

Another tool that can synchronize tables is the SQLyog Job Agent from webyog.
Thanks to Rohit Nadhani, SJA's author, for the conversations about the general
techniques.  There is a comparison of :program:`pt-table-sync` and SJA at
`http://tinyurl.com/maatkit-vs-sqlyog <http://tinyurl.com/maatkit-vs-sqlyog>`_

Thanks to the following people and organizations for helping in many ways:

The Rimm-Kaufman Group `http://www.rimmkaufman.com/ <http://www.rimmkaufman.com/>`_,
|MySQL| AB `http://www.mysql.com/ <http://www.mysql.com/>`_,
Blue Ridge InternetWorks `http://www.briworks.com/ <http://www.briworks.com/>`_,
Percona `http://www.percona.com/ <http://www.percona.com/>`_,
Fabien Coelho,
Giuseppe Maxia and others at |MySQL| AB,
Kristian Koehntopp (|MySQL| AB),
Rohit Nadhani (WebYog),
The helpful monks at *Perl* monks,
And others too numerous to mention.

COPYRIGHT, LICENSE, AND WARRANTY
================================


This program is copyright 2007-2011 *Baron Schwartz*, 2011 Percona Inc.
Feedback and improvements are welcome.

VERSION
=======

:program:`pt-table-sync` 1.0.1