Update pt-stalk docs more.

This commit is contained in:
Daniel Nichter
2013-03-04 18:20:20 -07:00
parent 35ab06febe
commit 660a049fa4

View File

@@ -1441,11 +1441,11 @@ pt-stalk - Collect forensic data about MySQL when problems occur.
Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS] Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]
pt-stalk watches for a trigger condition to occur, then collects data pt-stalk waits for a trigger condition to occur, then collects data
to help diagnose problems. The tool is designed to run as a daemon with root to help diagnose problems. The tool is designed to run as a daemon with root
privileges, so that you can diagnose intermittent problems that you cannot privileges, so that you can diagnose intermittent problems that you cannot
observe directly. You can also use it to execute a custom command, or to observe directly. You can also use it to execute a custom command, or to
collect data on demand without waiting for the stalk trigger to occur. collect data on demand without waiting for the trigger to occur.
=head1 RISKS =head1 RISKS
@@ -1476,16 +1476,20 @@ chance to see the system when it happens. How do you solve intermittent MySQL
problems when you can't observe them? That's why pt-stalk exists. In addition to problems when you can't observe them? That's why pt-stalk exists. In addition to
using it when there's a known problem on your servers, it is a good idea to run using it when there's a known problem on your servers, it is a good idea to run
pt-stalk all the time, even when you think nothing is wrong. You will pt-stalk all the time, even when you think nothing is wrong. You will
appreciate the data it gathers when a problem occurs, because problems such as appreciate the data it collects when a problem occurs, because problems such as
MySQL lockups or spikes of activity typically leave no evidence to use in root MySQL lockups or spikes in activity typically leave no evidence to use in root
cause analysis. cause analysis.
This tool does two things: it watches a server (typically MySQL) for a trigger pt-stalk does two things: it watches a MySQL server and waits for a trigger
to occur, and it gathers diagnostic data. To use it effectively, you need to condition to occur, and it collects diagnostic data when that trigger occurs.
define a good trigger condition. A good trigger is sensitive enough to fire To avoid false-positives caused by short-lived problems, the trigger condition
reliably when a problem occurs, so that you don't miss a chance to solve must be true at least L<"--cycles"> times before a L<"--collect"> is triggered.
problems. On the other hand, a good trigger isn't prone to false positives, so
you don't gather information when the server is functioning normally. To use pt-stalk effectively, you need to define a good trigger. A good trigger
is sensitive enough to fire reliably when a problem occurs, so that you don't
miss a chance to solve problems. On the other hand, a good trigger isn't
prone to false positives, so you don't gather information when the server
is functioning normally.
The most reliable triggers for MySQL tend to be the number of connections to the The most reliable triggers for MySQL tend to be the number of connections to the
server, and the number of queries running concurrently. These are available in server, and the number of queries running concurrently. These are available in
@@ -1495,14 +1499,15 @@ Threads_running usually is. Your job, as the tool's user, is to define an
appropriate trigger condition for the tool. Choose carefully, because the appropriate trigger condition for the tool. Choose carefully, because the
quality of your results will depend on the trigger you choose. quality of your results will depend on the trigger you choose.
You can define the trigger with the L<"--function">, L<"--variable">, and You define the trigger with the L<"--function">, L<"--variable">,
L<"--threshold"> options, among others. Please read the documentation for L<"--threshold">, and L<"--cycles"> options. The default values
L<"--function"> to learn how to do this. for these options define a reasonable trigger, but you should adjust
or change them to suite your particular system and needs.
The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger By default, pt-stalk tool watches MySQL forever until the trigger occurs,
becomes true. It then gathers diagnostics for a while, and sleeps afterwards for then it collects diagnostic data for a while, and sleeps afterwards to avoid
some time to prevent repeatedly gathering data if the condition remains true. repeatedly collecting data if the trigger remains true. The general order of
In crude pseudocode, omitting some subtleties, operations is:
while true; do while true; do
if --variable from --function > --threshold; then if --variable from --function > --threshold; then
@@ -1539,15 +1544,15 @@ In crude pseudocode, omitting some subtleties,
The diagnostic data is written to files whose names begin with a timestamp, so The diagnostic data is written to files whose names begin with a timestamp, so
you can distinguish samples from each other in case the tool collects data you can distinguish samples from each other in case the tool collects data
multiple times. The pt-sift tool is designed to help you browse and analyze the multiple times. The pt-sift tool is designed to help you browse and analyze
resulting samples of data. the resulting data samples.
Although this sounds simple enough, in practice there are a number of Although this sounds simple enough, in practice there are a number of
subtleties, such as detecting when the disk is beginning to fill up so that the subtleties, such as detecting when the disk is beginning to fill up so that the
tool doesn't cause the server to run out of disk space. This tool handles these tool doesn't cause the server to run out of disk space. This tool handles these
types of potential problems, so it's a good idea to use this tool instead of types of potential problems, so it's a good idea to use this tool instead of
writing something from scratch and possibly experiencing some of the hazards writing something from scratch and possibly experiencing some of the hazards
this tool is designed to prevent. this tool is designed to avoid.
=head1 CONFIGURING =head1 CONFIGURING
@@ -1555,15 +1560,15 @@ You can use standard Percona Toolkit configuration files to set command line
options. options.
You will probably want to run the tool as a daemon and customize at least the You will probably want to run the tool as a daemon and customize at least the
diagnostic threshold. Here's a sample configuration file for triggering when L<"--threshold">. Here's a sample configuration file for triggering when
there are more than 20 queries running at once: there are more than 20 queries running at once:
daemonize daemonize
threshold=20 threshold=20
If you're not running the tool as it's designed (as a root user, daemonized) If you don't run the tool as root, then you will need specify several options,
then you'll need to set several options, such as L<"--dest">, to locations that such as L<"--pid">, L<"--log">, and L<"--dest">, else the tool will probably
are writable by non-root users. fail to start.
=head1 OPTIONS =head1 OPTIONS
@@ -1573,8 +1578,8 @@ are writable by non-root users.
default: yes; negatable: yes default: yes; negatable: yes
Collect diagnostic data when the L<"--stalk"> trigger occurs. Specify Collect diagnostic data when the trigger occurs. Specify C<--no-collect>
C<--no-collect> to make the tool watch the system but not collect data. to make the tool watch the system but not collect data.
See also L<"--stalk">. See also L<"--stalk">.
@@ -1673,23 +1678,23 @@ margins are satisfied.
type: string; default: status type: string; default: status
What to watch for L<"--stalk"> trigger. The default value watches What to watch for the trigger. The default value watches
C<SHOW GLOBAL STATUS>, but you can also watch C<SHOW PROCESSLIST> and specify C<SHOW GLOBAL STATUS>, but you can also watch C<SHOW PROCESSLIST> and specify
a file with your own custom code. This function supplies the value of a file with your own custom code. This function supplies the value of
L<"--variable">, which is then compared against L<"--threshold"> to see if the L<"--variable">, which is then compared against L<"--threshold"> to see if the
L<"--stalk"> trigger condition is met. Additional options may be required as the trigger condition is met. Additional options may be required as
well; see below. Possible values are: well; see below. Possible values are:
=over =over
=item * status =item * status
Watch C<SHOW GLOBAL STATUS> for the L<"--stalk"> trigger. The value of Watch C<SHOW GLOBAL STATUS> for the trigger. The value of
L<"--variable"> then defines which status counter is the trigger. L<"--variable"> then defines which status counter is the trigger.
=item * processlist =item * processlist
Watch C<SHOW FULL PROCESSLIST> for the L<"--stalk"> trigger. The trigger Watch C<SHOW FULL PROCESSLIST> for the trigger. The trigger
value is the count of processes whose L<"--variable"> column matches the value is the count of processes whose L<"--variable"> column matches the
L<"--match"> option. For example, to trigger L<"--collect"> when more than L<"--match"> option. For example, to trigger L<"--collect"> when more than
10 processes are in the "statistics" state, specify: 10 processes are in the "statistics" state, specify:
@@ -1733,14 +1738,14 @@ Print help and exit.
type: int; default: 1 type: int; default: 1
How often to check the L<"--stalk"> trigger, in seconds. How often to check the if trigger is true, in seconds.
=item --iterations =item --iterations
type: int type: int
How many times to L<"--collect"> diagnostic data. By default, the tool How many times to L<"--collect"> diagnostic data. By default, the tool
runs forever and collects data every time the L<"--stalk"> trigger occurs. runs forever and collects data every time the trigger occurs.
Specify L<"--iterations"> to collect data a limited number of times. Specify L<"--iterations"> to collect data a limited number of times.
This option is also useful with C<--no-stalk> to collect data once and This option is also useful with C<--no-stalk> to collect data once and
exit, for example. exit, for example.
@@ -1791,7 +1796,7 @@ Called before stalking.
=item before_collect =item before_collect
Called when the L<"--stalk"> trigger occurs, before running a L<"--collect"> Called when the trigger occurs, before running a L<"--collect">
subprocesses in the background. subprocesses in the background.
=item after_collect =item after_collect
@@ -1857,7 +1862,7 @@ purged.
type: int; default: 30 type: int; default: 30
How long to L<"--collect"> diagnostic data when the L<"--stalk"> trigger occurs. How long to L<"--collect"> diagnostic data when the trigger occurs.
The value is in seconds and should not be longer than L<"--sleep">. It is The value is in seconds and should not be longer than L<"--sleep">. It is
usually not necessary to change this; if the default 30 seconds doesn't usually not necessary to change this; if the default 30 seconds doesn't
collect enough data, running longer is not likely to help because the system collect enough data, running longer is not likely to help because the system