mirror of
https://github.com/percona/percona-toolkit.git
synced 2025-09-11 13:40:07 +00:00
Merge lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-2.0-docs r155.
This commit is contained in:
241
bin/pt-stalk
241
bin/pt-stalk
@@ -1029,7 +1029,7 @@ main() {
|
|||||||
RAN_WITH="--function=$OPT_FUNCTION --variable=$OPT_VARIABLE --threshold=$OPT_THRESHOLD --match=$OPT_MATCH --cycles=$OPT_CYCLES --interval=$OPT_INTERVAL --iterations=$OPT_ITERATIONS --run-time=$OPT_RUN_TIME --sleep=$OPT_SLEEP --dest=$OPT_DEST --prefix=$OPT_PREFIX --notify-by-email=$OPT_NOTIFY_BY_EMAIL --log=$OPT_LOG --pid=$OPT_PID"
|
RAN_WITH="--function=$OPT_FUNCTION --variable=$OPT_VARIABLE --threshold=$OPT_THRESHOLD --match=$OPT_MATCH --cycles=$OPT_CYCLES --interval=$OPT_INTERVAL --iterations=$OPT_ITERATIONS --run-time=$OPT_RUN_TIME --sleep=$OPT_SLEEP --dest=$OPT_DEST --prefix=$OPT_PREFIX --notify-by-email=$OPT_NOTIFY_BY_EMAIL --log=$OPT_LOG --pid=$OPT_PID"
|
||||||
log "Starting $0 $RAN_WITH"
|
log "Starting $0 $RAN_WITH"
|
||||||
|
|
||||||
# Make the collection dir exists.
|
# Make sure the collection dir exists.
|
||||||
if [ ! -d "$OPT_DEST" ]; then
|
if [ ! -d "$OPT_DEST" ]; then
|
||||||
mkdir -p "$OPT_DEST" || die "Cannot make --dest $OPT_DEST"
|
mkdir -p "$OPT_DEST" || die "Cannot make --dest $OPT_DEST"
|
||||||
fi
|
fi
|
||||||
@@ -1136,16 +1136,17 @@ fi
|
|||||||
|
|
||||||
=head1 NAME
|
=head1 NAME
|
||||||
|
|
||||||
pt-stalk - Wait for a condition to occur then begin collecting data.
|
pt-stalk - Gather forensic data about MySQL when a problem occurs.
|
||||||
|
|
||||||
=head1 SYNOPSIS
|
=head1 SYNOPSIS
|
||||||
|
|
||||||
Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]
|
Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]
|
||||||
|
|
||||||
pt-stalk watches for a condition to become true, and when it does, executes
|
pt-stalk watches for a trigger condition to become true, and then collects data
|
||||||
a script. By default it executes L<pt-collect>, but that can be customized.
|
to help in diagnosing problems. It is designed to run as a daemon so that you
|
||||||
This tool is useful for gathering diagnostic data when an infrequent event
|
can diagnose intermittent problems that you cannot observe directly. You can
|
||||||
occurs, so an expert person can review the data later.
|
also use it to execute a custom command, or to gather the data on demand without
|
||||||
|
waiting for the trigger to happen.
|
||||||
|
|
||||||
=head1 RISKS
|
=head1 RISKS
|
||||||
|
|
||||||
@@ -1154,7 +1155,9 @@ whether known or unknown, of using this tool. The two main categories of risks
|
|||||||
are those created by the nature of the tool (e.g. read-only tools vs. read-write
|
are those created by the nature of the tool (e.g. read-only tools vs. read-write
|
||||||
tools) and those created by bugs.
|
tools) and those created by bugs.
|
||||||
|
|
||||||
pt-stalk is a read-only tool. It should be very low-risk.
|
pt-stalk is a read-only tool. It should be very low-risk. Some of the options
|
||||||
|
can cause intrusive data collection to be performed, however, so if you enable
|
||||||
|
any non-default options, you should read their documentation carefully.
|
||||||
|
|
||||||
At the time of this release, we know of no bugs that could cause serious harm
|
At the time of this release, we know of no bugs that could cause serious harm
|
||||||
to users.
|
to users.
|
||||||
@@ -1168,37 +1171,60 @@ See also L<"BUGS"> for more information on filing bugs and getting help.
|
|||||||
|
|
||||||
=head1 DESCRIPTION
|
=head1 DESCRIPTION
|
||||||
|
|
||||||
Although pt-stalk comes pre-configured to do a specific thing, in general
|
Sometimes a problem happens infrequently and for a short time, giving you no
|
||||||
this tool is just a skeleton script for the following flow of actions:
|
chance to see the system when it happens. How do you solve intermittent MySQL
|
||||||
|
problems when you can't observe them? That's why pt-stalk exists. In addition to
|
||||||
|
using it when there's a known problem on your servers, it is a good idea to run
|
||||||
|
pt-stalk all the time, even when you think nothing is wrong. You will
|
||||||
|
appreciate the data it gathers when a problem occurs, because problems such as
|
||||||
|
MySQL lockups or spikes of activity typically leave no evidence to use in root
|
||||||
|
cause analysis.
|
||||||
|
|
||||||
=over
|
This tool does two things: it watches a server (typically MySQL) for a trigger
|
||||||
|
to occur, and it gathers diagnostic data. To use it effectively, you need to
|
||||||
|
define a good trigger condition. A good trigger is sensitive enough to fire
|
||||||
|
reliably when a problem occurs, so that you don't miss a chance to solve
|
||||||
|
problems. On the other hand, a good trigger isn't prone to false positives, so
|
||||||
|
you don't gather information when the server is functioning normally.
|
||||||
|
|
||||||
=item 1.
|
The most reliable triggers for MySQL tend to be the number of connections to the
|
||||||
|
server, and the number of queries running concurrently. These are available in
|
||||||
|
the SHOW GLOBAL STATUS command as Threads_connected and Threads_running.
|
||||||
|
Sometimes Threads_connected is not a reliable indicator of trouble, but
|
||||||
|
Threads_running usually is. Your job, as the tool's user, is to define an
|
||||||
|
appropriate trigger condition for the tool. Choose carefully, because the
|
||||||
|
quality of your results will depend on the trigger you choose.
|
||||||
|
|
||||||
Loop infinitely, sleeping between iterations.
|
You can define the trigger with the L<"--function">, L<"--variable">, and
|
||||||
|
L<"--threshold"> options, among others. Please read the documentation for
|
||||||
|
--function to learn how to do this.
|
||||||
|
|
||||||
=item 2.
|
The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger
|
||||||
|
becomes true. It then gathers diagnostics for a while, and sleeps afterwards for
|
||||||
|
some time to prevent repeatedly gathering data if the condition remains true.
|
||||||
|
In crude pseudocode, omitting some subtleties,
|
||||||
|
|
||||||
In each iteration, run some command and get the output.
|
while true; do
|
||||||
|
if --variable from --function is greater than --threshold; then
|
||||||
|
observations++
|
||||||
|
if observations is greater than --cycles; then
|
||||||
|
capture diagnostics for --run-time seconds
|
||||||
|
exit if --iterations is exceeded
|
||||||
|
sleep for --sleep seconds
|
||||||
|
done
|
||||||
|
done
|
||||||
|
clean up data that's older than --retention-time
|
||||||
|
sleep for --interval seconds
|
||||||
|
done
|
||||||
|
|
||||||
=item 3.
|
The diagnostic data is written to files whose names begin with a timestamp, so
|
||||||
|
you can distinguish samples from each other in case the tool collects data
|
||||||
|
multiple times. The pt-sift tool is designed to help you browse and analyze the
|
||||||
|
resulting samples of data.
|
||||||
|
|
||||||
If the command fails or the output is larger than the threshold,
|
Although this sounds simple enough, in practice there are a number of
|
||||||
execute the collection script; but do not execute if the destination disk
|
subtleties, such as detecting when the disk is beginning to fill up so that the
|
||||||
is too full.
|
tool doesn't cause the server to run out of disk space.
|
||||||
|
|
||||||
=back
|
|
||||||
|
|
||||||
By default, the tool is configured to execute mysqladmin extended-status and
|
|
||||||
extract the value of the Threads_running variable; if this is greater than
|
|
||||||
25, it runs the collection script. This is really just placeholder code,
|
|
||||||
and almost certainly needs to be customized!
|
|
||||||
|
|
||||||
If the tool does execute the collection script, it will wait for a while
|
|
||||||
before checking and executing again. This is to prevent a continuous
|
|
||||||
condition from causing a huge number of executions to fire off.
|
|
||||||
|
|
||||||
The name 'stalk' is because 'watch' is already taken, and 'stalk' is fun.
|
|
||||||
|
|
||||||
=head1 CONFIGURING
|
=head1 CONFIGURING
|
||||||
|
|
||||||
@@ -1212,23 +1238,43 @@ TODO
|
|||||||
|
|
||||||
default: yes; negatable: yes
|
default: yes; negatable: yes
|
||||||
|
|
||||||
Collect system information.
|
Collect system information. You can negate this option to make the tool watch
|
||||||
|
the system but not actually gather any diagnostic data.
|
||||||
|
|
||||||
=item --collect-gdb
|
=item --collect-gdb
|
||||||
|
|
||||||
Collect GDB stacktraces.
|
Collect GDB stacktraces. This is achieved by attaching to MySQL and printing
|
||||||
|
stack traces from all threads. This will freeze the server for some period of
|
||||||
|
time, ranging from a second or so to much longer on very busy systems with a lot
|
||||||
|
of memory and many threads in the server. For this reason, it is disabled by
|
||||||
|
default. However, if you are trying to diagnose a server stall or lockup,
|
||||||
|
freezing the server causes no additional harm, and the stack traces can be vital
|
||||||
|
for diagnosis.
|
||||||
|
|
||||||
|
In addition to freezing the server, there is also some risk of the server
|
||||||
|
crashing or performing badly after GDB detaches from it.
|
||||||
|
|
||||||
=item --collect-oprofile
|
=item --collect-oprofile
|
||||||
|
|
||||||
Collect oprofile data.
|
Collect oprofile data. This is achieved by starting an oprofile session,
|
||||||
|
letting it run for the collection time, and then stopping and saving the
|
||||||
|
resulting profile data in the system's default location. Please read your
|
||||||
|
system's oprofile documentation to learn more about this.
|
||||||
|
|
||||||
=item --collect-strace
|
=item --collect-strace
|
||||||
|
|
||||||
Collect strace data.
|
Collect strace data. This is achieved by attaching strace to the server, which
|
||||||
|
will make it run very slowly until strace detaches. The same cautions apply as
|
||||||
|
those listed in --collect-gdb. You should not enable this option together with
|
||||||
|
--collect-gdb, because GDB and strace can't attach to the server process
|
||||||
|
simultaneously.
|
||||||
|
|
||||||
=item --collect-tcpdump
|
=item --collect-tcpdump
|
||||||
|
|
||||||
Collect tcpdump data.
|
Collect tcpdump data. This option causes tcpdump to capture all traffic on all
|
||||||
|
interfaces for the port on which MySQL is listening. You can later use
|
||||||
|
pt-query-digest to decode the MySQL protocol and extract a log of query traffic
|
||||||
|
from it.
|
||||||
|
|
||||||
=item --config
|
=item --config
|
||||||
|
|
||||||
@@ -1241,77 +1287,99 @@ first option on the command line.
|
|||||||
|
|
||||||
type: int; default: 5
|
type: int; default: 5
|
||||||
|
|
||||||
Number of times condition must be met before triggering collection.
|
The number of times the trigger condition must be true before collecting data.
|
||||||
|
This helps prevent false positives and make the trigger condition less
|
||||||
|
susceptible to firing when the condition recovers quickly.
|
||||||
|
|
||||||
=item --daemonize
|
=item --daemonize
|
||||||
|
|
||||||
Daemonize the tool.
|
Daemonize the tool. This causes the tool to fork into the background and log
|
||||||
|
its output as specified in --log.
|
||||||
|
|
||||||
=item --dest
|
=item --dest
|
||||||
|
|
||||||
type: string; default: ${HOME}/collected
|
type: string; default: ${HOME}/collected
|
||||||
|
|
||||||
Where to store collected data.
|
Where to store the diagnostic data. Each time the tool collects data, it writes
|
||||||
|
to a new set of files, which are named with the current system timestamp.
|
||||||
|
|
||||||
=item --disk-byte-limit
|
=item --disk-byte-limit
|
||||||
|
|
||||||
type: int; default: 100
|
type: int; default: 100
|
||||||
|
|
||||||
Exit if the disk has less than this many MB free.
|
Don't collect data unless the destination disk has this much free space. This
|
||||||
|
prevents the tool from filling up the disk with diagnostic data.
|
||||||
|
|
||||||
|
If the destination directory contains a previously captured sample of data, the
|
||||||
|
tool will measure its size and use that as an estimate of how much data is
|
||||||
|
likely to be gathered this time, too. It will then be even more pessimistic,
|
||||||
|
and will refuse to collect data unless the disk has enough free space to hold
|
||||||
|
the sample and still have the desired amount of free space. For example, if
|
||||||
|
you'd like 100MB of free space and the previous diagnostic sample consumed
|
||||||
|
100MB, the tool won't collect any data unless the disk has 200MB free.
|
||||||
|
|
||||||
=item --disk-pct-limit
|
=item --disk-pct-limit
|
||||||
|
|
||||||
type: int; default: 5
|
type: int; default: 5
|
||||||
|
|
||||||
Exit if the disk is less than this %full.
|
Don't collect data unless the disk has at least this percent free space. This
|
||||||
|
option works similarly to --disk-byte-limit, but specifies a percentage margin
|
||||||
|
of safety instead of a byte margin of safety. The tool honors both options, and
|
||||||
|
will not collect any data unless both margins are satisfied.
|
||||||
|
|
||||||
=item --function
|
=item --function
|
||||||
|
|
||||||
type: string; default: status
|
type: string; default: status
|
||||||
|
|
||||||
Built-in function name or plugin file name which returns the value of C<VARIABLE>.
|
Specifies what to watch for a diagnostic trigger. The default value watches
|
||||||
|
SHOW GLOBAL STATUS, but you can also watch SHOW PROCESSLIST or supply a plugin
|
||||||
Possible values are:
|
file with your own custom code. This function supplies the value of
|
||||||
|
L<"--variable">, which is then compared against L<"--threshold"> to see if the
|
||||||
|
trigger condition is met. Additional options may be required as well; see
|
||||||
|
below. Possible values:
|
||||||
|
|
||||||
=over
|
=over
|
||||||
|
|
||||||
=item * status
|
=item * status
|
||||||
|
|
||||||
Grep the value of C<VARIABLE> from C<mysqladmin extended-status>.
|
This value specifies that the source of data for the diagnostic trigger is SHOW
|
||||||
|
GLOBAL STATUS. The value of L<"--variable"> then defines which status counter
|
||||||
|
is the trigger.
|
||||||
|
|
||||||
=item * processlist
|
=item * processlist
|
||||||
|
|
||||||
Count the number of processes in C<mysqladmin processlist> whose
|
This value specifies that the data for the diagnostic trigger comes from SHOW
|
||||||
C<VARIABLE> column matches C<MATCH>. For example:
|
FULL PROCESSLIST. The trigger value is the count of processes whose
|
||||||
|
L<"--variable"> column matches the L<"--match"> option. For example, to trigger
|
||||||
|
when more than 10 processes are in the "statistics" state, use the following
|
||||||
|
options:
|
||||||
|
|
||||||
TRIGGER_FUNCTION="processlist" \
|
--trigger processlist --variable State --match statistics --threshold 10
|
||||||
VARIABLE="State" \
|
|
||||||
MATCH="statistics" \
|
|
||||||
THRESHOLD="10"
|
|
||||||
|
|
||||||
The above triggers when more than 10 processes are in the "statistics" state.
|
=back
|
||||||
C<MATCH> must be specified for this trigger function.
|
|
||||||
|
|
||||||
=item * magic
|
In addition, you can specify a file that contains your custom trigger function,
|
||||||
|
written in Unix shell script. This can be a wrapper that executes anything you
|
||||||
|
wish. If the argument to --function is a file, then it takes precedence over
|
||||||
|
builtin functions, so if there is a file in the working directory named "status"
|
||||||
|
or "processlist" then the tool will use that file as a plugin, even though those
|
||||||
|
are otherwise recognized as reserved words for this option.
|
||||||
|
|
||||||
TODO
|
The plugin file works by providing a function called C<trg_plugin>, and the tool
|
||||||
|
simply sources the file and executes the function. For example, the function
|
||||||
=item * plugin file name
|
might look like the following:
|
||||||
|
|
||||||
A plugin file allows you to specify a custom trigger function. The plugin
|
|
||||||
file must contain a function called C<trg_plugin>. For example:
|
|
||||||
|
|
||||||
trg_plugin() {
|
trg_plugin() {
|
||||||
# Do some stuff.
|
mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" | grep -c "has waited at"
|
||||||
echo "$value"
|
|
||||||
}
|
}
|
||||||
|
|
||||||
The last output if the function (its "return value") must be a number.
|
This snippet will count the number of mutex waits inside of InnoDB. It
|
||||||
This number is compared to C<THRESHOLD>. All L<"ENVIRONMENT"> variables
|
illustrates the general principle: the function must output a number, which is
|
||||||
are available to the function.
|
then compared to the threshold as usual. The $EXT_ARGV variable contains the
|
||||||
|
MySQL options mentioned in the L<"SYNOPSIS"> above.
|
||||||
|
|
||||||
Do not alter the tool's existing global variables. Prefix any plugin-specific
|
The plugin should not alter the tool's existing global variables. Prefix any
|
||||||
global variables with "PLUGIN_".
|
plugin-specific global variables with "PLUGIN_" or make them local.
|
||||||
|
|
||||||
=back
|
=back
|
||||||
|
|
||||||
@@ -1323,14 +1391,15 @@ Print help and exit.
|
|||||||
|
|
||||||
type: int; default: 1
|
type: int; default: 1
|
||||||
|
|
||||||
Interval between checks.
|
Interval between checks for the diagnostic trigger.
|
||||||
|
|
||||||
=item --iterations
|
=item --iterations
|
||||||
|
|
||||||
type: int
|
type: int
|
||||||
|
|
||||||
Exit after triggering C<pt-collect> this many times. By default, the tool
|
Exit after collecting diagnostics this many times. By default, the tool
|
||||||
will collect as many times as it's triggered.
|
will continue to watch the server forever, but this is useful for scenarios
|
||||||
|
where you want to capture once and then exit, for example.
|
||||||
|
|
||||||
=item --log
|
=item --log
|
||||||
|
|
||||||
@@ -1342,13 +1411,14 @@ Print all output to this file when daemonized.
|
|||||||
|
|
||||||
type: string
|
type: string
|
||||||
|
|
||||||
Match pattern for C<processlist> L<"--function">.
|
The pattern to use when watching SHOW PROCESSLIST. See the documentation for
|
||||||
|
L<"--function"> for details.
|
||||||
|
|
||||||
=item --notify-by-email
|
=item --notify-by-email
|
||||||
|
|
||||||
type: string
|
type: string
|
||||||
|
|
||||||
Send mail to this list of addresses when C<pt-collect> triggers.
|
Send mail to this list of addresses when data is collected.
|
||||||
|
|
||||||
=item --pid
|
=item --pid
|
||||||
|
|
||||||
@@ -1360,42 +1430,47 @@ Create a PID file when daemonized.
|
|||||||
|
|
||||||
type: string
|
type: string
|
||||||
|
|
||||||
Collect file prefix.
|
The filename prefix for diagnostic samples. By default, samples have a timestamp
|
||||||
|
prefix based on the current local time, such as 2011_12_06_14_02_02, which is
|
||||||
If not specified, the current local time is used like C<2011_12_06_14_02_02>,
|
December 6, 2011 at 14:02:02.
|
||||||
which is December 6, 2011 at 14:02:02.
|
|
||||||
|
|
||||||
=item --retention-time
|
=item --retention-time
|
||||||
|
|
||||||
type: int; default: 30
|
type: int; default: 30
|
||||||
|
|
||||||
Remove samples after this many days.
|
Number of days to retain collected samples. Any samples that are older will be
|
||||||
|
purged.
|
||||||
|
|
||||||
=item --run-time
|
=item --run-time
|
||||||
|
|
||||||
type: int; default: 30
|
type: int; default: 30
|
||||||
|
|
||||||
How long to collect statistics data for?
|
How long the tool will collect data when it triggers. This should not be longer
|
||||||
|
than L<"--sleep">. It is usually not necessary to change this; if the default 30
|
||||||
Make sure that this isn't longer than SLEEP.
|
seconds hasn't gathered enough diagnostic data, running longer is not likely to
|
||||||
|
do so. In fact, in many cases a shorter collection period is appropriate.
|
||||||
|
|
||||||
=item --sleep
|
=item --sleep
|
||||||
|
|
||||||
type: int; default: 300
|
type: int; default: 300
|
||||||
|
|
||||||
How long to sleep after collecting?
|
How long to sleep after collecting data. This prevents the tool from triggering
|
||||||
|
continuously, which might be a problem if the collection process is intrusive.
|
||||||
|
It also prevents filling up the disk or gathering too much data to analyze
|
||||||
|
reasonably.
|
||||||
|
|
||||||
=item --threshold
|
=item --threshold
|
||||||
|
|
||||||
type: int; default: 25
|
type: int; default: 25
|
||||||
|
|
||||||
Max number of C<N> to tolerate.
|
The threshold at which the diagnostic trigger should fire. See L<"--function">
|
||||||
|
for details.
|
||||||
|
|
||||||
=item --variable
|
=item --variable
|
||||||
|
|
||||||
type: string; default: Threads_running
|
type: string; default: Threads_running
|
||||||
|
|
||||||
This is the thing to check for.
|
The variable to compare against the threshold. See L<"--function"> for details.
|
||||||
|
|
||||||
=item --version
|
=item --version
|
||||||
|
|
||||||
|
@@ -178,6 +178,7 @@ diag(`cp $ENV{HOME}/.pt-stalk.conf $ENV{HOME}/.pt-stalk.conf.original 2>/dev/nul
|
|||||||
diag(`cp $trunk/t/pt-stalk/samples/config001.conf $ENV{HOME}/.pt-stalk.conf`);
|
diag(`cp $trunk/t/pt-stalk/samples/config001.conf $ENV{HOME}/.pt-stalk.conf`);
|
||||||
|
|
||||||
system "$trunk/bin/pt-stalk --dest $dest --pid $pid_file >$log_file 2>&1 &";
|
system "$trunk/bin/pt-stalk --dest $dest --pid $pid_file >$log_file 2>&1 &";
|
||||||
|
PerconaTest::wait_for_files($pid_file);
|
||||||
sleep 1;
|
sleep 1;
|
||||||
chomp($pid = `cat $pid_file`);
|
chomp($pid = `cat $pid_file`);
|
||||||
$retval = system("kill $pid 2>/dev/null");
|
$retval = system("kill $pid 2>/dev/null");
|
||||||
|
Reference in New Issue
Block a user