Merge lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-2.0-docs r155.

2025-09-11 13:40:07 +00:00 · 2012-01-24 12:01:47 -07:00
parent 3c97ae27d1 b13ff30bb7
commit fa6a6cb8ff
2 changed files with 159 additions and 83 deletions
--- a/bin/pt-stalk
+++ b/bin/pt-stalk
@@ -1029,7 +1029,7 @@ main() {
   RAN_WITH="--function=$OPT_FUNCTION --variable=$OPT_VARIABLE --threshold=$OPT_THRESHOLD --match=$OPT_MATCH --cycles=$OPT_CYCLES --interval=$OPT_INTERVAL --iterations=$OPT_ITERATIONS --run-time=$OPT_RUN_TIME --sleep=$OPT_SLEEP --dest=$OPT_DEST --prefix=$OPT_PREFIX --notify-by-email=$OPT_NOTIFY_BY_EMAIL --log=$OPT_LOG --pid=$OPT_PID"
   log "Starting $0 $RAN_WITH"
-   # Make the collection dir exists.
+   # Make sure the collection dir exists.
   if [ ! -d "$OPT_DEST" ]; then
      mkdir -p "$OPT_DEST" || die "Cannot make --dest $OPT_DEST"
   fi
@@ -1136,16 +1136,17 @@ fi
 =head1 NAME
-pt-stalk - Wait for a condition to occur then begin collecting data.
+pt-stalk - Gather forensic data about MySQL when a problem occurs.
 =head1 SYNOPSIS
 Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]
-pt-stalk watches for a condition to become true, and when it does, executes
+pt-stalk watches for a trigger condition to become true, and then collects data
-a script.  By default it executes L<pt-collect>, but that can be customized.
+to help in diagnosing problems. It is designed to run as a daemon so that you
-This tool is useful for gathering diagnostic data when an infrequent event
+can diagnose intermittent problems that you cannot observe directly. You can
-occurs, so an expert person can review the data later.
+also use it to execute a custom command, or to gather the data on demand without
 waiting for the trigger to happen.
 =head1 RISKS
@@ -1154,7 +1155,9 @@ whether known or unknown, of using this tool.  The two main categories of risks
 are those created by the nature of the tool (e.g. read-only tools vs. read-write
 tools) and those created by bugs.
-pt-stalk is a read-only tool.  It should be very low-risk.
+pt-stalk is a read-only tool.  It should be very low-risk.  Some of the options
 can cause intrusive data collection to be performed, however, so if you enable
 any non-default options, you should read their documentation carefully.
 At the time of this release, we know of no bugs that could cause serious harm
 to users.
@@ -1168,37 +1171,60 @@ See also L<"BUGS"> for more information on filing bugs and getting help.
 =head1 DESCRIPTION
-Although pt-stalk comes pre-configured to do a specific thing, in general
+Sometimes a problem happens infrequently and for a short time, giving you no
-this tool is just a skeleton script for the following flow of actions:
+chance to see the system when it happens. How do you solve intermittent MySQL
 problems when you can't observe them? That's why pt-stalk exists. In addition to
 using it when there's a known problem on your servers, it is a good idea to run
 pt-stalk all the time, even when you think nothing is wrong.  You will
 appreciate the data it gathers when a problem occurs, because problems such as
 MySQL lockups or spikes of activity typically leave no evidence to use in root
 cause analysis.
-=over
+This tool does two things: it watches a server (typically MySQL) for a trigger
 to occur, and it gathers diagnostic data.  To use it effectively, you need to
 define a good trigger condition. A good trigger is sensitive enough to fire
 reliably when a problem occurs, so that you don't miss a chance to solve
 problems. On the other hand, a good trigger isn't prone to false positives, so
 you don't gather information when the server is functioning normally.
-=item 1.
+The most reliable triggers for MySQL tend to be the number of connections to the
 server, and the number of queries running concurrently. These are available in
 the SHOW GLOBAL STATUS command as Threads_connected and Threads_running.
 Sometimes Threads_connected is not a reliable indicator of trouble, but
 Threads_running usually is.  Your job, as the tool's user, is to define an
 appropriate trigger condition for the tool.  Choose carefully, because the
 quality of your results will depend on the trigger you choose.
-Loop infinitely, sleeping between iterations.
+You can define the trigger with the L<"--function">, L<"--variable">, and
 L<"--threshold"> options, among others.  Please read the documentation for
 --function to learn how to do this.
-=item 2.
+The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger
 becomes true. It then gathers diagnostics for a while, and sleeps afterwards for
 some time to prevent repeatedly gathering data if the condition remains true.
 In crude pseudocode, omitting some subtleties,
-In each iteration, run some command and get the output.
+  while true; do
    if --variable from --function is greater than --threshold; then
      observations++
      if observations is greater than --cycles; then
        capture diagnostics for --run-time seconds
        exit if --iterations is exceeded
        sleep for --sleep seconds
      done
    done
    clean up data that's older than --retention-time
    sleep for --interval seconds
  done
-=item 3.
+The diagnostic data is written to files whose names begin with a timestamp, so
 you can distinguish samples from each other in case the tool collects data
 multiple times.  The pt-sift tool is designed to help you browse and analyze the
 resulting samples of data.
-If the command fails or the output is larger than the threshold,
+Although this sounds simple enough, in practice there are a number of
-execute the collection script; but do not execute if the destination disk
+subtleties, such as detecting when the disk is beginning to fill up so that the
-is too full.
+tool doesn't cause the server to run out of disk space.
 =back
 By default, the tool is configured to execute mysqladmin extended-status and
 extract the value of the Threads_running variable; if this is greater than
 25, it runs the collection script.  This is really just placeholder code,
 and almost certainly needs to be customized!
 If the tool does execute the collection script, it will wait for a while
 before checking and executing again.  This is to prevent a continuous
 condition from causing a huge number of executions to fire off.
 The name 'stalk' is because 'watch' is already taken, and 'stalk' is fun.
 =head1 CONFIGURING
@@ -1212,23 +1238,43 @@ TODO
 default: yes; negatable: yes
-Collect system information.
+Collect system information.  You can negate this option to make the tool watch
 the system but not actually gather any diagnostic data.
 =item --collect-gdb
-Collect GDB stacktraces.
+Collect GDB stacktraces.  This is achieved by attaching to MySQL and printing
 stack traces from all threads. This will freeze the server for some period of
 time, ranging from a second or so to much longer on very busy systems with a lot
 of memory and many threads in the server.  For this reason, it is disabled by
 default. However, if you are trying to diagnose a server stall or lockup,
 freezing the server causes no additional harm, and the stack traces can be vital
 for diagnosis.
 In addition to freezing the server, there is also some risk of the server
 crashing or performing badly after GDB detaches from it.
 =item --collect-oprofile
-Collect oprofile data.
+Collect oprofile data.  This is achieved by starting an oprofile session,
 letting it run for the collection time, and then stopping and saving the
 resulting profile data in the system's default location.  Please read your
 system's oprofile documentation to learn more about this.
 =item --collect-strace
-Collect strace data.
+Collect strace data. This is achieved by attaching strace to the server, which
 will make it run very slowly until strace detaches.  The same cautions apply as
 those listed in --collect-gdb.  You should not enable this option together with
 --collect-gdb, because GDB and strace can't attach to the server process
 simultaneously.
 =item --collect-tcpdump
-Collect tcpdump data.
+Collect tcpdump data. This option causes tcpdump to capture all traffic on all
 interfaces for the port on which MySQL is listening.  You can later use
 pt-query-digest to decode the MySQL protocol and extract a log of query traffic
 from it.
 =item --config
@@ -1241,77 +1287,99 @@ first option on the command line.
 type: int; default: 5
-Number of times condition must be met before triggering collection.
+The number of times the trigger condition must be true before collecting data.
 This helps prevent false positives and make the trigger condition less
 susceptible to firing when the condition recovers quickly.
 =item --daemonize
-Daemonize the tool.
+Daemonize the tool.  This causes the tool to fork into the background and log
 its output as specified in --log.
 =item --dest
 type: string; default: ${HOME}/collected
-Where to store collected data.
+Where to store the diagnostic data.  Each time the tool collects data, it writes
 to a new set of files, which are named with the current system timestamp.
 =item --disk-byte-limit
 type: int; default: 100
-Exit if the disk has less than this many MB free.
+Don't collect data unless the destination disk has this much free space. This
 prevents the tool from filling up the disk with diagnostic data.
 If the destination directory contains a previously captured sample of data, the
 tool will measure its size and use that as an estimate of how much data is
 likely to be gathered this time, too.  It will then be even more pessimistic,
 and will refuse to collect data unless the disk has enough free space to hold
 the sample and still have the desired amount of free space.  For example, if
 you'd like 100MB of free space and the previous diagnostic sample consumed
 100MB, the tool won't collect any data unless the disk has 200MB free.
 =item --disk-pct-limit
 type: int; default: 5
-Exit if the disk is less than this %full.
+Don't collect data unless the disk has at least this percent free space. This
 option works similarly to --disk-byte-limit, but specifies a percentage margin
 of safety instead of a byte margin of safety.  The tool honors both options, and
 will not collect any data unless both margins are satisfied.
 =item --function
 type: string; default: status
-Built-in function name or plugin file name which returns the value of C<VARIABLE>.
+Specifies what to watch for a diagnostic trigger.  The default value watches
-
+SHOW GLOBAL STATUS, but you can also watch SHOW PROCESSLIST or supply a plugin
-Possible values are:
+file with your own custom code.  This function supplies the value of
 L<"--variable">, which is then compared against L<"--threshold"> to see if the
 trigger condition is met.  Additional options may be required as well; see
 below. Possible values:
 =over
 =item * status
-Grep the value of C<VARIABLE> from C<mysqladmin extended-status>.
+This value specifies that the source of data for the diagnostic trigger is SHOW
 GLOBAL STATUS.  The value of L<"--variable"> then defines which status counter
 is the trigger.
 =item * processlist
-Count the number of processes in C<mysqladmin processlist> whose
+This value specifies that the data for the diagnostic trigger comes from SHOW
-C<VARIABLE> column matches C<MATCH>.  For example:
+FULL PROCESSLIST.  The trigger value is the count of processes whose
 L<"--variable"> column matches the L<"--match"> option.  For example, to trigger
 when more than 10 processes are in the "statistics" state, use the following
 options:
-   TRIGGER_FUNCTION="processlist" \
+  --trigger processlist --variable State --match statistics --threshold 10
   VARIABLE="State"               \
   MATCH="statistics"             \
   THRESHOLD="10"
-The above triggers when more than 10 processes are in the "statistics" state.
+=back
 C<MATCH> must be specified for this trigger function.
-=item * magic
+In addition, you can specify a file that contains your custom trigger function,
 written in Unix shell script.  This can be a wrapper that executes anything you
 wish.  If the argument to --function is a file, then it takes precedence over
 builtin functions, so if there is a file in the working directory named "status"
 or "processlist" then the tool will use that file as a plugin, even though those
 are otherwise recognized as reserved words for this option.
-TODO
+The plugin file works by providing a function called C<trg_plugin>, and the tool
-
+simply sources the file and executes the function.  For example, the function
-=item * plugin file name
+might look like the following:
 A plugin file allows you to specify a custom trigger function.  The plugin
 file must contain a function called C<trg_plugin>.  For example:
   trg_plugin() {
-      # Do some stuff.
+      mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" | grep -c "has waited at"
      echo "$value"
   }
-The last output if the function (its "return value") must be a number.
+This snippet will count the number of mutex waits inside of InnoDB. It
-This number is compared to C<THRESHOLD>.  All L<"ENVIRONMENT"> variables
+illustrates the general principle: the function must output a number, which is
-are available to the function.
+then compared to the threshold as usual.  The $EXT_ARGV variable contains the
 MySQL options mentioned in the L<"SYNOPSIS"> above.
-Do not alter the tool's existing global variables.  Prefix any plugin-specific
+The plugin should not alter the tool's existing global variables.  Prefix any
-global variables with "PLUGIN_".
+plugin-specific global variables with "PLUGIN_" or make them local.
 =back
@@ -1323,14 +1391,15 @@ Print help and exit.
 type: int; default: 1
-Interval between checks.
+Interval between checks for the diagnostic trigger.
 =item --iterations
 type: int
-Exit after triggering C<pt-collect> this many times.  By default, the tool
+Exit after collecting diagnostics this many times.  By default, the tool
-will collect as many times as it's triggered.
+will continue to watch the server forever, but this is useful for scenarios
 where you want to capture once and then exit, for example.
 =item --log
@@ -1342,13 +1411,14 @@ Print all output to this file when daemonized.
 type: string
-Match pattern for C<processlist> L<"--function">.
+The pattern to use when watching SHOW PROCESSLIST. See the documentation for
 L<"--function"> for details.
 =item --notify-by-email
 type: string
-Send mail to this list of addresses when C<pt-collect> triggers.
+Send mail to this list of addresses when data is collected.
 =item --pid
@@ -1360,42 +1430,47 @@ Create a PID file when daemonized.
 type: string
-Collect file prefix.
+The filename prefix for diagnostic samples. By default, samples have a timestamp
-
+prefix based on the current local time, such as 2011_12_06_14_02_02, which is
-If not specified, the current local time is used like C<2011_12_06_14_02_02>,
+December 6, 2011 at 14:02:02.
 which is December 6, 2011 at 14:02:02.
 =item --retention-time
 type: int; default: 30
-Remove samples after this many days.
+Number of days to retain collected samples.  Any samples that are older will be
 purged.
 =item --run-time
 type: int; default: 30
-How long to collect statistics data for?
+How long the tool will collect data when it triggers.  This should not be longer
-
+than L<"--sleep">. It is usually not necessary to change this; if the default 30
-Make sure that this isn't longer than SLEEP.
+seconds hasn't gathered enough diagnostic data, running longer is not likely to
 do so. In fact, in many cases a shorter collection period is appropriate.
 =item --sleep
 type: int; default: 300
-How long to sleep after collecting?
+How long to sleep after collecting data.  This prevents the tool from triggering
 continuously, which might be a problem if the collection process is intrusive.
 It also prevents filling up the disk or gathering too much data to analyze
 reasonably.
 =item --threshold
 type: int; default: 25
-Max number of C<N> to tolerate.
+The threshold at which the diagnostic trigger should fire.  See L<"--function">
 for details.
 =item --variable
 type: string; default: Threads_running
-This is the thing to check for.
+The variable to compare against the threshold. See L<"--function"> for details.
 =item --version
--- a/t/pt-stalk/pt-stalk.t
+++ b/t/pt-stalk/pt-stalk.t
@@ -178,6 +178,7 @@ diag(`cp $ENV{HOME}/.pt-stalk.conf $ENV{HOME}/.pt-stalk.conf.original 2>/dev/nul
 diag(`cp $trunk/t/pt-stalk/samples/config001.conf $ENV{HOME}/.pt-stalk.conf`);
 system "$trunk/bin/pt-stalk --dest $dest --pid $pid_file >$log_file 2>&1 &";
 PerconaTest::wait_for_files($pid_file);
 sleep 1;
 chomp($pid = `cat $pid_file`);
 $retval = system("kill $pid 2>/dev/null");