Merge lp:~percona-toolkit-dev/percona-toolkit/pt-stalk-2.0-docs r155.

2025-09-09 04:59:04 +00:00 · 2012-01-24 12:01:47 -07:00
parent 3c97ae27d1 b13ff30bb7
commit fa6a6cb8ff
2 changed files with 159 additions and 83 deletions
--- a/bin/pt-stalk
+++ b/bin/pt-stalk
@@ -1029,7 +1029,7 @@ main() {
   RAN_WITH="--function=$OPT_FUNCTION --variable=$OPT_VARIABLE --threshold=$OPT_THRESHOLD --match=$OPT_MATCH --cycles=$OPT_CYCLES --interval=$OPT_INTERVAL --iterations=$OPT_ITERATIONS --run-time=$OPT_RUN_TIME --sleep=$OPT_SLEEP --dest=$OPT_DEST --prefix=$OPT_PREFIX --notify-by-email=$OPT_NOTIFY_BY_EMAIL --log=$OPT_LOG --pid=$OPT_PID"
   log "Starting $0 $RAN_WITH"

-   # Make the collection dir exists.
+   # Make sure the collection dir exists.
   if [ ! -d "$OPT_DEST" ]; then
      mkdir -p "$OPT_DEST" || die "Cannot make --dest $OPT_DEST"
   fi
@@ -1136,16 +1136,17 @@ fi

 =head1 NAME

-pt-stalk - Wait for a condition to occur then begin collecting data.
+pt-stalk - Gather forensic data about MySQL when a problem occurs.

 =head1 SYNOPSIS

 Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]

-pt-stalk watches for a condition to become true, and when it does, executes
-a script.  By default it executes L<pt-collect>, but that can be customized.
-This tool is useful for gathering diagnostic data when an infrequent event
-occurs, so an expert person can review the data later.
+pt-stalk watches for a trigger condition to become true, and then collects data
+to help in diagnosing problems. It is designed to run as a daemon so that you
+can diagnose intermittent problems that you cannot observe directly. You can
+also use it to execute a custom command, or to gather the data on demand without
+waiting for the trigger to happen.

 =head1 RISKS

@@ -1154,7 +1155,9 @@ whether known or unknown, of using this tool.  The two main categories of risks
 are those created by the nature of the tool (e.g. read-only tools vs. read-write
 tools) and those created by bugs.

-pt-stalk is a read-only tool.  It should be very low-risk.
+pt-stalk is a read-only tool.  It should be very low-risk.  Some of the options
+can cause intrusive data collection to be performed, however, so if you enable
+any non-default options, you should read their documentation carefully.

 At the time of this release, we know of no bugs that could cause serious harm
 to users.
@@ -1168,37 +1171,60 @@ See also L<"BUGS"> for more information on filing bugs and getting help.

 =head1 DESCRIPTION

-Although pt-stalk comes pre-configured to do a specific thing, in general
-this tool is just a skeleton script for the following flow of actions:
+Sometimes a problem happens infrequently and for a short time, giving you no
+chance to see the system when it happens. How do you solve intermittent MySQL
+problems when you can't observe them? That's why pt-stalk exists. In addition to
+using it when there's a known problem on your servers, it is a good idea to run
+pt-stalk all the time, even when you think nothing is wrong.  You will
+appreciate the data it gathers when a problem occurs, because problems such as
+MySQL lockups or spikes of activity typically leave no evidence to use in root
+cause analysis.

-=over
+This tool does two things: it watches a server (typically MySQL) for a trigger
+to occur, and it gathers diagnostic data.  To use it effectively, you need to
+define a good trigger condition. A good trigger is sensitive enough to fire
+reliably when a problem occurs, so that you don't miss a chance to solve
+problems. On the other hand, a good trigger isn't prone to false positives, so
+you don't gather information when the server is functioning normally.

-=item 1.
+The most reliable triggers for MySQL tend to be the number of connections to the
+server, and the number of queries running concurrently. These are available in
+the SHOW GLOBAL STATUS command as Threads_connected and Threads_running.
+Sometimes Threads_connected is not a reliable indicator of trouble, but
+Threads_running usually is.  Your job, as the tool's user, is to define an
+appropriate trigger condition for the tool.  Choose carefully, because the
+quality of your results will depend on the trigger you choose.

-Loop infinitely, sleeping between iterations.
+You can define the trigger with the L<"--function">, L<"--variable">, and
+L<"--threshold"> options, among others.  Please read the documentation for
+--function to learn how to do this.

-=item 2.
+The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger
+becomes true. It then gathers diagnostics for a while, and sleeps afterwards for
+some time to prevent repeatedly gathering data if the condition remains true.
+In crude pseudocode, omitting some subtleties,

-In each iteration, run some command and get the output.
+  while true; do
+    if --variable from --function is greater than --threshold; then
+      observations++
+      if observations is greater than --cycles; then
+        capture diagnostics for --run-time seconds
+        exit if --iterations is exceeded
+        sleep for --sleep seconds
+      done
+    done
+    clean up data that's older than --retention-time
+    sleep for --interval seconds
+  done

-=item 3.
+The diagnostic data is written to files whose names begin with a timestamp, so
+you can distinguish samples from each other in case the tool collects data
+multiple times.  The pt-sift tool is designed to help you browse and analyze the
+resulting samples of data.

-If the command fails or the output is larger than the threshold,
-execute the collection script; but do not execute if the destination disk
-is too full.
-
-=back
-
-By default, the tool is configured to execute mysqladmin extended-status and
-extract the value of the Threads_running variable; if this is greater than
-25, it runs the collection script.  This is really just placeholder code,
-and almost certainly needs to be customized!
-
-If the tool does execute the collection script, it will wait for a while
-before checking and executing again.  This is to prevent a continuous
-condition from causing a huge number of executions to fire off.
-
-The name 'stalk' is because 'watch' is already taken, and 'stalk' is fun.
+Although this sounds simple enough, in practice there are a number of
+subtleties, such as detecting when the disk is beginning to fill up so that the
+tool doesn't cause the server to run out of disk space.

 =head1 CONFIGURING

@@ -1212,23 +1238,43 @@ TODO

 default: yes; negatable: yes

-Collect system information.
+Collect system information.  You can negate this option to make the tool watch
+the system but not actually gather any diagnostic data.

 =item --collect-gdb

-Collect GDB stacktraces.
+Collect GDB stacktraces.  This is achieved by attaching to MySQL and printing
+stack traces from all threads. This will freeze the server for some period of
+time, ranging from a second or so to much longer on very busy systems with a lot
+of memory and many threads in the server.  For this reason, it is disabled by
+default. However, if you are trying to diagnose a server stall or lockup,
+freezing the server causes no additional harm, and the stack traces can be vital
+for diagnosis.
+
+In addition to freezing the server, there is also some risk of the server
+crashing or performing badly after GDB detaches from it.

 =item --collect-oprofile

-Collect oprofile data.
+Collect oprofile data.  This is achieved by starting an oprofile session,
+letting it run for the collection time, and then stopping and saving the
+resulting profile data in the system's default location.  Please read your
+system's oprofile documentation to learn more about this.

 =item --collect-strace

-Collect strace data.
+Collect strace data. This is achieved by attaching strace to the server, which
+will make it run very slowly until strace detaches.  The same cautions apply as
+those listed in --collect-gdb.  You should not enable this option together with
+--collect-gdb, because GDB and strace can't attach to the server process
+simultaneously.

 =item --collect-tcpdump

-Collect tcpdump data.
+Collect tcpdump data. This option causes tcpdump to capture all traffic on all
+interfaces for the port on which MySQL is listening.  You can later use
+pt-query-digest to decode the MySQL protocol and extract a log of query traffic
+from it.

 =item --config

@@ -1241,77 +1287,99 @@ first option on the command line.

 type: int; default: 5

-Number of times condition must be met before triggering collection.
+The number of times the trigger condition must be true before collecting data.
+This helps prevent false positives and make the trigger condition less
+susceptible to firing when the condition recovers quickly.

 =item --daemonize

-Daemonize the tool.
+Daemonize the tool.  This causes the tool to fork into the background and log
+its output as specified in --log.

 =item --dest

 type: string; default: ${HOME}/collected

-Where to store collected data.
+Where to store the diagnostic data.  Each time the tool collects data, it writes
+to a new set of files, which are named with the current system timestamp.

 =item --disk-byte-limit

 type: int; default: 100

-Exit if the disk has less than this many MB free.
+Don't collect data unless the destination disk has this much free space. This
+prevents the tool from filling up the disk with diagnostic data.
+
+If the destination directory contains a previously captured sample of data, the
+tool will measure its size and use that as an estimate of how much data is
+likely to be gathered this time, too.  It will then be even more pessimistic,
+and will refuse to collect data unless the disk has enough free space to hold
+the sample and still have the desired amount of free space.  For example, if
+you'd like 100MB of free space and the previous diagnostic sample consumed
+100MB, the tool won't collect any data unless the disk has 200MB free.

 =item --disk-pct-limit

 type: int; default: 5

-Exit if the disk is less than this %full.
+Don't collect data unless the disk has at least this percent free space. This
+option works similarly to --disk-byte-limit, but specifies a percentage margin
+of safety instead of a byte margin of safety.  The tool honors both options, and
+will not collect any data unless both margins are satisfied.

 =item --function

 type: string; default: status

-Built-in function name or plugin file name which returns the value of C<VARIABLE>.
-
-Possible values are:
+Specifies what to watch for a diagnostic trigger.  The default value watches
+SHOW GLOBAL STATUS, but you can also watch SHOW PROCESSLIST or supply a plugin
+file with your own custom code.  This function supplies the value of
+L<"--variable">, which is then compared against L<"--threshold"> to see if the
+trigger condition is met.  Additional options may be required as well; see
+below. Possible values:

 =over

 =item * status

-Grep the value of C<VARIABLE> from C<mysqladmin extended-status>.
+This value specifies that the source of data for the diagnostic trigger is SHOW
+GLOBAL STATUS.  The value of L<"--variable"> then defines which status counter
+is the trigger.

 =item * processlist

-Count the number of processes in C<mysqladmin processlist> whose
-C<VARIABLE> column matches C<MATCH>.  For example:
+This value specifies that the data for the diagnostic trigger comes from SHOW
+FULL PROCESSLIST.  The trigger value is the count of processes whose
+L<"--variable"> column matches the L<"--match"> option.  For example, to trigger
+when more than 10 processes are in the "statistics" state, use the following
+options:

-   TRIGGER_FUNCTION="processlist" \
-   VARIABLE="State"               \
-   MATCH="statistics"             \
-   THRESHOLD="10"
+  --trigger processlist --variable State --match statistics --threshold 10

-The above triggers when more than 10 processes are in the "statistics" state.
-C<MATCH> must be specified for this trigger function.
+=back

-=item * magic
+In addition, you can specify a file that contains your custom trigger function,
+written in Unix shell script.  This can be a wrapper that executes anything you
+wish.  If the argument to --function is a file, then it takes precedence over
+builtin functions, so if there is a file in the working directory named "status"
+or "processlist" then the tool will use that file as a plugin, even though those
+are otherwise recognized as reserved words for this option.

-TODO
-
-=item * plugin file name
-
-A plugin file allows you to specify a custom trigger function.  The plugin
-file must contain a function called C<trg_plugin>.  For example:
+The plugin file works by providing a function called C<trg_plugin>, and the tool
+simply sources the file and executes the function.  For example, the function
+might look like the following:

   trg_plugin() {
-      # Do some stuff.
-      echo "$value"
+      mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" | grep -c "has waited at"
   }

-The last output if the function (its "return value") must be a number.
-This number is compared to C<THRESHOLD>.  All L<"ENVIRONMENT"> variables
-are available to the function.
+This snippet will count the number of mutex waits inside of InnoDB. It
+illustrates the general principle: the function must output a number, which is
+then compared to the threshold as usual.  The $EXT_ARGV variable contains the
+MySQL options mentioned in the L<"SYNOPSIS"> above.

-Do not alter the tool's existing global variables.  Prefix any plugin-specific
-global variables with "PLUGIN_".
+The plugin should not alter the tool's existing global variables.  Prefix any
+plugin-specific global variables with "PLUGIN_" or make them local.

 =back

@@ -1323,14 +1391,15 @@ Print help and exit.

 type: int; default: 1

-Interval between checks.
+Interval between checks for the diagnostic trigger.

 =item --iterations

 type: int

-Exit after triggering C<pt-collect> this many times.  By default, the tool
-will collect as many times as it's triggered.
+Exit after collecting diagnostics this many times.  By default, the tool
+will continue to watch the server forever, but this is useful for scenarios
+where you want to capture once and then exit, for example.

 =item --log

@@ -1342,13 +1411,14 @@ Print all output to this file when daemonized.

 type: string

-Match pattern for C<processlist> L<"--function">.
+The pattern to use when watching SHOW PROCESSLIST. See the documentation for
+L<"--function"> for details.

 =item --notify-by-email

 type: string

-Send mail to this list of addresses when C<pt-collect> triggers.
+Send mail to this list of addresses when data is collected.

 =item --pid

@@ -1360,42 +1430,47 @@ Create a PID file when daemonized.

 type: string

-Collect file prefix.
-
-If not specified, the current local time is used like C<2011_12_06_14_02_02>,
-which is December 6, 2011 at 14:02:02.
+The filename prefix for diagnostic samples. By default, samples have a timestamp
+prefix based on the current local time, such as 2011_12_06_14_02_02, which is
+December 6, 2011 at 14:02:02.

 =item --retention-time

 type: int; default: 30

-Remove samples after this many days.
+Number of days to retain collected samples.  Any samples that are older will be
+purged.

 =item --run-time

 type: int; default: 30

-How long to collect statistics data for?
-
-Make sure that this isn't longer than SLEEP.
+How long the tool will collect data when it triggers.  This should not be longer
+than L<"--sleep">. It is usually not necessary to change this; if the default 30
+seconds hasn't gathered enough diagnostic data, running longer is not likely to
+do so. In fact, in many cases a shorter collection period is appropriate.

 =item --sleep

 type: int; default: 300

-How long to sleep after collecting?
+How long to sleep after collecting data.  This prevents the tool from triggering
+continuously, which might be a problem if the collection process is intrusive.
+It also prevents filling up the disk or gathering too much data to analyze
+reasonably.

 =item --threshold

 type: int; default: 25

-Max number of C<N> to tolerate.
+The threshold at which the diagnostic trigger should fire.  See L<"--function">
+for details.

 =item --variable

 type: string; default: Threads_running

-This is the thing to check for.
+The variable to compare against the threshold. See L<"--function"> for details.

 =item --version