Make --no-stalk _not_ force --iterations and other options. Extensively update the tool's docs.

2025-09-10 21:19:59 +00:00 · 2013-03-04 15:57:52 -07:00
parent 45813e082d
commit 0da15fb083
2 changed files with 124 additions and 91 deletions
--- a/bin/pt-stalk
+++ b/bin/pt-stalk
@@ -1079,7 +1079,7 @@ sleep_ok() {
   local seconds="$1"
   local msg="${2:-""}"
   if oktorun; then
-      [ "$msg" ] && info "$msg"
+      [ "$msg" ] && log "$msg"
      sleep $seconds
   fi
 }
@@ -1333,10 +1333,8 @@ if    [ "${0##*/}" = "$TOOL" ] \
   if [ -z "$OPT_STALK" -a "$OPT_COLLECT" ]; then
      # Not stalking; do immediate collect once.
      OPT_ITERATIONS=1
      OPT_CYCLES=0
-      OPT_SLEEP=0
+      echo "[iter=$OPT_ITERATIONS] [cycle=$OPT_CYCLES] [sleep=$OPT_SLEEP] [interval=$OPT_INTERVAL]"
      OPT_INTERVAL=0
   fi
   usage_or_errors "$0"
@@ -1412,17 +1410,17 @@ fi
 =head1 NAME
-pt-stalk - Gather forensic data about MySQL when a problem occurs.
+pt-stalk - Collect forensic data about MySQL when problems occur.
 =head1 SYNOPSIS
 Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]
-pt-stalk watches for a trigger condition to become true, and then collects data
+pt-stalk watches for a trigger condition to occur, then collects data
-to help in diagnosing problems. It is designed to run as a daemon with root
+to help diagnose problems.  The tool is designed to run as a daemon with root
 privileges, so that you can diagnose intermittent problems that you cannot
-observe directly. You can also use it to execute a custom command, or to gather
+observe directly.  You can also use it to execute a custom command, or to
-the data on demand without waiting for the trigger to happen.
+collect data on demand without waiting for the stalk trigger to occur.
 =head1 RISKS
@@ -1474,25 +1472,45 @@ quality of your results will depend on the trigger you choose.
 You can define the trigger with the L<"--function">, L<"--variable">, and
 L<"--threshold"> options, among others.  Please read the documentation for
--function to learn how to do this.
+L<"--function"> to learn how to do this.
 The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger
 becomes true. It then gathers diagnostics for a while, and sleeps afterwards for
 some time to prevent repeatedly gathering data if the condition remains true.
 In crude pseudocode, omitting some subtleties,
-  while true; do
+   while true; do
-    if --variable from --function is greater than --threshold; then
+      if --variable from --function > --threshold; then
-      observations++
+         cycles_true++
-      if observations is greater than --cycles; then
+         if cycles_true >= --cycles; then
-        capture diagnostics for --run-time seconds
+            --notify-by-email
-        exit if --iterations is exceeded
+            if --collect; then
-        sleep for --sleep seconds
+               if --disk-bytes-free and --disk-pct-free ok; then
-      done
+                  (--collect for --run-time seconds) &
-    done
+               fi
-    clean up data that's older than --retention-time
+               rm files in --dest older than --retention-time
-    sleep for --interval seconds
+            fi
-  done
+            iter++
            cycles_true=0
         fi
         if iter < --iterations; then
            sleep --sleep seconds
         else
            break
         fi
      else
         if iter < --iterations; then
            sleep --interval seconds
         else
            break
         fi
      fi
   done
   rm old --dest files older than --retention-time
   if --collect process are still running; then
      wait up to --run-time * 3 seconds
      kill any remaining --collect processes 
   fi
 The diagnostic data is written to files whose names begin with a timestamp, so
 you can distinguish samples from each other in case the tool collects data
@@ -1530,8 +1548,8 @@ are writable by non-root users.
 default: yes; negatable: yes
-Collect system information.  You can negate this option to make the tool watch
+Collect diagnostic data when the L<"--stalk"> trigger occurs.  Specify
-the system but not actually gather any diagnostic data.
+C<--no-collect> to make the tool watch the system but not collect data.
 See also L<"--stalk">.
@@ -1581,9 +1599,8 @@ first option on the command line.
 type: int; default: 5
-The number of times the trigger condition must be true before collecting data.
+How many times L<"--variable"> must be greater than L<"--threshold"> before triggering L<"--collect">.  This helps prevent false positives, and makes
-This helps prevent false positives, and makes the trigger condition less likely
+the trigger condition less likely to fire when the problem recovers quickly.
 to fire when the problem recovers quickly.
 =item --daemonize
@@ -1594,14 +1611,15 @@ its output as specified in --log.
 type: string; default: /var/lib/pt-stalk
-Where to store the diagnostic data.  Each time the tool collects data, it writes
+Where to save diagnostic data from L<"--collect">.  Each time the tool
-to a new set of files, which are named with the current system timestamp.
+collects data, it writes to a new set of files, which are named with the
 current system timestamp.
 =item --disk-bytes-free
 type: size; default: 100M
-Don't collect data if the disk has less than this much free space.
+Do not L<"--collect"> if the disk has less than this much free space.
 This prevents the tool from filling up the disk with diagnostic data.
 If the L<"--dest"> directory contains a previously captured sample of data,
@@ -1618,7 +1636,7 @@ Valid size value suffixes are k, M, G, and T.
 type: int; default: 5
-Don't collect data if the disk has less than this percent free space.
+Do not L<"--collect"> if the disk has less than this percent free space.
 This prevents the tool from filling up the disk with diagnostic data.
 This option works similarly to L<"--disk-bytes-free"> but specifies a
@@ -1630,57 +1648,57 @@ margins are satisfied.
 type: string; default: status
-Specifies what to watch for a diagnostic trigger.  The default value watches
+What to watch for L<"--stalk"> trigger.  The default value watches
-SHOW GLOBAL STATUS, but you can also watch SHOW PROCESSLIST or supply a plugin
+C<SHOW GLOBAL STATUS>, but you can also watch C<SHOW PROCESSLIST> and specify
-file with your own custom code.  This function supplies the value of
+a file with your own custom code.  This function supplies the value of
 L<"--variable">, which is then compared against L<"--threshold"> to see if the
-trigger condition is met.  Additional options may be required as well; see
+L<"--stalk"> trigger condition is met.  Additional options may be required as
-below. Possible values:
+well; see below. Possible values are:
 =over
 =item * status
-This value specifies that the source of data for the diagnostic trigger is SHOW
+Watch C<SHOW GLOBAL STATUS> for the L<"--stalk"> trigger.  The value of
-GLOBAL STATUS.  The value of L<"--variable"> then defines which status counter
+L<"--variable"> then defines which status counter is the trigger.
 is the trigger.
 =item * processlist
-This value specifies that the data for the diagnostic trigger comes from SHOW
+Watch C<SHOW FULL PROCESSLIST> for the L<"--stalk"> trigger.  The trigger
-FULL PROCESSLIST.  The trigger value is the count of processes whose
+value is the count of processes whose L<"--variable"> column matches the
-L<"--variable"> column matches the L<"--match"> option.  For example, to trigger
+L<"--match"> option.  For example, to trigger L<"--collect"> when more than
-when more than 10 processes are in the "statistics" state, use the following
+10 processes are in the "statistics" state, specify:
 options:
-  --function processlist --variable State \
+   --function processlist \
-    --match statistics --threshold 10
+   --variable State       \
   --match statistics     \
   --threshold 10
 =back
-In addition, you can specify a file that contains your custom trigger function,
+In addition, you can specify a file that contains your custom trigger
-written in Unix shell script.  This can be a wrapper that executes anything you
+function, written in Unix shell script.  This can be a wrapper that executes
-wish.  If the argument to --function is a file, then it takes precedence over
+anything you wish.  If the argument to L<"--function"> is a file, then it
-builtin functions, so if there is a file in the working directory named "status"
+takes precedence over built-in functions, so if there is a file in the working
-or "processlist" then the tool will use that file as a plugin, even though those
+directory named "status" or "processlist" then the tool will use that file
-are otherwise recognized as reserved words for this option.
+even though are valid built-in values.
-The plugin file works by providing a function called C<trg_plugin>, and the tool
+The file works by providing a function called C<trg_plugin>, and the tool
-simply sources the file and executes the function.  For example, the function
+simply sources the file and executes the function.  For example, the file
-might look like the following:
+might contain:
   trg_plugin() {
      mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" \
        | grep -c "has waited at"
   }
-This snippet will count the number of mutex waits inside of InnoDB. It
+This snippet will count the number of mutex waits inside InnoDB.  It
 illustrates the general principle: the function must output a number, which is
-then compared to the threshold as usual.  The $EXT_ARGV variable contains the
+then compared to L<"--threshold"> as usual.  The C<$EXT_ARGV> variable
-MySQL options mentioned in the L<"SYNOPSIS"> above.
+contains the MySQL options mentioned in the L<"SYNOPSIS"> above.
-The plugin should not alter the tool's existing global variables.  Prefix any
+The file should not alter the tool's existing global variables.  Prefix any
-plugin-specific global variables with "PLUGIN_" or make them local.
+file-specific global variables with "PLUGIN_" or make them local.
 =item --help
@@ -1690,15 +1708,17 @@ Print help and exit.
 type: int; default: 1
-Interval between checks for the diagnostic trigger.
+How often to check the L<"--stalk"> trigger, in seconds.
 =item --iterations
 type: int
-Exit after collecting diagnostics this many times.  By default, the tool
+How many times to L<"--collect"> diagnostic data.  By default, the tool
-will continue to watch the server forever, but this is useful for scenarios
+runs forever and collects data every time the L<"--stalk"> trigger occurs.
-where you want to capture once and then exit, for example.
+Specify L<"--iterations"> to collect data a limited number of times.
 This option is also useful with C<--no-stalk> to collect data once and
 exit, for example.
 =item --log
@@ -1710,14 +1730,14 @@ Print all output to this file when daemonized.
 type: string
-The pattern to use when watching SHOW PROCESSLIST. See the documentation for
+The pattern to use when watching SHOW PROCESSLIST.  See L<"--function">
-L<"--function"> for details.
+for details.
 =item --notify-by-email
 type: string
-Send mail to this list of addresses when data is collected.
+Send an email to these addresses for every L<"--collect">.
 =item --pid
@@ -1746,7 +1766,7 @@ Called before stalking.
 =item before_collect
-Called when the stalk condition is triggered, before running a collector
+Called when the L<"--stalk"> trigger occurs, before running a L<"--collect">
 process as a backgrounded subshell.
 =item after_collect
@@ -1771,8 +1791,8 @@ this hook is only called if L<"--iterations"> is specified.
 =back
-For example, a very simple plugin that touches a file when a collector
+For example, a very simple plugin that touches a file when L<"--collect">
-process is triggered:
+is triggered:
   before_colllect() {
      touch /tmp/foo
@@ -1797,9 +1817,9 @@ be set to indicate why the tool was stopped.
 type: string
-The filename prefix for diagnostic samples. By default, samples have a timestamp
+The filename prefix for diagnostic samples.  By default, all files created
-prefix based on the current local time, such as 2011_12_06_14_02_02, which is
+by the same L<"--collect"> instance have a timestamp prefix based on the current
-December 6, 2011 at 14:02:02.
+local time, like C<2011_12_06_14_02_02>, which is December 6, 2011 at 14:02:02.
 =item --retention-time
@@ -1812,10 +1832,12 @@ purged.
 type: int; default: 30
-How long the tool will collect data when it triggers.  This should not be longer
+How long to L<"--collect"> diagnostic data when the L<"--stalk"> trigger occurs.
-than L<"--sleep">. It is usually not necessary to change this; if the default 30
+The value is in seconds and should not be longer than L<"--sleep">.  It is
-seconds hasn't gathered enough diagnostic data, running longer is not likely to
+usually not necessary to change this; if the default 30 seconds doesn't
-do so. In fact, in many cases a shorter collection period is appropriate.
+collect enough data, running longer is not likely to help because the system
 or MySQL server is probably too busy to respond.  In fact, in many cases a
 shorter collection period is appropriate.
 This value is used two other times.  After collecting, the collect subprocess
 will wait another L<"--run-time"> seconds for its commands to finish.  Some
@@ -1833,8 +1855,8 @@ all of its subprocesses.
 type: int; default: 300
-How long to sleep after collecting data.  This prevents the tool from triggering
+How long to sleep after L<"--collect">.  This prevents the tool
-continuously, which might be a problem if the collection process is intrusive.
+from triggering continuously, which might be a problem if the collection process is intrusive.
 It also prevents filling up the disk or gathering too much data to analyze
 reasonably.
@@ -1842,14 +1864,16 @@ reasonably.
 default: yes; negatable: yes
-Watch the server and wait for the trigger to occur.  You can negate this option
+Watch the server and wait for the trigger to occur.  Specify C<--no-stalk>
-to make the tool immediately gather any diagnostic data once and exit.  This is
+to collect diagnostic data immediately, that is, without waiting for the
-useful if a problem is already happening, but pt-stalk is not running, so
+trigger to occur.  You probably also want to specify values for
-you only want to collect diagnostic data.
+L<"--interval">, L<"--iterations">, and L<"--sleep">.  For example, to
 immediately collect data for 1 minute then exit, specify:
-If this option is negate, L<"--daemonize">, L<"--log">, L<"--pid">, and other
+   --no-stalk --run-time 60 --iterations 1
-stalking-related options have no effect; the tool simply collects diagnostic
+
-data and exits.  Safeguard options, like L<"--disk-bytes-free"> and
+L<"--cycles">, L<"--daemonize">, L<"--log"> and L<"--pid"> have no effect
 with C<--no-stalk>.  Safeguard options, like L<"--disk-bytes-free"> and
 L<"--disk-pct-free">, are still respected.
 See also L<"--collect">.
@@ -1858,14 +1882,18 @@ See also L<"--collect">.
 type: int; default: 25
-The threshold at which the diagnostic trigger should fire.  See L<"--function">
+The maximum acceptable value for L<"--variable">.  L<"--collect"> is
-for details.
+triggered when the value of L<"--variable"> is greater than L<"--threshold">
 for L<"--cycles"> many times.  Currently, there is no way to define a lower
 threshold to check for a L<"--variable"> value that is too low.
 See also L<"--function">.
 =item --variable
 type: string; default: Threads_running
-The variable to compare against the threshold. See L<"--function"> for details.
+The variable to compare against L<"--threshold">.  See also L<"--function">.
 =item --verbose
@@ -1995,7 +2023,8 @@ Replace C<TOOL> with the name of any tool.
 =head1 AUTHORS
-Baron Schwartz, Justin Swanhart, Fernando Ipar, and Daniel Nichter
+Baron Schwartz, Justin Swanhart, Fernando Ipar, Daniel Nichter,
 and Brian Fraser.
 =head1 ABOUT PERCONA TOOLKIT
--- a/t/pt-stalk/pt-stalk.t
+++ b/t/pt-stalk/pt-stalk.t
@@ -317,7 +317,11 @@ diag(`cp $ENV{HOME}/.pt-stalk.conf.original $ENV{HOME}/.pt-stalk.conf 2>/dev/nul
 cleanup();
-$retval = system("$trunk/bin/pt-stalk --no-stalk --run-time 2 --dest $dest --prefix nostalk --pid $pid_file -- --defaults-file=$cnf >$log_file 2>&1");
+# As of 2.2, --no-stalk means just that: don't stalk, just collect, so
 # we have to specify --iterations=1 else the tool will continue to run,
 # whereas in 2.1 --no-stalk implied/forced "collect once and exit".
 $retval = system("$trunk/bin/pt-stalk --no-stalk --run-time 2 --dest $dest --prefix nostalk --pid $pid_file --iterations 1 -- --defaults-file=$cnf >$log_file 2>&1");
 PerconaTest::wait_until(sub { !-f $pid_file });