Make --no-stalk _not_ force --iterations and other options. Extensively update the tool's docs.

2025-09-09 07:30:02 +00:00 · 2013-03-04 15:57:52 -07:00
parent 45813e082d
commit 0da15fb083
2 changed files with 124 additions and 91 deletions
--- a/bin/pt-stalk
+++ b/bin/pt-stalk
@@ -1079,7 +1079,7 @@ sleep_ok() {
   local seconds="$1"
   local msg="${2:-""}"
   if oktorun; then
-      [ "$msg" ] && info "$msg"
+      [ "$msg" ] && log "$msg"
      sleep $seconds
   fi
 }
@@ -1333,10 +1333,8 @@ if    [ "${0##*/}" = "$TOOL" ] \

   if [ -z "$OPT_STALK" -a "$OPT_COLLECT" ]; then
      # Not stalking; do immediate collect once.
-      OPT_ITERATIONS=1
      OPT_CYCLES=0
-      OPT_SLEEP=0
-      OPT_INTERVAL=0
+      echo "[iter=$OPT_ITERATIONS] [cycle=$OPT_CYCLES] [sleep=$OPT_SLEEP] [interval=$OPT_INTERVAL]"
   fi

   usage_or_errors "$0"
@@ -1412,17 +1410,17 @@ fi

 =head1 NAME

-pt-stalk - Gather forensic data about MySQL when a problem occurs.
+pt-stalk - Collect forensic data about MySQL when problems occur.

 =head1 SYNOPSIS

 Usage: pt-stalk [OPTIONS] [-- MYSQL OPTIONS]

-pt-stalk watches for a trigger condition to become true, and then collects data
-to help in diagnosing problems. It is designed to run as a daemon with root
+pt-stalk watches for a trigger condition to occur, then collects data
+to help diagnose problems.  The tool is designed to run as a daemon with root
 privileges, so that you can diagnose intermittent problems that you cannot
-observe directly. You can also use it to execute a custom command, or to gather
-the data on demand without waiting for the trigger to happen.
+observe directly.  You can also use it to execute a custom command, or to
+collect data on demand without waiting for the stalk trigger to occur.

 =head1 RISKS

@@ -1474,25 +1472,45 @@ quality of your results will depend on the trigger you choose.

 You can define the trigger with the L<"--function">, L<"--variable">, and
 L<"--threshold"> options, among others.  Please read the documentation for
--function to learn how to do this.
+L<"--function"> to learn how to do this.

 The pt-stalk tool, by default, simply watches MySQL repeatedly until the trigger
 becomes true. It then gathers diagnostics for a while, and sleeps afterwards for
 some time to prevent repeatedly gathering data if the condition remains true.
 In crude pseudocode, omitting some subtleties,

-  while true; do
-    if --variable from --function is greater than --threshold; then
-      observations++
-      if observations is greater than --cycles; then
-        capture diagnostics for --run-time seconds
-        exit if --iterations is exceeded
-        sleep for --sleep seconds
-      done
-    done
-    clean up data that's older than --retention-time
-    sleep for --interval seconds
-  done
+   while true; do
+      if --variable from --function > --threshold; then
+         cycles_true++
+         if cycles_true >= --cycles; then
+            --notify-by-email
+            if --collect; then
+               if --disk-bytes-free and --disk-pct-free ok; then
+                  (--collect for --run-time seconds) &
+               fi
+               rm files in --dest older than --retention-time
+            fi
+            iter++
+            cycles_true=0
+         fi
+         if iter < --iterations; then
+            sleep --sleep seconds
+         else
+            break
+         fi
+      else
+         if iter < --iterations; then
+            sleep --interval seconds
+         else
+            break
+         fi
+      fi
+   done
+   rm old --dest files older than --retention-time
+   if --collect process are still running; then
+      wait up to --run-time * 3 seconds
+      kill any remaining --collect processes 
+   fi

 The diagnostic data is written to files whose names begin with a timestamp, so
 you can distinguish samples from each other in case the tool collects data
@@ -1530,8 +1548,8 @@ are writable by non-root users.

 default: yes; negatable: yes

-Collect system information.  You can negate this option to make the tool watch
-the system but not actually gather any diagnostic data.
+Collect diagnostic data when the L<"--stalk"> trigger occurs.  Specify
+C<--no-collect> to make the tool watch the system but not collect data.

 See also L<"--stalk">.

@@ -1581,9 +1599,8 @@ first option on the command line.

 type: int; default: 5

-The number of times the trigger condition must be true before collecting data.
-This helps prevent false positives, and makes the trigger condition less likely
-to fire when the problem recovers quickly.
+How many times L<"--variable"> must be greater than L<"--threshold"> before triggering L<"--collect">.  This helps prevent false positives, and makes
+the trigger condition less likely to fire when the problem recovers quickly.

 =item --daemonize

@@ -1594,14 +1611,15 @@ its output as specified in --log.

 type: string; default: /var/lib/pt-stalk

-Where to store the diagnostic data.  Each time the tool collects data, it writes
-to a new set of files, which are named with the current system timestamp.
+Where to save diagnostic data from L<"--collect">.  Each time the tool
+collects data, it writes to a new set of files, which are named with the
+current system timestamp.

 =item --disk-bytes-free

 type: size; default: 100M

-Don't collect data if the disk has less than this much free space.
+Do not L<"--collect"> if the disk has less than this much free space.
 This prevents the tool from filling up the disk with diagnostic data.

 If the L<"--dest"> directory contains a previously captured sample of data,
@@ -1618,7 +1636,7 @@ Valid size value suffixes are k, M, G, and T.

 type: int; default: 5

-Don't collect data if the disk has less than this percent free space.
+Do not L<"--collect"> if the disk has less than this percent free space.
 This prevents the tool from filling up the disk with diagnostic data.

 This option works similarly to L<"--disk-bytes-free"> but specifies a
@@ -1630,57 +1648,57 @@ margins are satisfied.

 type: string; default: status

-Specifies what to watch for a diagnostic trigger.  The default value watches
-SHOW GLOBAL STATUS, but you can also watch SHOW PROCESSLIST or supply a plugin
-file with your own custom code.  This function supplies the value of
+What to watch for L<"--stalk"> trigger.  The default value watches
+C<SHOW GLOBAL STATUS>, but you can also watch C<SHOW PROCESSLIST> and specify
+a file with your own custom code.  This function supplies the value of
 L<"--variable">, which is then compared against L<"--threshold"> to see if the
-trigger condition is met.  Additional options may be required as well; see
-below. Possible values:
+L<"--stalk"> trigger condition is met.  Additional options may be required as
+well; see below. Possible values are:

 =over

 =item * status

-This value specifies that the source of data for the diagnostic trigger is SHOW
-GLOBAL STATUS.  The value of L<"--variable"> then defines which status counter
-is the trigger.
+Watch C<SHOW GLOBAL STATUS> for the L<"--stalk"> trigger.  The value of
+L<"--variable"> then defines which status counter is the trigger.

 =item * processlist

-This value specifies that the data for the diagnostic trigger comes from SHOW
-FULL PROCESSLIST.  The trigger value is the count of processes whose
-L<"--variable"> column matches the L<"--match"> option.  For example, to trigger
-when more than 10 processes are in the "statistics" state, use the following
-options:
+Watch C<SHOW FULL PROCESSLIST> for the L<"--stalk"> trigger.  The trigger
+value is the count of processes whose L<"--variable"> column matches the
+L<"--match"> option.  For example, to trigger L<"--collect"> when more than
+10 processes are in the "statistics" state, specify:

-  --function processlist --variable State \
-    --match statistics --threshold 10
+   --function processlist \
+   --variable State       \
+   --match statistics     \
+   --threshold 10

 =back

-In addition, you can specify a file that contains your custom trigger function,
-written in Unix shell script.  This can be a wrapper that executes anything you
-wish.  If the argument to --function is a file, then it takes precedence over
-builtin functions, so if there is a file in the working directory named "status"
-or "processlist" then the tool will use that file as a plugin, even though those
-are otherwise recognized as reserved words for this option.
+In addition, you can specify a file that contains your custom trigger
+function, written in Unix shell script.  This can be a wrapper that executes
+anything you wish.  If the argument to L<"--function"> is a file, then it
+takes precedence over built-in functions, so if there is a file in the working
+directory named "status" or "processlist" then the tool will use that file
+even though are valid built-in values.

-The plugin file works by providing a function called C<trg_plugin>, and the tool
-simply sources the file and executes the function.  For example, the function
-might look like the following:
+The file works by providing a function called C<trg_plugin>, and the tool
+simply sources the file and executes the function.  For example, the file
+might contain:

   trg_plugin() {
      mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" \
        | grep -c "has waited at"
   }

-This snippet will count the number of mutex waits inside of InnoDB. It
+This snippet will count the number of mutex waits inside InnoDB.  It
 illustrates the general principle: the function must output a number, which is
-then compared to the threshold as usual.  The $EXT_ARGV variable contains the
-MySQL options mentioned in the L<"SYNOPSIS"> above.
+then compared to L<"--threshold"> as usual.  The C<$EXT_ARGV> variable
+contains the MySQL options mentioned in the L<"SYNOPSIS"> above.

-The plugin should not alter the tool's existing global variables.  Prefix any
-plugin-specific global variables with "PLUGIN_" or make them local.
+The file should not alter the tool's existing global variables.  Prefix any
+file-specific global variables with "PLUGIN_" or make them local.

 =item --help

@@ -1690,15 +1708,17 @@ Print help and exit.

 type: int; default: 1

-Interval between checks for the diagnostic trigger.
+How often to check the L<"--stalk"> trigger, in seconds.

 =item --iterations

 type: int

-Exit after collecting diagnostics this many times.  By default, the tool
-will continue to watch the server forever, but this is useful for scenarios
-where you want to capture once and then exit, for example.
+How many times to L<"--collect"> diagnostic data.  By default, the tool
+runs forever and collects data every time the L<"--stalk"> trigger occurs.
+Specify L<"--iterations"> to collect data a limited number of times.
+This option is also useful with C<--no-stalk> to collect data once and
+exit, for example.

 =item --log

@@ -1710,14 +1730,14 @@ Print all output to this file when daemonized.

 type: string

-The pattern to use when watching SHOW PROCESSLIST. See the documentation for
-L<"--function"> for details.
+The pattern to use when watching SHOW PROCESSLIST.  See L<"--function">
+for details.

 =item --notify-by-email

 type: string

-Send mail to this list of addresses when data is collected.
+Send an email to these addresses for every L<"--collect">.

 =item --pid

@@ -1746,7 +1766,7 @@ Called before stalking.

 =item before_collect

-Called when the stalk condition is triggered, before running a collector
+Called when the L<"--stalk"> trigger occurs, before running a L<"--collect">
 process as a backgrounded subshell.

 =item after_collect
@@ -1771,8 +1791,8 @@ this hook is only called if L<"--iterations"> is specified.

 =back

-For example, a very simple plugin that touches a file when a collector
-process is triggered:
+For example, a very simple plugin that touches a file when L<"--collect">
+is triggered:

   before_colllect() {
      touch /tmp/foo
@@ -1797,9 +1817,9 @@ be set to indicate why the tool was stopped.

 type: string

-The filename prefix for diagnostic samples. By default, samples have a timestamp
-prefix based on the current local time, such as 2011_12_06_14_02_02, which is
-December 6, 2011 at 14:02:02.
+The filename prefix for diagnostic samples.  By default, all files created
+by the same L<"--collect"> instance have a timestamp prefix based on the current
+local time, like C<2011_12_06_14_02_02>, which is December 6, 2011 at 14:02:02.

 =item --retention-time

@@ -1812,10 +1832,12 @@ purged.

 type: int; default: 30

-How long the tool will collect data when it triggers.  This should not be longer
-than L<"--sleep">. It is usually not necessary to change this; if the default 30
-seconds hasn't gathered enough diagnostic data, running longer is not likely to
-do so. In fact, in many cases a shorter collection period is appropriate.
+How long to L<"--collect"> diagnostic data when the L<"--stalk"> trigger occurs.
+The value is in seconds and should not be longer than L<"--sleep">.  It is
+usually not necessary to change this; if the default 30 seconds doesn't
+collect enough data, running longer is not likely to help because the system
+or MySQL server is probably too busy to respond.  In fact, in many cases a
+shorter collection period is appropriate.

 This value is used two other times.  After collecting, the collect subprocess
 will wait another L<"--run-time"> seconds for its commands to finish.  Some
@@ -1833,8 +1855,8 @@ all of its subprocesses.

 type: int; default: 300

-How long to sleep after collecting data.  This prevents the tool from triggering
-continuously, which might be a problem if the collection process is intrusive.
+How long to sleep after L<"--collect">.  This prevents the tool
+from triggering continuously, which might be a problem if the collection process is intrusive.
 It also prevents filling up the disk or gathering too much data to analyze
 reasonably.

@@ -1842,14 +1864,16 @@ reasonably.

 default: yes; negatable: yes

-Watch the server and wait for the trigger to occur.  You can negate this option
-to make the tool immediately gather any diagnostic data once and exit.  This is
-useful if a problem is already happening, but pt-stalk is not running, so
-you only want to collect diagnostic data.
+Watch the server and wait for the trigger to occur.  Specify C<--no-stalk>
+to collect diagnostic data immediately, that is, without waiting for the
+trigger to occur.  You probably also want to specify values for
+L<"--interval">, L<"--iterations">, and L<"--sleep">.  For example, to
+immediately collect data for 1 minute then exit, specify:

-If this option is negate, L<"--daemonize">, L<"--log">, L<"--pid">, and other
-stalking-related options have no effect; the tool simply collects diagnostic
-data and exits.  Safeguard options, like L<"--disk-bytes-free"> and
+   --no-stalk --run-time 60 --iterations 1
+
+L<"--cycles">, L<"--daemonize">, L<"--log"> and L<"--pid"> have no effect
+with C<--no-stalk>.  Safeguard options, like L<"--disk-bytes-free"> and
 L<"--disk-pct-free">, are still respected.

 See also L<"--collect">.
@@ -1858,14 +1882,18 @@ See also L<"--collect">.

 type: int; default: 25

-The threshold at which the diagnostic trigger should fire.  See L<"--function">
-for details.
+The maximum acceptable value for L<"--variable">.  L<"--collect"> is
+triggered when the value of L<"--variable"> is greater than L<"--threshold">
+for L<"--cycles"> many times.  Currently, there is no way to define a lower
+threshold to check for a L<"--variable"> value that is too low.
+
+See also L<"--function">.

 =item --variable

 type: string; default: Threads_running

-The variable to compare against the threshold. See L<"--function"> for details.
+The variable to compare against L<"--threshold">.  See also L<"--function">.

 =item --verbose

@@ -1995,7 +2023,8 @@ Replace C<TOOL> with the name of any tool.

 =head1 AUTHORS

-Baron Schwartz, Justin Swanhart, Fernando Ipar, and Daniel Nichter
+Baron Schwartz, Justin Swanhart, Fernando Ipar, Daniel Nichter,
+and Brian Fraser.

 =head1 ABOUT PERCONA TOOLKIT