It was using maps in each context, which would be merged between
contexts, then injected each time we needed a message to display.
It had a limitation on complicated operator setups: historical
information would be overriden by newer associations.
(e.g, that IP was for node0 yesterday, now it's node1, so associations
have been overwritten and incorrect)
It also introduced complexity, such as forcing to define closures too
many times, merging maps, it would be harder to debug, and every files
were starting from empty translation maps.
Moreover, iterating on maps is guaranteed to be random so it could create
hard-to-debug output variations on complex cases.
Now it is a singleton in translate package, still using maps but now it
associates an array of "units" storing the timestamp with each piece of information.
It is protected by rwmutex, because map are not threadsafe. (there's no
parallel processing for now)
No regressions, and it passes "operator_ambiguous_ips_list_all_no_color"
where the old system failed.
It nows also can be used as an easy to read source of information in
itself
It existed for non-operator setup, but was not working for operators due
to k8s logs not interpreting newlines and tabs
This operator version re-uses existing regular regex handlers directly
It must test multiple times to remove doubts.
As the tool is reading files and relying on maps, their access order are
random. It can impact some translations
When adding "ownip", it was also propagating the new IP to the old hash.
But with operators, when IP where changing hash will also change, so
linking the new IP to old hash is anachronic. It is not wrong, but
depending on the order of map merges, the newest information could have
been overriden depending on the order of events.
That situation was producing X(2*number of conflicts) versions of output for operators, with
different md5sum, which could produce false-positive regression tests
So currently some information are not linked anymore, so some IPs are
not translated even though they could, but it is a limitation of using
maps as source of truth, as they are not versioned
It is a thing: 2 nodes joining at the same time, with 2 JOINERs and 2
DONORs cluster-wide
It can happen on operators with 2 garbd joining at the same time
Before, pt-galera-log-explainer was using SST metadata naively.
Basically if a node was DONOR and we found a "transfer completed"
message, we assumed the donor name we found is the correct one.
So for concurrent SSTs, donors were swapping names.
Now, it is handled by a map, indexed by a donor name. To know if a node
is actual donor or not, it now compare timestamps of events. It assumes
both "selected donor" and "shifting DONOR" messages should have happen
in less than 0.01 secs to avoid any conflict.
Regression tests coming in next commit with an operator logs having
concurrent SSTs. Another conflicts was sometimes breaking the test
depending on the order on which we read files, hence why it's not added
here yet
It was due to a silly regression when reformatting the main.go
The function iterating was doing too many things, and returning an error
when nothing was found, and a "continue" was done on the main
"timelineFromPaths" loop
It is now a simple foreach loop that does not return error so we have to
check if the localtimeline slice is empty
- These two files exist for the same purpose but have different content
that could confuse users. Better to keep only CONTRIBUTING.md with all
details, related to contributions.
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Proof of concept
- Fixed regular expression in lib/TableParser.pm mistakenly chaged in the tool's code
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Added basic test case for PT-2168
- Added more details for replica lag information
- Disconnecting replica if lag is not checked. This prevents "Too many
connections" error
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Implemented option --wait-lost-replicas for pt-osc, added test case
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Added more tests for situations where connection to the replica can
fail
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Removed extra checks for wait_no_die variable
- Added test cases for SQL queries that pt-osc sends to replicas
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Allow to reload dsns table while waiting for missed replica if
--recursion-method is dsn
- Fixed logic in replica rediscovery, so it works with replicas on the
same host but with different ports
- Renamed option wait-lost-replicas to fail-on-stopped-replication, so
it is in line with pt-table-checksum
- Adjusted tests
- Removed debug code for PT-1760
- Added test case for PT-1760
- Added exception for variable Open_tables_with_triggers in
lib/bash/collect.sh due to failed test in Percona Server 8.0.34+
- Updated pt-stalk
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Updated modules
- Fixed typo in t/pt-table-sync/bidirectional.t
- Removed trailing whitespaces in lib/MasterSlave.pm
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Help for option --fail-on-stopped-replication
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Added check for availability of the simple_rewrite_plugin in t/pt-online-schema-change/pt-2168.t
* PT-2168 - PT-OSC shouldn't fail while unable to monitor a replica node
- Added link to the simple_rewrite_plugin source code
- Removed tests for code that runs only in the beginning of pt-osc
action, so should not be affected by the option fail-on-stopped-replication
* PT-2248 - pt-k8s-debug-collector does not run pg_gather with K8SPG 2
- Added check for K8SPG 2, so can run pg_gather for it
- Added new allowed value for option --resource:
-- pgv2 for K8SPG 2
-- auto to auto-detect custom resource
- Option --resource has now default value "auto"
- Updated documentation
- Added test cases for new options
* PT-2248 - pt-k8s-debug-collector does not run pg_gather with K8SPG 2
- Implemented custom user and secrets handling (in case when no default
user exists).