TableChunker helps determine how to “chunk” a table. Chunk are pre-determined ranges of rows defined by boundary values (sometimes also called endpoints) on numeric or numeric-like columns, including date/time types. Any numeric column type that MySQL can do positional comparisons (<, <=, >, >=) on works. Chunking on character data is not supported yet (but see issue 568).
Usually chunks range over all rows in a table but sometimes they only range over a subset of rows if an optional where arg is passed to various subs. In either case a chunk is like “`col` >= 5 AND `col` < 10”. If col is of type int and is unique, then that chunk ranges over up to 5 rows.
Chunks are included in WHERE clauses by various tools to do work on discrete chunks of the table instead of trying to work on the entire table at once. Chunks do not overlap and their size is configurable via the chunk_size arg passed to several subs. The chunk_size can be a number of rows or a size like 1M, in which case it’s in estimated bytes of data. Real chunk sizes are usually close to the requested chunk_size but unless the optional exact arg is assed the real chunk sizes are approximate. Sometimes the distribution of values on the chunk colun can skew chunking. If, for example, col has values 0, 100, 101, ... then the zero value skews chunking. The zero_chunk arg handles this.
TableChunker | TableChunker helps determine how to “chunk” a table. |
Functions | |
new | |
find_chunk_columns | Find chunkable columns. |
calculate_chunks | Calculate chunks for the given range statistics. |
_chunk_numeric | Determine how to chunk a numeric column. |
_chunk_numeric | Determine how to chunk a character column. |
_chunk_char | |
get_first_chunkable_column | Get the first chunkable column in a table. |
size_to_rows | Convert a size in rows or bytes to a number of rows in the table, using SHOW TABLE STATUS. |
get_range_statistics | Determine the range of values for the chunk_col column on this table. |
inject_chunks | Create a SQL statement from a query prototype by filling in placeholders. |
value_to_number | |
range_num | |
range_time | |
range_date | |
range_datetime | |
range_timestamp | |
timestampdiff | |
get_valid_end_points | |
_get_valid_end_point | |
get_first_valid_value | |
_validate_temporal_value | |
get_nonzero_value | |
base_count | Count to any number in any base with the given symbols. |
_d |
sub find_chunk_columns
Find chunkable columns.
%args | Arguments |
table_struct | Hashref returned from TableParser::parse() |
exact | bool: Try to support exact chunk sizes (may still chunk fuzzily) |
Array: whether the table can be chunked exactly if requested (zero otherwise), arrayref of columns that support chunking. Example:
1, [ { column => 'id', index => 'PRIMARY' }, { column => 'i', index => 'i_idx' }, ]
sub calculate_chunks
Calculate chunks for the given range statistics. Args min, max and rows_in_range are returned from get_range_statistics() which is usually called before this sub. Min and max are expected to be valid values (NULL is valid).
%args | Arguments |
dbh | dbh |
db | database name |
tbl | table name |
tbl_struct | retval of TableParser::parse() |
chunk_col | column name to chunk on |
min | min col value, from TableChunker::get_range_statistics() |
max | max col value, from TableChunker::get_range_statistics() |
rows_in_range | number of rows to chunk, from TableChunker::get_range_statistics() |
chunk_size | requested size of each chunk |
exact | Use exact chunk_size? Use approximates is not. |
tries | Fetch up to this many rows to find a non-zero value |
chunk_range | Make chunk range open (default) or openclosed |
Array of WHERE predicates like “`col` >= ‘10’ AND `col` < ‘20’”, one for each chunk. All values are single-quoted due to issue 1002. Example:
`film_id` < '30', `film_id` >= '30' AND `film_id` < '60', `film_id` >= '60' AND `film_id` < '90', `film_id` >= '90',
sub _chunk_numeric
Determine how to chunk a numeric column.
%args | Arguments |
dbh | dbh |
db | database name |
tbl | table name |
tbl_struct | retval of TableParser::parse() |
chunk_col | column name to chunk on |
min | min col value, from TableChunker::get_range_statistics() |
max | max col value, from TableChunker::get_range_statistics() |
rows_in_range | number of rows to chunk, from TableChunker::get_range_statistics() |
chunk_size | requested size of each chunk |
exact | Use exact chunk_size? Use approximates is not. |
tries | Fetch up to this many rows to find a non-zero value |
zero_chunk | Add an extra chunk for zero values? (0, 00:00, etc.) |
Array of chunker info that calculate_chunks() uses to create chunks, like:
col => quoted chunk column name start_point => start value (a Perl number) end_point => end value (a Perl number) interval => interval to walk from start_ to end_point (a Perl number) range_func => coderef to return a value while walking that ^ range have_zero_chunk => whether to include a zero chunk (col=0)
Determine how to chunk a character column.
%args | Arguments |
dbh | dbh |
db | database name |
tbl | table name |
tbl_struct | retval of TableParser::parse() |
chunk_col | column name to chunk on |
min | min col value, from TableChunker::get_range_statistics() |
max | max col value, from TableChunker::get_range_statistics() |
rows_in_range | number of rows to chunk, from TableChunker::get_range_statistics() |
chunk_size | requested size of each chunk |
Array of chunker info that calculate_chunks() uses to create chunks, like:
col => quoted chunk column name start_point => start value (a Perl number) end_point => end value (a Perl number) interval => interval to walk from start_ to end_point (a Perl number) range_func => coderef to return a value while walking that ^ range
sub get_first_chunkable_column
Get the first chunkable column in a table. Only a “sane” column/index is returned. That means that the first auto-detected chunk col/index are used if any combination of preferred chunk col or index would be really bad, like chunk col=x and chunk index=some index over (y, z). That’s bad because the index doesn’t include the column; it would also be bad if the column wasn’t a left-most prefix of the index.
%args | Arguments |
tbl_struct | Hashref returned by TableParser::parse() |
chunk_column | Preferred chunkable column name |
chunk_index | Preferred chunkable column index name |
exact | bool: passed to find_chunk_columns() |
List: chunkable column name, chunkable colum index
sub size_to_rows
Convert a size in rows or bytes to a number of rows in the table, using SHOW TABLE STATUS. If the size is a string with a suffix of M/G/k, interpret it as mebibytes, gibibytes, or kibibytes respectively. If it’s just a number, treat it as a number of rows and return right away.
%args | Arguments |
dbh | dbh |
db | Database name |
tbl | Table name |
chunk_size | Chunk size string like “1000” or “50M” |
Array: number of rows, average row size
sub get_range_statistics
Determine the range of values for the chunk_col column on this table.
%args | Arguments |
dbh | dbh |
db | Database name |
tbl | Table name |
chunk_col | Chunk column name |
tbl_struct | Hashref returned by TableParser::parse() |
where | WHERE clause without “WHERE” to restrict range |
index_hint | ”FORCE INDEX (...)” clause |
tries | Fetch up to this many rows to find a valid value |
Array: min row value, max row value, rows in range
sub inject_chunks
Create a SQL statement from a query prototype by filling in placeholders.
%args | Arguments |
database | Database name |
table | Table name |
chunks | Arrayref of chunks from calculate_chunks() |
chunk_num | Index into chunks to use |
query | Query prototype returned by TableChecksum::make_checksum_query() |
index_hint | ”FORCE INDEX (...)” clause |
where | Arrayref of WHERE clauses joined with AND |
A SQL statement
sub base_count
Count to any number in any base with the given symbols. E.g. if counting to 10 in base 16 with symbols 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f the result is “a”. This is trival for stuff like base 16 (hex), but far less trivial for arbitrary bases with arbitrary symbols like base 25 with symbols B,C,D,...X,Y,Z. For that, counting to 10 results in “L”. The base and its symbols are determined by the character column. Symbols can be non-ASCII.
%args | Arguments |
count_to | Number to count to |
base | Base of special system |
symbols | Arrayref of symbols for “numbers” in special system |
The “number” (symbol) in the special target base system
sub new
Find chunkable columns.
sub find_chunk_columns
Calculate chunks for the given range statistics.
sub calculate_chunks
Determine how to chunk a numeric column.
sub _chunk_numeric
sub _chunk_char
Get the first chunkable column in a table.
sub get_first_chunkable_column
Convert a size in rows or bytes to a number of rows in the table, using SHOW TABLE STATUS.
sub size_to_rows
Determine the range of values for the chunk_col column on this table.
sub get_range_statistics
Create a SQL statement from a query prototype by filling in placeholders.
sub inject_chunks
sub value_to_number
sub range_num
sub range_time
sub range_date
sub range_datetime
sub range_timestamp
sub timestampdiff
sub get_valid_end_points
sub _get_valid_end_point
sub get_first_valid_value
sub _validate_temporal_value
sub get_nonzero_value
Count to any number in any base with the given symbols.
sub base_count
sub _d
Parse SHOW CREATE TABLE.
sub parse
sub make_checksum_query