SQLParser

SQLParser parses common MySQL SQL statements into data structures. This parser is MySQL-specific and intentionally meant to handle only “common” cases. Although there are many limiations (like UNION, CASE, etc.), many complex cases are handled that no other free, Perl SQL parser at the time of writing can parse, notably subqueries in all their places and varieties.

This package has not been profiled and since it relies heavily on mildly complex regex, so do not expect amazing performance.

See SQLParser.t for examples of the various data structures. There are many and they vary a lot depending on the statment parsed, so documentation in this file is not exhaustive.

This package differs from QueryParser because here we parse the entire SQL statement (thus giving access to all its parts), whereas QueryParser extracts just needed parts (and ignores all the rest).

Summary

SQLParser	SQLParser parses common MySQL SQL statements into data structures.
Variables
$quoted_ident
$unquoted_ident
$ident_alias
$table_ident
$column_ident
Functions
new	Create a SQLParser object.
parse	Parse a SQL statment.
_parse_clauses	Parse raw text of clauses into data structures.
clean_query	Remove spaces, flatten, and normalize some patterns for easier parsing.
normalize_keyword_spaces	Normalize spaces around certain SQL keywords.
_parse_query	This sub is called by the parse_TYPE subs except parse_insert.
parse_from	Parse a FROM clause, a.k.a.
parse_identifiers	Parse an arrayref of identifiers into their parts.
split_unquote	Split and unquote a table name.
is_identifier	Determine if something is a schema object identifier.

Variables

$quoted_ident

my $quoted_ident

$unquoted_ident

my $unquoted_ident

$ident_alias

my $ident_alias

$table_ident

my $table_ident

$column_ident

my $column_ident

Functions

new

sub new

Create a SQLParser object.

Parameters

%args

Arguments

Optional Arguments

Schema

Schema object. Can be set later by calling <set_Schema()>.

Returns

SQLParser object

parse

sub parse

Parse a SQL statment. Only statements of $allowed_types are parsed. This sub recurses to parse subqueries.

Parameters

$query

SQL statement

Returns

A complex hashref of the parsed SQL statment. All keys and almost all values are lowercase for consistency. The struct is roughly:

{
  type       => '',     # one of $allowed_types
  clauses    => {},     # raw, unparsed text of clauses
  <clause>   => struct  # parsed clause struct, e.g. from => [<tables>]
  keywords   => {},     # LOW_PRIORITY, DISTINCT, SQL_CACHE, etc.
  functions  => {},     # MAX(), SUM(), NOW(), etc.
  select     => {},     # SELECT struct for INSERT/REPLACE ... SELECT
  subqueries => [],     # pointers to subquery structs
}

It varies, of course, depending on the query. If something is missing it means the query doesn’t have that part. E.g. INSERT has an INTO clause but DELETE does not, and only DELETE and SELECT have FROM clauses. Each clause struct is different; see their respective parse_CLAUSE subs.

_parse_clauses

sub _parse_clauses

Parse raw text of clauses into data structures. This sub recurses to parse the clauses of subqueries. The clauses are read from and their data structures saved into the $struct parameter.

Parameters

$struct

Hashref from which clauses are read (%{$struct->{clauses}}) and into which data structs are saved (e.g. $struct->{from}=...).

clean_query

sub clean_query

Remove spaces, flatten, and normalize some patterns for easier parsing.

Parameters

$query

SQL statement

Returns

Cleaned $query

normalize_keyword_spaces

sub normalize_keyword_spaces

Normalize spaces around certain SQL keywords. Spaces are added and removed around certain SQL keywords to make parsing easier.

Parameters

$query

SQL statement

Returns

Normalized $query

_parse_query

This sub is called by the parse_TYPE subs except parse_insert. It does two things: remove, save the given keywords, all of which should appear at the beginning of the query; and, save (but not remove) the given clauses. The query should start with the values for the first clause because the query’s first word was removed in parse(). So for “SELECT cols FROM ...”, the query given here is “cols FROM ...” where “cols” belongs to the first clause “columns”. Then the query is walked clause-by-clause, saving each.

Parameters

$query	SQL statement with first word (SELECT, INSERT, etc.) removed
$keywords	Compiled regex of keywords that can appear in $query
$first_clause	First clause word to expect in $query
$clauses	Compiled regex of clause words that can appear in $query

Returns

Hashref with raw text of clauses

parse_from

Parse a FROM clause, a.k.a. the table references. Does not handle nested joins. See http://dev.mysql.com/doc/refman/5.1/en/join.html

Parameters

$from

FROM clause (with the word “FROM”)

Returns

Arrayref of hashrefs, one hashref for each table in the order that the tables appear, like:

{
  name           => 't2',  -- table's real name
  alias          => 'b',   -- table's alias, if any
  explicit_alias => 1,     -- if explicitly aliased with AS
  join  => {               -- if joined to another table, all but first
                           -- table are because comma implies INNER JOIN
    to        => 't1',     -- table name on left side of join, if this is
                           -- LEFT JOIN then this is the inner table, if
                           -- RIGHT JOIN then this is outer table
    type      => '',       -- left, right, inner, outer, cross, natural
    condition => 'using',  -- on or using, if applicable
    columns   => ['id'],   -- columns for USING condition, if applicable
    ansi      => 1,        -- true of ANSI JOIN, i.e. true if not implicit
                           -- INNER JOIN due to following a comma
  },
},
{
  name => 't3',
  join => {
    to        => 't2',
    type      => 'left',
    condition => 'on',     -- an ON condition is like a WHERE clause so
    where     => [...]     -- this arrayref of predicates appears, see
                           -- <parse_where()> for its structure
  },
},

parse_identifiers

Parse an arrayref of identifiers into their parts. Identifiers can be column names (optionally qualified), expressions, or constants. GROUP BY and ORDER BY specify a list of identifiers.

Parameters

$idents

Arrayref of indentifiers

Returns

Arrayref of hashes with each identifier’s parts, depending on what kind of identifier it is.

split_unquote

Split and unquote a table name. The table name can be database-qualified or not, like `db`.`table`. The table name can be backtick-quoted or not.

Parameters

$db_tbl	Table name
$default_db	Default database name to return if $db_tbl is not database-qualified

Returns

Array: unquoted database (possibly undef), unquoted table

is_identifier

Determine if something is a schema object identifier. E.g.: `tbl` is an identifier, but “tbl” is a string and 1 is a number. See http://dev.mysql.com/doc/refman/5.1/en/identifiers.html

Parameters

$thing

Name of something, including any quoting as it appears in a query.

Returns

True of $thing is an identifier, else false.