SQLParser

SQLParser parses common MySQL SQL statements into data structures.  This parser is MySQL-specific and intentionally meant to handle only “common” cases.  Although there are many limiations (like UNION, CASE, etc.), many complex cases are handled that no other free, Perl SQL parser at the time of writing can parse, notably subqueries in all their places and varieties.

This package has not been profiled and since it relies heavily on mildly complex regex, so do not expect amazing performance.

See SQLParser.t for examples of the various data structures.  There are many and they vary a lot depending on the statment parsed, so documentation in this file is not exhaustive.

This package differs from QueryParser because here we parse the entire SQL statement (thus giving access to all its parts), whereas QueryParser extracts just needed parts (and ignores all the rest).

Summary
SQLParserSQLParser parses common MySQL SQL statements into data structures.
Variables
$quoted_ident
$unquoted_ident
$ident_alias
$table_ident
$column_ident
Functions
newCreate a SQLParser object.
parseParse a SQL statment.
_parse_clausesParse raw text of clauses into data structures.
clean_queryRemove spaces, flatten, and normalize some patterns for easier parsing.
normalize_keyword_spacesNormalize spaces around certain SQL keywords.
_parse_queryThis sub is called by the parse_TYPE subs except parse_insert.
parse_fromParse a FROM clause, a.k.a.
parse_identifiersParse an arrayref of identifiers into their parts.
split_unquoteSplit and unquote a table name.
is_identifierDetermine if something is a schema object identifier.

Variables

$quoted_ident

my $quoted_ident

$unquoted_ident

my $unquoted_ident

$ident_alias

my $ident_alias

$table_ident

my $table_ident

$column_ident

my $column_ident

Functions

new

sub new

Create a SQLParser object.

Parameters

%argsArguments

Optional Arguments

SchemaSchema object.  Can be set later by calling <set_Schema()>.

Returns

SQLParser object

parse

sub parse

Parse a SQL statment.  Only statements of $allowed_types are parsed.  This sub recurses to parse subqueries.

Parameters

$querySQL statement

Returns

A complex hashref of the parsed SQL statment.  All keys and almost all values are lowercase for consistency.  The struct is roughly:

{
  type       => '',     # one of $allowed_types
  clauses    => {},     # raw, unparsed text of clauses
  <clause>   => struct  # parsed clause struct, e.g. from => [<tables>]
  keywords   => {},     # LOW_PRIORITY, DISTINCT, SQL_CACHE, etc.
  functions  => {},     # MAX(), SUM(), NOW(), etc.
  select     => {},     # SELECT struct for INSERT/REPLACE ... SELECT
  subqueries => [],     # pointers to subquery structs
}

It varies, of course, depending on the query.  If something is missing it means the query doesn’t have that part.  E.g.  INSERT has an INTO clause but DELETE does not, and only DELETE and SELECT have FROM clauses.  Each clause struct is different; see their respective parse_CLAUSE subs.

_parse_clauses

sub _parse_clauses

Parse raw text of clauses into data structures.  This sub recurses to parse the clauses of subqueries.  The clauses are read from and their data structures saved into the $struct parameter.

Parameters

$structHashref from which clauses are read (%{$struct->{clauses}}) and into which data structs are saved (e.g.  $struct->{from}=...).

clean_query

sub clean_query

Remove spaces, flatten, and normalize some patterns for easier parsing.

Parameters

$querySQL statement

Returns

Cleaned $query

normalize_keyword_spaces

sub normalize_keyword_spaces

Normalize spaces around certain SQL keywords.  Spaces are added and removed around certain SQL keywords to make parsing easier.

Parameters

$querySQL statement

Returns

Normalized $query

_parse_query

This sub is called by the parse_TYPE subs except parse_insert.  It does two things: remove, save the given keywords, all of which should appear at the beginning of the query; and, save (but not remove) the given clauses.  The query should start with the values for the first clause because the query’s first word was removed in parse().  So for “SELECT cols FROM ...”, the query given here is “cols FROM ...” where “cols” belongs to the first clause “columns”.  Then the query is walked clause-by-clause, saving each.

Parameters

$querySQL statement with first word (SELECT, INSERT, etc.) removed
$keywordsCompiled regex of keywords that can appear in $query
$first_clauseFirst clause word to expect in $query
$clausesCompiled regex of clause words that can appear in $query

Returns

Hashref with raw text of clauses

parse_from

Parse a FROM clause, a.k.a. the table references.  Does not handle nested joins.  See http://dev.mysql.com/doc/refman/5.1/en/join.html

Parameters

$fromFROM clause (with the word “FROM”)

Returns

Arrayref of hashrefs, one hashref for each table in the order that the tables appear, like:

{
  name           => 't2',  -- table's real name
  alias          => 'b',   -- table's alias, if any
  explicit_alias => 1,     -- if explicitly aliased with AS
  join  => {               -- if joined to another table, all but first
                           -- table are because comma implies INNER JOIN
    to        => 't1',     -- table name on left side of join, if this is
                           -- LEFT JOIN then this is the inner table, if
                           -- RIGHT JOIN then this is outer table
    type      => '',       -- left, right, inner, outer, cross, natural
    condition => 'using',  -- on or using, if applicable
    columns   => ['id'],   -- columns for USING condition, if applicable
    ansi      => 1,        -- true of ANSI JOIN, i.e. true if not implicit
                           -- INNER JOIN due to following a comma
  },
},
{
  name => 't3',
  join => {
    to        => 't2',
    type      => 'left',
    condition => 'on',     -- an ON condition is like a WHERE clause so
    where     => [...]     -- this arrayref of predicates appears, see
                           -- <parse_where()> for its structure
  },
},

parse_identifiers

Parse an arrayref of identifiers into their parts.  Identifiers can be column names (optionally qualified), expressions, or constants.  GROUP BY and ORDER BY specify a list of identifiers.

Parameters

$identsArrayref of indentifiers

Returns

Arrayref of hashes with each identifier’s parts, depending on what kind of identifier it is.

split_unquote

Split and unquote a table name.  The table name can be database-qualified or not, like `db`.`table`.  The table name can be backtick-quoted or not.

Parameters

$db_tblTable name
$default_dbDefault database name to return if $db_tbl is not database-qualified

Returns

Array: unquoted database (possibly undef), unquoted table

is_identifier

Determine if something is a schema object identifier.  E.g.: `tbl` is an identifier, but “tbl” is a string and 1 is a number.  See http://dev.mysql.com/doc/refman/5.1/en/identifiers.html

Parameters

$thingName of something, including any quoting as it appears in a query.

Returns

True of $thing is an identifier, else false.

my $quoted_ident
my $unquoted_ident
my $ident_alias
my $table_ident
my $column_ident
sub new
Create a SQLParser object.
sub parse
Parse a SQL statment.
sub _parse_clauses
Parse raw text of clauses into data structures.
sub clean_query
Remove spaces, flatten, and normalize some patterns for easier parsing.
sub normalize_keyword_spaces
Normalize spaces around certain SQL keywords.
Schema encapsulates a data structure representing databases and tables.
Close