Skip to content

Introspect if table is FTS4 or FTS5 #196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonw opened this issue Nov 5, 2020 · 19 comments
Closed

Introspect if table is FTS4 or FTS5 #196

simonw opened this issue Nov 5, 2020 · 19 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Nov 5, 2020

I want .search() to work against both FTS5 and FTS4 tables - but sort by rank should only work for FTS5.

This means I need to be able to introspect and tell if a table is FTS4 or FTS5.

Originally posted by @simonw in #192 (comment)

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

This is surprisingly difficult. I need to parse the CREATE VIRTUAL TABLE statement, which will look something like this:

CREATE VIRTUAL TABLE "global-power-plants_fts" USING FTS5 ("name", content="global-power-plants")

The problem is I need to be able to handle various different quoting formats for the table name (mytable v.s. "mytable" v.s. [mytable]) plus I need to look out for CREATE TABLE IF NOT EXISTS.

@simonw simonw added the enhancement New feature or request label Nov 5, 2020
@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

This is blocking landing .search() in #195

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I could use a parsing library like https://parsy.readthedocs.io/en/latest/tutorial.html for this - or pyparsing which has a SQLite example here: https://github.com/pyparsing/pyparsing/blob/master/examples/select_parser.py

I'd rather not add a new dependency for this though so I'm going to see if I can get something that's good-enough just using a regular expression.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

https://sqlite.org/lang_keywords.html says:

There are four ways of quoting keywords in SQLite:

'keyword' A keyword in single quotes is a string literal.
"keyword" A keyword in double-quotes is an identifier.
[keyword] A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility.
`keyword` A keyword enclosed in grave accents (ASCII code 96) is an identifier. This is not standard SQL. This quoting mechanism is used by MySQL and is included in SQLite for compatibility.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

Then https://sqlite.org/lang_expr.html#literal_values_constants_ says:

A string constant is formed by enclosing the string in single quotes ('). A single quote within the string can be encoded by putting two single quotes in a row - as in Pascal. C-style escapes using the backslash character are not supported because they are not standard SQL.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I confirmed all three of these are valid syntax for creating tables:

~ % sqlite3 tmp.db
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> create table 'foo''and' (id int);
sqlite> create table "bar""and" (id int);
sqlite> create table [baz] (id int);
sqlite> create table `bant` (id int);
sqlite> .schema
CREATE TABLE IF NOT EXISTS 'foo''and' (id int);
CREATE TABLE IF NOT EXISTS "bar""and" (id int);
CREATE TABLE [baz] (id int);
CREATE TABLE `bant` (id int);
sqlite> select * from sqlite_master;
table|foo'and|foo'and|2|CREATE TABLE 'foo''and' (id int)
table|bar"and|bar"and|3|CREATE TABLE "bar""and" (id int)
table|baz|baz|4|CREATE TABLE [baz] (id int)
table|bant|bant|5|CREATE TABLE `bant` (id int)

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

Good news: I don't think I have to deal with foo.tablename, because that doesn't get reflected in the sqlite_master table:

sqlite> attach 'foo.db' as foo;
sqlite> create table foo.`bant` (id int);
sqlite> select * from foo.sqlite_master;
table|bant|bant|2|CREATE TABLE `bant` (id int)

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

In [8]: r = re.compile(r"""'[^']*(?:''[^']*)*'""")
In [9]: r.match("'fo'o'")
Out[9]: <re.Match object; span=(0, 4), match="'fo'">
In [10]: r.match("'fo''o'")
Out[10]: <re.Match object; span=(0, 7), match="'fo''o'">

'[^']*(?:''[^']*)*'

This matches a single quote, then 0+ not-single-quotes, then 0+ (either 0+ not-single quotes or a double single quote), then a single quote.

Unrolling the loop technique described here: http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I'm going to try re.VERBOSE to see if I can make this readable with comments. https://docs.python.org/3/howto/regex.html

charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I think this might be it:

create_virtual_table_re = re.compile(r"""
\s*CREATE\s+VIRTUAL\s+TABLE\s+ # CREATE VIRTUAL TABLE
(
   '(?P<squoted_table>[^']*(?:''[^']*)*)' | # single quoted name
   "(?P<dquoted_table>[^"]*(?:""[^"]*)*)" | # double quoted name
   `(?P<backtick_table>[^`]+)` |            # `backtick` quoted name
   \[(?P<squarequoted_table>[^\]]+)\]       # [...] quoted name
)
\s+(IF\s+NOT\s+EXISTS\s+)?     # IF NOT EXISTS (optional)
USING\s+(?P<using>\w+)
""", re.VERBOSE | re.IGNORECASE)

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

Next step: lots of unit tests.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I'm missing the case where a table has no quotes around it at all - create virtual table foo using fts5

So I need to know how to create a regex for a SQLite identifier. https://www.sqlite.org/draft/tokenreq.html seems to be the only available documentation for that.

Identifier tokens

Identifiers follow the usual rules with the exception that SQLite allows the dollar-sign symbol in the interior of an identifier. The dollar-sign is for compatibility with Microsoft SQL-Server and is not part of the SQL standard.

H41130: SQLite shall recognize as an ID token any sequence of characters that begins with an ALPHABETIC character and continue with zero or more ALPHANUMERIC characters and/or "$" (u0024) characters and which is not a keyword token.

Identifiers can be arbitrary character strings within square brackets. This feature is also for compatibility with Microsoft SQL-Server and not a part of the SQL standard.

H41140: SQLite shall recognize as an ID token any sequence of non-zero characters that begins with "[" (u005b) and continuing through the first "]" (u005d) character.

The standard way of quoting SQL identifiers is to use double-quotes.

H41150: SQLite shall recognize as an ID token any sequence of characters that begins with a double-quote (u0022), is followed by zero or more non-zero characters and/or pairs of double-quotes (u0022) and terminates with a double-quote (u0022) that is not part of a pair.

MySQL allows identifiers to be quoted using the grave accent character. SQLite supports this for interoperability.

H41160: SQLite shall recognize as an ID token any sequence of characters that begins with a grave accent (u0060), is followed by zero or more non-zero characters and/or pairs ofgrave accents (u0060) and terminates with a grave accent (u0022) that is not part of a pair.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

More from that document, describing ALPHANUMERIC:

ALPHABETIC

Any of the characters in the range u0041 through u005a (letters "A" through "Z") or in the range u0061 through u007a (letters "a" through "z") or the character u005f ("_") or any other character larger than u007f.

NUMERIC

Any of the characters in the range u0030 through u0039 (digits "0" through "9")

ALPHANUMERIC

Any character which is either ALPHABETIC or NUMERIC

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

"any other character larger than u007f." Need to figure that out!

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

I think that's \u007F-\uFFFF in regex range speak.

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

So...

ALPHABETIC: [\u0041-\u005a\u0061-\u0071\u007f-\uffff\u005f]

NUMERIC: [\u0030-\u0039]

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

And an identifier is "ALPHABETIC character and continue with zero or more ALPHANUMERIC characters and/or "$" (u0024) characters"

So...

[\u0041-\u005a\u0061-\u0071\u007f-\uffff\u005f][\u0041-\u005a\u0061-\u0071\u007f-\uffff\u005f\u0030-\u0039\u0024]+

@simonw
Copy link
Owner Author

simonw commented Nov 5, 2020

The finished monster:

_virtual_table_using_re = re.compile(r"""
^ # Start of string
\s*CREATE\s+VIRTUAL\s+TABLE\s+ # CREATE VIRTUAL TABLE
(
    '(?P<squoted_table>[^']*(?:''[^']*)*)' | # single quoted name
    "(?P<dquoted_table>[^"]*(?:""[^"]*)*)" | # double quoted name
    `(?P<backtick_table>[^`]+)`            | # `backtick` quoted name
    \[(?P<squarequoted_table>[^\]]+)\]     | # [...] quoted name
    (?P<identifier>                          # SQLite non-quoted identifier
        [A-Za-z_\u0080-\uffff]  # \u0080-\uffff = "any character larger than u007f"
        [A-Za-z_\u0080-\uffff0-9\$]* # zero-or-more alphanemuric or $
    )
)
\s+(IF\s+NOT\s+EXISTS\s+)?      # IF NOT EXISTS (optional)
USING\s+(?P<using>\w+)          # e.g. USING FTS5
""", re.VERBOSE | re.IGNORECASE)

@simonw simonw closed this as completed in 59d8689 Nov 5, 2020
simonw added a commit that referenced this issue Nov 7, 2020
simonw added a commit that referenced this issue Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant