Skip to content

Comments

Introduce new schema introspection API#7153

Merged
morozov merged 5 commits intodoctrine:4.4.xfrom
morozov:schema-introspection
Sep 23, 2025
Merged

Introduce new schema introspection API#7153
morozov merged 5 commits intodoctrine:4.4.xfrom
morozov:schema-introspection

Conversation

@morozov
Copy link
Member

@morozov morozov commented Sep 16, 2025

This is a logical continuation of the effort started in #5268 – it moved some of the introspection SQL statements from platform classes to schema managers. This PR consolidates the logic of building the queries for schema introspection and processing their results, thus providing a cleaner and more robust API.

Problems with the current API

Wrong abstractions

The current API is better than before #5268, but it's still designed around wrong abstractions and thus declares very non-obvious contracts. It makes implementations brittle and hard to statically analyze:

  1. There are AbstractSchemaManager::select*() methods like the following:
    /**
    * Selects definitions of table columns in the specified database. If the table name is specified, narrows down
    * the selection to this table.
    *
    * @throws Exception
    */
    abstract protected function selectTableColumns(string $databaseName, ?string $tableName = null): Result;
    The contract of this interface implies that each implementation will produce an SQL result with a certain set of columns, but the exact shape of the result isn't described anywhere.
  2. As a counterpart to most of such methods, there is an AbstractSchemaManager::_getPortable*Definition() which accepts an associative array representing a row of the result produced by the first method and is supposed to process it and return a schema object. For example:
    /**
    * Gets Table Column Definition.
    *
    * @param array<string, mixed> $tableColumn
    *
    * @throws TypesException
    */
    abstract protected function _getPortableTableColumnDefinition(array $tableColumn): Column;

It would be okay-ish, if the schema manager implementation for each database platform implemented its "select" and "get portable definition" methods, which would assume a certain shape of row per platform. But in reality, there are shared implementations of the "get portable definition" method. For example:

protected function _getPortableTableIndexesList(array $rows, string $tableName): array
{
$result = [];
foreach ($rows as $row) {
$indexName = $keyName = $row['key_name'];
if ($row['primary']) {
$keyName = 'primary';
}

This way, the shared implementation enforces the shape of the result on all implementations. As a result, implementations have to implement some "translation" of their "native" shape of the result into the shape expected by the shared implementation. For example:

SELECT
scm.name AS schema_name,
tbl.name AS table_name,
idx.name AS key_name,
col.name AS column_name,
~idx.is_unique AS non_unique,
idx.is_primary_key AS [primary],
CASE idx.type
WHEN '1' THEN 'clustered'
WHEN '2' THEN 'nonclustered'
ELSE NULL
END AS flags

We are lucky to have built a great integration test suite, but the code itself is very hard to reason about.

array_change_key_case($row, CASE_LOWER) all over the place

The reason for that is that the connection used by AbstractSchemaManager may use a Portability\Middleware. In this case, there is no guarantee that the array keys of fetched rows will be in the same case as in the query (even if the corresponding column names are quoted), so these calls to array_change_key_case() compensate for that.

These calls aren't used consistently across the codebase (e.g. in MySQLSchemaManager). As a result, there may be bugs reproducible in some portability modes.

Proposed design

classDiagram
    class AbstractSchemaManager
    AbstractSchemaManager : +introspectTables()

    interface SchemaProvider
    SchemaProvider : +getAllTables()
    SchemaProvider <-- AbstractSchemaManager

    SchemaProvider <|-- IntrospectingSchemaProvider

    class IntrospectingSchemaProvider
    IntrospectingSchemaProvider : +getAllTables()
    IntrospectingSchemaProvider : -getColumnsForAllTables()
    IntrospectingSchemaProvider : -getIndexesForAllTables()

    interface MetadataProvider
    MetadataProvider: +getTableColumnsForAllTables()
    MetadataProvider: +getIndexColumnsForAllTables()
    MetadataProvider <-- IntrospectingSchemaProvider

    class MySQLMetadataProvider
    MetadataProvider <|-- MySQLMetadataProvider

    class PostgreSQLMetadataProvider
    MetadataProvider <|-- PostgreSQLMetadataProvider

    class IndexColumnMetadataProcessor
    IndexColumnMetadataProcessor <-- IntrospectingSchemaProvider

    class ForeignKeyConstraintColumnMetadataProcessor
    ForeignKeyConstraintColumnMetadataProcessor <-- IntrospectingSchemaProvider
Loading

The AbstractSchemaManager class

The schema manager remains the user-facing API for schema introspection but no longer contains any implementation. In addition to that, its only responsibility is to translate table names from the user-input format (e.g. OptionallyQualifiedName or a unquoted/quoted pair of schema/table name) into the actual schema/table name for the underlying schema provider.

This transformation includes:

  1. Folding of the identifiers according to the semantics of the underlying platform (e.g. unquoted identifiers are upper-cased on Oracle and Db2).
  2. Applying the default schema name (e.g. userspublic.users on Postgres).

Rejected ideas

Originally I thought that users could interact with the schema provider directly. Similar to Connection::createSchemaManager() they could use Connection::createSchemaProvider(). This would provide a nice CQRS-style separation of APIs. However, there is the AbstractSchemaManager::migrateSchema() which internally both introspects the schema and applies the changes, so schema introspection cannot be completely moved out of AbstractSchemaManager.

We could sacrifice AbstractSchemaManager::migrateSchema() (which is just a convenience methods) and separate the APIs cleanly, but I don't know what other side effects it would have.

The SchemaProvider interface

The schema provider provides access to the elements of the underlying database schema in the form of schema objects (e.g. Column) or their collections. With this approach, the way in which a given implementation obtains the information about the underlying schema become an implementation detail.

The IntrospectingSchemaProvider class

This is the only out-of-the box SchemaProvider implementation. It is responsible for providing the schema by introspecting the underlying database. Internally, it interacts with a MetadataProvider. It requests metadata from the metadata provider and builds schema objects from them.

Additionally, it's responsible for unifying the shape of the schema introspection results between the platforms that do and do not support schemas. If the underlying platform supports schemas, and the schema name of an object matches the current schema, the schema name will be omitted to match the behavior of the platforms that do not support schemas. For example, on PostgresSQL, a table public.users will be introspected as users.

The MetadataProvider interface

MetadataProvider is a lower-level interface compared to SchemaProvider. It provides access to the metadata objects, which are the analogs of the array<string, mixed> arrays used by current schema introspection API. Unlike them, they are well-defined and strictly typed.

Each AbstractPlatform implementation is now responsible for instantiating a metadata provider:

public function createMetadataProvider(Connection $connection): MetadataProvider
{
throw NotSupported::new(__METHOD__);
}

With the new API, most of the time, the SELECT statement and the usage of the selected columns in the PHP code are located in the same method. For example:

$sql = <<<'SQL'
SELECT SCHEMA_NAME
FROM information_schema.SCHEMATA
ORDER BY SCHEMA_NAME
SQL;
foreach ($this->connection->iterateColumn($sql) as $databaseName) {

This allows fetching the result using iterateNumeric() or even iterateColumn() and eliminates the need for array_change_key_case().

Access to the objects scoped to a table (e.g. table columns) is usually provided in the form of two methods:

public function getTableColumnsForAllTables(): iterable;
public function getTableColumnsForTable(?string $schemaName, string $tableName): iterable;

This way, depending on whether the consumer needs to introspect the schema of a single table or of all tables in the database, the corresponding method is invoked. Under the hood, these methods usually share an implementation.

MetadataProvider methods return collections in the form of iterable<T> (usually, a Generator<T>), not array<T>. This is not because we want to optimize memory usage. This is because it's more convenient to iterate the query result usingforeach and yeild rows w/o buffering them in an array. The metadata elements aren't supposed to be accessed by index – only iterated and processed one by one. iterable<T> works just fine for that.

Metadata processors

Metadata processors generalize the logic of converting one or many metadata rows into one or more corresponding schema object (e.g. the metadata about foreign key constraint columns into a list of foreign key constraints). This logic is agnostic of the database platform, so there is one implementation per object type (e.g. ForeignKeyConstraintColumnMetadataProcessor).

The exception is table columns. Unlike other metadata row types which contain scalar values, TableColumnMetadataRow contains a full Column object. The reason is that most of the properties the column carries belong to its data type, not the column itself. Therefore, it is more practical to interpret type-specific properties during introspection within a metadata provider and instantiate a complete Column object right away instead defining an abstraction to represent them.

What else has improved?

  1. All schema introspection queries now consistently use prepared statements.
  2. System tables and schemas that don't need to be exposed to the user are consistently handled in each platform's metadata provider.
  3. The reserved column names for schema introspection queries (AbstractSchemaManager::SCHEMA_NAME_COLUMN and AbstractSchemaManager::TABLE_NAME_COLUMN) will be gone.
  4. The collections of objects returned by the new schema manager methods are list<T>, not array<lower-case-name, T>. This will allow to eliminate the conflicts between objects with the names that differ only in case. This makes the API more consistent with Deprecate using keys of table-scoped object collections as names #7102 and other similar changes.

Known limitations

Unlike the current API, the new one doesn't allow introspection of databases other than the current one. The reasons are:

  1. This is not portable. The definition of a "database" varies significanly per supported platform. The PostgreSQL, SQL Server and SQLite schema managers currently just ignore the $databaseName parameter.
  2. This is not tested.
  3. Schema modification is supported only withing the current database on all platforms.

Deprecations and upgrade path

  1. Methods named as list<Object>s() are being deprecated in favor of the corresponding introspect<Object>s(). The methods that used to return a list of strings as names (e.g. listDatabases()) now return a list of quoted names.
  2. Methods named as getTable<Something>() that accept the table name as a god-knows-how formatted string, are being deprecated in favor of a set of introspectTable<Something>(By(Unqualified|Qualified)Name)?(). This way, the user can be explicit about the semantics of the string or can pass the name as object.

The exception is introspectTable(string $name). There's no way to provide an upgrade path in the form of introspectTable(OptionallyQualified $name) because this name is already taken. I expect that most users will use the By(Unqualified|Qualified)Name versions of this method. The signature will be modified in 5.0.x.

Modified tests

Most of the calls to the deprecated APIs in the tests have been replaced with the use of the recommended ones. There are a few exceptions where a given deprecated method is still used in some tests:

  1. listTableColumns() is used by a test that covers the deprecated jsonb column option.
  2. listTableIndexes() is used by the tests that expect the PK constraint to be returned as well (also deprecated).
  3. introspectTable() is used by the tests that:
    1. Use a variable OptionallyQualifiedName table name. Using one of the recommended methods would require breaking it down into the schema name and unqualified name. This won't be a problem in 5.0.x.
    2. Perform assertions using Index::getColumns(), ForeignKeyConstraint::getLocalColumns() and other similar methods. With the new API, the methods return names surrounded by quotes, which breaks the test. They all are deprecated.
    3. Break for other reasons because the names are introspected as quoted (all failures would be expected, but most of them will be compensated by other changes in 5.0.x, so the end result won't change).

Remaining known issues

Unlike announced in #7030 (comment), the migration to the new API alone won't address the problem of introspecting object names containing special characters (e.g. dots). The reason is that, even though the names will be introspected as quoted, AbstractAsset::_setName() will ignore that and parse the dot as a special character:

if (str_contains($name, '.')) {
$parts = explode('.', $name);

This issue should be already addressed in 5.0.x for most if not all the object types.

Future plans

  1. Get rid of @internal platform-specific introspection-specific methods in platform classes:

    public function fetchTableOptionsByTable(bool $includeTableName): string
    public function getDefaultColumnValueSQLSnippet(): string

    The query used to introspect the schema should be part of the corresponding metadata provider implementation. In general, supporting platform versions by extending classes is not the best approach (I can elaborate on that separately).

  2. Introduce an interface for getDoctrineTypeMapping(). This way, combined with the previous step, we can eliminate the dependency of MetadataProvider implementations on AbstractPlatform.

  3. Introduce a better schema filtering API. The current API is error-prone: it filters tables by a vaguely formatted string (<schema-name>.<table-name>), so if the input contains a dot, it cannot tell whether the dot is a separator or part of the schema or table name. Additionally, filtering sequences probably never worked (the entire Sequence object is passed to the filter instead of the name).

@morozov morozov force-pushed the schema-introspection branch from d18986a to 75be0ff Compare September 19, 2025 01:36
@morozov morozov force-pushed the schema-introspection branch from 75be0ff to 68038b8 Compare September 20, 2025 00:51
@morozov morozov marked this pull request as ready for review September 20, 2025 01:10
greg0ire
greg0ire previously approved these changes Sep 22, 2025
Copy link
Member

@greg0ire greg0ire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's most impressive between the patch or its description… great job!

* Unlike other metadata row types, this one contains a full Column object. The reason is that most of the properties
* the column carries belong to its data type, not the column itself. Therefore, it is more practical to interpret
* type-specific properties during introspection within a metadata provider and instantiate a complete Column object
* right away instead defining an abstraction to represent them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* right away instead defining an abstraction to represent them.
* right away instead of defining an abstraction to represent them.

## Deprecated `AbstractSchemaManager` methods

The following `AbstractSchemaManager` methods have been deprecated:
1. `listDatabases()` - use `introspectDatabaseNames()` instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For methods that appear in the docs, like in this example:

listDatabases()
---------------
Retrieve an array of databases on the configured connection:
.. code-block:: php
<?php
$databases = $sm->listDatabases();

should the corresponding docs be removed, or should a note be added about them being deprecated in favor of another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can replace them with the recommended ones. I recalled that I need to update the docs like 5 times but still forgot 😓

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of documentation, I learned from a person who professionally writes documentation that wrapping lines in RST/AsciiDoc representing technical documentation is an anti-pattern.

The argument is that, let's say the max line width is 120 characters and you have a paragraph consisting of 1000 characters, so it spans 9 lines. If you add a couple of words in the beginning of the paragraph, it will shift all following words and will require reformatting the entire paragraph and all 9 lines making the diff unreadable.

@greg0ire, @SenseException what do you think about not wrapping lines? On the one hand, it will add quite some horizontal scrolling, but on the other hand, the markdown will not require manual formatting during editing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the diff only becomes unreadable if it's a line diff, but at least with GitHub, this will result in a big diff with a smaller diff based on word diff inside. Also, if I open the source for technical docs on a wide display, I don't want to have to read a 1000 character long line. If we are that concerned about reformatting, we should IMO consider https://sembr.org/ instead. But I don't think it is that much of a concern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not a concern. I'm fine with what we have right now.

@morozov morozov force-pushed the schema-introspection branch from d6dc2b0 to 11873bf Compare September 22, 2025 15:56
Comment on lines -21 to -26
Parameters containing identifiers passed to the SchemaManager
methods are *NOT* quoted automatically! Identifier quoting is
really difficult to do manually in a consistent way across
different databases. You have to manually quote the identifiers
when you accept data from user or other sources not under your
control.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer the case 😎

Copy link
Member Author

@morozov morozov Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I realize we need a separate article (and maybe an additional blog post) that explains the meaning of quoted/unquoted identifiers (i.e. that both are safe to use), the API design around them, etc. But I'll postpone this until the hackathon.

---------------

Retrieve an array of databases on the configured connection:
Retrieve a list of the names of available database:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Retrieve a list of the names of available database:
Retrieve a list of the names of available databases:

greg0ire
greg0ire previously approved these changes Sep 23, 2025
Copy link
Member

@greg0ire greg0ire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after fixing the typo

@morozov morozov force-pushed the schema-introspection branch from 11873bf to 75f2cb1 Compare September 23, 2025 12:50
@morozov morozov merged commit 1eccb57 into doctrine:4.4.x Sep 23, 2025
97 checks passed
@morozov morozov deleted the schema-introspection branch September 23, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants