Skip to content

refactor(language): proposal to remove all usage of schema to refer to hierarchy #8477

@gforsyth

Description

@gforsyth

Schema as a word in databases is fraught with dual meanings and is
generally horrible. We do not use it consistently in Ibis, even
sometimes using different meanings for the same method but on different
backends.

I propose a complete elimination of the hierarchical "schema" that Postgres and others have saddled us with.

Glossary

For the purposes of this issue and also moving forward in Ibis:

  • database: a collection of tables
  • catalog: a collection of databases
  • schema: a mapping of column names to dtypes and NOTHING ELSE

The database kwarg in Ibis

Post refactor, database can always be one of:

  • string of "database"
  • dotted string of "catalog.database" (although this might error if you
    pass a "catalog.database" to a backend that only has one level of
    hierarchy)
  • tuple of ("catalog", "database")
  • ibis namespace object

Places where we currently use hierarchical schema and removal proposal

Backend.list_tables

Current: list_tables that only takes schema as kwarg

  • mysql
  • postgres
  • oracle

Future:

These 3 are relatively easy, we can deprecate schema and have it warn
and add database

Current: list_tables that takes both database and schema as kwargs

  • trino
  • bigquery
  • duckdb
  • snowflake
  • mssql
  1. Current behavior:

    • user only passes schema, we assume current database
    • user only passes database, we assume current schema (duckdb,
      but this seems like a bug and no one would do this)
    • user only passes database, we assume information_schema as
      schema (mssql)
    • user only passes database, we error (trino, bigquery, snowflake)

Future:

  • if user only passes deprecated schema, warn, treat as new database

  • if user only passes database, treat as new database This is
    technically a hard break that we can't warn users about (in code).
    This would only impact (if anyone) users of the mssql backend and maybe DuckDB (but I doubt it)

  • if user passes both old database and old schema, warn, treat as
    "catalog.database"

Backend.table

Current: Backend.table that takes schema and database

  • Default SQL backend behavior
  • bigquery currently errors if database only

Future:

Deprecate schema, warn if schema is passed, make kw-only

In ibis 8, we were creating an ops.Namespace, passing that to
_get_sqla_table and only using the schema attribute of
ops.Namespace so I don't think anyone was using only database in a
functional way even if it wasn't explicitly erroring there.

(sqlalchemy only accepts a schema kwarg when defining a
sqlalchemy table, this is one of the reasons for the current mess.)

While this is a breaking change, I don't believe it will break anyone's
code without warning.

Current: Backend.table that takes schema as mapping

  • polars takes a _schema kwarg that is unused
  • pandas
  • dask

For dask and pandas you can use schema to override the mapping
schema used to create a table, sort of similar to using ibis.table?

Future:

Deprecate schema keyword, offer no replacement for this weird and
inconsistent functionality.

Possible: add the database kwarg for API consistency and no-op if it
gets used (or error?)

Backend.get_schema

This is new since TES and we can rename it to get_database (or
something else)

Backend.list_schemas

We deprecate this, point users at list_databases

Backend.list_databases

Current: was undefined or was returning nothing (because backend has no catalog support)

Future:

This will now return database And we add a catalog kwarg so you can
list databases in a given catalog

Current: returns catalog

Future:

This will just break. This is unfortunate, but I'm cautiously optimistic
that no one is using this programmatically.

Other changes and possibly additional steps in later versions

  • Add Backend.list_catalogs which behaves like the old version of
    list_databases

After we remove schema (next major version after deprecating),
we can consider adding an additional catalog kwarg to several of the
above methods. We will still continue to allow all the database
behaviors listed above and we add error handling for if someone
specifies catalog and also provides a dotted path as database (we
have this now for overspecifying database.schema).

Metadata

Metadata

Assignees

Labels

backendsIssues related to all backendsrefactorIssues or PRs related to refactoring the codebase

Type

No type

Projects

Status

done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions