Skip to content

Conversation

@yzeng1618
Copy link
Contributor

…ssions in PostgreSQL/mysql/oracle/sqlserver, and fix the null pointer issue in the regular expression
#9209

Purpose of this pull request

added contents about multiple tables and regular expressions in PostgreSQL/mysql/oracle/sqlserver, and fix the null pointer issue in the regular expression.

Does this PR introduce any user-facing change?

Yes.
This PR enhances the table_list parameter to directly support regex patterns for table filtering, while maintaining backward compatibility.
Here's the detailed breakdown:

New Feature: Direct Use of Regular Expressions in table_path
Purpose: Allow users to write regular expressions directly in the table_path field within table_list to filter tables.
Example Configuration:
"table_list"=[
{
"table_path"="TEST.TEST_DB_*" # Matches all tables with the "TEST_DB" prefix
}
]
This configuration matches all tables prefixed with TEST_DB (e.g., TEST_DB_2023, TEST_DB_2024).

Improvement and Fix:
Enhanced the robustness of the approximateRowCntStatement method in OracleDialect.
Fixed a null pointer error in Oracle when executing queries with empty or invalid parameters.
Key Notes:
The table_path now supports regex syntax (e.g., TEST.TEST_DB_* → matches all tables under the TEST schema with names starting with TEST_DB_).
The Oracle fix ensures stable execution of row count estimation logic, avoiding crashes due to unhandled edge cases.

How was this patch tested?

Testing Environment
OS: Linux (Ubuntu 20.04)
SeaTunnel Version: 2.3.9
Execution Mode: Flink on YARN (yarn-application)
Databases:
PostgreSQL 16.0(Source)
Iceberg (Sink)

Test Configuration:
Created a configuration file pg2iceberg.conf to read tables matching the regex TEST.TEST_DB_* from PostgreSQL:
{
env {
execution.parallelism = 1
job.mode = "BATCH"
job.name = "seatunnel_batch_job"
}

source配置
source {
JDBC {
url = "jdbc:postgresql://xxxxxxx:xxxxx/xxxxx"
driver = "org.postgresql.Driver"
user = "xxxxxxxx"
password = "xxxxxxx"
"table_list" = [
{
"table_path" = "postgres.public.test_db_2.*"
}
]
split.size = 5000
fetch_size = 2000
}
}

sink配置
sink {
Iceberg {
........
}
}
}

Execution Command:
./bin/start-seatunnel-flink-15-connector-v2.sh --config pg2iceberg.conf --deploy-mode run-application --target yarn-application --name multitable_pg2Iceberg

Check log:
Filtering tables with regex pattern: postgres.public.test_db_2.*
Found regex match table: postgres.public.test_db_20
Found regex match table: postgres.public.test_db_21
Found regex match table: postgres.public.test_db_22
Found regex match table: postgres.public.test_db_20250324
Found regex match table: postgres.public.test_db_202502
Found regex match table: postgres.public.test_db_202501
Total tables matched after filtering: 6

Check list

…ssions in PostgreSQL/mysql/oracle/sqlserver, and fix the null pointer issue in the regular expression

This comment was marked as outdated.

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a option to mark whether the TablePath is a regular expression.

@yzeng1618
Copy link
Contributor Author

I think we need to add a option to mark whether the TablePath is a regular expression.

1、Simplified Path Matching Design,If you enter a path like /project/table_123, the system first checks for this exact path,If no exact match exists, it automatically treats your input as a regex pattern.
The main advantages are as follows:

  • Zero configuration: There is no need for marking parameters like "is_regex", and users don't have to switch modes manually.
    Just enter the input and it will be matched, reducing the cognitive burden.
  • Intuitive behavior: It follows a logic similar to that of the grep tool, First, perform a literal match, and then enable pattern matching.

2、Like Flink CDC (where you directly use exact table names OR regex patterns without special parameters), we keep things simple.

@nielifeng nielifeng requested a review from Copilot June 5, 2025 01:37

This comment was marked as outdated.

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 6, 2025 03:01

Examples:
- `testdb.table+` - Matches all tables starting with "table" followed by numbers
- `testdb.*` - Matches all tables in the `testdb` database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this expression will match database testdb2 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular expression testdb.* will only match all the tables in the testdb database and will not match the testdb2 database.

Comment on lines 88 to 108
for (JdbcSourceTableConfig tableConfig : tablePaths) {
List<String> schemaTables = new ArrayList<>();
String tablePath = tableConfig.getTablePath();
String query = tableConfig.getQuery();
LOG.info("Processing table path: {}, custom query: {}", tablePath, query);
String sql;
if (StringUtils.isBlank(query)) {
String schemaName;
if (jdbcDialect.dialectName().startsWith(DatabaseTypeEnum.ORACLE.getValue())) {
schemaName = tablePath.split("\\.")[0];
sql = "SELECT OWNER, TABLE_NAME FROM dba_tables where OWNER=?";
ps = connection.prepareStatement(sql);
ps.setString(1, schemaName);
rs = ps.executeQuery();
while (rs.next()) {
// For Oracle: schema.table
String foundTable =
rs.getString("OWNER") + POINT + rs.getString("TABLE_NAME");
schemaTables.add(foundTable);
LOG.info("Found table in Oracle: {}", foundTable);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach of manually implementing table scanning in JdbcSource offers several key advantages:
Precise Control

  1. Custom SQL
  • queries tailored for each database type (Oracle, MySQL, SQLServer, PostgreSQL)
  • Direct access to system tables for accurate table information retrieval
  1. Detailed Logging
  • Comprehensive logging for debugging and troubleshooting
  • Database-specific logging (e.g., LOG.info("Found table in PostgreSQL: {}", foundTable))
  1. Specialized Format Handling
  • Dedicated handling for different database table path formats:

    MySQL: database.table
    PostgreSQL: database.schema.table
    Oracle: schema.table
    SQLServer: database.schema.table

  1. Flexible Pattern Matching
  • Two-step matching strategy in filterCapturedTablesByRegrex:First attempts exact matches
    Falls back to regex pattern matching
  1. Reduced Dependencies
    Functions independently of complete Catalog implementations
    Works even when some databases have incomplete Catalog implementations

  2. Compatibility Assurance
    Explicitly defines supported databases via isSupportedDatabase
    Provides fallback to JdbcCatalogUtils.getTables for unsupported database types

While using the Catalog interface might lead to more unified code, the current implementation provides greater flexibility and adaptability to specific database behaviors, especially when handling differences in system table structures and naming conventions.

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 6, 2025 05:50
@github-actions github-actions bot added the api label Jun 10, 2025
@yzeng1618
Copy link
Contributor Author

The modification has been made. I hope you can review it when you are free.

.withDescription("The database names RegEx of the database to capture.");

Option<String> SCHEMA_PATTERN =
Options.key("schema-pattern")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

@yzeng1618 yzeng1618 requested a review from hailin0 June 12, 2025 06:05
import java.util.Set;
import java.util.stream.Collectors;

@Disabled("Please Test it in your local environment")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why disabled the test case? You can reuse

to mock data.

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 20, 2025 23:09
@davidzollo
Copy link
Contributor

good job. This is already close to merge.
Please add e2e

@github-actions github-actions bot added the e2e label Jun 21, 2025
Comment on lines 448 to 450
boolean isOracleDialect =
jdbcDialect.dialectName().equalsIgnoreCase("oracle")
|| jdbcDialect.dialectName().equalsIgnoreCase("oceanbase-oracle");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot add this special processing logic. There are more than two Oracle-like databases.

@yzeng1618
Copy link
Contributor Author

good job. This is already close to merge. Please add e2e

E2E tests for MySQL and Oracle have been added.

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 23, 2025 13:44
Comment on lines 857 to 864

default Optional<String> getDefaultDatabase() {
return Optional.empty();
}

default boolean useThreePartTablePath() {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 24, 2025 15:02
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM except two minior problems. Thanks @yzeng1618

Comment on lines 69 to 71
public Boolean getUseRegex() {
return useRegex != null ? useRegex : false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public Boolean getUseRegex() {
return useRegex != null ? useRegex : false;
}

Comment on lines 88 to 92
boolean isRegexPath = tableConfig.getUseRegex();

if (StringUtils.isNotEmpty(tableConfig.getTablePath())
&& StringUtils.isEmpty(tableConfig.getQuery())
&& isRegexPath) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean isRegexPath = tableConfig.getUseRegex();
if (StringUtils.isNotEmpty(tableConfig.getTablePath())
&& StringUtils.isEmpty(tableConfig.getQuery())
&& isRegexPath) {
if (StringUtils.isNotEmpty(tableConfig.getTablePath())
&& StringUtils.isEmpty(tableConfig.getQuery())
&& tableConfig.getUseRegex()) {

@yzeng1618 yzeng1618 requested a review from Hisoka-X June 26, 2025 02:30
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk @yzeng1618 ! LGTM if ci passes.

Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
good job

@davidzollo davidzollo merged commit 670a52a into apache:dev Jun 27, 2025
5 checks passed
dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants