-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow Provider(s)
Versions of Apache Airflow Providers
apache-airflow-providers-google==6.3.0
Apache Airflow version
2.2.3
Operating System
Linux
Deployment
Composer
Deployment details
- Composer Environment version:
composer-2.0.3-airflow-2.2.3
What happened
When I use BigQueryToGCSOperator, I got following error.
Invalid dataset ID "MY_PROJECT:MY_DATASET". Dataset IDs must be alphanumeric (plus underscores and dashes) and must be at most 1024 characters long.
What you expected to happen
I guess that it is due to I use colon (: ) as the separator between project_id and dataset_id in source_project_dataset_table .
I tried use dot(.) as separator and it worked.
However, document of BigQueryToGCSOperator states that it is possible to use colon as the separator between project_id and dataset_id. In fact, at least untill Airflow1.10.15 version, it also worked with colon separator.
In Airflow 1.10.*, it separate and extract project_id and dataset_id by colon in bigquery hook. But apache-airflow-providers-google==6.3.0 doesn't have this process.
airflow/airflow/contrib/hooks/bigquery_hook.py
Lines 2186 to 2247 in d3b0669
| def _split_tablename(table_input, default_project_id, var_name=None): | |
| if '.' not in table_input: | |
| raise ValueError( | |
| 'Expected target table name in the format of ' | |
| '<dataset>.<table>. Got: {}'.format(table_input)) | |
| if not default_project_id: | |
| raise ValueError("INTERNAL: No default project is specified") | |
| def var_print(var_name): | |
| if var_name is None: | |
| return "" | |
| else: | |
| return "Format exception for {var}: ".format(var=var_name) | |
| if table_input.count('.') + table_input.count(':') > 3: | |
| raise Exception(('{var}Use either : or . to specify project ' | |
| 'got {input}').format( | |
| var=var_print(var_name), input=table_input)) | |
| cmpt = table_input.rsplit(':', 1) | |
| project_id = None | |
| rest = table_input | |
| if len(cmpt) == 1: | |
| project_id = None | |
| rest = cmpt[0] | |
| elif len(cmpt) == 2 and cmpt[0].count(':') <= 1: | |
| if cmpt[-1].count('.') != 2: | |
| project_id = cmpt[0] | |
| rest = cmpt[1] | |
| else: | |
| raise Exception(('{var}Expect format of (<project:)<dataset>.<table>, ' | |
| 'got {input}').format( | |
| var=var_print(var_name), input=table_input)) | |
| cmpt = rest.split('.') | |
| if len(cmpt) == 3: | |
| if project_id: | |
| raise ValueError( | |
| "{var}Use either : or . to specify project".format( | |
| var=var_print(var_name))) | |
| project_id = cmpt[0] | |
| dataset_id = cmpt[1] | |
| table_id = cmpt[2] | |
| elif len(cmpt) == 2: | |
| dataset_id = cmpt[0] | |
| table_id = cmpt[1] | |
| else: | |
| raise Exception( | |
| ('{var}Expect format of (<project.|<project:)<dataset>.<table>, ' | |
| 'got {input}').format(var=var_print(var_name), input=table_input)) | |
| if project_id is None: | |
| if var_name is not None: | |
| log.info( | |
| 'Project not included in %s: %s; using project "%s"', | |
| var_name, table_input, default_project_id | |
| ) | |
| project_id = default_project_id | |
| return project_id, dataset_id, table_id |
How to reproduce
You can reproduce following steps.
- Create a test DAG to execute BigQueryToGCSOperator in Composer environment(
composer-2.0.3-airflow-2.2.3). - And give
source_project_dataset_tablearg source BigQuery table path in following format. - Trigger DAG.
source_project_dataset_table = 'PROJECT_ID:DATASET_ID.TABLE_ID'
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct