Skip to content

BigQueryToGCSOperator: Invalid dataset ID error #22034

@shuhoy

Description

@shuhoy

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==6.3.0

Apache Airflow version

2.2.3

Operating System

Linux

Deployment

Composer

Deployment details

  • Composer Environment version: composer-2.0.3-airflow-2.2.3

What happened

When I use BigQueryToGCSOperator, I got following error.

Invalid dataset ID "MY_PROJECT:MY_DATASET". Dataset IDs must be alphanumeric (plus underscores and dashes) and must be at most 1024 characters long.

What you expected to happen

I guess that it is due to I use colon (: ) as the separator between project_id and dataset_id in source_project_dataset_table .
I tried use dot(.) as separator and it worked.
However, document of BigQueryToGCSOperator states that it is possible to use colon as the separator between project_id and dataset_id. In fact, at least untill Airflow1.10.15 version, it also worked with colon separator.
In Airflow 1.10.*, it separate and extract project_id and dataset_id by colon in bigquery hook. But apache-airflow-providers-google==6.3.0 doesn't have this process.

def _split_tablename(table_input, default_project_id, var_name=None):
if '.' not in table_input:
raise ValueError(
'Expected target table name in the format of '
'<dataset>.<table>. Got: {}'.format(table_input))
if not default_project_id:
raise ValueError("INTERNAL: No default project is specified")
def var_print(var_name):
if var_name is None:
return ""
else:
return "Format exception for {var}: ".format(var=var_name)
if table_input.count('.') + table_input.count(':') > 3:
raise Exception(('{var}Use either : or . to specify project '
'got {input}').format(
var=var_print(var_name), input=table_input))
cmpt = table_input.rsplit(':', 1)
project_id = None
rest = table_input
if len(cmpt) == 1:
project_id = None
rest = cmpt[0]
elif len(cmpt) == 2 and cmpt[0].count(':') <= 1:
if cmpt[-1].count('.') != 2:
project_id = cmpt[0]
rest = cmpt[1]
else:
raise Exception(('{var}Expect format of (<project:)<dataset>.<table>, '
'got {input}').format(
var=var_print(var_name), input=table_input))
cmpt = rest.split('.')
if len(cmpt) == 3:
if project_id:
raise ValueError(
"{var}Use either : or . to specify project".format(
var=var_print(var_name)))
project_id = cmpt[0]
dataset_id = cmpt[1]
table_id = cmpt[2]
elif len(cmpt) == 2:
dataset_id = cmpt[0]
table_id = cmpt[1]
else:
raise Exception(
('{var}Expect format of (<project.|<project:)<dataset>.<table>, '
'got {input}').format(var=var_print(var_name), input=table_input))
if project_id is None:
if var_name is not None:
log.info(
'Project not included in %s: %s; using project "%s"',
var_name, table_input, default_project_id
)
project_id = default_project_id
return project_id, dataset_id, table_id

How to reproduce

You can reproduce following steps.

  • Create a test DAG to execute BigQueryToGCSOperator in Composer environment(composer-2.0.3-airflow-2.2.3).
  • And give source_project_dataset_table arg source BigQuery table path in following format.
  • Trigger DAG.
source_project_dataset_table = 'PROJECT_ID:DATASET_ID.TABLE_ID'

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions