-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Description
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
All versions
Apache Airflow version
2.3.0 (latest released)
Operating System
linux
Deployment
Astronomer
Deployment details
No response
What happened
When using the AWS glue crawler operator, if the sensing stage gets a 400 error with "Rate exceeded" message then the glue crawler operator fails.
The problem is that this can occur with as little as 10 crawler sensors working concurrently with a poll_interval of 60s.
You can set retries and exponential_backoff on the operators, however if you retry the glue crawler operator it will then fail because the crawler has already been started. In essence this means that if you use this operator and you get rate limited, then you cannot retry and you end up with brittle pipelines.
This rate limting issue happens with all other AWS operators I have worked with too.
What you think should happen instead
This is the kind of error that the operator receives when it pings AWS for a status update:
"An error occurred (ThrottlingException) when calling the GetCrawler operation (reached max retries: 4): Rate exceeded"
I think this error should be handled instead of it causing the task to fail.
I have implemented a custom operator which catches this particular exception. I would be happy to try and submit a PR myself for the issue.
How to reproduce
As this is a problem of being rate limited by AWS you will need to have an AWS account setup with some crawlers.
Create a DAG that uses the glue crawler operator linked to above. Have somewhere between 10 and 20 of these setup to run glue crawlers in an AWS environment.
Anything else
This problem occurs without fail every time we try to run 12 crawler operators at once (each with a poll_interval of 60s).
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct