[airflow] Implement airflow-xcom-pull-in-template-string (AIR201)#23583
[airflow] Implement airflow-xcom-pull-in-template-string (AIR201)#23583MichaReiser merged 5 commits intoastral-sh:mainfrom
Conversation
|
| code | total | + violation | - violation | + fix | - fix |
|---|---|---|---|---|---|
| AIR201 | 3 | 3 | 0 | 0 | 0 |
c0c2205 to
fb18572
Compare
|
cc @sjyangkevin for review? |
would be great to also have @Lee-W to have a look when he is available. |
ca33d94 to
25ab2b5
Compare
|
Hey, thanks for creating this. I'm actaully good with adding this, but it would be better if we had a stronger consensus across the community. I'll start a discussion in the dev list today. |
25ab2b5 to
12d2489
Compare
|
Thanks @Lee-W and @sjyangkevin for taking a look! I think I'll convert this back to a draft for now until there's consensus on the Airflow side. |
12d2489 to
e51918c
Compare
|
@ntBre @amyreese @Lee-W @sjyangkevin The Airflow community approved this rule, so I'm marking it "Ready for review". Refs: |
Lee-W
left a comment
There was a problem hiding this comment.
a few nits. but overall looks good!
| # Mixed content (not just xcom_pull) | ||
| task_10 = BashOperator( | ||
| task_id="task_10", | ||
| bash_command="echo {{ ti.xcom_pull(task_ids='task_1') }}", |
There was a problem hiding this comment.
I guess we're worrying about false possitive?
There was a problem hiding this comment.
Shouldn't we? What do you suggest?
There was a problem hiding this comment.
I meant why shouldn't we replace if it's mixed content
There was a problem hiding this comment.
In other cases, the template (string) is replaced by a different type (xcom). In a mixed context we need to keep a string, which relies on the correct serializability of xcoms. IMO this is more trouble than it's worth.
However, let's consider your proposal for a second - what should be the replacement in this example?
bash_command="echo {{ ti.xcom_pull(task_ids='task_1') }}", # original
bash_command="echo {{ task_1.output }}",
bash_command="echo " + task_1.output,
bash_command=f"echo {task_1.output}",
...There was a problem hiding this comment.
I think it should be bash_command="echo {{ task_1.output }}",
e51918c to
64581b7
Compare
64581b7 to
dcdeb6a
Compare
sjyangkevin
left a comment
There was a problem hiding this comment.
Thanks! Overall looks good to me but one more small feedback. @Lee-W when you have time, would you mind also have a look see if the following valid?
Should we also handle an edge case that when the task is defined in a TaskGroup, let's say taskgroup_1. The callable task_id will be group_id.task_id, for example
task_4 = BashOperator(
task_id="task_4",
bash_command="{{ ti.xcom_pull('taskgroup_1.task_1') }}",
)
|
@Lee-W, how many rules do you plan on adding? I'm asking because adding and maintaining all these airflow rules is a considerable effort on our end. Don't get me wrong, the contributions are great, but it's unfortunately not at zero cost for us. If you plan on adding many rules, I think it's best if we discuss that first and how we can scale this process. |
By all means - let's discuss!
|
|
Yeah. I know we've heard in the past that |
|
@MichaReiser, we don't have a clear understanding of whether we should add more rules at this time. However, we definitely need to discuss our next steps regarding this matter! Currently, if anyone has ideas about best practices for Airflow, they will need to initiate a discussion and build consensus within the Airflow community. This process takes time and doesn't happen often, so I don’t expect many changes. |
Thanks, this sounds good to me. Plugin support is on our mind. But we first want to push some highly demanded and larger user-facing features (warning severity, human-readable names, rule recategorization, unified format/check command). I suspect that plugins will be high up on our feature list once these are out. |
MichaReiser
left a comment
There was a problem hiding this comment.
Thank you. This overall looks good. The Jinja parsing makes me a bit uneasy. I think there are a few more cases that need handling and there's potential for more code reuse.
| } | ||
|
|
||
| // Check keyword arguments for xcom_pull template strings. | ||
| for keyword in &*call.arguments.keywords { |
There was a problem hiding this comment.
Does this indeed apply to all keyword arguments or can we restrict the rule to a few known keyword names? What if the arguments are passed as positional arguments (or are they required to be keyword arguments)?
There was a problem hiding this comment.
In Airflow, any operator argument can be a template field — it's determined by the operator's template_fields class attribute and varies per operator. Since the {{ ti.xcom_pull(...) }} pattern is specific enough to avoid false positives, I now check all arguments (both positional and keyword) with a code comment explaining the rationale. Positional template strings are rare in operator calls but technically possible.
There was a problem hiding this comment.
Are there some known fields that we can always skip? E.g. task_id, dag and task_group?
There was a problem hiding this comment.
Can you please elaborate or give an example of the use case you have in mind?
This rule was made to detect a common (and outdated) pattern where specifically the return value of a task is retrieved. The reason for the proliferation of the pattern is because it was the recommended (only?) way of doing things for a long time, and since this is what appeared in the docs - that's what users ended up doing. While it is possible to use positional args to call the function i.e.
ti.xcom_pull("task") # Plausible
ti.xcom_pull("task", None) # Uncommon
ti.xcom_pull("task", None, "return_value") # UncommonI think the above are so uncommon that false-negatives are not a concern in practice. Moreover, xcom_pull has 5 "positional-able" args (see below) - and we don't want to touch it if it the user provides anything except just task_ids and (optionally) key.
To conclude - I think that dealing with arbitrary templates is outside the scope of this rule.
CC: @Lee-W @sjyangkevin @potiuk
The current signature of xcom_pull:
def xcom_pull(
self,
task_ids: str | Iterable[str] | None = None,
dag_id: str | None = None,
key: str = XCOM_RETURN_KEY, # "return_value"
include_prior_dates: bool = False,
session: Session = NEW_SESSION,
*,
map_indexes: int | Iterable[int] | None = None,
default: Any = None,
run_id: str | None = None,
) -> Any:dcdeb6a to
2ab3e1b
Compare
| } | ||
|
|
||
| // Check keyword arguments for xcom_pull template strings. | ||
| for keyword in &*call.arguments.keywords { |
There was a problem hiding this comment.
Are there some known fields that we can always skip? E.g. task_id, dag and task_group?
- Rewrite parse_xcom_pull_template using ruff_python_trivia::Cursor instead of strip_prefix/strip_suffix chains - Extract reusable helpers: eat_whitespace, parse_identifier, parse_quoted_string - Add whitespace tolerance between all token pairs (e.g. ti . xcom_pull) - Bail on escaped quotes in string content - Broaden argument checking to cover both positional and keyword args with explanatory comment about template_fields - Add Fix safety doc section - Add unit tests for new whitespace/escape/unknown-keyword handling Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…_whitespace - Use `iter_source_order` for argument iteration (MichaReiser) - Use `char::is_whitespace` instead of `is_ascii_whitespace` to match Jinja's Unicode whitespace semantics (MichaReiser) - Support reordered keyword arguments: `key='return_value', task_ids='...'` is now recognized in addition to the existing order (MichaReiser + Lee-W) - Extract `parse_task_id_value` helper for list/tuple wrapping logic - Add unit tests for reordered keyword patterns - Add fixture trigger case for reordered keywords (task_25) - Show modern `.output` replacement pattern as comment on mixed-content test - Regenerate snapshot in current insta format Co-Authored-By: Claude Opus 4.6 <[email protected]>
1765da5 to
485dc03
Compare
|
This looks good to me, but Codex had two findings. I find it difficult to assess whether they're correct because I'm unfamiliar with airflow. Could someone take a look:
|
Codex is technically correct here, but I don't think anyone does this. If they do, it's likely a mistake, and while it may be flagged for the wrong reason, it would indicate to the user that something's up. Also, not sure a template like that is even a valid task_id (there are some rules it needs to follow which I cannot find at the moment).
This one's true, but the original syntax is already wrong for the exact same reason. We wouldn't be replacing working code with broken one. |
|
This is great work. Thank you |
Thank you for bringing this over the finish line! |
Summary
Implements rule
AIR004(airflow-xcom-pull-in-template-string) that detects Airflow operator/sensor keyword arguments using a Jinja template string containing a singlexcom_pullcall (e.g.,"{{ ti.xcom_pull(task_ids='some_task') }}") and suggests replacing it with the.outputattribute on the task object (e.g.,some_task.output).Using
.outputinstead ofxcom_pulltemplate strings:@task-decorated functions)What the rule flags
Suggested fix
Template patterns detected
{{ ti.xcom_pull(task_ids='...') }}and{{ task_instance.xcom_pull(task_ids='...') }}{{ ti.xcom_pull('...') }}task_id=andtask_ids=keyword formsWhat it allows (no false positives)
"echo {{ ti.xcom_pull(task_ids='task_1') }}""{{ ti.xcom_pull(task_ids='task_1', key='my_key') }}".output:task_1.output"{{ ti.xcom_pull(task_ids=['a', 'b']) }}"DAG(...))Unsafe fix
When the referenced
task_idmatches a variable in scope (either an operator assignment or a@task-decorated function), an unsafe fix is provided that replaces the template string with<variable>.output. When no matching variable is found, the diagnostic is still reported but without an auto-fix.Test Plan
AIR004.pycovering:ti.xcom_pull,task_instance.xcom_pull, positional args, double quotes, no-space braces, singulartask_idkeyword, sensors, provider operators, and@task-decorated function sources.task_idnot visible as a variable in the current scope..output, regular strings, non-string arguments, listtask_ids, and non-operator calls.parse_xcom_pull_templateparser covering all template variants and rejection cases.related: apache/airflow#43176