-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: ignore files not in remote when push is false #10749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ignore files not in remote when push is false #10749
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10749 +/- ##
==========================================
+ Coverage 90.68% 91.06% +0.38%
==========================================
Files 504 504
Lines 39795 40040 +245
Branches 3141 3164 +23
==========================================
+ Hits 36087 36462 +375
+ Misses 3042 2950 -92
+ Partials 666 628 -38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4537674 to
b5a6a58
Compare
|
@skshetry, have you had time to look at this? This feature would be really great for our team! |
dvc/repo/data.py
Outdated
| not_in_remote = uncommitted_diff.pop("not_in_remote", []) | ||
|
|
||
| if respect_no_push: | ||
| logger.debug("Filtering out paths that are not pushable") | ||
| not_in_remote = _filter_out_push_false_outs(repo, not_in_remote) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to extract out the "not_in_remote" checks from above _diff(), as it has nothing to do with diff between worktree and committed changes (index). We need to calculate only for the given index (commited changes).
That is what repo.index is. It's an index of committed changes. You can filter that index to a view, using worktree_view:
if not_in_remote:
view = worktree_view(repo.index, push=True)
# ... existing logicpush=True gives us https://github.com/iterative/dvc/blob/549821e6be050b57ab89b8325ed4f3c41b7fd966/dvc/repo/worktree.py#L73.
You can get access to DataIndex using view.data["repo"]. And then use index.iteritems(shallow=not granular) on it.
Let me know if you need help. I can take over the PR if you prefer that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can extract the above not_in_remote stuff. It is using Change type, but you can make minor modification to it to instead use Entry type. change.old and change.new are just Entry type.
Something like follows, maybe:
data_index = view.data["repo"]
for key, entry in data_index.iteritems(shallow=not granular):
if not (entry and entry.hash_info):
continue
k = (*key, "") if entry.meta and entry.meta.isdir else key
try:
if not index.storage_map.remote_exists(entry, refresh=remote_refresh):
yield os.path.sep.join(k)
except StorageError:
passThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks clean! Just to make sure I understand you correctly:
You propose to not use the uncommitted_diff["not_in_remote"] at all, and instead handle it directly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from our spec. I think the plan was to show "not_in_remote" only for the tracked directories/files.
I don't think it makes sense for us to show "not_in_remote" for items that are already git-committed. If they are still being tracked by dvc.yaml/.dvc file in the workspace (i.e. the index), and are missing from remote, they will show up in "Not in remote" section anyway.
Only the files that are not being tracked anymore, but were tracked by Git's "HEAD" and is missing from the remote will show up in committed_diff.not_in_remote, which I don't think makes sense. It could be intentional, or even if not intentional, that only adds noise. You can't populate those files into the remote now with dvc push (unless you use --all-commits), as they are no longer being tracked by DVC.
I don't see any discussion about this on either of:
So maybe that was an oversight.
Do you see any use for this? Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no Not in cache status for git-committed items, but missing from cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your conventions on scope of PRs, @skshetry? It is starting to look like the scope here may creep a bit from the original issue, to also cover the nuances that we are discussing above. I'm fine with it, but wanted to check what are your norms. Should I make one PR with narrow scope of the push: false part and then start on another for the rest (which you may be more equipped to do properly), or should we collaborate on doing it in one PR here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then start on another for the rest
Not sure I understand what the rest part is.
You propose to not use the uncommitted_diff["not_in_remote"] at all, and instead handle it directly, right?
^ I was only providing a bit more context for you to work with. I am not asking anything more than what you are proposing, except for making it the default.
And, using a different approach (aka refactoring) is necessary to get a "view" of a pushable data, which I suggested above. This push: false information does not exist in _diff(), and I am not sure how to represent it yet on the data management layers.
And I tried to give you my thoughts on why committed_diff["not_in_remote"] does not make any sense and tried to hear yours.
If you have any questions, please feel free to ask. I am not trying to expand the scope here. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks:) Working on the refactoring now. Really appreciate the thorough and patient follow-up - great maintainership!
|
@skshetry, thanks for the thorough review, and sorry for the very late reply from me. When reading up on your suggestions, especially on the change from diffing index - workspace to inspecting index directly, I realized I am somewhat confused about the intended/expected behavior of What is the difference between
I first thought this meant there was a difference between files generated with pipelines ( I made a simple example to investigate. ❯ source ./run_demo.sh
==================================
## Before repro
==================================
>>> dvc status -c --json
{
"bar.txt": "missing",
"foo.txt": "missing"
}
>>> dvc status --json
{
"create-bar": [
{
"changed outs": {
"bar.txt": "not in cache"
}
}
],
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt",
"bar.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Running repro
==================================
>>> dvc repro
Running stage 'create-bar':
> echo bar > bar.txt
Use `dvc push` to send your updates to remote storage.
==================================
## After repro
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "missing"
}
>>> dvc status --json
{
"foo.txt.dvc": [
{
"changed outs": {
"foo.txt": "not in cache"
}
}
]
}
>>> dvc data status --not-in-remote --json
{
"not_in_cache": [
"foo.txt"
],
"not_in_remote": [
"foo.txt",
"bar.txt"
],
"committed": {
"not_in_remote": [
"foo.txt",
"bar.txt"
]
}
}
==================================
## Add and commit foo
==================================
echo foo > foo.txt
>>> dvc commit foo.txt
==================================
## After commit
==================================
>>> dvc status -c --json
{
"bar.txt": "new",
"foo.txt": "new"
}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"not_in_remote": [
"bar.txt",
"foo.txt"
],
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
==================================
## Running push
==================================
>>> dvc push
Collecting |2.00 [00:00, 521entry/s]
Pushing
2 files pushed
==================================
## After push
==================================
>>> dvc status -c --json
{}
>>> dvc status --json
{}
>>> dvc data status --not-in-remote --json
{
"committed": {
"not_in_remote": [
"bar.txt",
"foo.txt"
]
}
}
It does seem to contain the same information, structured slightly differently. Also: the Repro of example
|
The
The data from outputs are still "data" tracked by DVC. So they are shown by default (it ignores dependencies' unless they are also part of an output somewhere in the pipeline unlike
|
|
If you are using dvc for data management, use |
I see, thank you, that helps. I just got a bit confused by the |
Note that So while it's called a The unchanged items are always computed unconditionally here, but only shown if (The |
|
@skshetry, updated the PR now, to use the Bit premature there... Need to sort out some issues before review. |
df95f2a to
100983e
Compare
|
It is not clear to me why the failing tests are failing, or if it is related to these changes (main succeeds, so I assume so). Any help appreciated. |
skshetry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work overall. The change look good! I’ve left a few minor/pedantic comments inline, just for polish. 🙂
Looks unrelated, maybe new pytest release is to blame. Please ignore, that'd fail on main too, but maybe we were lucky. I'll investigate separately. |
Convert kwargs to explicit arguments and extract not_in_remote handling
|
Thanks for the guidance! Ps. Also took the liberty to swap out the |
skshetry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for contributing!
I'll keep this PR open for a few days to gather feedback from community (will also discuss internally), and then merge by Wednesday.
|
@skshetry, what are your release cycle/policy? Really looking forward to getting this into our CI 🤩 |
I am planning to release by early next week. |
|
@Northo, I have created a new release. :) |


Fixes #10317
Enables opt-in to remove
push: falsestage outputs fromnot_in_remotedata status results.Notable changes:
outs_no_pushtodvc.stage.utils.fill_stage_outputskeys, to facilitate making outputs withpush: false.status, when flag enabled, filter through files reported asnot_in_remote, and remove them if notcan_push.--respect-no-pushflag to CLIOpen to suggestions on how to make the flag names more intuitive!
Corresponding PR for the docs: treeverse/dvc.org#5373
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏