Save remote output metadata to ActionCache and load them before checking action cache#13604
Save remote output metadata to ActionCache and load them before checking action cache#13604coeuvre wants to merge 20 commits intobazelbuild:masterfrom
Conversation
|
Another thought: we can probably move "load metadata cache" and "update metadata cache" into |
|
General idea seems reasonable, will defer to @alexjski |
|
Can you explain why we need separate files for the metadata as opposed to storing that with the existing structure? That could technically open possibilities of entries present in the metadata, but not in the cache and the other way round. Is the worry that we could read the metadata in case of a cache miss? |
|
They are technically different type of maps: the action cache is a map of
I think this is a valid case. If entries present in the metadata but not in the cache, Bazel will just ignore them. If the other way round, it is just the case before this change.
Even if we store metadata into action cache files, a cache miss shouldn't remove the metadata. However, we can change to store the metadata to the existing action cache file (by suffix the path of output file with |
There was a problem hiding this comment.
They are technically different type of maps: the action cache is a map of
<Path of action's first output, Digest of action data>while the metadata cache is a map of<Path of a output file, Metadata>. Storing them into different files seems to be reasonable.
They are, but I don't think we need 2 maps -- my suggestion was to merge them in which case the storage would be merged too. Anyway, given the check on the output metadata digest is still present, I am less pressed to ask you to do that. Technically, if we merged those we may not need to store the outputs fingerprint and could have the outputs instead, but it may be more work than it's worth.
That could technically open possibilities of entries present in the metadata, but not in the cache and the other way round.
I think this is a valid case. If entries present in the metadata but not in the cache, Bazel will just ignore them. If the other way round, it is just the case before this change.
I am afraid that will be a possibility, unless we do something like differ the version of action cache based on whether this feature is enabled (that's technically abuse of version). Another thing we could do is if we had some global indicator in the action_cache file on whether it has metadata, we could reject action cache if it disagrees on whether it should have it.
Is the worry that we could read the metadata in case of a cache miss?
Even if we store metadata into action cache files, a cache miss shouldn't remove the metadata.
However, we can change to store the metadata to the existing action cache file (by suffix the path of output file with
:metadatafor example) if you prefer this way.
Yeah, that's one way to have a global variance on the option. That sounds reasonable with only the caveat that if we have more options like that, we may end up with multitude of prefixes, but maybe that is not a thing to worry about. Also, there is the case that if you had a prior build storing the metadata, follow up one without it could technically still use it, but I think that is a case not worth optimizing for.
That would take out the possibility of the metadata being absent from prior builds due to changing options -- missing metadata in this case would be an indication of a bug.
1901917 to
7b7c4a6
Compare
|
Updated to merge two maps into one. For metadata of an output, the key is its |
d33ec55 to
7945971
Compare
462abdc to
03f6874
Compare
That's good. I think I initially had more negative reaction to separate maps than it calls for (sorry) because I didn't realize the output fingerprint check is still present, which negates all of the risks. Anyway with merged map, we know they both are always in sink which is nice thing in itself. |
|
Regarding #13604 (comment), I am thinking separating these metadata methods from The goal of this PR is allowing Bazel to download remote outputs after action execution. To do that, we need:
1 is implemented - during the step when we update action cache, we also check for metadata injected during action execution, and save those which is "remote". 2 is only implemented for downloads happened within action execution - if the metadata injected is not "remote", we remove the key from cache during updating the action cache. Considering the case that the local action cache of a remote action is valid (hence action execution is skipped) but we need to download its outputs. With current implementation, it won't work. We need a post action execution download procedure. Whichever module implements this feature, it needs to access ActionCache to remove metadata for downloaded outputs. Accessing ActionCache somewhere else is risky IMO. If we move these metadata methods to the
WDYT? |
That may be a reason for 2 maps, so you can still have cache hits for local actions.
My very first reaction is that I think we discussed that before and that was not agreed upon. I see 2 reasons to need to download the outputs:
Seems like that if you need those, your best bet may be a FUSE-based solution which tracks all outputs which would deem all of these changes unnecessary (see #12823). Another thing to consider about downloading results, if you wanted to do that eagerly (I think if we can have FUSE, that would be strictly better) is a subscribing to events and looking at outputs. We already issue those for top-level actions (BEP), there also is CachedActionEvent (you might need to add the metadata to it though). I am not pressing too hard, but that is something to think about -- it looks like we may already have the desired interface there. For reference, we already look at the outputs right there: This PR as which only tracks the metadata in action cache, to me can serve an interesting case for remote-only builds, which is something that we discussed before and it is indeed a nice thing to have. |
alexjski
left a comment
There was a problem hiding this comment.
I will leave that up to you whether you want to have 2 maps or 1. One high-level comment is that if you are keeping that together, enhancing the ActionCache.Entry may give you a better interface where you do a single update with all remote metadata. This can potentially spare some of the OutputStore manipulations (injections are still needed). Something to think about.
| @@ -348,9 +353,24 @@ public ActionCache.Entry get(String key) { | |||
|
|
|||
| @Override | |||
| public void put(String key, ActionCache.Entry entry) { | |||
There was a problem hiding this comment.
If we were to store that together, would expanding the entry not make more sense than having multiple calls for the metadata? What I mean is for the ActionCache.Entry to have a field of type @Nullable Map<PathFragment, Metadata> remoteMetadata + something similar for tree expansions. Then you also don't need any new method in here or special logic to remove entries -- an update would just add one with less metadata if that is not needed anymore.
Please note that I am actually not leaning very much in either way -- 2 maps actually could be just fine.
| } | ||
| } | ||
|
|
||
| private static byte[] encodeTreeMetadata(TreeArtifactValue metadata) { |
There was a problem hiding this comment.
TreeArtifactValue already has serialization code (and so does FileArtifactValue):
ObjectCodecs codecs = ... // may need to have one per the cache
byte[] bytes = codecs.toByteArray();
| new ConcurrentHashMap<>(); | ||
|
|
||
| private final ConcurrentMap<TreeFileArtifact, FileArtifactValue> treeFileArtifactData = | ||
| new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
We should not need that -- treeArtifactData already has the metadata.
| FileArtifactValue childMetadata = childEntry.getValue(); | ||
| if (!childMetadata.isRemote()) { | ||
| // Only save remote tree file metadata | ||
| childMetadata = MISSING_FILE_MARKER; |
There was a problem hiding this comment.
I wonder if we should have a different marker or at least add a comment that this differs from what that represents -- this is metadata for a file which does not exists at all. In here, we are saying it exists, but locally.
Another case is we can just inject metadata for the outputs of an action and start executing dependent actions while keeping output downloads in the background (today, we have to wait for the download).
IIUC, we are experimenting with non-FUSE-based solution. The solution in my head is we can build artifacts as normal and use "outputs" flags to tell Bazel which outputs we want to download, e.g. no flags means we want all the outputs, If we need outputs that weren't downloaded during previous build, we can then "rebuild" with different "outputs" flags which will hit the local action cache but start downloading missing outputs. The entire progress for executing an action remotely should be:
This works without FUSE but provides similar experience.
This is also a good place to consider for post execution downloading. One thing I mentioned above is that we need to remove remote metadata from cache after download. So we need to decide whether should we expose ActionCache to other modules or we keep ActionCache as it is and introduce a separate interface for metadata only.
|
|
Perhaps the word "FUSE" is a bit of a red herring. Let's say, any
filesystem-based output service. An NFS volume would also work. Chi, what
is the reason to avoid such a setup and centralize logic inside Bazel?
|
|
Just chiming in, but macFUSE performance isn't the best, so any solution that requires FUSE might not be performant on macOS. |
|
To be clear, I wasn't meant to replace FUSE-based (or any filesystem-based output service) solution but just wanted to improve "build withouts bytes".
IIUC, we don't offer the setup for Bazel today. For users who care about network bandwidth, the only choice is "build without bytes". The ways moving forward are probably:
I don't think these two ways are incompatible with each other. |
|
I would like to go ahead with Chi's idea of having this logic in Bazel. I think filesystem-based layers are probably the most user-friendly interface when they work, but they come with lots of challenges, too: They are platform-specific, they are often slower than non-filesystem-based APIs (e.g. see sandboxfs performance vs symlink-based sandbox performance), they require additional setup (e.g. install FUSE), they require having special privileges on the system (admin rights, ability to load a kernel extension, ability to mount a filesystem, ...) and they might not be available on all systems due to security policies. As we know from Google they can perform very well in a carefully managed IT environment, but I don't believe we will be able to provide a good out of the box experience for these solutions on typical user systems. The two approaches should nicely complement each other though! |
There already was a discussion about that before and I also believe there was some effort put towards testing how NFS-based solutions perform. I think that in order to have most informed discussion, we should merge those and look at the results from the tests and them make design decisions. I believe that this code review may not be the best place for that. The main thinking so far was that if we have a FUSE/NFS-based solution available, then the incentive of this may be low. What I was missing from the proposal before was the justification for why the added complexity is needed if we have FUSE/NFS solution. If there is none, I would strongly prefer to i
Why do we need to remove the metadata? Also, it may be useful what metadata you mean in this context -- the newly added metadata in action cache or the metadata stored in ActionExecutionValue (skyframe).
|
I was referring to the newly added metadata in action cache. This PR only save remote metadata and load them into OutputStore before checking the action cache. If we don't remove the remote metadata of one output from action cache after downloaded it, we will still use the remote metadata from action cache instead of re-calculating it from local file system next time when we check action cache (even if the local file content is changed). Now I think it again, maybe we can change to storing all metadata of outputs into action cache (which helps #13566). Re-calculating all metadata of outputs for the action from local file system (we do it today as well) and compare them with those stored in the action cache. Then we don't need to remove metadata after downloads. It looks like saving metadata directly in action cache is a preferred way. Then I would like to make it stored within one map by extending |
I agree, we need to take care of that. My thinking was that it may be a better API to just request a desired end state rather than have fine-grain manipulations. My preference would be to have an API like this: In fact, that could be a single call as part of updating the cache, which we already do. over: In fact, then you can choose in the cache implementation to only track the remote ones and you don't have to worry about removing old entries since the operation resembles
Not sure if we need that. My point was merely about the API seeming too fine grained. Logically it makes perfect sense.
I agree, if we are doing that, the |
f29cf9d to
9d78ee1
Compare
b9196f4 to
e25da42
Compare
|
Added a few tests. More are coming tomorrow. |
|
Tests added. I think we can have another review round. |
This PR extends
ActionCache.Entryto store output metadata by having a map of <Path, Metadata>. This map is updated after action execution when we update action cache so that metadata of all outputs of the action are saved. Before checking the action cache (when executing actions), we will load the output metadata into output store if it is remote and the correspondingly local one is missing.With this change, remote output metadata is saved to disk so build without bytes can use them among server restarts.
We can also download outputs after action execution since remote output metadata can be accessed outside.
Part of #12665.
Fixes #8248.