Skip to content

Conversation

@samestep
Copy link
Contributor

@samestep samestep commented Mar 9, 2021

Currently, most of our CircleCI jobs run directly on the tip of a PR, but a significant few instead merge the PR into master before running (the two top-level build jobs in this list do the actual merging, and their sub-items use the results of those builds, thus inheriting the merged repo state):

  • pytorch_linux_xenial_py3_6_gcc5_4_build
    • pytorch_cpp_doc_build
    • pytorch_doc_test
    • pytorch_linux_backward_compatibility_check_test
    • pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test
    • pytorch_linux_xenial_py3_6_gcc5_4_test
    • pytorch_python_doc_build
  • pytorch_xla_linux_bionic_py3_6_clang9_build
    • pytorch_xla_linux_bionic_py3_6_clang9_test

This is extremely surprising and confusing: for the past few months I've been unsure whether or not our CircleCI setup merges PRs into master before building/testing them, and I would not have guessed (until today, when I finally dug into .circleci/config.yml to figure this out, leading to this PR), that the answer is "both yes and no".

Beyond the principle of least astonishment, here is a snippet from an internal doc explaining a bit more of the motivation for not merging into master on PRs:

Currently, our CircleCI workflows (and some of our GitHub Actions) merge into master before building/testing/etc. ... we do not want to retain this behavior, since .... PyTorch gets many commits per day, and master is frequently broken. By merging with master before running CI, we deny people the benefit of basing their PRs off viable/strict and thus give them worse signal. See these PRs for some discussion of this behavior on the GHA side:

For some additional context, here's some of the Git blame timeline for the existing behavior:

I've added the authors and reviewers of those PRs as reviewers of this one, to hopefully come to a consensus on what to do here. I'm guessing that this PR as initially written is not the best solution, since this automerging seems to perhaps be required for our backward-compatibility test job (and perhaps also our XLA job(s)?), so if those require automerging, we can't get rid of it entirely. However, in that case, I would much prefer to isolate the parts of our Circle build workflow that use automerging, and document exactly which parts of the workflow automerge and why.

Test plan:

PR #53639 serves as a testbed for this PR. Specifically:

  • this CircleCI workflow shows the two failing build jobs with the current automerging behavior when a merge conflict is introduced
  • this CircleCI workflow shows those jobs and their descendants succeeding when this PR is cherry-picked

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 9, 2021

💊 CI failures summary and remediations

As of commit dec71a5 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samestep has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@samestep samestep requested a review from walterddr March 9, 2021 21:39
@samestep
Copy link
Contributor Author

samestep commented Mar 9, 2021

Here's a specific example of automerging causing bad signal on a PR that was based on viable/strict and thus should not have been getting mypy failures:

if [[ -n "$CIRCLE_PULL_REQUEST" && "$CIRCLE_BRANCH" != "nightly" ]]; then
PR_NUM=$(basename $CIRCLE_PULL_REQUEST)
CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')
if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the isolating the automerging idea to the configs that need it, though from the list above, it seems like more jobs than just backwards compatibility toggled on automerging. I'm a little confused as I'm not quite sure how pytorch_bazel_test whose build environment seems to be pytorch-linux-xenial-py3.6-gcc7-bazel-test also made it onto the automerge list (based on this if logic).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right! My bad, I misread the graph and put pytorch_bazel_test in there by mistake. Good catch, thanks!

@seemethere
Copy link
Member

I did attempt to do this at one point, #45281

@seemethere
Copy link
Member

cc @zcain117, @ailzhang for pytorch/xla perspective

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2021

XLA is explained by Ailing in this comment #45281 (review)

To elaborate on BC test, rebasing is needed for BC test because if nightly is NEWER than your base branch, it will look to the BC test like you have REMOVED operators (that were added between your merge base, and nightly) and BC test will spuriously fail.

@janeyx99
Copy link
Contributor

If you were thinking of splitting the build into a merge-with-master vs not, I did something similar with coverage back in the day. This PR might be helpful: #46656

@walterddr
Copy link
Contributor

walterddr commented Mar 10, 2021

To elaborate on BC test, rebasing is needed for BC test because if nightly is NEWER than your base branch, it will look to the BC test like you have REMOVED operators (that were added between your merge base, and nightly) and BC test will spuriously fail.

question on the BC test, wouldn't it make more sense to compare against base commit of the PR instead of nightly, after all it is trying to check if the PR introduce any BC breakage right? I can see there're 2 potential problems not doing so:

  1. base can be too old and some APIs were changed in newer master commits, but i think most likely this would also result in a merge conflict against master.
  2. we need to rebuild a binary from base commit every PR that could cost lots of CI resources

@samestep
Copy link
Contributor Author

compare against base commit of the PR

Minor note: this can be difficult to do for ghstack PRs, particularly with some of the recent unreleased work we've done in ghstack: ezyang/ghstack#44

@ezyang
Copy link
Contributor

ezyang commented Mar 10, 2021

question on the BC test, wouldn't it make more sense to compare against base commit of the PR instead of nightly, after all it is trying to check if the PR introduce any BC breakage right?

The way the BC test works is it literally downloads the nightly wheel and then run some code that imports torch to dump the set of operators. For an arbitrary base commit, no nightly necessarily exists.

TBH, I've always thought this design was kind of hacky and maybe there are better ways to do it (e.g., maintain a separate database that maps commit id to set of supported operators) but you'll have to do implement that first before you can tackle this problem.

@samestep samestep closed this Mar 18, 2021
@ezyang
Copy link
Contributor

ezyang commented Mar 18, 2021

@samestep btw, you don't have to close this issue, I think it is a legitimate complaint, just one that isn't easily fixable right now.

@samestep
Copy link
Contributor Author

@ezyang Ah thanks; yeah I'm just closing it for my own sake (so that my only open PRs are the ones that I am actively working on), but I do definitely intend to come back to this.

facebook-github-bot pushed a commit that referenced this pull request Mar 30, 2021
Summary:
PRs #53652 and #54693 attempted to increase the consistency of our choice of commit (head vs merge) for CI on PRs, and have so far been unsuccessful. This PR takes a less ambitious approach to the problem by clarifying the choice in one specific way (see the following paragraph) and documenting it in `CONTRIBUTING.md`.

In addition to documentation, this PR also removes the current behavior of our GHA jobs that checkout the PR tip instead of the merge commit. At first glance, this behavior seems to increase consistency (by eliminating the special-case for `ghstack` PRs), but in reality, it actually just means that for non-`ghstack` PRs, the question "Which commit is used in CI?" has *two* answers instead of just one; see the description of #53652 for more details.

Once merged, this PR will unblock other PRs that add modify our GHA workflows in breaking ways, such as #54737.

Pull Request resolved: #54967

Test Plan: None.

Reviewed By: walterddr, seemethere

Differential Revision: D27435913

Pulled By: samestep

fbshipit-source-id: 405fb419cf015cf88107d5eb2498cfb5bcb7ce33
@samestep samestep deleted the circleci-no-merge branch April 1, 2021 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants