Skip to content

Commit a2e290d

Browse files
authored
Add GitHub Action to suggest PR reviewers based on git history (#4789)
## Why The CLI's CODEOWNERS catch-all assigns 6 people to every PR outside a few narrow paths. This creates review noise and diffuses responsibility. We want targeted reviewer suggestions based on who actually worked on the changed code recently. ## Changes Before: Every PR touching core code auto-assigns all 6 CODEOWNERS. No signal about who is best suited to review. Now: A new GitHub Action analyzes git history of the changed files and posts a PR comment with two sections: - **Suggested reviewers** (1-3 people best suited based on recency-weighted git history) - **Eligible reviewers** (everyone from CODEOWNERS who could review, minus the suggested ones) This is additive only. CODEOWNERS and auto-assign stay unchanged. How it works: - Triggers on PR open, synchronize, and ready-for-review (skips drafts and fork PRs) - Classifies changed files by type (source=1.0, tests=0.3, acceptance=0.2, generated=0.0) - Scores contributors using recency-weighted commit history (half-life 150 days) - Resolves git author names to GitHub logins via the GitHub API (no hardcoded alias table to maintain) - Parses `.github/CODEOWNERS` to find eligible reviewers for the changed paths - Updates the comment in-place on re-runs (no notification churn) Implementation: a single Python script (`tools/suggest_reviewers.py`, 281 lines) and a minimal workflow YAML. ## Test plan - [x] Action ran on this PR itself and posted a comment successfully - [x] Verified script parses cleanly, passed `make checks`, passed `ruff format` - [x] Dry-run tested against 4 recent merged PRs with different characteristics: **PR #4784** (66 files, by pietern, big DABs rename): ``` ## Suggested reviewers - @denik -- recent work in `./`, `bundle/`, `cmd/bundle/generate/` Confidence: high ## Eligible reviewers @andrewnester, @anton-107, @lennartkats-db, @shreyas-goenka, @simonfaltum ``` Correctly identifies Denis as the clear top reviewer (2x second place score). All 6 CODEOWNERS shown as eligible. **PR #4782** (13 files, by denik, bundle engine priority): ``` ## Suggested reviewers - @andrewnester -- recent work in `bundle/schema/`, `bundle/internal/schema/`, `cmd/bundle/generate/` - @pietern -- recent work in `bundle/schema/`, `cmd/bundle/generate/`, `bundle/internal/schema/` - @shreyas-goenka -- recent work in `bundle/schema/`, `bundle/internal/schema/`, `cmd/bundle/` Confidence: medium ## Eligible reviewers @anton-107, @simonfaltum ``` Suggests 3 reviewers when scores are close. Remaining CODEOWNERS shown as eligible. **PR #4785** (2 files, by MarioCadenas, apps bug fix): ``` ## Suggested reviewers - @arsenyinfo -- recent work in `cmd/apps/` - @pietern -- recent work in `cmd/apps/` - @pkosiec -- recent work in `cmd/apps/` Confidence: low ## Eligible reviewers @databricks/eng-apps-devex ``` Correctly suggests apps-area contributors (not the catch-all CODEOWNERS). Shows the apps team as eligible. Low confidence since only 2 files. **PR #4774** (28 files, by denik, direct engine grants): ``` ## Suggested reviewers - @andrewnester -- recent work in `bundle/direct/dresources/`, `acceptance/bundle/invariant/` - @shreyas-goenka -- recent work in `bundle/direct/dresources/` Confidence: high ## Eligible reviewers @anton-107, @pietern, @simonfaltum ``` Correctly identifies the two main bundle/direct contributors. High confidence with clear score separation.
1 parent 8fb16e5 commit a2e290d

File tree

2 files changed

+332
-0
lines changed

2 files changed

+332
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: suggest-reviewers
2+
3+
on:
4+
pull_request:
5+
types: [opened, synchronize, ready_for_review]
6+
7+
permissions:
8+
contents: read
9+
pull-requests: write
10+
11+
jobs:
12+
suggest-reviewers:
13+
runs-on:
14+
group: databricks-deco-testing-runner-group
15+
labels: ubuntu-latest-deco
16+
if: ${{ !github.event.pull_request.draft && !github.event.pull_request.head.repo.fork }}
17+
18+
steps:
19+
- name: Checkout repository
20+
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
21+
with:
22+
fetch-depth: 0
23+
24+
- name: Install uv
25+
uses: astral-sh/setup-uv@eac588ad8def6316056a12d4907a9d4d84ff7a3b # v7.3.0
26+
with:
27+
version: "0.6.5"
28+
29+
- name: Suggest reviewers
30+
env:
31+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
32+
GITHUB_REPOSITORY: ${{ github.repository }}
33+
PR_NUMBER: ${{ github.event.pull_request.number }}
34+
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
35+
run: uv run tools/suggest_reviewers.py

tools/suggest_reviewers.py

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
#!/usr/bin/env python3
2+
# /// script
3+
# requires-python = ">=3.12"
4+
# ///
5+
6+
import fnmatch
7+
import os
8+
import subprocess
9+
import sys
10+
from datetime import datetime, timezone
11+
from pathlib import Path
12+
13+
MENTION_REVIEWERS = True
14+
CODEOWNERS_LINK = "[CODEOWNERS](.github/CODEOWNERS)"
15+
MARKER = "<!-- REVIEWER_SUGGESTION -->"
16+
_login_cache: dict[str, str | None] = {}
17+
18+
19+
def classify_file(path: str) -> float:
20+
p = Path(path)
21+
if p.name.startswith("out.") or p.name == "output.txt":
22+
return 0.0
23+
if path.startswith(("acceptance/", "integration/")):
24+
return 0.2
25+
if path.endswith("_test.go"):
26+
return 0.3
27+
return 1.0 if path.endswith(".go") else 0.5
28+
29+
30+
def get_changed_files(pr_number: str) -> list[str]:
31+
r = subprocess.run(
32+
["gh", "pr", "diff", "--name-only", pr_number],
33+
capture_output=True,
34+
encoding="utf-8",
35+
)
36+
if r.returncode != 0:
37+
print(f"gh pr diff failed: {r.stderr.strip()}", file=sys.stderr)
38+
sys.exit(1)
39+
return [f.strip() for f in r.stdout.splitlines() if f.strip()]
40+
41+
42+
def git_log(path: str) -> list[tuple[str, str, datetime]]:
43+
r = subprocess.run(
44+
["git", "log", "-50", "--no-merges", "--since=12 months ago", "--format=%H|%an|%aI", "--", path],
45+
capture_output=True,
46+
encoding="utf-8",
47+
)
48+
if r.returncode != 0:
49+
return []
50+
entries = []
51+
for line in r.stdout.splitlines():
52+
parts = line.strip().split("|", 2)
53+
if len(parts) != 3:
54+
continue
55+
try:
56+
entries.append((parts[0], parts[1], datetime.fromisoformat(parts[2])))
57+
except ValueError:
58+
continue
59+
return entries
60+
61+
62+
def resolve_login(repo: str, sha: str, author_name: str) -> str | None:
63+
if author_name in _login_cache:
64+
return _login_cache[author_name]
65+
r = subprocess.run(
66+
["gh", "api", f"repos/{repo}/commits/{sha}", "--jq", ".author.login"],
67+
capture_output=True,
68+
encoding="utf-8",
69+
)
70+
login = r.stdout.strip() or None if r.returncode == 0 else None
71+
_login_cache[author_name] = login
72+
return login
73+
74+
75+
def _codeowners_match(pattern: str, filepath: str) -> bool:
76+
if pattern.startswith("/"):
77+
pattern = pattern[1:]
78+
if pattern.endswith("/"):
79+
return filepath.startswith(pattern)
80+
return fnmatch.fnmatch(filepath, pattern) or filepath == pattern
81+
return fnmatch.fnmatch(filepath, pattern) or fnmatch.fnmatch(Path(filepath).name, pattern)
82+
83+
84+
def parse_codeowners(changed_files: list[str]) -> list[str]:
85+
path = Path(".github/CODEOWNERS")
86+
if not path.exists():
87+
return []
88+
rules: list[tuple[str, list[str]]] = []
89+
for line in path.read_text().splitlines():
90+
line = line.strip()
91+
if not line or line.startswith("#"):
92+
continue
93+
parts = line.split()
94+
owners = [p for p in parts[1:] if p.startswith("@")]
95+
if len(parts) >= 2 and owners:
96+
rules.append((parts[0], owners))
97+
98+
all_owners: set[str] = set()
99+
for filepath in changed_files:
100+
matched = []
101+
for pattern, owners in rules:
102+
if _codeowners_match(pattern, filepath):
103+
matched = owners
104+
all_owners.update(matched)
105+
return sorted(all_owners)
106+
107+
108+
def score_contributors(
109+
files: list[str], pr_author: str, now: datetime, repo: str
110+
) -> tuple[dict[str, float], dict[str, dict[str, float]], int]:
111+
scores: dict[str, float] = {}
112+
dir_scores: dict[str, dict[str, float]] = {}
113+
scored_count = 0
114+
author_login = pr_author.lower()
115+
116+
for filepath in files:
117+
weight = classify_file(filepath)
118+
if weight == 0.0:
119+
continue
120+
history = git_log(filepath)
121+
if not history:
122+
parent = str(Path(filepath).parent)
123+
if parent and parent != ".":
124+
history = git_log(parent)
125+
if not history:
126+
continue
127+
128+
top_dir = str(Path(filepath).parent) or "."
129+
file_contributed = False
130+
for sha, name, commit_date in history:
131+
if name.endswith("[bot]"):
132+
continue
133+
login = resolve_login(repo, sha, name)
134+
if not login or login.lower() == author_login:
135+
continue
136+
days_ago = max(0, (now - commit_date).total_seconds() / 86400)
137+
s = weight * (0.5 ** (days_ago / 150))
138+
scores[login] = scores.get(login, 0) + s
139+
dir_scores.setdefault(login, {})
140+
dir_scores[login][top_dir] = dir_scores[login].get(top_dir, 0) + s
141+
file_contributed = True
142+
if file_contributed:
143+
scored_count += 1
144+
return scores, dir_scores, scored_count
145+
146+
147+
def top_dirs(ds: dict[str, float], n: int = 3) -> list[str]:
148+
return [d for d, _ in sorted(ds.items(), key=lambda x: -x[1])[:n]]
149+
150+
151+
def fmt_reviewer(login: str, dirs: list[str]) -> str:
152+
mention = f"@{login}" if MENTION_REVIEWERS else login
153+
return f"- {mention} -- recent work in {', '.join(f'`{d}/`' for d in dirs)}"
154+
155+
156+
def select_reviewers(ss: list[tuple[str, float]]) -> list[tuple[str, float]]:
157+
if not ss:
158+
return []
159+
out = [ss[0]]
160+
if len(ss) >= 2 and ss[0][1] < 1.5 * ss[1][1]:
161+
out.append(ss[1])
162+
if len(ss) >= 3 and ss[1][1] < 1.5 * ss[2][1]:
163+
out.append(ss[2])
164+
return out
165+
166+
167+
def compute_confidence(ss: list[tuple[str, float]], scored_count: int) -> str:
168+
if scored_count < 3 or len(ss) < 2:
169+
return "low"
170+
if len(ss) >= 3 and ss[0][1] > 2 * ss[2][1]:
171+
return "high"
172+
if len(ss) >= 3 and ss[0][1] > 1.5 * ss[2][1]:
173+
return "medium"
174+
return "low"
175+
176+
177+
def fmt_eligible(owners: list[str]) -> str:
178+
if MENTION_REVIEWERS:
179+
return ", ".join(owners)
180+
return ", ".join(o.lstrip("@") for o in owners)
181+
182+
183+
def build_comment(
184+
sorted_scores: list[tuple[str, float]],
185+
dir_scores: dict[str, dict[str, float]],
186+
total_files: int,
187+
scored_count: int,
188+
eligible_owners: list[str],
189+
pr_author: str,
190+
) -> str:
191+
reviewers = select_reviewers(sorted_scores)
192+
suggested_logins = {login.lower() for login, _ in reviewers}
193+
eligible = [
194+
o
195+
for o in eligible_owners
196+
if o.lstrip("@").lower() != pr_author.lower() and o.lstrip("@").lower() not in suggested_logins
197+
]
198+
199+
lines = [MARKER]
200+
if reviewers:
201+
lines += [
202+
"## Suggested reviewers",
203+
"",
204+
"Based on git history of the changed files, these people are best suited to review:",
205+
"",
206+
]
207+
for login, _ in reviewers:
208+
lines.append(fmt_reviewer(login, top_dirs(dir_scores.get(login, {}))))
209+
lines += ["", f"Confidence: {compute_confidence(sorted_scores, scored_count)}"]
210+
if eligible:
211+
lines += [
212+
"",
213+
"## Eligible reviewers",
214+
"",
215+
"Based on CODEOWNERS, these people or teams could also review:",
216+
"",
217+
fmt_eligible(eligible),
218+
]
219+
elif eligible:
220+
lines += [
221+
"## Eligible reviewers",
222+
"",
223+
"Could not determine reviewers from git history. Based on CODEOWNERS, these people or teams could review:",
224+
"",
225+
fmt_eligible(eligible),
226+
]
227+
else:
228+
lines += [
229+
"## Suggested reviewers",
230+
"",
231+
f"Could not determine reviewers from git history. Please pick from {CODEOWNERS_LINK}.",
232+
]
233+
234+
lines += [
235+
"",
236+
f"<sub>Suggestions based on git history of {total_files} changed files "
237+
f"({scored_count} scored). "
238+
f"See {CODEOWNERS_LINK} for path-specific ownership rules.</sub>",
239+
]
240+
return "\n".join(lines) + "\n"
241+
242+
243+
def find_existing_comment(repo: str, pr_number: str) -> str | None:
244+
r = subprocess.run(
245+
[
246+
"gh",
247+
"api",
248+
f"repos/{repo}/issues/{pr_number}/comments",
249+
"--paginate",
250+
"--jq",
251+
f'.[] | select(.body | contains("{MARKER}")) | .id',
252+
],
253+
capture_output=True,
254+
encoding="utf-8",
255+
)
256+
if r.returncode != 0:
257+
print(f"gh api comments failed: {r.stderr.strip()}", file=sys.stderr)
258+
sys.exit(1)
259+
for cid in r.stdout.splitlines():
260+
cid = cid.strip()
261+
if cid:
262+
return cid
263+
return None
264+
265+
266+
def main():
267+
repo = os.environ["GITHUB_REPOSITORY"]
268+
pr_number = os.environ["PR_NUMBER"]
269+
pr_author = os.environ["PR_AUTHOR"]
270+
271+
files = get_changed_files(pr_number)
272+
if not files:
273+
print("No changed files found.")
274+
return
275+
276+
now = datetime.now(timezone.utc)
277+
scores, dir_scores, scored_count = score_contributors(files, pr_author, now, repo)
278+
sorted_scores = sorted(scores.items(), key=lambda x: -x[1])
279+
eligible = parse_codeowners(files)
280+
comment = build_comment(sorted_scores, dir_scores, len(files), scored_count, eligible, pr_author)
281+
282+
print(comment)
283+
existing_id = find_existing_comment(repo, pr_number)
284+
if existing_id:
285+
subprocess.run(
286+
["gh", "api", f"repos/{repo}/issues/comments/{existing_id}", "-X", "PATCH", "-f", f"body={comment}"],
287+
check=True,
288+
)
289+
else:
290+
subprocess.run(
291+
["gh", "pr", "comment", pr_number, "--body", comment],
292+
check=True,
293+
)
294+
295+
296+
if __name__ == "__main__":
297+
main()

0 commit comments

Comments
 (0)