Don't show trap links to recognized bots#12413
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce “trap link” exposure for known/recognized crawlers by gating the hidden show_page_status=1 link behind a recognized-bot check, leveraging the request-scoped context vars infrastructure.
Changes:
- Conditionally render the hidden “Page Status” trap link only when
not is_recognized_bot()in the global nav header. - Expose a new
is_recognized_bot()helper inopenlibrary.plugins.openlibrary.codefor use in templates.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| openlibrary/templates/lib/nav_head.html | Suppresses the hidden trap link for recognized bots. |
| openlibrary/plugins/openlibrary/code.py | Adds a @public is_recognized_bot() helper backed by req_context. |
Comments suppressed due to low confidence (1)
openlibrary/plugins/openlibrary/code.py:1241
- The comment says this reads a ContextVar set by
set_context_from_fastapi(), butset_context_from_fastapi()currently never setsis_recognized_bot(it only computesis_bot). As a result,is_recognized_bot()will always returnFalsein FastAPI contexts. Either updateset_context_from_fastapi()to compute/populateis_recognized_botas well, or adjust this comment so it doesn’t imply FastAPI support.
# Reads from the request-scoped ContextVar set by set_context_from_legacy_web_py()
# (web.py) or set_context_from_fastapi() — the web.py equivalent of web.ctx.
return req_context.get().is_recognized_bot
| $if not is_recognized_bot(): | ||
| $# detect-missing-i18n-skip-line | ||
| <a | ||
| href="$changequery(dict(show_page_status=1))" | ||
| style="color:transparent;position:absolute;pointer-events:none;" |
There was a problem hiding this comment.
is_recognized_bot() is derived from req_context. For responses computed in memcache async update threads (via caching_prethread() in openlibrary/plugins/openlibrary/home.py), the new thread currently recomputes request context from a dummy web.ctx.env, which makes is_recognized_bot false even for recognized crawlers. That means recognized bots can still receive cached pages containing this trap link. Consider updating caching_prethread() to propagate the original UA / is_recognized_bot (or directly copy RequestContextVars) into the background thread before template rendering.
| $if not is_recognized_bot(): | ||
| $# detect-missing-i18n-skip-line | ||
| <a | ||
| href="$changequery(dict(show_page_status=1))" | ||
| style="color:transparent;position:absolute;pointer-events:none;" | ||
| tabindex="-1" | ||
| aria-hidden="true" | ||
| >Page Status</a> |
There was a problem hiding this comment.
This behavior change (suppressing the trap link for recognized bots) doesn’t appear to be covered by tests. There are existing template-rendering tests (e.g., openlibrary/plugins/openlibrary/tests/test_home.py) that could be extended to assert the "Page Status" link is absent when req_context.is_recognized_bot=True and present otherwise.
Closes #
Technical
Note: Pre-existing bug! We need to update our caching prethread to copy over both
is_botandis_recognized_bot. Currently we have cache leak.Testing
Confirmed locally trap link still shows. Put on prod, saw a drop in identifying bots hitting trap links:
sudo tac /1/var/log/nginx/access.log | obfi_grep_bots -v | grep 'show_page_status=1' | obfi_count_minuteScreenshot
Stakeholders