feat: make telemetry off by default#4281
Conversation
|
@PastelStorm Would you please review this? Thanks |
|
@claytonlin1110 There you go :) Code Review:
|
|
@PastelStorm Feel free to take a review the updates. Thanks. |
FindingsHigh
Medium
Low
Potentially missing tests
|
|
@PastelStorm Thanks for your feedback. Just updated. |
|
@claytonlin1110 most likely the last batch of reviews as PR is pretty solid already, just need to iron our a few inconsistencies: FindingsMedium: telemetry tests are no longer hermetic because package import now runs telemetry before per-test mocks existFiles: The new startup boundary executes telemetry on from .partition.utils.config import env_config
from .telemetry import init_telemetry
# init env_config
env_config
# Explicit startup boundary for telemetry (opt-in, best-effort)
init_telemetry()But @pytest.fixture
def telemetry_mocks(monkeypatch):
"""Clear telemetry env and patch requests.get + subprocess.check_output.
Returns (mock_get, mock_subprocess). Use for both send and no-send tests so
we can assert network and subprocess side effects.
"""
monkeypatch.delenv("UNSTRUCTURED_TELEMETRY_ENABLED", raising=False)
monkeypatch.delenv("SCARF_NO_ANALYTICS", raising=False)
monkeypatch.delenv("DO_NOT_TRACK", raising=False)
mock_get = Mock()
mock_subprocess = Mock()
monkeypatch.setattr("unstructured.utils.requests.get", mock_get)
monkeypatch.setattr("unstructured.utils.subprocess.check_output", mock_subprocess)If the outer environment already has Suggested fix: avoid importing Medium: the new opt-in import smoke test is partly dead and can pass without exercising the code path it claims to coverFile: The opt-in test inherits the parent environment, sets def it_import_unstructured_succeeds_with_opt_out(self):
...
def it_import_unstructured_succeeds_with_opt_in(self):
"""Import path with opt-in env does not crash (integration-style)."""
project_root = Path(__file__).resolve().parent.parent
env = {
**os.environ,
"UNSTRUCTURED_TELEMETRY_ENABLED": "true",
"DO_NOT_TRACK": "",
"PYTHONPATH": str(project_root),
}
result = subprocess.run(
[sys.executable, "-c", "import unstructured; print('ok')"],So on any machine with Suggested fix: scrub both opt-out vars in the subprocess env and assert a local observable signal rather than just Medium: opt-out semantics changed in a backward-incompatible way, but the changelog says they were “unchanged”Files: The branch now defines both def _telemetry_opt_out() -> bool:
"""True if telemetry should be disabled via env.
DO_NOT_TRACK and SCARF_NO_ANALYTICS both follow the same rule: any non-empty
value (after strip) opts out. See README/CHANGELOG for the public contract.
"""
return bool((os.getenv("DO_NOT_TRACK") or "").strip()) or bool(
(os.getenv("SCARF_NO_ANALYTICS") or "").strip()
)
...
def scarf_analytics():
...
if _telemetry_opt_out() or not _telemetry_opt_in():
returnAnd the docs now match that: ## 0.22.0
### Enhancements
- **Telemetry off by default**: Analytics/telemetry is now **disabled by default**. Upgrading will stop sending the library-load ping unless you opt in. To restore the previous behavior, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` before importing `unstructured`. Opt-out via `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` (any non-empty value) is unchanged and takes precedence.## :chart_with_upwards_trend: Analytics
Telemetry is **off by default**. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `=1`) before importing `unstructured`. To opt out, set `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` to any non-empty value (e.g. `true`, `1`, `yes`); opt-out takes precedence. See our [Privacy Policy](https://unstructured.io/privacy-policy).However, relative to Suggested fix: either preserve the old exact-boolean semantics, or document this as an intentional breaking change rather than “unchanged” and add tests for Medium: the repo’s outbound-connectivity harness is now stale and no longer actually validates telemetry egressFiles: The branch moved telemetry from DO_NOT_TRACK=""
HF_HUB_OFFLINE=""
REMOVE_CACHE=0
case "$SCENARIO" in
baseline) ;;
missing-models) REMOVE_CACHE=1 ;;
analytics-online-only) HF_HUB_OFFLINE=1 ;;
offline)
DO_NOT_TRACK=true
HF_HUB_OFFLINE=1
;;
...
docker exec -i -e PYTHONUNBUFFERED=1 "$CID" python - <<PY |& tee "${PY_LOG_DIR}/${SCENARIO}.log"
import logging
from unstructured.partition.auto import partition
from unstructured.logger import logger # force analytics ping if not DO_NOT_TRACKThat script can now pass its “analytics” scenarios without ever covering telemetry traffic at all, so it gives false confidence about outbound behavior. Suggested fix: update the harness to set Low: the new GPU-probe tests still miss the explicit
|
|
@PastelStorm I updated, but test_dockerfile failed. |
We had an issue with pulling our Grype image. We disabled the build fail step temporarily. Thank you for yet another contribution! |
Summary
Closes #3940
Changes
scarf_analytics()sends the ping only whenUNSTRUCTURED_TELEMETRY_ENABLED=true(or1). Opt-out env varsDO_NOT_TRACKandSCARF_NO_ANALYTICSare still respected and take precedence.DescribeScarfAnalyticstests for default off, opt-in (true/1), and opt-out overriding opt-in.