Skip to content

feat: make telemetry off by default#4281

Merged
PastelStorm merged 12 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/telemetry-off-by-default
Mar 16, 2026
Merged

feat: make telemetry off by default#4281
PastelStorm merged 12 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/telemetry-off-by-default

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

@claytonlin1110 claytonlin1110 commented Mar 9, 2026

Summary

Closes #3940

Changes

  • Behavior: scarf_analytics() sends the ping only when UNSTRUCTURED_TELEMETRY_ENABLED=true (or 1). Opt-out env vars DO_NOT_TRACK and SCARF_NO_ANALYTICS are still respected and take precedence.
  • Docs: README Analytics section and logger comment updated to describe the new default and opt-in/opt-out.
  • Tests: New DescribeScarfAnalytics tests for default off, opt-in (true/1), and opt-out overriding opt-in.
  • Changelog: Entry under 0.21.13.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Would you please review this? Thanks

@PastelStorm
Copy link
Copy Markdown
Contributor

PastelStorm commented Mar 10, 2026

@claytonlin1110 There you go :)

Code Review: claytonlin1110:feat/telemetry-off-by-default

This PR changes telemetry from opt-out to opt-in (off by default), adds a new UNSTRUCTURED_TELEMETRY_ENABLED env var, updates docs and version. Below are the consolidated findings from all review agents, deduplicated and ranked by severity.


Critical Issues

1. Case-sensitivity asymmetry in opt-out vs opt-in logic

The opt-in check is case-insensitive, but opt-out is case-sensitive -- a dangerous inconsistency:

opt_out = os.getenv("SCARF_NO_ANALYTICS") == "true" or os.getenv("DO_NOT_TRACK") == "true"
opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1")
  • UNSTRUCTURED_TELEMETRY_ENABLED=TRUE will opt in (.lower() normalizes)
  • DO_NOT_TRACK=TRUE will NOT opt out (exact == "true" fails)
  • DO_NOT_TRACK=1 will NOT opt out

This violates the Console Do Not Track standard which specifies any non-empty value should be honored. The docstring claims "opt-out env vars are always respected and take precedence" but the code doesn't deliver on that promise for common case variations. This is a blocker.

2. Wasted nvidia-smi subprocess on every import (ordering bug)

The function shells out to nvidia-smi and computes python_version before checking whether telemetry is even enabled:

def scarf_analytics():
    try:
        subprocess.check_output("nvidia-smi")  # runs FIRST, every import
        gpu_present = True
    except Exception:
        gpu_present = False

    python_version = ".".join(platform.python_version().split(".")[:2])

    opt_out = ...
    opt_in = ...
    if opt_out or not opt_in:  # checked AFTER subprocess
        return

Since telemetry is now off by default, the vast majority of users pay the cost of a subprocess spawn on every import unstructured for zero benefit. This is a regression introduced by this PR -- the old code had the opt-out inside the try block so the subprocess cost was at least paired with the telemetry send. The env var check must be moved to the very top.

3. subprocess.check_output("nvidia-smi") is never mocked in tests

Every test actually shells out to nvidia-smi. This is slow, non-deterministic (different gpu= values on different machines), and noisy on stderr. The tests pass either way by accident, not by design.


Major Issues

4. Massive code duplication in dev/non-dev URL branches

The if "dev" in __version__ / else branches are 12 lines of identical code differing only in &dev=true vs &dev=false. This duplication existed before but the PR was already restructuring the function and had a natural opportunity to collapse it:

dev = "dev" in __version__
requests.get(
    "https://packages.unstructured.io/python-telemetry",
    params={
        "version": __version__,
        "platform": platform.system(),
        "python": python_version,
        "arch": platform.machine(),
        "gpu": str(gpu_present),
        "dev": str(dev).lower(),
    },
    timeout=10,
)

Using params= also fixes the pre-existing URL bug (issue #6) and the fragile string concatenation.

5. Mixing monkeypatch and unittest.mock.patch in tests

The tests use pytest's monkeypatch for env vars but unittest.mock.patch (context manager) for requests.get. This is inconsistent within the same class and with the rest of the test file. Should use one approach throughout -- monkeypatch.setattr is more Pythonic in a pytest codebase.

6. Test isolation gaps -- ambient env vars can mask test intent

it_does_not_send_when_do_not_track_is_set_even_if_opt_in doesn't clear SCARF_NO_ANALYTICS, and vice versa. If a developer's shell has either variable set, the test passes for the wrong reason.

7. dev vs non-dev URL path is never tested

The function has two distinct code paths based on "dev" in __version__ but no test mocks __version__ or asserts which dev= value appears in the URL. Half the code paths are untested.

8. Patch version bump for a behavioral breaking change

Going from 0.21.12 to 0.21.13 (patch) is inappropriate. Changing telemetry from on-by-default to off-by-default is a behavioral change that silently stops analytics for all existing users on upgrade. This should arguably be a minor bump (0.22.0) and deserves a migration note.


Minor Issues

9. Pre-existing bug preserved: missing = in URL query parameter

+ "&python"    # produces &python3.11 instead of &python=3.11
+ python_version

This is inherited from main but the PR is modifying this exact function. Boy scout rule applies.

10. Import-time HTTP request with 10-second timeout

When telemetry IS enabled, requests.get(..., timeout=10) runs at import time. In air-gapped environments or slow networks, import unstructured blocks for up to 10 seconds. Should be fire-and-forget in a daemon thread or use a much shorter timeout.

11. except Exception: pass is too broad

The catch-all silently swallows all exceptions including MemoryError, SystemError, etc. Should narrow to except (requests.RequestException, OSError) or at minimum log at debug level.

12. Missing test coverage for exception handling

No test verifies that the function handles requests.get exceptions gracefully.

13. Missing test coverage for explicit non-opt-in values

UNSTRUCTURED_TELEMETRY_ENABLED=false, =0, =yes, =on are not tested to verify they suppress telemetry.

14. URL assertions in tests are too shallow

Only checking "python-telemetry" in call_url and "version=" in call_url -- doesn't verify actual values, timeout parameter, or URL structure.


Nits

15. Redundant docstring + inline comment -- both say the same thing about "off by default, opt-in via env var."

16. subprocess.check_output("nvidia-smi") should use list form ["nvidia-smi"] and redirect stderr=subprocess.DEVNULL to avoid console noise.

17. Overly broad except Exception for nvidia-smi -- should narrow to except (FileNotFoundError, subprocess.CalledProcessError).

18. it_accepts_opt_in_value_1 should be a parametrized test covering "true", "True", "TRUE", "1" rather than a standalone method.


Recommended Refactored Implementation

All of issues 1-4, 6, 9, 11, 15-17 could be resolved with this rewrite:

def scarf_analytics():
    """Send a lightweight analytics ping. Off by default.

    Set UNSTRUCTURED_TELEMETRY_ENABLED=true to opt in.
    Opt-out env vars (DO_NOT_TRACK, SCARF_NO_ANALYTICS) are always respected and take precedence.
    """
    opt_out = (
        os.getenv("SCARF_NO_ANALYTICS", "").lower() in ("true", "1")
        or bool(os.getenv("DO_NOT_TRACK", ""))
    )
    opt_in = os.getenv("UNSTRUCTURED_TELEMETRY_ENABLED", "").lower() in ("true", "1")
    if opt_out or not opt_in:
        return

    try:
        subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
        gpu_present = True
    except (FileNotFoundError, subprocess.CalledProcessError):
        gpu_present = False

    python_version = ".".join(platform.python_version().split(".")[:2])

    try:
        requests.get(
            "https://packages.unstructured.io/python-telemetry",
            params={
                "version": __version__,
                "platform": platform.system(),
                "python": python_version,
                "arch": platform.machine(),
                "gpu": str(gpu_present),
                "dev": str("dev" in __version__).lower(),
            },
            timeout=10,
        )
    except Exception:
        pass

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Feel free to take a review the updates. Thanks.

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110

Findings

High

  • SCARF_NO_ANALYTICS behavior is inconsistent with docs and with DO_NOT_TRACK semantics
    • Files: unstructured/utils.py, CHANGELOG.md, README.md, test_unstructured/test_utils.py
    • CHANGELOG.md says SCARF_NO_ANALYTICS opt-out works for “any non-empty value”, but implementation only treats "true"/"1" as opt-out.
    • This can leak telemetry unexpectedly for users who set values like yes/on expecting opt-out.
    • Fix: unify env parsing rules in one helper and align docs/tests exactly to that contract.

Medium

  • Telemetry can raise on import for unhandled nvidia-smi execution errors

    • Files: unstructured/utils.py, unstructured/logger.py
    • scarf_analytics() now catches only FileNotFoundError and subprocess.CalledProcessError around check_output(["nvidia-smi"], ...). Other OSError-type failures (e.g., permission/execution issues) can bubble up.
    • Because scarf_analytics() is called at import time in logger.py, this can break imports in opt-in environments.
    • Fix: broaden that probe exception handling (at least include OSError) and keep telemetry best-effort/non-fatal.
  • Test for request-exception handling is weak and can pass without exercising the intended path

    • File: test_unstructured/test_utils.py
    • it_handles_requests_exception_gracefully only asserts “does not raise”; it does not assert requests.get was actually called. An early-return regression would still pass this test.
    • Fix: assert requests.get call count/arguments in that test.
  • Import-time side effect architecture remains (logging module triggers telemetry network path)

    • Files: unstructured/logger.py, unstructured/utils.py
    • Telemetry execution is still tied to import-time behavior of logger initialization, which is brittle and hard to reason about.
    • Fix: move telemetry into a dedicated initializer/module and call it from an explicit startup boundary.

Low

  • New telemetry tests contain substantial duplication and are expensive to maintain

    • File: test_unstructured/test_utils.py
    • Repeated env setup and mock wiring across many cases; several can be parametrized.
    • Fix: add shared fixture/helper + table-driven tests.
  • Missing side-effect assertions in “no-send” tests

    • File: test_unstructured/test_utils.py
    • Many negative-path tests only assert requests.get was not called; they don’t assert subprocess.check_output wasn’t called.
    • Fix: assert both network and subprocess side effects are absent when telemetry is disabled/opted-out.

Potentially missing tests

  • Explicit coverage for SCARF_NO_ANALYTICS values beyond "true" (e.g., "1", "yes", whitespace/case variants), depending on intended contract.
  • Coverage for unexpected nvidia-smi failures (OSError/PermissionError) to guarantee import-safe behavior.
  • A deterministic integration-style check for import path behavior via unstructured/logger.py with opt-in/opt-out combinations.

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm Thanks for your feedback. Just updated.

@PastelStorm
Copy link
Copy Markdown
Contributor

@claytonlin1110 most likely the last batch of reviews as PR is pretty solid already, just need to iron our a few inconsistencies:

Findings

Medium: telemetry tests are no longer hermetic because package import now runs telemetry before per-test mocks exist

Files: test_unstructured/test_utils.py, unstructured/__init__.py, unstructured/telemetry.py

The new startup boundary executes telemetry on import unstructured:

from .partition.utils.config import env_config
from .telemetry import init_telemetry

# init env_config
env_config

# Explicit startup boundary for telemetry (opt-in, best-effort)
init_telemetry()

But test_unstructured/test_utils.py still imports from unstructured import utils at module import time, while the telemetry mocks are only installed inside the telemetry_mocks fixture:

@pytest.fixture
def telemetry_mocks(monkeypatch):
    """Clear telemetry env and patch requests.get + subprocess.check_output.

    Returns (mock_get, mock_subprocess). Use for both send and no-send tests so
    we can assert network and subprocess side effects.
    """
    monkeypatch.delenv("UNSTRUCTURED_TELEMETRY_ENABLED", raising=False)
    monkeypatch.delenv("SCARF_NO_ANALYTICS", raising=False)
    monkeypatch.delenv("DO_NOT_TRACK", raising=False)
    mock_get = Mock()
    mock_subprocess = Mock()
    monkeypatch.setattr("unstructured.utils.requests.get", mock_get)
    monkeypatch.setattr("unstructured.utils.subprocess.check_output", mock_subprocess)

If the outer environment already has UNSTRUCTURED_TELEMETRY_ENABLED=true, test collection can run the real subprocess/network path before any fixture patches it. That makes this test module environment-dependent and can leak real side effects during collection.

Suggested fix: avoid importing unstructured at module scope in telemetry tests. Patch before import and use importlib.reload, or move startup-boundary tests into a dedicated module that clears env and controls import order explicitly.

Medium: the new opt-in import smoke test is partly dead and can pass without exercising the code path it claims to cover

File: test_unstructured/test_utils.py

The opt-in test inherits the parent environment, sets UNSTRUCTURED_TELEMETRY_ENABLED=true, but does not clear SCARF_NO_ANALYTICS, and only asserts import unstructured prints ok:

def it_import_unstructured_succeeds_with_opt_out(self):
    ...
def it_import_unstructured_succeeds_with_opt_in(self):
    """Import path with opt-in env does not crash (integration-style)."""
    project_root = Path(__file__).resolve().parent.parent
    env = {
        **os.environ,
        "UNSTRUCTURED_TELEMETRY_ENABLED": "true",
        "DO_NOT_TRACK": "",
        "PYTHONPATH": str(project_root),
    }
    result = subprocess.run(
        [sys.executable, "-c", "import unstructured; print('ok')"],

So on any machine with SCARF_NO_ANALYTICS exported, the subprocess takes the opt-out path and the test still passes. Even in a clean environment, the test still does not prove that init_telemetry() or scarf_analytics() actually ran; it only proves import succeeded.

Suggested fix: scrub both opt-out vars in the subprocess env and assert a local observable signal rather than just "ok". Better yet, patch the call-site symbol and reload unstructured to assert the import-time boundary executed exactly once.

Medium: opt-out semantics changed in a backward-incompatible way, but the changelog says they were “unchanged”

Files: unstructured/utils.py, CHANGELOG.md, README.md

The branch now defines both DO_NOT_TRACK and SCARF_NO_ANALYTICS as “any non-empty value opts out”:

def _telemetry_opt_out() -> bool:
    """True if telemetry should be disabled via env.

    DO_NOT_TRACK and SCARF_NO_ANALYTICS both follow the same rule: any non-empty
    value (after strip) opts out. See README/CHANGELOG for the public contract.
    """
    return bool((os.getenv("DO_NOT_TRACK") or "").strip()) or bool(
        (os.getenv("SCARF_NO_ANALYTICS") or "").strip()
    )
...
def scarf_analytics():
    ...
    if _telemetry_opt_out() or not _telemetry_opt_in():
        return

And the docs now match that:

## 0.22.0

### Enhancements
- **Telemetry off by default**: Analytics/telemetry is now **disabled by default**. Upgrading will stop sending the library-load ping unless you opt in. To restore the previous behavior, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` before importing `unstructured`. Opt-out via `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` (any non-empty value) is unchanged and takes precedence.
## :chart_with_upwards_trend: Analytics

Telemetry is **off by default**. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `=1`) before importing `unstructured`. To opt out, set `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` to any non-empty value (e.g. `true`, `1`, `yes`); opt-out takes precedence. See our [Privacy Policy](https://unstructured.io/privacy-policy).

However, relative to main, this is not “unchanged”: values like DO_NOT_TRACK=false or SCARF_NO_ANALYTICS=0 will now block telemetry if users explicitly opt in. That is a real config compatibility change and will be surprising in environments that template booleans as 0/false.

Suggested fix: either preserve the old exact-boolean semantics, or document this as an intentional breaking change rather than “unchanged” and add tests for false/0 combinations so the contract is explicit.

Medium: the repo’s outbound-connectivity harness is now stale and no longer actually validates telemetry egress

Files: scripts/image/test-outbound-connectivity.sh, unstructured/__init__.py

The branch moved telemetry from logger.py to package import and requires explicit opt-in, but the connectivity script still assumes importing logger forces the analytics ping and never exports UNSTRUCTURED_TELEMETRY_ENABLED:

DO_NOT_TRACK=""
HF_HUB_OFFLINE=""
REMOVE_CACHE=0
case "$SCENARIO" in
baseline) ;;
missing-models) REMOVE_CACHE=1 ;;
analytics-online-only) HF_HUB_OFFLINE=1 ;;
offline)
  DO_NOT_TRACK=true
  HF_HUB_OFFLINE=1
  ;;
...
docker exec -i -e PYTHONUNBUFFERED=1 "$CID" python - <<PY |& tee "${PY_LOG_DIR}/${SCENARIO}.log"
import logging
from unstructured.partition.auto import partition
from unstructured.logger import logger  # force analytics ping if not DO_NOT_TRACK

That script can now pass its “analytics” scenarios without ever covering telemetry traffic at all, so it gives false confidence about outbound behavior.

Suggested fix: update the harness to set UNSTRUCTURED_TELEMETRY_ENABLED=1 in telemetry-positive scenarios and remove the stale logger-based trigger assumption.

Low: the new GPU-probe tests still miss the explicit CalledProcessError branch

Files: test_unstructured/test_utils.py, unstructured/utils.py

Production now catches subprocess.CalledProcessError, but the tests only cover OSError and PermissionError:

    try:
        subprocess.check_output(["nvidia-smi"], stderr=subprocess.DEVNULL)
        gpu_present = True
    except (OSError, subprocess.CalledProcessError):
        gpu_present = False
@pytest.mark.parametrize(
    "exc",
    [OSError(), PermissionError("nvidia-smi denied")],
    ids=["OSError", "PermissionError"],
)
def it_handles_nvidia_smi_failure_gracefully(self, monkeypatch, telemetry_mocks, exc):

So the newly added non-zero-exit-path branch is still untested.

Suggested fix: add a subprocess.CalledProcessError(returncode=1, cmd=["nvidia-smi"]) case and assert gpu == "False".

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@PastelStorm I updated, but test_dockerfile failed.
Would you please check it ? i don't think it is due to this code issue.

@PastelStorm PastelStorm enabled auto-merge March 16, 2026 16:31
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PastelStorm PastelStorm added this pull request to the merge queue Mar 16, 2026
Merged via the queue into Unstructured-IO:main with commit cb16853 Mar 16, 2026
53 checks passed
@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm I updated, but test_dockerfile failed. Would you please check it ? i don't think it is due to this code issue.

We had an issue with pulling our Grype image. We disabled the build fail step temporarily.

Thank you for yet another contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/telemetry-off-default

3 participants