Skip to content

Conversation

@ByteBaker
Copy link
Contributor

@ByteBaker ByteBaker commented Dec 1, 2025

Summary

  • Extract service graph from compactor into dedicated job module
  • Support INTERNAL spans (span_kind='1') in service graph topology
  • Add O2_SERVICE_GRAPH_EXCLUDE_INTERNAL_SPANS config flag
  • Remove early exit when peer attributes are missing

Changes

Job Separation

  • Create src/job/service_graph.rs with independent lifecycle
  • Remove service graph code from src/job/compactor.rs
  • Update src/job/mod.rs to spawn service_graph job
  • Logs now use [SERVICE_GRAPH::JOB] prefix

INTERNAL Span Support

  • Update SQL to include span_kind IN ('1', '2', '3') (was IN ('2', '3'))
  • Add CASE statements to handle INTERNAL spans:
    • Client: service_name
    • Server: COALESCE(peer_expr, 'internal')
  • Remove early exit when no peer identification attributes found
  • Use 'internal' as fallback for peer identification

Configuration

  • Add O2_SERVICE_GRAPH_EXCLUDE_INTERNAL_SPANS env variable (default: false)
  • Allows users to exclude INTERNAL spans for traditional service-mesh-only view
  • Conditional SQL generation based on flag

Motivation

Service graph was conceptually unrelated to file compaction despite living in the compactor module. Extracting it provides better separation of concerns.

OpenObserve's internal tracing produces primarily INTERNAL spans without explicit peer.service attributes. Previous implementation:

  • Skipped streams without peer attributes (early exit)
  • Only processed CLIENT/SERVER spans
  • Resulted in 0 edges from OpenObserve's own traces (137K+ INTERNAL spans ignored)

New implementation enables service graph to visualize internal operation topology while maintaining backward compatibility.

Test Results

With 10,000 INTERNAL spans ingested:

[ServiceGraph] Stream default/default has no peer identification attributes - will use 'internal' fallback for INTERNAL spans
[ServiceGraph] Query returned 2 pre-aggregated edges from default/default
[ServiceGraph] Wrote 2 SQL-aggregated edge summaries for default

Topology created:

  • compactor → internal (19,998 requests, p50=640µs, p95=6.6ms)
  • ingester → internal (2 requests, p50=12ms)

Related

  • openobserve/o2-enterprise#998 - Add exclude_internal_spans config field

Test Plan

  • Service graph runs as independent job
  • OSS build compiles without enterprise feature
  • Enterprise build compiles successfully
  • INTERNAL spans create edges with 'internal' fallback
  • Config flag controls INTERNAL span inclusion
  • Topology API returns correct graph data
  • No warnings in compilation

@ByteBaker ByteBaker requested a review from oasisk December 1, 2025 12:16
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

Failed to generate code suggestions for PR

@github-actions github-actions bot added the ☢️ Bug Something isn't working label Dec 1, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 1, 2025

Greptile Overview

Greptile Summary

Fixed SQL query failure when peer.service attribute is stored in nested structures instead of flattened columns. The query now uses COALESCE to check multiple locations: peer_service column, attributes['peer.service'], and attributes_string['peer.service']. Also moved NULL filtering from the CTE's WHERE clause to the final WHERE clause after aggregation, which is more correct since client/server are computed values.

Key Changes:

  • Used COALESCE(peer_service, attributes['peer.service'], attributes_string['peer.service']) to handle multiple storage formats
  • Removed AND peer_service IS NOT NULL from initial WHERE clause
  • Added WHERE client IS NOT NULL AND server IS NOT NULL after aggregation to filter out invalid edges

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change correctly handles different storage formats for the peer.service attribute using COALESCE, which is a defensive and backwards-compatible approach. Moving the NULL filter to after aggregation is more correct since it filters on computed values. The query logic for span_kind remains unchanged and the fix addresses a real schema compatibility issue
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
src/service/traces/service_graph/processor.rs 5/5 Enhanced SQL query to handle peer.service in multiple storage formats (flattened column, attributes map, attributes_string map) using COALESCE, and moved NULL filtering from WHERE clause to post-aggregation for correctness

Sequence Diagram

sequenceDiagram
    participant Compactor as Compactor Job
    participant Processor as Service Graph Processor
    participant DB as Database
    participant Search as Search Service
    participant Writer as Edge Writer

    Compactor->>Processor: process_service_graph()
    Processor->>DB: get_trace_streams()
    DB-->>Processor: List of (org_id, stream_name)
    
    loop For each stream
        Processor->>Processor: Build SQL with COALESCE for peer.service
        Note over Processor: Try peer_service, attributes['peer.service'],<br/>attributes_string['peer.service']
        Processor->>Search: Execute query with CTE
        Note over Search: Filter span_kind IN ('2', '3')<br/>Compute client/server with COALESCE<br/>Aggregate metrics (count, errors, percentiles)
        Search-->>Processor: Aggregated edges
        alt Has edges
            Processor->>Writer: write_sql_aggregated_edges()
            Writer->>DB: Write to _o2_service_graph stream
        end
    end
    
    Processor-->>Compactor: Success/Failure
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ByteBaker ByteBaker force-pushed the fix/service-graph-sql branch from 901242f to af9c3d8 Compare December 2, 2025 03:58
@ByteBaker ByteBaker force-pushed the fix/service-graph-sql branch 4 times, most recently from fb54589 to b130daa Compare December 3, 2025 07:54
@ByteBaker ByteBaker changed the title fix: handle peer.service in nested attribute structures refactor: extract service graph into independent job and support INTERNAL spans Dec 3, 2025
@ByteBaker ByteBaker force-pushed the fix/service-graph-sql branch from b130daa to eff0d53 Compare December 3, 2025 07:58
@ByteBaker ByteBaker force-pushed the fix/service-graph-sql branch from eff0d53 to c0b4af5 Compare December 3, 2025 10:05
The peer.service attribute can be stored in:
- peer_service column (flattened)
- attributes['peer.service'] map
- attributes_string['peer.service'] map

Use COALESCE to try all locations. Filter NULL results in WHERE clause.
…RNAL spans

- Move service graph processor from compactor to dedicated job module
- Create `src/job/service_graph.rs` with independent lifecycle
- Remove service graph code from `src/job/compactor.rs`
- Update `src/job/mod.rs` to spawn service_graph job

- Support INTERNAL spans (span_kind='1') in addition to CLIENT/SERVER
- Remove early exit when peer attributes are missing
- Use 'internal' as fallback for peer identification
- Update SQL to include `span_kind IN ('1', '2', '3')`

This enables service graph to work with OpenObserve's internal tracing
data which uses INTERNAL spans without explicit peer.service attributes.
- Check schema for peer attributes before querying
- COALESCE priority: peer_service > server_address > db_name > db_system > http_url
- Fallback to 'internal' if no peer attributes
- Support INTERNAL spans with exclude config
- Improved UI empty state with troubleshooting
@ByteBaker ByteBaker force-pushed the fix/service-graph-sql branch from c0b4af5 to d19bfaf Compare December 3, 2025 10:06
@ByteBaker ByteBaker merged commit 14212b7 into main Dec 3, 2025
40 of 42 checks passed
@ByteBaker ByteBaker deleted the fix/service-graph-sql branch December 3, 2025 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants