-
Notifications
You must be signed in to change notification settings - Fork 715
feat: service graph through streams #9418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Failed to generate code suggestions for PR |
Greptile OverviewGreptile SummaryReplaces in-memory service graph processing with SQL-based daemon architecture that runs in compactor every hour, completely decoupling service graph from trace ingestion. Major Changes:
Architecture Benefits:
Configuration:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Client as Frontend Client
participant API as Service Graph API
participant Compactor as Compactor Job
participant Processor as SQL Processor
participant DataFusion as DataFusion Query Engine
participant TraceStream as Trace Streams
participant ServiceGraphStream as _o2_service_graph Stream
Note over Compactor,Processor: Daemon runs every processing_interval_secs (default: 3600s)
Compactor->>Processor: process_service_graph()
Processor->>Processor: get_trace_streams() - List all orgs & trace streams
loop For each trace stream
Processor->>DataFusion: Execute SQL with CTE
Note right of DataFusion: WITH edges AS (<br/> SELECT client, server, duration, span_status<br/> WHERE span_kind IN ('2','3')<br/> AND peer_service IS NOT NULL<br/>)<br/>SELECT COUNT(*), percentiles<br/>GROUP BY client, server
DataFusion->>TraceStream: Query last 60min of traces
TraceStream-->>DataFusion: Span records
DataFusion-->>Processor: Aggregated edges (p50/p95/p99)
Processor->>ServiceGraphStream: write_sql_aggregated_edges()
Note right of ServiceGraphStream: Bulk ingest via logs API<br/>to _o2_service_graph
end
Note over Client,API: User queries topology (independent of daemon)
Client->>API: GET /topology/current?stream_name=X
API->>ServiceGraphStream: SQL query last 60min
ServiceGraphStream-->>API: Edge records
API->>API: build_topology() - Convert to nodes/edges
API-->>Client: {nodes, edges, availableStreams}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20 files reviewed, no comments
b0d49a6 to
e563680
Compare
5964399 to
e23e8c5
Compare
Replace in-memory EdgeStore with daemon-based architecture that periodically
queries trace streams using SQL aggregation in DataFusion.
Architecture:
- Daemon runs in compactor job every 1 hour (default, configurable)
- Queries last 60 minutes of trace data (configurable window)
- Uses DataFusion SQL with CTE for efficient aggregation
- Calculates p50, p95, p99 percentiles using approx functions
- Writes aggregated edges to internal `_o2_service_graph` stream
- Zero impact on trace ingestion performance
Key Changes:
- Add `processor.rs` with SQL-based daemon that queries trace streams
- Add `aggregator.rs` to write SQL results to `_o2_service_graph` stream (uses json! macro)
- Refactor `api.rs` to query pre-aggregated data from stream
- Move business logic to enterprise repo (`build_topology()`, `span_to_graph_span()`)
- Remove ALL inline processing during trace ingestion
- Delete in-memory EdgeStore, worker threads, and buffering code
- Feature-gated for enterprise builds with non-enterprise stubs
SQL Query:
- CTE extracts client/server from span_kind ('2'=SERVER, '3'=CLIENT)
- Aggregates by (client, server) with COUNT, percentiles, error rate
- Filters: span_kind IN ('2','3') AND peer_service IS NOT NULL
Configuration:
- O2_SERVICE_GRAPH_ENABLED (default: false)
- O2_SERVICE_GRAPH_PROCESSING_INTERVAL_SECS (default: 3600) - Daemon run frequency
- O2_SERVICE_GRAPH_QUERY_TIME_RANGE_MINUTES (default: 60) - Query window size
Technical Details:
- Stream name: `_o2_service_graph` (internal stream)
- Edge timestamp: Uses query end time for API compatibility
- Span kinds: '2' (SERVER), '3' (CLIENT) as VARCHAR in parquet
- Percentiles: approx_median(), approx_percentile_cont(0.95), approx_percentile_cont(0.99)
e23e8c5 to
97c701a
Compare
Architecture
Replaces in-memory EdgeStore with daemon that periodically queries trace streams using DataFusion SQL.
Flow:
_o2_service_graphinternal stream_o2_service_graphfor last 60 minutesZero impact on trace ingestion - completely decoupled.
SQL Query
Configuration
O2_SERVICE_GRAPH_ENABLED=true- Enable featureO2_SERVICE_GRAPH_PROCESSING_INTERVAL_SECS=3600- How often daemon runs (default: 1 hour)O2_SERVICE_GRAPH_QUERY_TIME_RANGE_MINUTES=60- Query window size (default: 60 min)Changes
Added:
processor.rs(202 lines) - SQL daemonaggregator.rs(155 lines) - Stream writerModified:
api.rs- Query from_o2_service_graphstream, returnavailableStreamscompactor.rs- Integrate service_graph_processor jobtraces/mod.rs- Remove ALL inline processingDeleted:
job/service_graph.rs- Standalone daemon (now in compactor)Enterprise dependency:
Technical Details