Skip to content

fix: piece state tracking and peer connection handling improvement#532

Merged
ikatson merged 12 commits intomainfrom
fix-testserver-issues
Jan 18, 2026
Merged

fix: piece state tracking and peer connection handling improvement#532
ikatson merged 12 commits intomainfrom
fix-testserver-issues

Conversation

@ikatson
Copy link
Copy Markdown
Owner

@ikatson ikatson commented Jan 18, 2026

Various fixes

  • Fix piece state invariant violations by introducing a new PieceTracker abstraction that wraps ChunkTracker with inflight piece tracking, ensuring pieces are always in exactly one state (HAVE, QUEUED, IN_FLIGHT, or NOT_NEEDED)
  • Fix deadlock where on_steal was called while holding a peer lock, which could deadlock when multiple threads accessed different peer shards
  • Fix peer death cleanup where pieces owned by dead peers were left in both inflight_pieces AND queue_pieces, causing "already requested ChunkInfo" warnings
  • Fix stolen chunks cleanup - chunks are now removed from the victim peer's inflight_requests when stolen
  • Fix incoming connection race - gracefully handle when a peer reconnects during reconnect backoff instead of logging an error
  • Skip queued uploads for disconnected peers to avoid wasting time on dead connections
  • Improve testserver debugging with deterministic ports (50000+), unique torrent names, peer_id in session spans, and configurable file loggin

Use init_logging from tracing_subscriber_config_utils to write DEBUG-level
logs to /tmp/rqbit-simulate-traffic/testserver.log by default. Configurable
via TESTSERVER_LOG_FILE and TESTSERVER_LOG_FILE_RUST_LOG env vars.

Also document log file locations for both devserver and testserver in
CLAUDE.md with guidance to always filter logs with rg instead of reading
directly.
When a piece is stolen from a peer, the on_steal function sends Cancel
messages but didn't remove the chunks from the victim's inflight_requests.
This caused "we already requested ChunkInfo" warnings when the stealer
died and the original peer re-reserved the same piece.

Now uses retain() to both send cancels and remove chunks in a single pass,
which is also more efficient (no intermediate Vec allocation).
When an outgoing peer connection dies, we set the peer to Dead state and
spawn a wait_for_peer task that sleeps for backoff before requeuing.
If an incoming connection from the same peer arrives during this sleep,
the peer transitions from Dead to Live. When wait_for_peer wakes up and
finds the peer Live instead of Dead, this is not a bug - it is a valid
race condition where the peer reconnected to us.

Previously this produced an error log. Now we gracefully handle all
non-Dead states by simply skipping the requeue since the peer is either
already connected, already queued, or no longer needed.
Documents the shared state structures (ChunkTracker, inflight_pieces,
inflight_requests), their intended invariants, and state transitions.

The previous fix (f61c1bd - remove stolen chunks from peer's
inflight_requests) only partially addressed the 'already requested
ChunkInfo' warning. It helps when:
- Peer A steals from live peer B
- Peer A dies
- Peer B re-reserves the piece

But it doesn't help when the victim peer DIES (not just loses pieces):
- When peer B dies, its LivePeerState is consumed (taken)
- on_steal's with_live_mut(B, ...) returns None - nothing to clean up
- Pieces stay in inflight_pieces owned by dead peer B
- mark_piece_broken_if_not_have adds pieces to queue_pieces
- INVARIANT VIOLATED: pieces now in BOTH inflight_pieces AND queue_pieces
- Peer A steals from dead B, requests chunks
- reserve_next_needed_piece returns same piece (still in queue_pieces)
- Warning fires when trying to insert already-present chunks

The root cause is that on_peer_dead() doesn't clean up inflight_pieces.
This doc captures the full picture before deciding on proper fix approach.
Add PieceTracker abstraction that wraps ChunkTracker + inflight_pieces
to encapsulate piece state transitions and maintain invariants.

Key changes:
- New piece_tracker.rs with PieceTracker struct and unit tests
- Consolidated reserve_next_needed_piece + try_steal_old_slow_piece
  into single acquire_piece method with steal(10x) -> reserve -> steal(3x)
- Fixed bug where peer death left pieces in both queue_pieces AND
  inflight_pieces by adding release_pieces_owned_by() method
- Fixed steal bug where stealing peer was not checked for having the piece
- Updated STATE.md to reflect new architecture

The invariant that pieces are in exactly one state (HAVE, IN_FLIGHT,
QUEUED, or NOT_NEEDED) is now enforced by PieceTracker methods.
…n testserver

- Torrent names are now unique by using index-based selection with cycle suffix
- Each session (peers and main) now has explicit peer_id passed to SessionOptions
- Root spans include peer_id for easier log tracing across sessions
The acquire_next_piece function was calling on_steal inside a
with_live_mut closure, which holds a DashMap write lock on the peer.
Since on_steal accesses other peers (with_peer, with_live_mut), this
created a nested lock pattern that could deadlock when multiple threads
try to access different peers shards simultaneously.

Fix by storing the steal info and processing it after the closure
returns, when no peer lock is held.

Also expanded STATE.md lock ordering documentation with the critical
rule about never accessing other peers while holding a peer lock.
Main session listens on port 50000, peer sessions on 50001, 50002, etc.
Also includes port in tracing spans for log filtering.
When a peer disconnects while their upload requests are waiting in the
rate-limited queue, we now detect this via tx.closed() and skip those
requests instead of waiting forever for rate limit clearance.

This helps reduce wasted time on dead peers but doesn't fully solve
the queue starvation problem where one peer can monopolize the upload
queue while others wait.
The delete modal was showing "Torrent #1" instead of the actual name
because TorrentActions was using detailsResponse (which may not be
loaded) instead of the torrent name from the list.

Also simplified component props throughout:
- TorrentActions now takes torrent object instead of separate props
- Removed unused detailsResponse from TorrentCardContent and TorrentCard
- PeersTab now takes torrent object instead of separate id/stats
- Removed redundant TorrentToDelete type from DeleteTorrentModal
@ikatson ikatson force-pushed the fix-testserver-issues branch from b572104 to c84604e Compare January 18, 2026 21:19
@ikatson ikatson force-pushed the fix-testserver-issues branch from c84604e to 2d31467 Compare January 18, 2026 21:21
@ikatson ikatson merged commit 702c3f5 into main Jan 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant