Retry availability until the receiver of the request is dropped (#2763)

* guide updates

* keep interactions alive until receivers drop

* retry indefinitely

* cancel approval tasks on finality

* use swap_remove instead of remove
This commit is contained in:
Robert Habermeier
2021-03-30 17:33:38 +02:00
committed by GitHub
parent 6514e00144
commit 08d5b268a0
6 changed files with 341 additions and 266 deletions
@@ -25,31 +25,27 @@ We hold a state which tracks the current recovery interactions we have live, as
```rust
struct State {
/// Each interaction is implemented as its own async task, and these handles are for communicating with them.
interactions: Map<CandidateHash, InteractionHandle>,
/// Each interaction is implemented as its own remote async task, and these handles are remote
/// for it.
interactions: FuturesUnordered<InteractionHandle>,
/// A multiplexer over receivers from live interactions.
interaction_receivers: FuturesUnordered<ResponseReceiver<Concluded>>,
/// A recent block hash for which state should be available.
live_block_hash: Hash,
/// interaction communication. This is cloned and given to interactions that are spun up.
from_interaction_tx: Sender<FromInteraction>,
/// receiver for messages from interactions.
from_interaction_rx: Receiver<FromInteraction>,
// An LRU cache of recently recovered data.
availability_lru: LruCache<CandidateHash, Result<AvailableData, RecoveryError>>,
}
/// This is a future, which concludes either when a response is received from the interaction,
/// or all the `awaiting` channels have closed.
struct InteractionHandle {
candidate_hash: CandidateHash,
interaction_response: RemoteHandle<Concluded>,
awaiting: Vec<ResponseChannel<Result<AvailableData, RecoveryError>>>,
}
struct Unavailable;
enum FromInteraction {
// An interaction concluded.
Concluded(CandidateHash, Result<AvailableData, RecoveryError>),
// Send a request on the network.
NetworkRequest(Requests),
}
struct Concluded(CandidateHash, Result<AvailableData, RecoveryError>);
struct InteractionParams {
validator_authority_keys: Vec<AuthorityId>,
@@ -71,12 +67,12 @@ enum InteractionPhase {
// request the chunk from them.
shuffling: Vec<ValidatorIndex>,
received_chunks: Map<ValidatorIndex, ErasureChunk>,
requesting_chunks: FuturesUnordered<Receiver<DataResponse<ErasureChunk>>>,
requesting_chunks: FuturesUnordered<Receiver<ErasureChunkRequestResponse>>,
}
}
struct Interaction {
to_state: Sender<FromInteraction>,
to_subsystems: SubsystemSender,
params: InteractionParams,
phase: InteractionPhase,
}
@@ -104,10 +100,6 @@ On `Conclude`, shut down the subsystem.
1. Load the entry from the `interactions` map. It should always exist, if not for logic errors. Send the result to each member of `awaiting`.
1. Add the entry to the availability_lru.
#### `FromInteraction::NetworkRequest(requests)`
1. Forward with `NetworkBridgeMessage::SendRequests`.
### Interaction logic
#### `launch_interaction(session_index, session_info, candidate_receipt, candidate_hash, Option<backing_group_index>)`
@@ -115,13 +107,13 @@ On `Conclude`, shut down the subsystem.
1. Compute the threshold from the session info. It should be `f + 1`, where `n = 3f + k`, where `k in {1, 2, 3}`, and `n` is the number of validators.
1. Set the various fields of `InteractionParams` based on the validator lists in `session_info` and information about the candidate.
1. If the `backing_group_index` is `Some`, start in the `RequestFromBackers` phase with a shuffling of the backing group validator indices and a `None` requesting value.
1. Otherwise, start in the `RequestChunks` phase with `received_chunks` and `requesting_chunks` both empty.
1. Set the `to_state` sender to be equal to a clone of `state.from_interaction_tx`.
1. Otherwise, start in the `RequestChunks` phase with `received_chunks`,`requesting_chunks`, and `next_shuffling` all empty.
1. Set the `to_subsystems` sender to be equal to a clone of the `SubsystemContext`'s sender.
1. Initialize `received_chunks` to an empty set, as well as `requesting_chunks`.
Launch the interaction as a background task running `interaction_loop(interaction)`.
#### `interaction_loop(interaction)`
#### `interaction_loop(interaction) -> Result<AvailableData, RecoeryError>`
```rust
// How many parallel requests to have going at once.
@@ -135,13 +127,14 @@ Loop:
* If the backer is `Some`, issue a `FromInteraction::NetworkRequest` with a network request for the `AvailableData` and wait for the response.
* If it concludes with a `None` result, return to beginning.
* If it concludes with available data, attempt a re-encoding.
* If it has the correct erasure-root, break and issue a `Concluded(Ok(available_data))`.
* If it has the correct erasure-root, break and issue a `Ok(available_data)`.
* If it has an incorrect erasure-root, issue a `FromInteraction::ReportPeer` message and return to beginning.
* If the backer is `None`, set the phase to `InteractionPhase::RequestChunks` with a random shuffling of validators and empty `received_chunks` and `requesting_chunks`.
* If the backer is `None`, set the phase to `InteractionPhase::RequestChunks` with a random shuffling of validators and empty `next_shuffling`, `received_chunks`, and `requesting_chunks`.
* If the phase is `InteractionPhase::RequestChunks`:
* Poll for new updates from `requesting_chunks`. Check merkle proofs of any received chunks, and any failures should lead to issuance of a `FromInteraction::ReportPeer` message.
* If `received_chunks` has more than `threshold` entries, attempt to recover the data. If that fails, or a re-encoding produces an incorrect erasure-root, break and issue a `Concluded(RecoveryError::Invalid)`. If correct, break and issue `Concluded(Ok(available_data))`.
* If `received_chunks + requesting_chunks + shuffling` lengths are less than the threshold, break and return `Err(Unavailable)`.
* Poll for new updates from `requesting_chunks`. Check merkle proofs of any received chunks. If the request simply fails due to network issues, push onto the back of `shuffling` to be retried.
* If `received_chunks` has more than `threshold` entries, attempt to recover the data. If that fails, or a re-encoding produces an incorrect erasure-root, break and issue a `Err(RecoveryError::Invalid)`. If correct, break and issue `Ok(available_data)`.
* While there are fewer than `N_PARALLEL` entries in `requesting_chunks`,
* Pop the next item from `shuffling`. If it's empty and `requesting_chunks` is empty, break and set the phase to `Concluded(None)`.
* Pop the next item from `shuffling`. If it's empty and `requesting_chunks` is empty, return `Err(RecoveryError::Unavailable)`.
* Issue a `FromInteraction::NetworkRequest` and wait for the response in `requesting_chunks`.