feat: initialize Kurdistan SDK - independent fork of Polkadot SDK

2025-12-13 15:44:15 +03:00
commit 286de54384
6841 changed files with 1848356 additions and 0 deletions
@@ -0,0 +1,50 @@
+# Node Architecture
+
+## Design Goals
+
+* Modularity: Components of the system should be as self-contained as possible. Communication boundaries between
+  components should be well-defined and mockable. This is key to creating testable, easily reviewable code.
+* Minimizing side effects: Components of the system should aim to minimize side effects and to communicate with other
+  components via message-passing.
+* Operational Safety: The software will be managing signing keys where conflicting messages can lead to large amounts of
+  value to be slashed. Care should be taken to ensure that no messages are signed incorrectly or in conflict with each
+  other.
+
+The architecture of the node-side behavior aims to embody the Rust principles of ownership and message-passing to create
+clean, isolatable code. Each resource should have a single owner, with minimal sharing where unavoidable.
+
+Many operations that need to be carried out involve the network, which is asynchronous. This asynchrony affects all core
+subsystems that rely on the network as well. The approach of hierarchical state machines is well-suited to this kind of
+environment.
+
+We introduce
+
+## Components
+
+The node architecture consists of the following components:
+  * The Overseer (and subsystems): A hierarchy of state machines where an overseer supervises subsystems. Subsystems can
+    contain their own internal hierarchy of jobs. This is elaborated on in the next section on Subsystems.
+  * A block proposer: Logic triggered by the consensus algorithm of the chain when the node should author a block.
+  * A GRANDPA voting rule: A strategy for selecting chains to vote on in the GRANDPA algorithm to ensure that only valid
+    teyrchain candidates appear in finalized relay-chain blocks.
+
+## Assumptions
+
+The Node-side code comes with a set of assumptions that we build upon. These assumptions encompass most of the
+fundamental blockchain functionality.
+
+We assume the following constraints regarding provided basic functionality:
+  * The underlying **consensus** algorithm, whether it is BABE or SASSAFRAS is implemented.
+  * There is a **chain synchronization** protocol which will search for and download the longest available chains at all
+    times.
+  * The **state** of all blocks at the head of the chain is available. There may be **state pruning** such that state of
+    the last `k` blocks behind the last finalized block are available, as well as the state of all their descendants.
+    This assumption implies that the state of all active leaves and their last `k` ancestors are all available. The
+    underlying implementation is expected to support `k` of a few hundred blocks, but we reduce this to a very
+    conservative `k=5` for our purposes.
+  * There is an underlying **networking** framework which provides **peer discovery** services which will provide us
+    with peers and will not create "loopback" connections to our own node. The number of peers we will have is assumed
+    to be bounded at 1000.
+  * There is a **transaction pool** and a **transaction propagation** mechanism which maintains a set of current
+    transactions and distributes to connected peers. Current transactions are those which are not outdated relative to
+    some "best" fork of the chain, which is part of the active heads, and have not been included in the best fork.
@@ -0,0 +1,10 @@
+# Approval Subsystems
+
+The approval subsystems implement the node-side of the [Approval Protocol](../../protocol-approval.md).
+
+We make a divide between the [assignment/voting logic](approval-voting.md) and the [distribution
+logic](approval-distribution.md) that distributes assignment certifications and approval votes. The logic in the
+assignment and voting also informs the GRANDPA voting rule on how to vote.
+
+These subsystems are intended to flag issues and begin participating in live disputes. Dispute subsystems also track all
+observed votes (backing, approval, and dispute-specific) by all validators on all candidates.
@@ -0,0 +1,348 @@
+# Approval Distribution
+
+A subsystem for the distribution of assignments and approvals for approval checks on candidates over the network.
+
+The [Approval Voting](approval-voting.md) subsystem is responsible for active participation in a protocol designed to
+select a sufficient number of validators to check each and every candidate which appears in the relay chain. Statements
+of participation in this checking process are divided into two kinds:
+  * **Assignments** indicate that validators have been selected to do checking
+  * **Approvals** indicate that validators have checked and found the candidate satisfactory.
+
+The [Approval Voting](approval-voting.md) subsystem handles all the issuing and tallying of this protocol, but this
+subsystem is responsible for the disbursal of statements among the validator-set.
+
+The inclusion pipeline of candidates concludes after availability, and only after inclusion do candidates actually get
+pushed into the approval checking pipeline. As such, this protocol deals with the candidates _made available by_
+particular blocks, as opposed to the candidates which actually appear within those blocks, which are the candidates
+_backed by_ those blocks. Unless stated otherwise, whenever we reference a candidate partially by block hash, we are
+referring to the set of candidates _made available by_ those blocks.
+
+We implement this protocol as a gossip protocol, and like other teyrchain-related gossip protocols our primary concerns
+are about ensuring fast message propagation while maintaining an upper bound on the number of messages any given node
+must store at any time.
+
+Approval messages should always follow assignments, so we need to be able to discern two pieces of information based on
+our [View](../../types/network.md#universal-types):
+  1. Is a particular assignment relevant under a given `View`?
+  2. Is a particular approval relevant to any assignment in a set?
+
+For our own local view, these two queries  must not yield false negatives. When applied to our peers' views, it is
+acceptable for them to yield false negatives. The reason for that is that our peers' views may be beyond ours, and we
+are not capable of fully evaluating them. Once we have caught up, we can check again for false negatives to continue
+distributing.
+
+For assignments, what we need to be checking is whether we are aware of the (block, candidate) pair that the assignment
+references. For approvals, we need to be aware of an assignment by the same validator which references the candidate
+being approved.
+
+However, awareness on its own of a (block, candidate) pair would imply that even ancient candidates all the way back to
+the genesis are relevant. We are actually not interested in anything before finality.
+
+We gossip assignments along a grid topology produced by the [Gossip Support Subsystem](../utility/gossip-support.md) and
+also to a few random peers. The first time we accept an assignment or approval, regardless of the source, which
+originates from a validator peer in a shared dimension of the grid, we propagate the message to validator peers in the
+unshared dimension as well as a few random peers.
+
+But, in case these mechanisms don't work on their own, we need to trade bandwidth for protocol liveness by introducing
+aggression.
+
+Aggression has 3 levels:
+* Aggression Level 0: The basic behaviors described above.
+* Aggression Level 1: The originator of a message sends to all peers. Other peers follow the rules above.
+* Aggression Level 2: All peers send all messages to all their row and column neighbors. This means that each validator
+    will, on average, receive each message approximately 2*sqrt(n) times.
+
+These aggression levels are chosen based on how long a block has taken to finalize: assignments and approvals related to
+the unfinalized block will be propagated with more aggression. In particular, it's only the earliest unfinalized blocks
+that aggression should be applied to, because descendants may be unfinalized only by virtue of being descendants.
+
+## Protocol
+
+Input:
+  * `ApprovalDistributionMessage::NewBlocks`
+  * `ApprovalDistributionMessage::DistributeAssignment`
+  * `ApprovalDistributionMessage::DistributeApproval`
+  * `ApprovalDistributionMessage::NetworkBridgeUpdate`
+  * `OverseerSignal::BlockFinalized`
+
+Output:
+  * `ApprovalVotingMessage::ImportAssignment`
+  * `ApprovalVotingMessage::ImportApproval`
+  * `NetworkBridgeMessage::SendValidationMessage::ApprovalDistribution`
+
+## Functionality
+
+```rust
+type BlockScopedCandidate = (Hash, CandidateHash);
+
+enum PendingMessage {
+  Assignment(IndirectAssignmentCert, CoreIndex),
+  Approval(IndirectSignedApprovalVote),
+}
+
+/// The `State` struct is responsible for tracking the overall state of the subsystem.
+///
+/// It tracks metadata about our view of the unfinalized chain, which assignments and approvals we have seen, and our peers' views.
+struct State {
+  // These two fields are used in conjunction to construct a view over the unfinalized chain.
+  blocks_by_number: BTreeMap<BlockNumber, Vec<Hash>>,
+  blocks: HashMap<Hash, BlockEntry>,
+
+  /// Our view updates to our peers can race with `NewBlocks` updates. We store messages received
+  /// against the directly mentioned blocks in our view in this map until `NewBlocks` is received.
+  ///
+  /// As long as the parent is already in the `blocks` map and `NewBlocks` messages aren't delayed
+  /// by more than a block length, this strategy will work well for mitigating the race. This is
+  /// also a race that occurs typically on local networks.
+  pending_known: HashMap<Hash, Vec<(PeerId, PendingMessage>)>>,
+
+  // Peer view data is partially stored here, and partially inline within the `BlockEntry`s
+  peer_views: HashMap<PeerId, View>,
+}
+
+enum MessageFingerprint {
+  Assignment(Hash, u32, ValidatorIndex),
+  Approval(Hash, u32, ValidatorIndex),
+}
+
+struct Knowledge {
+  known_messages: HashSet<MessageFingerprint>,
+}
+
+struct PeerKnowledge {
+  /// The knowledge we've sent to the peer.
+  sent: Knowledge,
+  /// The knowledge we've received from the peer.
+  received: Knowledge,
+}
+
+/// Information about blocks in our current view as well as whether peers know of them.
+struct BlockEntry {
+  // Peers who we know are aware of this block and thus, the candidates within it. This maps to their knowledge of messages.
+  known_by: HashMap<PeerId, PeerKnowledge>,
+  // The number of the block.
+  number: BlockNumber,
+  // The parent hash of the block.
+  parent_hash: Hash,
+  // Our knowledge of messages.
+  knowledge: Knowledge,
+  // A votes entry for each candidate.
+  candidates: IndexMap<CandidateHash, CandidateEntry>,
+}
+
+enum ApprovalState {
+  Assigned(AssignmentCert),
+  Approved(AssignmentCert, ApprovalSignature),
+}
+
+/// Information about candidates in the context of a particular block they are included in. In other words,
+/// multiple `CandidateEntry`s may exist for the same candidate, if it is included by multiple blocks - this is likely the case
+/// when there are forks.
+struct CandidateEntry {
+  approvals: HashMap<ValidatorIndex, ApprovalState>,
+}
+```
+
+### Network updates
+
+#### `NetworkBridgeEvent::PeerConnected`
+
+Add a blank view to the `peer_views` state.
+
+#### `NetworkBridgeEvent::PeerDisconnected`
+
+Remove the view under the associated `PeerId` from `State::peer_views`.
+
+Iterate over every `BlockEntry` and remove `PeerId` from it.
+
+#### `NetworkBridgeEvent::OurViewChange`
+
+Remove entries in `pending_known` for all hashes not present in the view. Ensure a vector is present in `pending_known`
+for each hash in the view that does not have an entry in `blocks`.
+
+#### `NetworkBridgeEvent::PeerViewChange`
+
+Invoke `unify_with_peer(peer, view)` to catch them up to messages we have.
+
+We also need to use the `view.finalized_number` to remove the `PeerId` from any blocks that it won't be wanting
+information about anymore. Note that we have to be on guard for peers doing crazy stuff like jumping their
+`finalized_number` forward 10 trillion blocks to try and get us stuck in a loop for ages.
+
+One of the safeguards we can implement is to reject view updates from peers where the new `finalized_number` is less
+than the previous.
+
+We augment that by defining `constrain(x)` to output the x bounded by the first and last numbers in
+`state.blocks_by_number`.
+
+From there, we can loop backwards from `constrain(view.finalized_number)` until `constrain(last_view.finalized_number)`
+is reached, removing the `PeerId` from all `BlockEntry`s referenced at that height. We can break the loop early if we
+ever exit the bound supplied by the first block in `state.blocks_by_number`.
+
+#### `NetworkBridgeEvent::PeerMessage`
+
+If the block hash referenced by the message exists in `pending_known`, add it to the vector of pending messages and
+return.
+
+If the message is of type `ApprovalDistributionV1Message::Assignment(assignment_cert, claimed_index)`, then call
+`import_and_circulate_assignment(MessageSource::Peer(sender), assignment_cert, claimed_index)`
+
+If the message is of type `ApprovalDistributionV1Message::Approval(approval_vote)`, then call
+`import_and_circulate_approval(MessageSource::Peer(sender), approval_vote)`
+
+### Subsystem Updates
+
+#### `ApprovalDistributionMessage::NewBlocks`
+
+Create `BlockEntry` and `CandidateEntries` for all blocks.
+
+For all entries in `pending_known`:
+  * If there is now an entry under `blocks` for the block hash, drain all messages and import with
+    `import_and_circulate_assignment` and `import_and_circulate_approval`.
+
+For all peers:
+  * Compute `view_intersection` as the intersection of the peer's view blocks with the hashes of the new blocks.
+  * Invoke `unify_with_peer(peer, view_intersection)`.
+
+#### `ApprovalDistributionMessage::DistributeAssignment`
+
+Call `import_and_circulate_assignment` with `MessageSource::Local`.
+
+#### `ApprovalDistributionMessage::DistributeApproval`
+
+Call `import_and_circulate_approval` with `MessageSource::Local`.
+
+#### `OverseerSignal::BlockFinalized`
+
+Prune all lists from `blocks_by_number` with number less than or equal to `finalized_number`. Prune all the
+`BlockEntry`s referenced by those lists.
+
+
+### Utility
+
+```rust
+enum MessageSource {
+  Peer(PeerId),
+  Local,
+}
+```
+
+#### `import_and_circulate_assignment(...)`
+
+`import_and_circulate_assignment(source: MessageSource, assignment: IndirectAssignmentCert, claimed_candidate_index:
+CandidateIndex)`
+
+Imports an assignment cert referenced by block hash and candidate index. As a postcondition, if the cert is valid, it
+will have distributed the cert to all peers who have the block in their view, with the exclusion of the peer referenced
+by the `MessageSource`.
+
+We maintain a few invariants:
+  * we only send an assignment to a peer after we add its fingerprint to our knowledge
+  * we add a fingerprint of an assignment to our knowledge only if it's valid and hasn't been added before
+
+The algorithm is the following:
+
+  * Load the `BlockEntry` using `assignment.block_hash`. If it does not exist, report the source if it is
+    `MessageSource::Peer` and return.
+  * Compute a fingerprint for the `assignment` using `claimed_candidate_index`.
+  * If the source is `MessageSource::Peer(sender)`:
+    * check if `peer` appears under `known_by` and whether the fingerprint is in the knowledge of the peer. If the peer
+      does not know the block, report for providing data out-of-view and proceed. If the peer does know the block and
+      the `sent` knowledge contains the fingerprint, report for providing replicate data and return, otherwise, insert
+      into the `received` knowledge and return.
+    * If the message fingerprint appears under the `BlockEntry`'s `Knowledge`, give the peer a small positive reputation
+    boost, add the fingerprint to the peer's knowledge only if it knows about the block and return. Note that we must do
+    this after checking for out-of-view and if the peers knows about the block to avoid being spammed. If we did this
+    check earlier, a peer could provide data out-of-view repeatedly and be rewarded for it.
+    * Check the assignment certificate is valid.
+      * If the cert kind is `RelayVRFModulo`, then the certificate is valid as long as `sample <
+        session_info.relay_vrf_samples` and the VRF is valid for the validator's key with the input
+        `block_entry.relay_vrf_story ++ sample.encode()` as described with
+        [the approvals protocol section](../../protocol-approval.md#assignment-criteria). We set
+        `core_index = vrf.make_bytes().to_u32() % session_info.n_cores`. If the `BlockEntry` causes
+        inclusion of a candidate at `core_index`, then this is a valid assignment for the candidate
+        at `core_index` and has delay tranche 0. Otherwise, it can be ignored.
+      * If the cert kind is `RelayVRFModuloCompact`, then the certificate is valid as long as the VRF
+        is valid for the validator's key with the input `block_entry.relay_vrf_story ++ relay_vrf_samples.encode()`
+        as described with [the approvals protocol section](../../protocol-approval.md#assignment-criteria).
+        We enforce that all `core_bitfield` indices are included in the set of the core indices sampled from the
+        VRF Output. The assignment is considered a valid tranche0 assignment for all claimed candidates if all
+        `core_bitfield` indices match the core indices where the claimed candidates were included at.
+      * If the cert kind is `RelayVRFDelay`, then we check if the VRF is valid for the validator's key with the
+        input `block_entry.relay_vrf_story ++ cert.core_index.encode()` as described in [the approvals protocol
+        section](../../protocol-approval.md#assignment-criteria). The cert can be ignored if the block did not
+        cause inclusion of a candidate on that core index. Otherwise, this is a valid assignment for the included
+        candidate. The delay tranche for the assignment is determined by reducing
+        `(vrf.make_bytes().to_u64() % (session_info.n_delay_tranches + session_info.zeroth_delay_tranche_width)).saturating_sub(session_info.zeroth_delay_tranche_width)`.
+      * We also check that the core index derived by the output is covered by the `VRFProof` by means of an auxiliary signature.
+      * If the delay tranche is too far in the future, return `AssignmentCheckResult::TooFarInFuture`.
+    * If the result is `AssignmentCheckResult::Accepted`
+      * Dispatch `ApprovalVotingMessage::ImportAssignment(assignment)` to approval-voting to import the assignment.
+      * If the vote was accepted but not duplicate, give the peer a positive reputation boost
+      * add the fingerprint to both our and the peer's knowledge in the `BlockEntry`. Note that we only doing this after
+        making sure we have the right fingerprint.
+    * If the result is `AssignmentCheckResult::AcceptedDuplicate`, add the fingerprint to the peer's knowledge if it
+      knows about the block and return.
+    * If the result is `AssignmentCheckResult::TooFarInFuture`, mildly punish the peer and return.
+    * If the result is `AssignmentCheckResult::Bad`, punish the peer and return.
+  * If the source is `MessageSource::Local(CandidateIndex)`
+    * check if the fingerprint appears under the `BlockEntry's` knowledge. If not, add it.
+  * Load the candidate entry for the given candidate index. It should exist unless there is a logic error in the
+    approval voting subsystem.
+  * Set the approval state for the validator index to `ApprovalState::Assigned` unless the approval state is set
+    already. This should not happen as long as the approval voting subsystem instructs us to ignore duplicate
+    assignments.
+  * Dispatch a `ApprovalDistributionV1Message::Assignment(assignment, candidate_index)` to all peers in the
+    `BlockEntry`'s `known_by` set, excluding the peer in the `source`, if `source` has kind `MessageSource::Peer`. Add
+    the fingerprint of the assignment to the knowledge of each peer.
+
+
+#### `import_and_circulate_approval(source: MessageSource, approval: IndirectSignedApprovalVote)`
+
+Imports an approval signature referenced by block hash and candidate index:
+
+  * Load the `BlockEntry` using `approval.block_hash` and the candidate entry using `approval.candidate_entry`. If
+    either does not exist, report the source if it is `MessageSource::Peer` and return.
+  * Compute a fingerprint for the approval.
+  * Compute a fingerprint for the corresponding assignment. If the `BlockEntry`'s knowledge does not contain that
+    fingerprint, then report the source if it is `MessageSource::Peer` and return. All references to a fingerprint after
+    this refer to the approval's, not the assignment's.
+  * If the source is `MessageSource::Peer(sender)`:
+    * check if `peer` appears under `known_by` and whether the fingerprint is in the knowledge of the peer. If the peer
+      does not know the block, report for providing data out-of-view and proceed. If the peer does know the block and
+      the `sent` knowledge contains the fingerprint, report for providing replicate data and return, otherwise, insert
+      into the `received` knowledge and return.
+    * If the message fingerprint appears under the `BlockEntry`'s `Knowledge`, give the peer a small positive reputation
+    boost, add the fingerprint to the peer's knowledge only if it knows about the block and return. Note that we must do
+    this after checking for out-of-view to avoid being spammed. If we did this check earlier, a peer could provide data
+    out-of-view repeatedly and be rewarded for it.
+    * Construct a `SignedApprovalVote` using the candidates hashes and check against the validator's approval key,
+      based on the session info of the block. If invalid or no such validator, return `Err(InvalidVoteError)`.
+    * If the result of checking the signature is `Ok(CheckedIndirectSignedApprovalVote)`:
+      * Dispatch `ApprovalVotingMessage::ImportApproval(approval)` .
+      * Give the peer a positive reputation boost and add the fingerprint to both our and the peer's knowledge.
+    * If the result is `Err(InvalidVoteError)`:
+      * Report the peer and return.
+  * Load the candidate entry for the given candidate index. It should exist unless there is a logic error in the
+    approval voting subsystem.
+  * Set the approval state for the validator index to `ApprovalState::Approved`. It should already be in the `Assigned`
+    state as our `BlockEntry` knowledge contains a fingerprint for the assignment.
+  * Dispatch a `ApprovalDistributionV1Message::Approval(approval)` to all peers in the `BlockEntry`'s `known_by` set,
+    excluding the peer in the `source`, if `source` has kind `MessageSource::Peer`. Add the fingerprint of the
+    assignment to the knowledge of each peer. Note that this obeys the politeness conditions:
+    * We guarantee elsewhere that all peers within `known_by` are aware of all assignments relative to the block.
+    * We've checked that this specific approval has a corresponding assignment within the `BlockEntry`.
+    * Thus, all peers are aware of the assignment or have a message to them in-flight which will make them so.
+
+#### `unify_with_peer(peer: PeerId, view)`
+
+1. Initialize a set `missing_knowledge = {}`
+
+For each block in the view:
+  1. Load the `BlockEntry` for the block. If the block is unknown, or the number is less than or equal to the view's
+     finalized number go to step 6.
+  1. Inspect the `known_by` set of the `BlockEntry`. If the peer already knows all assignments/approvals, go to step 6.
+  1. Add the peer to `known_by` and add the hash and missing knowledge of the block to `missing_knowledge`.
+  1. Return to step 2 with the ancestor of the block.
+
+1. For each block in `missing_knowledge`, send all assignments and approvals for all candidates in those blocks to the
+   peer.
@@ -0,0 +1,30 @@
+# Approval voting parallel
+
+The approval-voting-parallel subsystem acts as an orchestrator for the tasks handled by the [Approval Voting](approval-voting.md)
+and [Approval Distribution](approval-distribution.md) subsystems. Initially, these two systems operated separately and interacted
+with each other and other subsystems through orchestra.
+
+With approval-voting-parallel, we have a single subsystem that creates two types of workers:
+- Four approval-distribution workers that operate in parallel, each handling tasks based on the validator_index of the message
+  originator.
+- One approval-voting worker that performs the tasks previously managed by the standalone approval-voting subsystem.
+
+This subsystem does not maintain any state. Instead, it functions as an orchestrator that:
+- Spawns and initializes each workers.
+- Forwards each message and signal to the appropriate worker.
+- Aggregates results for messages that require input from more than one worker, such as GetApprovalSignatures.
+
+## Forwarding logic
+
+The messages received and forwarded by approval-voting-parallel split in three categories:
+- Signals which need to be forwarded to all workers.
+- Messages that only the `approval-voting` worker needs to handle, `ApprovalVotingParallelMessage::ApprovedAncestor`
+  and   `ApprovalVotingParallelMessage::GetApprovalSignaturesForCandidate`
+- Control messages  that all `approval-distribution` workers need to receive `ApprovalVotingParallelMessage::NewBlocks`,
+  `ApprovalVotingParallelMessage::ApprovalCheckingLagUpdate`  and all network bridge variants `ApprovalVotingParallelMessage::NetworkBridgeUpdate`
+  except `ApprovalVotingParallelMessage::NetworkBridgeUpdate(NetworkBridgeEvent::PeerMessage)`
+- Data messages `ApprovalVotingParallelMessage::NetworkBridgeUpdate(NetworkBridgeEvent::PeerMessage)`  which need to be sent
+  just to a single `approval-distribution`  worker based on the ValidatorIndex. The logic for assigning the work is:
+  ```
+  assigned_worker_index = validator_index % number_of_workers;
+  ```
@@ -0,0 +1,531 @@
+# Approval Voting
+
+Reading the [section on the approval protocol](../../protocol-approval.md) will likely be necessary to understand the
+aims of this subsystem.
+
+Approval votes are split into two parts: Assignments and Approvals. Validators first broadcast their assignment to
+indicate intent to check a candidate. Upon successfully checking, they don't immediately send the vote instead
+they queue the check for a short period of time `MAX_APPROVAL_COALESCE_WAIT_TICKS` to give the opportunity of the
+validator to vote for more than one candidate. Once MAX_APPROVAL_COALESCE_WAIT_TICKS have passed or at least
+`MAX_APPROVAL_COALESCE_COUNT` are ready they broadcast an approval vote for all candidates. If a validator
+doesn't broadcast their approval vote shortly after issuing an assignment, this is an indication that they are
+being prevented from recovering or validating the block data and that more validators should self-select to
+check the candidate. This is known as a "no-show".
+
+The core of this subsystem is a Tick-based timer loop, where Ticks are 500ms. We also reason about time in terms of
+`DelayTranche`s, which measure the number of ticks elapsed since a block was produced. We track metadata for all
+un-finalized but included candidates. We compute our local assignments to check each candidate, as well as which
+`DelayTranche` those assignments may be minimally triggered at. As the same candidate may appear in more than one block,
+we must produce our potential assignments for each (Block, Candidate) pair. The timing loop is based on waiting for
+assignments to become no-shows or waiting to broadcast and begin our own assignment to check.
+
+Another main component of this subsystem is the logic for determining when a (Block, Candidate) pair has been approved
+and when to broadcast and trigger our own assignment. Once a (Block, Candidate) pair has been approved, we mark a
+corresponding bit in the `BlockEntry` that indicates the candidate has been approved under the block. When we trigger
+our own assignment, we broadcast it via Approval Distribution, begin fetching the data from Availability Recovery, and
+then pass it through to the Candidate Validation. Once these steps are successful, we issue our approval vote. If any of
+these steps fail, we don't issue any vote and will "no-show" from the perspective of other validators in addition a
+dispute is raised via the dispute-coordinator, by sending `IssueLocalStatement`.
+
+Where this all fits into Pezkuwi is via block finality. Our goal is to not finalize any block containing a candidate
+that is not approved. We provide a hook for a custom GRANDPA voting rule - GRANDPA makes requests of the form (target,
+minimum) consisting of a target block (i.e. longest chain) that it would like to finalize, and a minimum block which,
+due to the rules of GRANDPA, must be voted on. The minimum is typically the last finalized block, but may be beyond it,
+in the case of having a last-round-estimate beyond the last finalized. Thus, our goal is to inform GRANDPA of some block
+between target and minimum which we believe can be finalized safely. We do this by iterating backwards from the target
+to the minimum and finding the longest continuous chain from minimum where all candidates included by those blocks have
+been approved.
+
+## Protocol
+
+Input:
+  * `ApprovalVotingMessage::ImportAssignment`
+  * `ApprovalVotingMessage::ImportApproval`
+  * `ApprovalVotingMessage::ApprovedAncestor`
+
+Output:
+  * `ApprovalDistributionMessage::DistributeAssignment`
+  * `ApprovalDistributionMessage::DistributeApproval`
+  * `RuntimeApiMessage::Request`
+  * `ChainApiMessage`
+  * `AvailabilityRecoveryMessage::Recover`
+  * `CandidateExecutionMessage::ValidateFromExhaustive`
+
+## Functionality
+
+The approval voting subsystem is responsible for casting votes and determining approval of candidates and as a result,
+blocks.
+
+This subsystem wraps a database which is used to store metadata about unfinalized blocks and the candidates within them.
+Candidates may appear in multiple blocks, and assignment criteria are chosen differently based on the hash of the block
+they appear in.
+
+## Database Schema
+
+The database schema is designed with the following goals in mind:
+  1. To provide an easy index from unfinalized blocks to candidates
+  1. To provide a lookup from candidate hash to approval status
+  1. To be easy to clear on start-up. What has happened while we were offline is unimportant.
+  1. To be fast to clear entries outdated by finality
+
+Structs:
+
+```rust
+struct TrancheEntry {
+    tranche: DelayTranche,
+    // assigned validators who have not yet approved, and the instant we received
+    // their assignment.
+    assignments: Vec<(ValidatorIndex, Tick)>,
+}
+
+pub struct OurAssignment {
+	/// Our assignment certificate.
+	cert: AssignmentCertV2,
+	/// The tranche for which the assignment refers to.
+	tranche: DelayTranche,
+	/// Our validator index for the session in which the candidates were included.
+	validator_index: ValidatorIndex,
+	/// Whether the assignment has been triggered already.
+	triggered: bool,
+}
+
+pub struct ApprovalEntry {
+	tranches: Vec<TrancheEntry>, // sorted ascending by tranche number.
+	backing_group: GroupIndex,
+	our_assignment: Option<OurAssignment>,
+	our_approval_sig: Option<ValidatorSignature>,
+	assigned_validators: Bitfield, // `n_validators` bits.
+	approved: bool,
+}
+
+
+struct CandidateEntry {
+    candidate: CandidateReceipt,
+    session: SessionIndex,
+    // Assignments are based on blocks, so we need to track assignments separately
+    // based on the block we are looking at.
+    block_assignments: HashMap<Hash, ApprovalEntry>,
+    approvals: Bitfield, // n_validators bits
+}
+
+struct BlockEntry {
+    block_hash: Hash,
+    session: SessionIndex,
+    slot: Slot,
+    // random bytes derived from the VRF submitted within the block by the block
+    // author as a credential and used as input to approval assignment criteria.
+    relay_vrf_story: [u8; 32],
+    // The candidates included as-of this block and the index of the core they are
+    // leaving. Sorted ascending by core index.
+    candidates: Vec<(CoreIndex, Hash)>,
+    // A bitfield where the i'th bit corresponds to the i'th candidate in `candidates`.
+    // The i'th bit is `true` iff the candidate has been approved in the context of
+    // this block. The block can be considered approved has all bits set to 1
+    approved_bitfield: Bitfield,
+    children: Vec<Hash>,
+    // A list of candidates we have checked, but didn't not sign and
+    // advertise the vote yet.
+    candidates_pending_signature: BTreeMap<CandidateIndex, CandidateSigningContext>,
+    // Assignments we already distributed. A 1 bit means the candidate index for which
+    // we already have sent out an assignment. We need this to avoid distributing
+    // multiple core assignments more than once.
+    distributed_assignments: Bitfield,
+}
+
+// slot_duration * 2 + DelayTranche gives the number of delay tranches since the
+// unix epoch.
+type Tick = u64;
+
+struct StoredBlockRange(BlockNumber, BlockNumber);
+```
+
+In the schema, we map
+
+```
+"StoredBlocks" => StoredBlockRange
+BlockNumber => Vec<BlockHash>
+BlockHash => BlockEntry
+CandidateHash => CandidateEntry
+```
+
+## Logic
+
+```rust
+const APPROVAL_SESSIONS: SessionIndex = 6;
+
+// The minimum amount of ticks that an assignment must have been known for.
+const APPROVAL_DELAY: Tick = 2;
+```
+
+In-memory state:
+
+```rust
+struct ApprovalVoteRequest {
+  validator_index: ValidatorIndex,
+  block_hash: Hash,
+  candidate_index: CandidateIndex,
+}
+
+// Requests that background work (approval voting tasks) may need to make of the main subsystem
+// task.
+enum BackgroundRequest {
+  ApprovalVote(ApprovalVoteRequest),
+  // .. others, unspecified as per implementation.
+}
+
+// This is the general state of the subsystem. The actual implementation may split this
+// into further pieces.
+struct State {
+    earliest_session: SessionIndex,
+    session_info: Vec<SessionInfo>,
+    babe_epoch: Option<BabeEpoch>, // information about a cached BABE epoch.
+    keystore: Keystore,
+
+    // A scheduler which keeps at most one wakeup per hash, candidate hash pair and
+    // maps such pairs to `Tick`s.
+    wakeups: Wakeups,
+
+    // These are connected to each other.
+    background_tx: mpsc::Sender<BackgroundRequest>,
+    background_rx: mpsc::Receiver<BackgroundRequest>,
+}
+```
+
+This guide section makes no explicit references to writes to or reads from disk. Instead, it handles them implicitly,
+with the understanding that updates to block, candidate, and approval entries are persisted to disk.
+
+[`SessionInfo`](../../runtime/session_info.md)
+
+On start-up, we clear everything currently stored by the database. This is done by loading the `StoredBlockRange`,
+iterating through each block number, iterating through each block hash, and iterating through each candidate referenced
+by each block. Although this is `O(o*n*p)`, we don't expect to have more than a few unfinalized blocks at any time and
+in extreme cases, a few thousand. The clearing operation should be relatively fast as a result.
+
+Main loop:
+  * Each iteration, select over all of
+    * The next `Tick` in `wakeups`: trigger `wakeup_process` for each `(Hash, Hash)` pair scheduled under the `Tick` and
+      then remove all entries under the `Tick`.
+    * The next message from the overseer: handle the message as described in the [Incoming Messages
+      section](#incoming-messages)
+    * The next approval vote request from `background_rx`
+      * If this is an `ApprovalVoteRequest`, [Issue an approval vote](#issue-approval-vote).
+
+### Incoming Messages
+
+#### `OverseerSignal::BlockFinalized`
+
+On receiving an `OverseerSignal::BlockFinalized(h)`, we fetch the block number `b` of that block from the `ChainApi`
+subsystem. We update our `StoredBlockRange` to begin at `b+1`. Additionally, we remove all block entries and candidates
+referenced by them up to and including `b`. Lastly, we prune out all descendants of `h` transitively: when we remove a
+`BlockEntry` with number `b` that is not equal to `h`, we recursively delete all the `BlockEntry`s referenced as
+children. We remove the `block_assignments` entry for the block hash and if `block_assignments` is now empty, remove the
+`CandidateEntry`. We also update each of the `BlockNumber -> Vec<Hash>` keys in the database to reflect the blocks at
+that height, clearing if empty.
+
+
+#### `OverseerSignal::ActiveLeavesUpdate`
+
+On receiving an `OverseerSignal::ActiveLeavesUpdate(update)`:
+  * We determine the set of new blocks that were not in our previous view. This is done by querying the ancestry of all
+    new items in the view and contrasting against the stored `BlockNumber`s. Typically, there will be only one new
+    block. We fetch the headers and information on these blocks from the `ChainApi` subsystem. Stale leaves in the
+    update can be ignored.
+  * We update the `StoredBlockRange` and the `BlockNumber` maps.
+  * We use the `RuntimeApiSubsystem` to determine information about these blocks. It is generally safe to assume that
+    runtime state is available for recent, unfinalized blocks. In the case that it isn't, it means that we are catching
+    up to the head of the chain and needn't worry about assignments to those blocks anyway, as the security assumption
+    of the protocol tolerates nodes being temporarily offline or out-of-date.
+    * We fetch the set of candidates included by each block by dispatching a `RuntimeApiRequest::CandidateEvents` and
+      checking the `CandidateIncluded` events.
+    * We fetch the session of the block by dispatching a `session_index_for_child` request with the parent-hash of the
+      block.
+    * If the `session index - APPROVAL_SESSIONS > state.earliest_session`, then bump `state.earliest_sessions` to that
+      amount and prune earlier sessions.
+    * If the session isn't in our `state.session_info`, load the session info for it and for all sessions since the
+      earliest-session, including the earliest-session, if that is missing. And it can be, just after pruning, if we've
+      done a big jump forward, as is the case when we've just finished chain synchronization.
+    * If any of the runtime API calls fail, we just warn and skip the block.
+  * We use the `RuntimeApiSubsystem` to determine the set of candidates included in these blocks and use BABE logic to
+    determine the slot number and VRF of the blocks.
+  * We also note how late we appear to have received the block. We create a `BlockEntry` for each block and a
+    `CandidateEntry` for each candidate obtained from `CandidateIncluded` events after making a
+    `RuntimeApiRequest::CandidateEvents` request.
+  * For each candidate, if the amount of needed approvals is more than the validators remaining after the backing group
+    of the candidate is subtracted, then the candidate is insta-approved as approval would be impossible otherwise. If
+    all candidates in the block are insta-approved, or there are no candidates in the block, then the block is
+    insta-approved. If the block is insta-approved, a [`ChainSelectionMessage::Approved`][CSM] should be sent for the
+    block.
+  * Ensure that the `CandidateEntry` contains a `block_assignments` entry for the block, with the correct backing group
+    set.
+  * If a validator in this session, compute and assign `our_assignment` for the `block_assignments`
+    * Only if not a member of the backing group.
+    * Run `RelayVRFModulo` and `RelayVRFDelay` according to the [the approvals protocol
+      section](../../protocol-approval.md#assignment-criteria). Ensure that the assigned core derived from the output is
+      covered by the auxiliary signature aggregated in the `VRFPRoof`.
+  * [Handle Wakeup](#handle-wakeup) for each new candidate in each new block - this will automatically broadcast a
+    0-tranche assignment, kick off approval work, and schedule the next delay.
+  * Dispatch an `ApprovalDistributionMessage::NewBlocks` with the meta information filled out for each new block.
+
+#### `ApprovalVotingMessage::ImportAssignment`
+
+On receiving a `ApprovalVotingMessage::ImportAssignment` message, we assume the assignment cert itself has already been
+checked to be valid we proceed then to import the assignment inside the block entry. The cert itself contains
+information necessary to determine the candidate that is being assigned-to. In detail:
+  * Load the `BlockEntry` for the relay-parent referenced by the message. If there is none, return
+    `AssignmentCheckResult::Bad`.
+  * Fetch the `SessionInfo` for the session of the block
+  * Determine the assignment key of the validator based on that.
+  * Determine the claimed core index by looking up the candidate with given index in `block_entry.candidates`. Return
+    `AssignmentCheckResult::Bad` if missing.
+  * Import the assignment.
+    * Load the candidate in question and access the `approval_entry` for the block hash the cert references.
+    * Ignore if we already observe the validator as having been assigned.
+    * Ensure the validator index is not part of the backing group for the candidate.
+    * Ensure the validator index is not present in the approval entry already.
+    * Create a tranche entry for the delay tranche in the approval entry and note the assignment within it.
+    * Note the candidate index within the approval entry.
+  * [Schedule a wakeup](#schedule-wakeup) for this block, candidate pair.
+  * return the appropriate `AssignmentCheckResult` on the response channel.
+
+#### `ApprovalVotingMessage::ImportApproval`
+
+On receiving a `ImportApproval(indirect_approval_vote, response_channel)` message:
+  * Fetch the `BlockEntry` from the indirect approval vote's `block_hash`. If none, return `ApprovalCheckResult::Bad`.
+  * Fetch all `CandidateEntry` from the indirect approval vote's `candidate_indices`. If the block did not trigger
+    inclusion of enough candidates, return `ApprovalCheckResult::Bad`.
+  * Send `ApprovalCheckResult::Accepted`
+  * [Import the checked approval vote](#import-checked-approval) for all candidates
+
+#### `ApprovalVotingMessage::ApprovedAncestor`
+
+On receiving an `ApprovedAncestor(Hash, BlockNumber, response_channel)`:
+  * Iterate over the ancestry of the hash all the way back to block number given, starting from the provided block hash.
+    Load the `CandidateHash`es from each block entry.
+  * Keep track of an `all_approved_max: Option<(Hash, BlockNumber, Vec<(Hash, Vec<CandidateHash>))>`.
+  * For each block hash encountered, load the `BlockEntry` associated. If any are not found, return `None` on the
+    response channel and conclude.
+  * If the block entry's `approval_bitfield` has all bits set to 1 and `all_approved_max == None`, set `all_approved_max
+    = Some((current_hash, current_number))`.
+  * If the block entry's `approval_bitfield` has any 0 bits, set `all_approved_max = None`.
+  * If `all_approved_max` is `Some`, push the current block hash and candidate hashes onto the list of blocks and
+    candidates `all_approved_max`.
+  * After iterating all ancestry, return `all_approved_max`.
+
+### Updates and Auxiliary Logic
+
+#### Import Checked Approval
+  * Import an approval vote which we can assume to have passed signature checks and correspond to an imported
+    assignment.
+  * Requires `(BlockEntry, CandidateEntry, ValidatorIndex)`
+  * Set the corresponding bit of the `approvals` bitfield in the `CandidateEntry` to `1`. If already `1`, return.
+  * Checks the approval state of a candidate under a specific block, and updates the block and candidate entries
+    accordingly.
+  * Checks the `ApprovalEntry` for the block.
+    * [determine the tranches to inspect](#determine-required-tranches) of the candidate,
+    * [the candidate is approved under the block](#check-approval), set the corresponding bit in the
+      `block_entry.approved_bitfield`.
+    * If the block is now fully approved and was not before, send a [`ChainSelectionMessage::Approved`][CSM].
+    * Otherwise, [schedule a wakeup of the candidate](#schedule-wakeup)
+  * If the approval vote originates locally, set the `our_approval_sig` in the candidate entry.
+
+#### Handling Wakeup
+  * Handle a previously-scheduled wakeup of a candidate under a specific block.
+  * Requires `(relay_block, candidate_hash)`
+  * Load the `BlockEntry` and `CandidateEntry` from disk. If either is not present, this may have lost a race with
+    finality and can be ignored. Also load the `ApprovalEntry` for the block and candidate.
+  * [determine the `RequiredTranches` of the candidate](#determine-required-tranches).
+  * Determine if we should trigger our assignment.
+    * If we've already triggered or `OurAssignment` is `None`, we do not trigger.
+    * If we have  `RequiredTranches::All`, then we trigger if the candidate is [not approved](#check-approval). We have
+      no next wakeup as we assume that other validators are doing the same and we will be implicitly woken up by
+      handling new votes.
+    * If we have `RequiredTranches::Pending { considered, next_no_show, uncovered, maximum_broadcast, clock_drift }`,
+      then we trigger if our assignment's tranche is less than or equal to `maximum_broadcast` and the current tick,
+      with `clock_drift` applied, is at least the tick of our tranche.
+    * If we have `RequiredTranches::Exact { .. }` then we do not trigger, because this value indicates that no new
+      assignments are needed at the moment.
+  * If we should trigger our assignment
+    * Import the assignment to the `ApprovalEntry`
+    * Broadcast on network with an `ApprovalDistributionMessage::DistributeAssignment`.
+    * [Launch approval work](#launch-approval-work) for the candidate.
+  * [Schedule a new wakeup](#schedule-wakeup) of the candidate.
+
+#### Schedule Wakeup
+
+  * Requires `(approval_entry, candidate_entry)` which effectively denotes a `(Block Hash, Candidate Hash)` pair - the
+    candidate, along with the block it appears in.
+  * Also requires `RequiredTranches`
+  * If the `approval_entry` is approved, this doesn't need to be woken up again.
+  * If `RequiredTranches::All` - no wakeup. We assume other incoming votes will trigger wakeup and potentially
+    re-schedule.
+  * If `RequiredTranches::Pending { considered, next_no_show, uncovered, maximum_broadcast, clock_drift }` - schedule at
+    the lesser of the next no-show tick, or the tick, offset positively by `clock_drift` of the next non-empty tranche
+    we are aware of after `considered`, including any tranche containing our own unbroadcast assignment. This can lead
+    to no wakeup in the case that we have already broadcast our assignment and there are no pending no-shows; that is,
+    we have approval votes for every assignment we've received that is not already a no-show. In this case, we will be
+    re-triggered by other validators broadcasting their assignments.
+  * If `RequiredTranches::Exact { next_no_show, latest_assignment_tick, .. }` - set a wakeup for the earlier of the next
+    no-show tick or the latest assignment tick + `APPROVAL_DELAY`.
+
+#### Launch Approval Work
+
+* Requires `(SessionIndex, SessionInfo, CandidateReceipt, ValidatorIndex, backing_group, block_hash, candidate_index)`
+* Extract the public key of the `ValidatorIndex` from the `SessionInfo` for the session.
+* Issue an `AvailabilityRecoveryMessage::RecoverAvailableData(candidate, session_index, Some(backing_group),
+Some(core_index), response_sender)`
+* Load the historical validation code of the teyrchain by dispatching a
+  `RuntimeApiRequest::ValidationCodeByHash(descriptor.validation_code_hash)` against the state of `block_hash`.
+* Spawn a background task with a clone of `background_tx`
+  * Wait for the available data
+  * Issue a `CandidateValidationMessage::ValidateFromExhaustive` message with `APPROVAL_EXECUTION_TIMEOUT` as the
+    timeout parameter.
+  * Wait for the result of validation
+  * Check that the result of validation, if valid, matches the commitments in the receipt.
+  * If valid, issue a message on `background_tx` detailing the request.
+  * If any of the data, the candidate, or the commitments are invalid, issue on `background_tx` a
+    [`DisputeCoordinatorMessage::IssueLocalStatement`](../../types/overseer-protocol.md#dispute-coordinator-message)
+    with `valid = false` to initiate a dispute.
+
+#### Issue Approval Vote
+  * Fetch the block entry and candidate entry. Ignore if `None` - we've probably just lost a race with finality.
+  * [Import the checked approval vote](#import-checked-approval). It is "checked" as we've just issued the signature.
+  * IF `MAX_APPROVAL_COALESCE_COUNT`  candidates are in the waiting queue
+    * Construct a `SignedApprovalVote` with the validator index for the session and all candidate hashes in the waiting queue.
+    * Construct a `IndirectSignedApprovalVote` using the information about the vote.
+    * Dispatch `ApprovalDistributionMessage::DistributeApproval`.
+  * ELSE
+    * Queue the candidate in the `BlockEntry::candidates_pending_signature`
+    * Arm a per BlockEntry timer with latest tick we can send the vote.
+
+### Delayed vote distribution
+  * [Issue Approval Vote](#issue-approval-vote) arms once a per block timer if there are no requirements to send the
+    vote immediately.
+  * When the timer wakes up it will either:
+  * IF there is a candidate in the queue past its sending tick:
+    * Construct a `SignedApprovalVote` with the validator index for the session and all candidate hashes in the waiting queue.
+    * Construct a `IndirectSignedApprovalVote` using the information about the vote.
+    * Dispatch `ApprovalDistributionMessage::DistributeApproval`.
+  * ELSE
+    * Re-arm the timer with latest tick we have then send the vote.
+
+### Determining Approval of Candidate
+
+#### Determine Required Tranches
+
+This logic is for inspecting an approval entry that tracks the assignments received, along with information on which
+assignments have corresponding approval votes. Inspection also involves the current time and expected requirements and
+is used to help the higher-level code determine the following:
+  * Whether to broadcast the local assignment
+  * Whether to check that the candidate entry has been completely approved.
+  * If the candidate is waiting on approval, when to schedule the next wakeup of the `(candidate, block)` pair at a
+    point where the state machine could be advanced.
+
+These routines are pure functions which only depend on the environmental state. The expectation is that this
+determination is re-run every time we attempt to update an approval entry: either when we trigger a wakeup to advance
+the state machine based on a no-show or our own broadcast, or when we receive further assignments or approvals from the
+network.
+
+Thus it may be that at some point in time, we consider that tranches 0..X is required to be considered, but as we
+receive more information, we might require fewer tranches. Or votes that we perceived to be missing and require
+replacement are filled in and change our view.
+
+Requires `(approval_entry, approvals_received, tranche_now, block_tick, no_show_duration, needed_approvals)`
+
+```rust
+enum RequiredTranches {
+  // All validators appear to be required, based on tranches already taken and remaining no-shows.
+  All,
+  // More tranches required - We're awaiting more assignments.
+  Pending {
+    /// The highest considered delay tranche when counting assignments.
+    considered: DelayTranche,
+    /// The tick at which the next no-show, of the assignments counted, would occur.
+    next_no_show: Option<Tick>,
+    /// The highest tranche to consider when looking to broadcast own assignment.
+    /// This should be considered along with the clock drift to avoid broadcasting
+    /// assignments that are before the local time.
+    maximum_broadcast: DelayTranche,
+    /// The clock drift, in ticks, to apply to the local clock when determining whether
+    /// to broadcast an assignment or when to schedule a wakeup. The local clock should be treated
+    /// as though it is `clock_drift` ticks earlier.
+    clock_drift: Tick,
+  },
+  // An exact number of required tranches and a number of no-shows. This indicates that the amount of `needed_approvals`
+  // are assigned and additionally all no-shows are covered.
+  Exact {
+    /// The tranche to inspect up to.
+    needed: DelayTranche,
+    /// The amount of missing votes that should be tolerated.
+    tolerated_missing: usize,
+    /// When the next no-show would be, if any. This is used to schedule the next wakeup in the
+    /// event that there are some assignments that don't have corresponding approval votes. If this
+    /// is `None`, all assignments have approvals.
+    next_no_show: Option<Tick>,
+    /// The last tick at which a needed assignment was received.
+    last_assignment_tick: Option<Tick>,
+  }
+}
+```
+
+**Clock-drift and Tranche-taking**
+
+Our vote-counting procedure depends heavily on how we interpret time based on the presence of no-shows - assignments
+which have no corresponding approval after some time.
+
+We have this is because of how we handle no-shows: we keep track of the depth of no-shows we are covering.
+
+As an example: there may be initial no-shows in tranche 0. It'll take `no_show_duration` ticks before those are
+considered no-shows. Then, we don't want to immediately take `no_show_duration` more tranches. Instead, we want to take
+one tranche for each uncovered no-show. However, as we take those tranches, there may be further no-shows. Since these
+depth-1 no-shows should have only been triggered after the depth-0 no-shows were already known to be no-shows, we need
+to discount the local clock by `no_show_duration` to  see whether these should be considered no-shows or not. There may
+be malicious parties who broadcast their assignment earlier than they were meant to, who shouldn't be counted as instant
+no-shows. We continue onwards to cover all depth-1 no-shows which may lead to depth-2 no-shows and so on.
+
+Likewise, when considering how many tranches to take, the no-show depth should be used to apply a depth-discount or
+clock drift to the `tranche_now`.
+
+**Procedure**
+
+  * Start with `depth = 0`.
+  * Set a clock drift of `depth * no_show_duration`
+  * Take tranches up to `tranche_now - clock_drift` until all needed assignments are met.
+  * Keep track of the `next_no_show` according to the clock drift, as we go.
+  * Keep track of the `last_assignment_tick` as we go.
+  * If running out of tranches before then, return `Pending { considered, next_no_show, maximum_broadcast, clock_drift
+    }`
+  * If there are no no-shows, return `Exact { needed, tolerated_missing, next_no_show, last_assignment_tick }`
+  * `maximum_broadcast` is either `DelayTranche::max_value()` at tranche 0 or otherwise by the last considered tranche +
+    the number of uncovered no-shows at this point.
+  * If there are no-shows, return to the beginning, incrementing `depth` and attempting to cover the number of no-shows.
+    Each no-show must be covered by a non-empty tranche, which are tranches that have at least one assignment. Each
+    non-empty tranche covers exactly one no-show.
+  * If at any point, it seems that all validators are required, do an early return with `RequiredTranches::All` which
+    indicates that everyone should broadcast.
+
+#### Check Approval
+  * Check whether a candidate is approved under a particular block.
+  * Requires `(block_entry, candidate_entry, approval_entry, n_tranches)`
+  * If we have `3 * n_approvals > n_validators`, return true. This is because any set with f+1 validators must have at
+    least one honest validator, who has approved the candidate.
+  * If `n_tranches` is `RequiredTranches::Pending`, return false
+  * If `n_tranches` is `RequiredTranches::All`, return false.
+  * If `n_tranches` is `RequiredTranches::Exact { tranche, tolerated_missing, latest_assignment_tick, .. }`, then we
+    return whether all assigned validators up to `tranche` less `tolerated_missing` have approved and
+    `latest_assignment_tick + APPROVAL_DELAY >= tick_now`.
+    * e.g. if we had 5 tranches and 1 tolerated missing, we would accept only if all but 1 of assigned validators in
+      tranches 0..=5 have approved. In that example, we also accept all validators in tranches 0..=5 having approved,
+      but that would indicate that the `RequiredTranches` value was incorrectly constructed, so it is not realistic.
+      `tolerated_missing` actually represents covered no-shows. If there are more missing approvals than there are
+      tolerated missing, that indicates that there are some assignments which are not yet no-shows, but may become
+      no-shows, and we should wait for the validators to either approve or become no-shows.
+    * e.g. If the above passes and the `latest_assignment_tick` was 5 and the current tick was 6, then we'd return
+      false.
+
+### Time
+
+#### Current Tranche
+  * Given the slot number of a block, and the current time, this informs about the current tranche.
+  * Convert `time.saturating_sub(slot_number.to_time())` to a delay tranches value
+
+[CSM]: ../../types/overseer-protocol.md#chainselectionmessage
@@ -0,0 +1,7 @@
+# Availability Subsystems
+
+The availability subsystems are responsible for ensuring that Proofs of Validity of backed candidates are widely
+available within the validator set, without requiring every node to retain a full copy. They accomplish this by broadly
+distributing erasure-coded chunks of the PoV, keeping track of which validator has which chunk by means of signed
+bitfields. They are also responsible for reassembling a complete PoV when required, e.g. when an approval checker needs
+to validate a teyrchain block.
@@ -0,0 +1,84 @@
+# Availability Distribution
+
+This subsystem is responsible for distribution availability data to peers. Availability data are chunks, `PoV`s and
+`AvailableData` (which is `PoV` + `PersistedValidationData`). It does so via request response protocols.
+
+In particular this subsystem is responsible for:
+
+- Respond to network requests requesting availability data by querying the [Availability
+  Store](../utility/availability-store.md).
+- Request chunks from backing validators to put them in the local `Availability Store` whenever we find an occupied core
+  on any fresh leaf, this is to ensure availability by at least 2/3+ of all validators, this happens after a candidate
+  is backed.
+- Fetch `PoV` from validators, when requested via `FetchPoV` message from backing (`pov_requester` module).
+
+The backing subsystem is responsible of making available data available in the local `Availability Store` upon
+validation. This subsystem will serve any network requests by querying that store.
+
+## Protocol
+
+This subsystem does not handle any peer set messages, but the `pov_requester` does connect to validators of the same
+backing group on the validation peer set, to ensure fast propagation of statements between those validators and for
+ensuring already established connections for requesting `PoV`s. Other than that this subsystem drives request/response
+protocols.
+
+Input:
+
+- `OverseerSignal::ActiveLeaves(ActiveLeavesUpdate)`
+- `AvailabilityDistributionMessage{msg: ChunkFetchingRequest}`
+- `AvailabilityDistributionMessage{msg: PoVFetchingRequest}`
+- `AvailabilityDistributionMessage{msg: FetchPoV}`
+
+Output:
+
+- `NetworkBridgeMessage::SendRequests(Requests, IfDisconnected::TryConnect)`
+- `AvailabilityStore::QueryChunk(candidate_hash, index, response_channel)`
+- `AvailabilityStore::StoreChunk(candidate_hash, chunk)`
+- `AvailabilityStore::QueryAvailableData(candidate_hash, response_channel)`
+- `RuntimeApiRequest::SessionIndexForChild`
+- `RuntimeApiRequest::SessionInfo`
+- `RuntimeApiRequest::AvailabilityCores`
+
+## Functionality
+
+### PoV Requester
+
+The PoV requester in the `pov_requester` module takes care of staying connected to validators of the current backing
+group of this very validator on the `Validation` peer set and it will handle `FetchPoV` requests by issuing network
+requests to those validators. It will check the hash of the received `PoV`, but will not do any further validation. That
+needs to be done by the original `FetchPoV` sender (backing subsystem).
+
+### Chunk Requester
+
+After a candidate is backed, the availability of the PoV block must be confirmed by 2/3+ of all validators. The chunk
+requester is responsible of making that availability a reality.
+
+It does that by querying checking occupied cores for all active leaves. For each occupied core it will spawn a task
+fetching the erasure chunk which has the `ValidatorIndex` of the node. For this an `ChunkFetchingRequest` is issued, via
+Substrate's generic request/response protocol.
+
+The spawned task will start trying to fetch the chunk from validators in responsible group of the occupied core, in a
+random order. For ensuring that we use already open TCP connections wherever possible, the requester maintains a cache
+and preserves that random order for the entire session.
+
+Note however that, because not all validators in a group have to be actual backers, not all of them are required to have
+the needed chunk. This in turn could lead to low throughput, as we have to wait for fetches to fail, before reaching a
+validator finally having our chunk. We do rank back validators not delivering our chunk, but as backers could vary from
+block to block on a perfectly legitimate basis, this is still not ideal. See issues
+[2509](https://github.com/paritytech/polkadot/issues/2509) and
+[2512](https://github.com/paritytech/polkadot/issues/2512) for more information.
+
+The current implementation also only fetches chunks for occupied cores in blocks in active leaves. This means though, if
+active leaves skips a block or we are particularly slow in fetching our chunk, we might not fetch our chunk if
+availability reached 2/3 fast enough (slot becomes free). This is not desirable as we would like as many validators as
+possible to have their chunk. See this [issue](https://github.com/paritytech/polkadot/issues/2513) for more details.
+
+
+### Serving
+
+On the other side the subsystem will listen for incoming `ChunkFetchingRequest`s and `PoVFetchingRequest`s from the
+network bridge and will respond to queries, by looking the requested chunks and `PoV`s up in the availability store,
+this happens in the `responder` module.
+
+We rely on the backing subsystem to make available data available locally in the `Availability Store` after it has
+validated it.
@@ -0,0 +1,184 @@
+# Availability Recovery
+
+This subsystem is responsible for recovering the data made available via the
+[Availability Distribution](availability-distribution.md) subsystem, necessary for candidate validation during the
+approval/disputes processes. Additionally, it is also being used by collators to recover PoVs in adversarial scenarios
+where the other collators of the para are censoring blocks.
+
+According to the Pezkuwi protocol, in order to recover any given `AvailableData`, we generally must recover at least
+`f + 1` pieces from validators of the session. Thus, we should connect to and query randomly chosen validators until we
+have received `f + 1` pieces.
+
+In practice, there are various optimisations implemented in this subsystem which avoid querying all chunks from
+different validators and/or avoid doing the chunk reconstruction altogether.
+
+## Protocol
+
+This version of the availability recovery subsystem is based only on request-response network protocols.
+
+Input:
+
+* `AvailabilityRecoveryMessage::RecoverAvailableData(candidate, session, backing_group, core_index, response)`
+
+Output:
+
+* `NetworkBridgeMessage::SendRequests`
+* `AvailabilityStoreMessage::QueryAllChunks`
+* `AvailabilityStoreMessage::QueryAvailableData`
+* `AvailabilityStoreMessage::QueryChunkSize`
+
+
+## Functionality
+
+We hold a state which tracks the currently ongoing recovery tasks. A `RecoveryTask` is a structure encapsulating all
+network tasks needed in order to recover the available data in respect to a candidate.
+
+Each `RecoveryTask` has a collection of ordered recovery strategies to try.
+
+```rust
+/// Subsystem state.
+struct State {
+  /// Each recovery task is implemented as its own async task,
+  /// and these handles are for communicating with them.
+  ongoing_recoveries: FuturesUnordered<RecoveryHandle>,
+  /// A recent block hash for which state should be available.
+  live_block: (BlockNumber, Hash),
+  /// An LRU cache of recently recovered data.
+  availability_lru: LruMap<CandidateHash, CachedRecovery>,
+  /// Cached runtime info.
+  runtime_info: RuntimeInfo,
+}
+
+struct RecoveryParams {
+  /// Discovery ids of `validators`.
+  pub validator_authority_keys: Vec<AuthorityDiscoveryId>,
+  /// Number of validators.
+  pub n_validators: usize,
+  /// The number of regular chunks needed.
+  pub threshold: usize,
+  /// The number of systematic chunks needed.
+  pub systematic_threshold: usize,
+  /// A hash of the relevant candidate.
+  pub candidate_hash: CandidateHash,
+  /// The root of the erasure encoding of the candidate.
+  pub erasure_root: Hash,
+  /// Metrics to report.
+  pub metrics: Metrics,
+  /// Do not request data from availability-store. Useful for collators.
+  pub bypass_availability_store: bool,
+  /// The type of check to perform after available data was recovered.
+  pub post_recovery_check: PostRecoveryCheck,
+  /// The blake2-256 hash of the PoV.
+  pub pov_hash: Hash,
+  /// Protocol name for ChunkFetchingV1.
+  pub req_v1_protocol_name: ProtocolName,
+  /// Protocol name for ChunkFetchingV2.
+  pub req_v2_protocol_name: ProtocolName,
+  /// Whether or not chunk mapping is enabled.
+  pub chunk_mapping_enabled: bool,
+  /// Channel to the erasure task handler.
+	pub erasure_task_tx: mpsc::Sender<ErasureTask>,
+}
+
+pub struct RecoveryTask<Sender: overseer::AvailabilityRecoverySenderTrait> {
+	sender: Sender,
+	params: RecoveryParams,
+	strategies: VecDeque<Box<dyn RecoveryStrategy<Sender>>>,
+	state: task::State,
+}
+
+#[async_trait::async_trait]
+/// Common trait for runnable recovery strategies.
+pub trait RecoveryStrategy<Sender: overseer::AvailabilityRecoverySenderTrait>: Send {
+	/// Main entry point of the strategy.
+	async fn run(
+		mut self: Box<Self>,
+		state: &mut task::State,
+		sender: &mut Sender,
+		common_params: &RecoveryParams,
+	) -> Result<AvailableData, RecoveryError>;
+
+	/// Return the name of the strategy for logging purposes.
+	fn display_name(&self) -> &'static str;
+
+	/// Return the strategy type for use as a metric label.
+	fn strategy_type(&self) -> &'static str;
+}
+```
+
+### Signal Handling
+
+On `ActiveLeavesUpdate`, if `activated` is non-empty, set `state.live_block_hash` to the first block in `Activated`.
+
+Ignore `BlockFinalized` signals.
+
+On `Conclude`, shut down the subsystem.
+
+#### `AvailabilityRecoveryMessage::RecoverAvailableData(...)`
+
+1. Check the `availability_lru` for the candidate and return the data if present.
+1. Check if there is already a recovery handle for the request. If so, add the response handle to it.
+1. Otherwise, load the session info for the given session under the state of `live_block_hash`, and initiate a recovery
+   task with `launch_recovery_task`. Add a recovery handle to the state and add the response channel to it.
+1. If the session info is not available, return `RecoveryError::Unavailable` on the response channel.
+
+### Recovery logic
+
+#### `handle_recover(...) -> Result<()>`
+
+Instantiate the appropriate `RecoveryStrategy`es, based on the subsystem configuration, params and session info.
+Call `launch_recovery_task()`.
+
+#### `launch_recovery_task(state, ctx, response_sender, recovery_strategies, params) -> Result<()>`
+
+Create the `RecoveryTask` and launch it as a background task running `recovery_task.run()`.
+
+#### `recovery_task.run(mut self) -> Result<AvailableData, RecoveryError>`
+
+* Loop:
+  * Pop a strategy from the queue. If none are left, return `RecoveryError::Unavailable`.
+  * Run the strategy.
+  * If the strategy returned successfully or returned `RecoveryError::Invalid`, break the loop.
+
+### Recovery strategies
+
+#### `FetchFull`
+
+This strategy tries requesting the full available data from the validators in the backing group to
+which the node is already connected. They are tried one by one in a random order.
+It is very performant if there's enough network bandwidth and the backing group is not overloaded.
+The costly reed-solomon reconstruction is not needed.
+
+#### `FetchSystematicChunks`
+
+Very similar to `FetchChunks` below but requests from the validators that hold the systematic chunks, so that we avoid
+reed-solomon reconstruction. Only possible if `node_features::FeatureIndex::AvailabilityChunkMapping` is enabled and
+the `core_index` is supplied (currently only for recoveries triggered by approval voting).
+
+More info in
+[RFC-47](https://github.com/polkadot-fellows/RFCs/blob/main/text/0047-assignment-of-availability-chunks.md).
+
+#### `FetchChunks`
+
+The least performant strategy but also the most comprehensive one. It's the only one that cannot fail under the
+byzantine threshold assumption, so it's always added as the last one in the `recovery_strategies` queue.
+
+Performs parallel chunk requests to validators. When enough chunks were received, do the reconstruction.
+In the worst case, all validators will be tried.
+
+### Default recovery strategy configuration
+
+#### For validators
+
+If the estimated available data size is smaller than a configured constant (currently 1Mib for Pezkuwi or 4Mib for
+other networks), try doing `FetchFull` first.
+Next, if the preconditions described in `FetchSystematicChunks` above are met, try systematic recovery.
+As a last resort, do `FetchChunks`.
+
+#### For collators
+
+Collators currently only use `FetchChunks`, as they only attempt recoveries in rare scenarios.
+
+Moreover, the recovery task is specially configured to not attempt requesting data from the local availability-store
+(because it doesn't exist) and to not reencode the data after a successful recovery (because it's an expensive check
+that is not needed; checking the pov_hash is enough for collators).
@@ -0,0 +1,40 @@
+# Bitfield Distribution
+
+Validators vote on the availability of a backed candidate by issuing signed bitfields, where each bit corresponds to a
+single candidate. These bitfields can be used to compactly determine which backed candidates are available or not based
+on a 2/3+ quorum.
+
+## Protocol
+
+`PeerSet`: `Validation`
+
+Input: [`BitfieldDistributionMessage`](../../types/overseer-protocol.md#bitfield-distribution-message) which are
+gossiped to all peers, no matter if validator or not.
+
+Output:
+
+- `NetworkBridge::SendValidationMessage([PeerId], message)` gossip a verified incoming bitfield on to interested
+  subsystems within this validator node.
+- `NetworkBridge::ReportPeer(PeerId, cost_or_benefit)` improve or penalize the reputation of peers based on the messages
+  that are received relative to the current view.
+- `ProvisionerMessage::ProvisionableData(ProvisionableData::Bitfield(relay_parent, SignedAvailabilityBitfield))` pass on
+  the bitfield to the other submodules via the overseer.
+
+## Functionality
+
+This is implemented as a gossip system.
+
+It is necessary to track peer connection, view change, and disconnection events, in order to maintain an index of which
+peers are interested in which relay parent bitfields.
+
+
+Before gossiping incoming bitfields, they must be checked to be signed by one of the validators of the validator set
+relevant to the current relay parent. Only accept bitfields relevant to our current view and only distribute bitfields
+to other peers when relevant to their most recent view. Accept and distribute only one bitfield per validator.
+
+
+When receiving a bitfield either from the network or from a `DistributeBitfield` message, forward it along to the block
+authorship (provisioning) subsystem for potential inclusion in a block.
+
+Peers connecting after a set of valid bitfield gossip messages was received, those messages must be cached and sent upon
+connection of new peers or re-connecting peers.
@@ -0,0 +1,37 @@
+# Bitfield Signing
+
+Validators vote on the availability of a backed candidate by issuing signed bitfields, where each bit corresponds to a
+single candidate. These bitfields can be used to compactly determine which backed candidates are available or not based
+on a 2/3+ quorum.
+
+## Protocol
+
+Input:
+
+There is no dedicated input mechanism for bitfield signing. Instead, Bitfield Signing produces a bitfield representing
+the current state of availability on `StartWork`.
+
+Output:
+
+- `BitfieldDistribution::DistributeBitfield`: distribute a locally signed bitfield
+- `AvailabilityStore::QueryChunk(CandidateHash, validator_index, response_channel)`
+
+## Functionality
+
+Upon receipt of an `ActiveLeavesUpdate`, launch bitfield signing job for each `activated` head referring to a fresh
+leaf. Stop the job for each `deactivated` head.
+
+## Bitfield Signing Job
+
+Localized to a specific relay-parent `r` If not running as a validator, do nothing.
+
+- For each fresh leaf, begin by waiting a fixed period of time so availability distribution has the chance to make
+  candidates available.
+- Determine our validator index `i`, the set of backed candidates pending availability in `r`, and which bit of the
+  bitfield each corresponds to.
+- Start with an empty bitfield. For each bit in the bitfield, if there is a candidate pending availability, query the
+  [Availability Store](../utility/availability-store.md) for whether we have the availability chunk for our validator
+  index. The `OccupiedCore` struct contains the candidate hash so the full candidate does not need to be fetched from
+  runtime.
+- For all chunks we have, set the corresponding bit in the bitfield.
+- Sign the bitfield and dispatch a `BitfieldDistribution::DistributeBitfield` message.
@@ -0,0 +1,15 @@
+# Backing Subsystems
+
+The backing subsystems, when conceived as a black box, receive an arbitrary quantity of parablock candidates and
+associated proofs of validity from arbitrary untrusted collators. From these, they produce a bounded quantity of
+backable candidates which relay chain block authors may choose to include in a subsequent block.
+
+In broad strokes, the flow operates like this:
+
+- **Candidate Selection** winnows the field of parablock candidates, selecting up to one of them to second.
+- **Candidate Backing** ensures that a seconding candidate is valid, then generates the appropriate `Statement`. It also
+  keeps track of which candidates have received the backing of a quorum of other validators.
+- **Statement Distribution** is the networking component which ensures that all validators receive each others'
+  statements.
+- **PoV Distribution** is the networking component which ensures that validators considering a candidate can get the
+  appropriate PoV.
@@ -0,0 +1,189 @@
+# Candidate Backing
+
+> NOTE: This module has suffered changes for the elastic scaling implementation. As a result, parts of this document may
+be out of date and will be updated at a later time. Issue tracking the update:
+https://github.com/pezkuwichain/pezkuwi-sdk/issues/132
+
+The Candidate Backing subsystem ensures every parablock considered for relay block inclusion has been seconded by at
+least one validator, and approved by a quorum. Parablocks for which not enough validators will assert correctness are
+discarded. If the block later proves invalid, the initial backers are slashable; this gives Pezkuwi a rational threat
+model during subsequent stages.
+
+Its role is to produce backable candidates for inclusion in new relay-chain blocks. It does so by issuing signed
+[`Statement`s][Statement] and tracking received statements signed by other validators. Once enough statements are
+received, they can be combined into backing for specific candidates.
+
+Note that though the candidate backing subsystem attempts to produce as many backable candidates as possible, it does
+_not_ attempt to choose a single authoritative one. The choice of which actually gets included is ultimately up to the
+block author, by whatever metrics it may use; those are opaque to this subsystem.
+
+Once a sufficient quorum has agreed that a candidate is valid, this subsystem notifies the [Provisioner][PV], which in
+turn engages block production mechanisms to include the parablock.
+
+## Protocol
+
+Input: [`CandidateBackingMessage`][CBM]
+
+Output:
+
+* [`CandidateValidationMessage`][CVM]
+* [`RuntimeApiMessage`][RAM]
+* [`CollatorProtocolMessage`][CPM]
+* [`ProvisionerMessage`][PM]
+* [`AvailabilityDistributionMessage`][ADM]
+* [`StatementDistributionMessage`][SDM]
+
+## Functionality
+
+The [Collator Protocol][CP] subsystem is the primary source of non-overseer messages into this subsystem. That subsystem
+generates appropriate [`CandidateBackingMessage`s][CBM] and passes them to this subsystem.
+
+This subsystem requests validation from the [Candidate Validation][CV] and generates an appropriate
+[`Statement`][Statement]. All `Statement`s are then passed on to the [Statement Distribution][SD] subsystem to be
+gossiped to peers. When [Candidate Validation][CV] decides that a candidate is invalid, and it was recommended to us to
+second by our own [Collator Protocol][CP] subsystem, a message is sent to the [Collator Protocol][CP] subsystem with the
+candidate's hash so that the collator which recommended it can be penalized.
+
+The subsystem should maintain a set of handles to Candidate Backing Jobs that are currently live, as well as the
+relay-parent to which they correspond.
+
+### On Overseer Signal
+
+* If the signal is an [`OverseerSignal`][OverseerSignal]`::ActiveLeavesUpdate`:
+  * spawn a Candidate Backing Job for each `activated` head referring to a fresh leaf, storing a bidirectional channel
+    with the Candidate Backing Job in the set of handles.
+  * cease the Candidate Backing Job for each `deactivated` head, if any.
+* If the signal is an [`OverseerSignal`][OverseerSignal]`::Conclude`: Forward conclude messages to all jobs, wait a
+  small amount of time for them to join, and then exit.
+
+### On Receiving `CandidateBackingMessage`
+
+* If the message is a [`CandidateBackingMessage`][CBM]`::GetBackedCandidates`, get all backable candidates from the
+  statement table and send them back.
+* If the message is a [`CandidateBackingMessage`][CBM]`::Second`, sign and dispatch a `Seconded` statement only if we
+  have not seconded any other candidate and have not signed a `Valid` statement for the requested candidate. Signing
+  both a `Seconded` and `Valid` message is a double-voting misbehavior with a heavy penalty, and this could occur if
+  another validator has seconded the same candidate and we've received their message before the internal seconding
+  request.
+* If the message is a [`CandidateBackingMessage`][CBM]`::Statement`, count the statement to the quorum. If the statement
+  in the message is `Seconded` and it contains a candidate that belongs to our assignment, request the corresponding
+  `PoV` from the backing node via `AvailabilityDistribution` and launch validation. Issue our own `Valid` or `Invalid`
+  statement as a result.
+
+If the seconding node did not provide us with the `PoV` we will retry fetching from other backing validators.
+
+
+> big TODO: "contextual execution"
+>
+> * At the moment we only allow inclusion of _new_ teyrchain candidates validated by _current_ validators.
+> * Allow inclusion of _old_ teyrchain candidates validated by _current_ validators.
+> * Allow inclusion of _old_ teyrchain candidates validated by _old_ validators.
+>
+> This will probably blur the lines between jobs, will probably require inter-job communication and a short-term memory
+> of recently backable, but not backed candidates.
+
+## Candidate Backing Job
+
+The Candidate Backing Job represents the work a node does for backing candidates with respect to a particular
+relay-parent.
+
+The goal of a Candidate Backing Job is to produce as many backable candidates as possible. This is done via signed
+[`Statement`s][STMT] by validators. If a candidate receives a majority of supporting Statements from the Teyrchain
+Validators currently assigned, then that candidate is considered backable.
+
+### On Startup
+
+* Fetch current validator set, validator -> teyrchain assignments from [`Runtime API`][RA] subsystem using
+  [`RuntimeApiRequest::Validators`][RAM] and [`RuntimeApiRequest::ValidatorGroups`][RAM]
+* Determine if the node controls a key in the current validator set. Call this the local key if so.
+* If the local key exists, extract the teyrchain head and validation function from the [`Runtime API`][RA] for the
+  teyrchain the local key is assigned to by issuing a [`RuntimeApiRequest::Validators`][RAM]
+* Issue a [`RuntimeApiRequest::SigningContext`][RAM] message to get a context that will later be used upon signing.
+
+### On Receiving New Candidate Backing Message
+
+```rust
+match msg {
+  GetBackedCandidates(hashes, tx) => {
+    // Send back a set of backable candidates.
+  }
+  CandidateBackingMessage::Second(hash, candidate) => {
+    if candidate is unknown and in local assignment {
+      if spawn_validation_work(candidate, teyrchain head, validation function).await == Valid {
+        send(DistributePoV(pov))
+      }
+    }
+  }
+  CandidateBackingMessage::Statement(hash, statement) => {
+    // count to the votes on this candidate
+    if let Statement::Seconded(candidate) = statement {
+      if candidate.teyrchain_id == our_assignment {
+        spawn_validation_work(candidate, teyrchain head, validation function)
+      }
+    }
+  }
+}
+```
+
+Add `Seconded` statements and `Valid` statements to a quorum. If the quorum reaches a pre-defined threshold, send a
+[`ProvisionerMessage`][PM]`::ProvisionableData(ProvisionableData::BackedCandidate(CandidateReceipt))` message. `Invalid`
+statements that conflict with already witnessed `Seconded` and `Valid` statements for the given candidate, statements
+that are double-votes, self-contradictions and so on, should result in issuing a
+[`ProvisionerMessage`][PM]`::MisbehaviorReport` message for each newly detected case of this kind.
+
+Backing does not need to concern itself with providing statements to the dispute coordinator as the dispute coordinator
+scrapes them from chain. This way the import is batched and contains only statements that actually made it on some
+chain.
+
+### Validating Candidates
+
+```rust
+fn spawn_validation_work(candidate, teyrchain head, validation function) {
+  asynchronously {
+    let pov = (fetch pov block).await
+
+    let valid = (validate pov block).await;
+    if valid {
+      // make PoV available for later distribution. Send data to the availability store to keep.
+      // sign and dispatch `valid` statement to network if we have not seconded the given candidate.
+    } else {
+      // sign and dispatch `invalid` statement to network.
+    }
+  }
+}
+```
+
+### Fetch PoV Block
+
+Create a `(sender, receiver)` pair. Dispatch a [`AvailabilityDistributionMessage`][ADM]`::FetchPoV{ validator_index,
+pov_hash, candidate_hash, tx, }` and listen on the passed receiver for a response. Availability distribution will send
+the request to the validator specified by `validator_index`, which might not be serving it for whatever reasons,
+therefore we need to retry with other backing validators in that case.
+
+
+### Validate PoV Block
+
+Create a `(sender, receiver)` pair. Dispatch a `CandidateValidationMessage::Validate(validation function, candidate,
+pov, BACKING_EXECUTION_TIMEOUT, sender)` and listen on the receiver for a response.
+
+### Distribute Signed Statement
+
+Dispatch a [`StatementDistributionMessage`][SDM]`::Share(relay_parent, SignedFullStatementWithPVD)`.
+
+[OverseerSignal]: ../../types/overseer-protocol.md#overseer-signal
+[Statement]: ../../types/backing.md#statement-type
+[STMT]: ../../types/backing.md#statement-type
+[CPM]: ../../types/overseer-protocol.md#collator-protocol-message
+[RAM]: ../../types/overseer-protocol.md#runtime-api-message
+[CVM]: ../../types/overseer-protocol.md#validation-request-type
+[PM]: ../../types/overseer-protocol.md#provisioner-message
+[CBM]: ../../types/overseer-protocol.md#candidate-backing-message
+[ADM]: ../../types/overseer-protocol.md#availability-distribution-message
+[SDM]: ../../types/overseer-protocol.md#statement-distribution-message
+[DCM]: ../../types/overseer-protocol.md#dispute-coordinator-message
+
+[CP]: ../collators/collator-protocol.md
+[CV]: ../utility/candidate-validation.md
+[SD]: statement-distribution.md
+[RA]: ../utility/runtime-api.md
+[PV]: ../utility/provisioner.md
@@ -0,0 +1 @@
+# PoV Distribution
@@ -0,0 +1,162 @@
+# Prospective Teyrchains
+
+> NOTE: This module has suffered changes for the elastic scaling implementation. As a result, parts of this document may
+be out of date and will be updated at a later time. Issue tracking the update:
+https://github.com/pezkuwichain/pezkuwi-sdk/issues/132
+
+## Overview
+
+**Purpose:** Tracks and handles prospective teyrchain fragments and informs
+other backing-stage subsystems of work to be done.
+
+"prospective":
+- [*prə'spɛktɪv*] adj.
+- future, likely, potential
+
+Asynchronous backing changes the runtime to accept teyrchain candidates from a
+certain allowed range of historic relay-parents. This means we can now build
+*prospective teyrchains* – that is, trees of potential (but likely) future
+teyrchain blocks. This is the subsystem responsible for doing so.
+
+Other subsystems such as Backing rely on Prospective Teyrchains, e.g. for
+determining if a candidate can be seconded. This subsystem is the main
+coordinator of work within the node for the collation and backing phases of
+teyrchain consensus.
+
+Prospective Teyrchains is primarily an implementation of fragment trees. It also
+handles concerns such as:
+
+- the relay-chain being forkful
+- session changes
+
+See the following sections for more details.
+
+### Fragment Trees
+
+This subsystem builds up fragment trees, which are trees of prospective para
+candidates. Each path through the tree represents a possible state transition
+path for the para. Each potential candidate is a fragment, or a node, in the
+tree. Candidates are validated against constraints as they are added.
+
+This subsystem builds up trees for each relay-chain block in the view, for each
+para. These fragment trees are used for:
+
+- providing backable candidates to other subsystems
+- sanity-checking that candidates can be seconded
+- getting seconded candidates under active leaves
+- etc.
+
+For example, here is a tree with several possible paths:
+
+```
+Para Head registered by the relay chain:     included_head
+                                                  ↲  ↳
+depth 0:                                  head_0_a    head_0_b
+                                             ↲            ↳
+depth 1:                             head_1_a              head_1_b
+                                  ↲      |     ↳
+depth 2:                 head_2_a1   head_2_a2  head_2_a3
+```
+
+### The Relay-Chain Being Forkful
+
+We account for the same candidate possibly appearing in different forks. While
+we still build fragment trees for each head in each fork, we are efficient with
+how we reference candidates to save space.
+
+### Session Changes
+
+Allowed ancestry doesn't cross session boundary. That is, you can only build on
+top of the freshest relay parent when the session starts. This is a current
+limitation that may be lifted in the future.
+
+Also, runtime configuration values needed for constraints (such as
+`max_pov_size`) are constant within a session. This is important when building
+prospective validation data. This is unlikely to change.
+
+## Messages
+
+### Incoming
+
+- `ActiveLeaves`
+  - Notification of a change in the set of active leaves.
+  - Constructs fragment trees for each para for each new leaf.
+- `ProspectiveTeyrchainsMessage::IntroduceCandidate`
+  - Informs the subsystem of a new candidate.
+  - Sent by the Backing Subsystem when it is importing a statement for a
+    new candidate.
+- `ProspectiveTeyrchainsMessage::CandidateSeconded`
+  - Informs the subsystem that a previously introduced candidate has
+    been seconded.
+  - Sent by the Backing Subsystem when it is importing a statement for a
+    new candidate after it sends `IntroduceCandidate`, if that wasn't
+    rejected by Prospective Teyrchains.
+- `ProspectiveTeyrchainsMessage::CandidateBacked`
+  - Informs the subsystem that a previously introduced candidate has
+    been backed.
+  - Sent by the Backing Subsystem after it successfully imports a
+    statement giving a candidate the necessary quorum of backing votes.
+- `ProspectiveTeyrchainsMessage::GetBackableCandidates`
+  - Get the requested number of backable candidate hashes along with their relay parent for a given
+    teyrchain,under a given relay-parent (leaf) hash, which are descendants of given candidate
+    hashes.
+  - Sent by the Provisioner when requesting backable candidates, when
+    selecting candidates for a given relay-parent.
+- `ProspectiveTeyrchainsMessage::GetHypotheticalMembership`
+  - Gets the hypothetical frontier membership of candidates with the
+    given properties under the specified active leaves' fragment trees.
+  - Sent by the Backing Subsystem when sanity-checking whether a candidate can
+    be seconded based on its hypothetical frontiers.
+- `ProspectiveTeyrchainsMessage::GetMinimumRelayParents`
+  - Gets the minimum accepted relay-parent number for each para in the
+    fragment tree for the given relay-chain block hash.
+  - That is, this returns the minimum relay-parent block number in the
+    same branch of the relay-chain which is accepted in the fragment
+    tree for each para-id.
+  - Sent by the Backing, Statement Distribution, and Collator Protocol
+    subsystems when activating leaves in the implicit view.
+- `ProspectiveTeyrchainsMessage::GetProspectiveValidationData`
+  - Gets the validation data of some prospective candidate. The
+    candidate doesn't need to be part of any fragment tree.
+  - Sent by the Collator Protocol subsystem (validator side) when
+    handling a fetched collation result.
+
+### Outgoing
+
+- `RuntimeApiRequest::ParaBackingState`
+  - Gets the backing state of the given para (the constraints of the para and
+    candidates pending availability).
+- `RuntimeApiRequest::BackingConstraints`
+  - Gets the constraints on the actions that can be taken by a new teyrchain
+    block.
+- `RuntimeApiRequest::AvailabilityCores`
+  - Gets information on all availability cores.
+- `ChainApiMessage::Ancestors`
+  - Requests the `k` ancestor block hashes of a block with the given
+    hash.
+- `ChainApiMessage::BlockHeader`
+  - Requests the block header by hash.
+
+## Glossary
+
+- **Candidate storage:** Stores candidates and information about them
+  such as their relay-parents and their backing states. Is indexed in
+  various ways.
+- **Constraints:**
+  - Constraints on the actions that can be taken by a new teyrchain
+    block.
+  - Exhaustively define the set of valid inputs and outputs to teyrchain
+    execution.
+- **Fragment:** A prospective para block (that is, a block not yet referenced by
+  the relay-chain). Fragments are anchored to the relay-chain at a particular
+  relay-parent.
+- **Fragment tree:**
+  - A tree of fragments. Together, these fragments define one or more
+    prospective paths a teyrchain's state may transition through.
+  - See the "Fragment Tree" section.
+- **Inclusion emulation:** Emulation of the logic that the runtime uses
+  for checking teyrchain blocks.
+- **Relay-parent:** A particular relay-chain block that a fragment is
+  anchored to.
+- **Scope:** The scope of a fragment tree, defining limits on nodes
+  within the tree.
@@ -0,0 +1,412 @@
+# Statement Distribution
+
+This subsystem is responsible for distributing signed statements that we have generated and forwarding statements
+generated by our peers. Received candidate receipts and statements are passed to the [Candidate Backing
+subsystem](candidate-backing.md) to handle producing local statements. On receiving
+`StatementDistributionMessage::Share`, this subsystem distributes the message across the network with redundancy to
+ensure a fast backing process.
+
+## Overview
+
+**Goal:** every well-connected node is aware of every next potential teyrchain block.
+
+Validators can either:
+
+- receive teyrchain block from collator, check block, and gossip statement.
+- receive statements from other validators, check the teyrchain block if it originated within their own group, gossip
+  forward statement if valid.
+
+Validators must have statements, candidates, and persisted validation from all other validators. This is because we need
+to store statements from validators who've checked the candidate on the relay chain, so we know who to hold accountable
+in case of disputes. Any validator can be selected as the next relay-chain block author, and this is not revealed in
+advance for security reasons. As a result, all validators must have a up to date view of all possible teyrchain
+candidates + backing statements that could be placed on-chain in the next block.
+
+[This blog post](https://pezkuwichain.io/blog/polkadot-v1-0-sharding-and-economic-security) puts it another way:
+"Validators who aren't assigned to the teyrchain still listen for the attestations [statements] because whichever
+validator ends up being the author of the relay-chain block needs to bundle up attested teyrchain blocks for several
+teyrchains and place them into the relay-chain block."
+
+Backing-group quorum (that is, enough backing group votes) must be reached before the block author will consider the
+candidate. Therefore, validators need to consider _all_ seconded candidates within their own group, because that's what
+they're assigned to work on. Validators only need to consider _backable_ candidates from other groups. This informs the
+design of the statement distribution protocol to have separate phases for in-group and out-group distribution,
+respectively called "cluster" and "grid" mode (see below).
+
+### With Async Backing
+
+Asynchronous backing changes the runtime to accept teyrchain candidates from a certain allowed range of historic
+relay-parents. These candidates must be backed by the group assigned to the teyrchain as-of their corresponding relay
+parents.
+
+## Protocol
+
+To address the concern of dealing with large numbers of spam candidates or statements, the overall design approach is to
+combine a focused "clustering" protocol for legitimate fresh candidates with a broad-distribution "grid" protocol to
+quickly get backed candidates into the hands of many validators. Validators do not eagerly send each other heavy
+`CommittedCandidateReceipt`, but instead request these lazily through request/response protocols.
+
+A high-level description of the protocol follows:
+
+### Messages
+
+Nodes can send each other a few kinds of messages: `Statement`, `BackedCandidateManifest`,
+`BackedCandidateAcknowledgement`.
+
+- `Statement` messages contain only a signed compact statement, without full candidate info.
+- `BackedCandidateManifest` messages advertise a description of a backed candidate and stored statements.
+- `BackedCandidateAcknowledgement` messages acknowledge that a backed candidate is fully known.
+
+### Request/response protocol
+
+Nodes can request the full `CommittedCandidateReceipt` and `PersistedValidationData`, along with statements, over a
+request/response protocol. This is the `AttestedCandidateRequest`; the response is `AttestedCandidateResponse`.
+
+### Importability and the Hypothetical Frontier
+
+The **prospective teyrchains** subsystem maintains prospective "fragment trees" which can be used to determine whether a
+particular teyrchain candidate could possibly be included in the future. Candidates which either are within a fragment
+tree or _would be_ part of a fragment tree if accepted are said to be in the "hypothetical frontier".
+
+The **statement-distribution** subsystem keeps track of all candidates, and updates its knowledge of the hypothetical
+frontier based on events such as new relay parents, new confirmed candidates, and newly backed candidates.
+
+We only consider statements as "importable" when the corresponding candidate is part of the hypothetical frontier, and
+only send "importable" statements to the backing subsystem itself.
+
+### Cluster Mode
+
+- Validator nodes are partitioned into groups (with some exceptions), and validators within a group at a relay-parent
+  can send each other `Statement` messages for any candidates within that group and based on that relay-parent.
+- This is referred to as the "cluster" mode.
+  - Right now these are the same as backing groups, though "cluster" specifically refers to the set of nodes
+    communicating with each other in the first phase of distribution.
+- `Seconded` statements must be sent before `Valid` statements.
+- `Seconded` statements may only be sent to other members of the group when the candidate is fully known by the local
+  validator.
+  - "Fully known" means the validator has the full `CommittedCandidateReceipt` and `PersistedValidationData`, which it
+    receives on request from other validators or from a collator.
+  - The reason for this is that sending a statement (which is always a `CompactStatement` carrying nothing but a hash
+    and signature) to the cluster, is also a signal that the sending node is available to request the candidate from.
+  - This makes the protocol easier to reason about, while also reducing network messages about candidates that don't
+    really exist.
+- Validators in a cluster receiving messages about unknown candidates request the candidate (and statements) from other
+  cluster members which have it.
+- Spam considerations
+  - The maximum depth of candidates allowed in asynchronous backing determines the maximum amount of `Seconded`
+    statements originating from a validator V which each validator in a cluster may send to others. This bounds the
+    number of candidates.
+  - There is a small number of validators in each group, which further limits the amount of candidates.
+- We accept candidates which don't fit in the fragment trees of any relay parents.
+  - "Accept" means "attempt to request and store in memory until useful or expired".
+  - We listen to prospective teyrchains subsystem to learn of new additions to the fragment trees.
+  - Use this to attempt to import the candidate later.
+
+### Grid Mode
+
+- Every consensus session provides randomness and a fixed validator set, which is used to build a redundant grid
+  topology.
+  - It's redundant in the sense that there are 2 paths from every node to every other node. See "Grid Topology" section
+    for more details.
+- This grid topology is used to create a sending path from each validator group to every validator.
+- When a node observes a candidate as backed, it sends a `BackedCandidateManifest` to their "receiving" nodes.
+- If receiving nodes don't yet know the candidate, they request it.
+- Once they know the candidate, they respond with a `BackedCandidateAcknowledgement`.
+- Once two nodes perform a manifest/acknowledgement exchange, they can send `Statement` messages directly to each other
+  for any new statements they might need.
+  - This limits the amount of statements we'd have to deal with w.r.t. candidates that don't really exist. See "Manifest
+    Exchange" section.
+- There are limitations on the number of candidates that can be advertised by each peer, similar to those in the
+  cluster. Validators do not request candidates which exceed these limitations.
+- Validators request candidates as soon as they are advertised, but do not import the statements until the candidate is
+  part of the hypothetical frontier, and do not re-advertise or acknowledge until the candidate is considered both
+  backable and part of the hypothetical frontier.
+- Note that requesting is not an implicit acknowledgement, and an explicit acknowledgement must be sent upon receipt.
+
+### Disabled validators
+
+After a validator is disabled in the runtime, other validators should no longer
+accept statements from it. Filtering out of statements from disabled validators
+on the node side is purely an optimization, as it will be done in the runtime
+as well.
+
+We use the state of the relay parent to check whether a validator is disabled
+to avoid race conditions and ensure that disabling works well in the presence
+of re-enabling.
+
+## Messages
+
+### Incoming
+
+- `ActiveLeaves`
+  - Notification of a change in the set of active leaves.
+- `StatementDistributionMessage::Share`
+  - Notification of a locally-originating statement. That is, this statement comes from our node and should be
+    distributed to other nodes.
+  - Sent by the Backing Subsystem after it successfully imports a locally-originating statement.
+- `StatementDistributionMessage::Backed`
+  - Notification of a candidate being backed (received enough validity votes from the backing group).
+  - Sent by the Backing Subsystem after it successfully imports a statement for the first time and after sending
+    ~Share~.
+- `StatementDistributionMessage::NetworkBridgeUpdate`
+  - See next section.
+
+#### Network bridge events
+
+- v1 compatibility
+  - Messages for the v1 protocol are routed to the legacy statement distribution.
+- `Statement`
+  - Notification of a signed statement.
+  - Sent by a peer's Statement Distribution subsystem when circulating statements.
+- `BackedCandidateManifest`
+  - Notification of a backed candidate being known by the sending node.
+  - For the candidate being requested by the receiving node if needed.
+  - Announcement.
+  - Sent by a peer's Statement Distribution subsystem.
+- `BackedCandidateKnown`
+  - Notification of a backed candidate being known by the sending node.
+  - For informing a receiving node which already has the candidate.
+  - Acknowledgement.
+  - Sent by a peer's Statement Distribution subsystem.
+
+### Outgoing
+
+- `NetworkBridgeTxMessage::SendValidationMessages`
+  - Sends a peer all pending messages / acknowledgements / statements for a relay parent, either through the cluster or
+    the grid.
+- `NetworkBridgeTxMessage::SendValidationMessage`
+  - Circulates a compact statement to all peers who need it, either through the cluster or the grid.
+- `NetworkBridgeTxMessage::ReportPeer`
+  - Reports a peer (either good or bad).
+- `CandidateBackingMessage::Statement`
+  - Note a validator's statement about a particular candidate.
+- `ProspectiveTeyrchainsMessage::GetHypotheticalMembership`
+  - Gets the hypothetical frontier membership of candidates under active leaves' fragment trees.
+- `NetworkBridgeTxMessage::SendRequests`
+  - Sends requests, initiating the request/response protocol.
+
+## Request/Response
+
+We also have a request/response protocol because validators do not eagerly send each other heavy
+`CommittedCandidateReceipt`, but instead need to request these lazily.
+
+### Protocol
+
+1. Requesting Validator
+
+   - Requests are queued up with `RequestManager::get_or_insert`.
+     - Done as needed, when handling incoming manifests/statements.
+   - `RequestManager::dispatch_requests` sends any queued-up requests.
+     - Calls `RequestManager::next_request` to completion.
+       - Creates the `OutgoingRequest`, saves the receiver in `RequestManager::pending_responses`.
+     - Does nothing if we have more responses pending than the limit of parallel requests.
+
+2. Peer
+
+   - Requests come in on a peer on the `IncomingRequestReceiver`.
+     - Runs in a background responder task which feeds requests to `answer_request` through `MuxedMessage`.
+     - This responder task has a limit on the number of parallel requests.
+   - `answer_request` on the peer takes the request and sends a response.
+     - Does this using the response sender on the request.
+
+3. Requesting Validator
+
+   - `receive_response` on the original validator yields a response.
+     - Response was sent on the request's response sender.
+     - Uses `RequestManager::await_incoming` to await on pending responses in an unordered fashion.
+     - Runs on the `MuxedMessage` receiver.
+   - `handle_response` handles the response.
+
+### API
+
+- `dispatch_requests`
+  - Dispatches pending requests for candidate data & statements.
+- `answer_request`
+  - Answers an incoming request for a candidate.
+  - Takes an incoming `AttestedCandidateRequest`.
+- `receive_response`
+  - Wait on the next incoming response.
+  - If there are no requests pending, this future never resolves.
+  - Returns `UnhandledResponse`
+- `handle_response`
+  - Handles an incoming response.
+  - Takes `UnhandledResponse`
+
+## Manifests
+
+A manifest is a message about a known backed candidate, along with a description of the statements backing it. It can be
+one of two kinds:
+
+- `Full`: Contains information about the candidate and should be sent to peers who may not have the candidate yet. This
+  is also called an `Announcement`.
+- `Acknowledgement`: Omits information implicit in the candidate, and should be sent to peers which are guaranteed to
+  have the candidate already.
+
+### Manifest Exchange
+
+Manifest exchange is when a receiving node received a `Full` manifest and replied with an `Acknowledgement`. It
+indicates that both nodes know the candidate as valid and backed. This allows the nodes to send `Statement` messages
+directly to each other for any new statements.
+
+Why? This limits the amount of statements we'd have to deal with w.r.t. candidates that don't really exist. Limiting
+out-of-group statement distribution between peers to only candidates that both peers agree are backed and exist ensures
+we only have to store statements about real candidates.
+
+In practice, manifest exchange means that one of three things have happened:
+
+- They announced, we acknowledged.
+- We announced, they acknowledged.
+- We announced, they announced.
+
+Concerning the last case, note that it is possible for two nodes to have each other in their sending set. Consider:
+
+```
+1 2
+3 4
+```
+
+If validators 2 and 4 are in group B, then there is a path `2->1->3` and `4->3->1`. Therefore, 1 and 3 might send each
+other manifests for the same candidate at the same time, without having seen the other's yet. This also counts as a
+manifest exchange, but is only allowed to occur in this way.
+
+After the exchange is complete, we update pending statements. Pending statements are those we know locally that the
+remote node does not.
+
+#### Alternative Paths Through The Topology
+
+Nodes should send a `BackedCandidateAcknowledgement(CandidateHash, StatementFilter)` notification to any peer which has
+sent a manifest, and the candidate has been acquired by other means. This keeps alternative paths through the topology
+open, which allows nodes to receive additional statements that come later, but not after the candidate has been posted
+on-chain.
+
+This is mostly about the limitation that the runtime has no way for block authors to post statements that come after the
+parablock is posted on-chain and ensure those validators still get rewarded. Technically, we only need enough statements
+to back the candidate and the manifest + request will provide that. But more statements might come shortly afterwards,
+and we want those to end up on-chain as well to ensure all validators in the group are rewarded.
+
+For clarity, here is the full timeline:
+
+1. candidate seconded
+1. backable in cluster
+1. distributed along grid
+1. latecomers issue statements
+1. candidate posted on chain
+1. really latecomers issue statements
+
+## Cluster Module
+
+The cluster module provides direct distribution of unbacked candidates within a group. By utilizing this initial phase
+of propagating only within clusters/groups, we bound the number of `Seconded` messages per validator per relay-parent,
+helping us prevent spam. Validators can try to circumvent this, but they would only consume a few KB of memory and it is
+trivially slashable on chain.
+
+The cluster module determines whether to accept/reject messages from other validators in the same group. It keeps track
+of what we have sent to other validators in the group, and pending statements. For the full protocol, see "Protocol".
+
+## Grid Module
+
+The grid module provides distribution of backed candidates and late statements outside the backing group. For the full
+protocol, see the "Protocol" section.
+
+### Grid Topology
+
+For distributing outside our cluster (aka backing group) we use a 2D grid topology. This limits the amount of peers we
+send messages to, and handles view updates.
+
+The basic operation of the grid topology is that:
+
+- A validator producing a message sends it to its row-neighbors and its column-neighbors.
+- A validator receiving a message originating from one of its row-neighbors sends it to its column-neighbors.
+- A validator receiving a message originating from one of its column-neighbors sends it to its row-neighbors.
+
+This grid approach defines 2 unique paths for every validator to reach every other validator in at most 2 hops,
+providing redundancy.
+
+Propagation follows these rules:
+
+- Each node has a receiving set and a sending set. These are different for each group. That is, if a node receives a
+  candidate from group A, it checks if it is allowed to receive from that node for candidates from group A.
+- For groups that we are in, receive from nobody and send to our X/Y peers.
+- For groups that we are not part of:
+  - We receive from any validator in the group we share a slice with and send to the corresponding X/Y slice in the
+    other dimension.
+  - For any validators we don't share a slice with, we receive from the nodes which share a slice with them.
+
+### Example
+
+For size 11, the matrix would be:
+
+```
+0  1  2
+3  4  5
+6  7  8
+9 10
+```
+
+e.g. for index 10, the neighbors would be 1, 4, 7, 9 -- these are the nodes we could directly communicate with (e.g.
+either send to or receive from).
+
+Now, which of these neighbors can 10 receive from? Recall that the sending/receiving sets for 10 would be different for
+different groups. Here are some hypothetical scenarios:
+
+- **Scenario 1:** 9 belongs to group A but not 10. Here, 10 can directly receive candidates from group A from 9. 10
+  would propagate them to the nodes in {1, 4, 7} that are not in A.
+- **Scenario 2:** 6 is in group A instead of 9, and 7 is not in group A. 10 can receive group A messages from 7 or 9. 10
+  will try to relay these messages, but 7 and 9 together should have already propagated the message to all x/y peers of
+  10. If so, then 10 will just receive acknowledgements in reply rather than requests.
+- **Scenario 3:** 10 itself is in group A. 10 would not receive candidates from this group from any other nodes through
+  the grid. It would itself send such candidates to all its neighbors that are not in A.
+
+### Seconding Limit
+
+The seconding limit is a per-validator limit. Before asynchronous backing, we had a rule that every validator was only
+allowed to second one candidate per relay parent. With asynchronous backing, we have a 'maximum depth' which makes it
+possible to second multiple candidates per relay parent. The seconding limit is set to `max depth + 1` to set an upper
+bound on candidates entering the system.
+
+## Candidates Module
+
+The candidates module provides a tracker for all known candidates in the view, whether they are confirmed or not, and
+how peers have advertised the candidates. What is a confirmed candidate? It is a candidate for which we have the full
+receipt and the persisted validation data. This module gets confirmed candidates from two sources:
+
+- It can be that a validator fetched a collation directly from the collator and validated it.
+- The first time a validator gets an announcement for an unknown candidate, it will send a request for the candidate.
+  Upon receiving a response and validating it (see `UnhandledResponse::validate_response`), it will mark the candidate
+  as confirmed.
+
+## Requests Module
+
+The requests module provides a manager for pending requests for candidate data, as well as pending responses. See
+"Request/Response Protocol" for a high-level description of the flow. See module-docs for full details.
+
+## Glossary
+
+- **Acknowledgement:** A partial manifest sent to a validator that already has the candidate to inform them that the
+  sending node also knows the candidate. Concludes a manifest exchange.
+- **Announcement:** A full manifest indicating that a backed candidate is known by the sending node. Initiates a
+  manifest exchange.
+- **Attestation:** See "Statement".
+- **Backable vs. Backed:**
+  - Note that we sometimes use "backed" to refer to candidates that are "backable", but not yet backed on chain.
+  - **Backed** should technically mean that the parablock candidate and its backing statements have been added to a
+    relay chain block.
+  - **Backable** is when the necessary backing statements have been acquired but those statements and the parablock
+    candidate haven't been backed in a relay chain block yet.
+- **Fragment tree:** A teyrchain fragment not referenced by the relay-chain. It is a tree of prospective teyrchain
+  blocks.
+- **Manifest:** A message about a known backed candidate, along with a description of the statements backing it. There
+  are two kinds of manifest, `Acknowledgement` and `Announcement`. See "Manifests" section.
+- **Peer:** Another validator that a validator is connected to.
+- **Request/response:** A protocol used to lazily request and receive heavy candidate data when needed.
+- **Reputation:** Tracks reputation of peers. Applies annoyance cost and good behavior benefits.
+- **Statement:** Signed statements that can be made about teyrchain candidates.
+  - **Seconded:** Proposal of a teyrchain candidate. Implicit validity vote.
+  - **Valid:** States that a teyrchain candidate is valid.
+- **Target:** Target validator to send a statement to.
+- **View:** Current knowledge of the chain state.
+  - **Explicit view** / **immediate view**
+    - The view a peer has of the relay chain heads and highest finalized block.
+  - **Implicit view**
+    - Derived from the immediate view. Composed of active leaves and minimum relay-parents allowed for candidates of
+      various teyrchains at those leaves.
@@ -0,0 +1,8 @@
+# Collators
+
+Collators are special nodes which bridge a teyrchain to the relay chain. They are simultaneously full nodes of the
+teyrchain, and at least light clients of the relay chain. Their overall contribution to the system is the generation of
+Proofs of Validity for teyrchain candidates.
+
+The **Collation Generation** subsystem triggers collators to produce collations and then forwards them to **Collator
+Protocol** to circulate to validators.
@@ -0,0 +1,142 @@
+# Collation Generation
+
+The collation generation subsystem is executed on collator nodes and produces candidates to be distributed to
+validators. If configured to produce collations for a para, it produces collations and then feeds them to the [Collator
+Protocol][CP] subsystem, which handles the networking.
+
+## Protocol
+
+Collation generation for Teyrchains currently works in the following way:
+
+1. A new relay chain block is imported.
+2. The collation generation subsystem checks if the core associated to the teyrchain is free and if yes, continues.
+3. Collation generation calls our collator callback, if present, to generate a PoV. If none exists, do nothing.
+4. Authoring logic determines if the current node should build a PoV.
+5. Build new PoV and give it back to collation generation.
+
+## Messages
+
+### Incoming
+
+- `ActiveLeaves`
+  - Notification of a change in the set of active leaves.
+  - Triggers collation generation procedure outlined in "Protocol" section.
+- `CollationGenerationMessage::Initialize`
+  - Initializes the subsystem. Carries a config.
+  - No more than one initialization message should ever be sent to the collation generation subsystem.
+  - Sent by a collator to initialize this subsystem.
+- `CollationGenerationMessage::SubmitCollation`
+  - If the subsystem isn't initialized or the relay-parent is too old to be relevant, ignore the message.
+  - Otherwise, use the provided parameters to generate a [`CommittedCandidateReceipt`]
+  - Submit the collation to the collator-protocol with `CollatorProtocolMessage::DistributeCollation`.
+
+### Outgoing
+
+- `CollatorProtocolMessage::DistributeCollation`
+  - Provides a generated collation to distribute to validators.
+
+## Functionality
+
+The process of generating a collation for a teyrchain is very teyrchain-specific. As such, the details of how to do so
+are left beyond the scope of this description. The subsystem should be implemented as an abstract wrapper, which is
+aware of this configuration:
+
+```rust
+/// The output of a collator.
+///
+/// This differs from `CandidateCommitments` in two ways:
+///
+/// - does not contain the erasure root; that's computed at the Pezkuwi level, not at Cumulus
+/// - contains a proof of validity.
+pub struct Collation {
+  /// Messages destined to be interpreted by the Relay chain itself.
+  pub upward_messages: Vec<UpwardMessage>,
+  /// The horizontal messages sent by the teyrchain.
+  pub horizontal_messages: Vec<OutboundHrmpMessage<ParaId>>,
+  /// New validation code.
+  pub new_validation_code: Option<ValidationCode>,
+  /// The head-data produced as a result of execution.
+  pub head_data: HeadData,
+  /// Proof to verify the state transition of the teyrchain.
+  pub proof_of_validity: PoV,
+  /// The number of messages processed from the DMQ.
+  pub processed_downward_messages: u32,
+  /// The mark which specifies the block number up to which all inbound HRMP messages are processed.
+  pub hrmp_watermark: BlockNumber,
+}
+
+/// Result of the [`CollatorFn`] invocation.
+pub struct CollationResult {
+  /// The collation that was build.
+  pub collation: Collation,
+  /// An optional result sender that should be informed about a successfully seconded collation.
+  ///
+  /// There is no guarantee that this sender is informed ever about any result, it is completely okay to just drop it.
+  /// However, if it is called, it should be called with the signed statement of a teyrchain validator seconding the
+  /// collation.
+  pub result_sender: Option<oneshot::Sender<CollationSecondedSignal>>,
+}
+
+/// Signal that is being returned when a collation was seconded by a validator.
+pub struct CollationSecondedSignal {
+  /// The hash of the relay chain block that was used as context to sign [`Self::statement`].
+  pub relay_parent: Hash,
+  /// The statement about seconding the collation.
+  ///
+  /// Anything else than `Statement::Seconded` is forbidden here.
+  pub statement: SignedFullStatement,
+}
+
+/// Collation function.
+///
+/// Will be called with the hash of the relay chain block the teyrchain block should be build on and the
+/// [`ValidationData`] that provides information about the state of the teyrchain on the relay chain.
+///
+/// Returns an optional [`CollationResult`].
+pub type CollatorFn = Box<
+  dyn Fn(
+      Hash,
+      &PersistedValidationData,
+    ) -> Pin<Box<dyn Future<Output = Option<CollationResult>> + Send>>
+    + Send
+    + Sync,
+>;
+
+/// Configuration for the collation generator
+pub struct CollationGenerationConfig {
+  /// Collator's authentication key, so it can sign things.
+  pub key: CollatorPair,
+  /// Collation function. See [`CollatorFn`] for more details.
+  pub collator: Option<CollatorFn>,
+  /// The teyrchain that this collator collates for
+  pub para_id: ParaId,
+}
+```
+
+The configuration should be optional, to allow for the case where the node is not run with the capability to collate.
+
+### Summary in plain English
+
+- **Collation (output of a collator)**
+
+  - Contains the PoV (proof to verify the state transition of the teyrchain) and other data.
+
+- **Collation result**
+
+  - Contains the collation, and an optional result sender for a collation-seconded signal.
+
+- **Collation seconded signal**
+
+  - The signal that is returned when a collation was seconded by a validator.
+
+- **Collation function**
+
+  - Called with the relay chain block the parablock will be built on top of.
+  - Called with the validation data.
+    - Provides information about the state of the teyrchain on the relay chain.
+
+- **Collation generation config**
+
+  - Contains collator's authentication key, optional collator function, and teyrchain ID.
+
+[CP]: collator-protocol.md
@@ -0,0 +1,196 @@
+# Collator Protocol
+
+> NOTE: This module has suffered changes for the elastic scaling implementation. As a result, parts of this document may
+be out of date and will be updated at a later time. Issue tracking the update:
+https://github.com/pezkuwichain/pezkuwi-sdk/issues/132
+
+The Collator Protocol implements the network protocol by which collators and validators communicate. It is used by
+collators to distribute collations to validators and used by validators to accept collations by collators.
+
+Collator-to-Validator networking is more difficult than Validator-to-Validator networking because the set of possible
+collators for any given para is unbounded, unlike the validator set. Validator-to-Validator networking protocols can
+easily be implemented as gossip because the data can be bounded, and validators can authenticate each other by their
+`PeerId`s for the purposes of instantiating and accepting connections.
+
+Since, at least at the level of the para abstraction, the collator-set for any given para is unbounded, validators need
+to make sure that they are receiving connections from capable and honest collators and that their bandwidth and time are
+not being wasted by attackers. Communicating across this trust-boundary is the most difficult part of this subsystem.
+
+Validation of candidates is a heavy task, and furthermore, the [`PoV`][PoV] itself is a large piece of data.
+Empirically, `PoV`s are on the order of 10MB.
+
+> TODO: note the incremental validation function Ximin proposes at https://github.com/paritytech/polkadot/issues/1348
+
+As this network protocol serves as a bridge between collators and validators, it communicates primarily with one
+subsystem on behalf of each. As a collator, this will receive messages from the [`CollationGeneration`][CG] subsystem.
+As a validator, this will communicate only with the [`CandidateBacking`][CB].
+
+## Protocol
+
+Input: [`CollatorProtocolMessage`][CPM]
+
+Output:
+
+* [`RuntimeApiMessage`][RAM]
+* [`NetworkBridgeMessage`][NBM]
+* [`CandidateBackingMessage`][CBM]
+
+## Functionality
+
+This network protocol uses the `Collation` peer-set of the [`NetworkBridge`][NB].
+
+It uses the [`CollatorProtocolV1Message`](../../types/network.md#collator-protocol) as its `WireMessage`
+
+Since this protocol functions both for validators and collators, it is easiest to go through the protocol actions for
+each of them separately.
+
+Validators and collators.
+```dot process
+digraph {
+  c1 [shape=MSquare, label="Collator 1"];
+  c2 [shape=MSquare, label="Collator 2"];
+
+  v1 [shape=MSquare, label="Validator 1"];
+  v2 [shape=MSquare, label="Validator 2"];
+
+  c1 -> v1;
+  c1 -> v2;
+  c2 -> v2;
+}
+```
+
+### Collators
+
+It is assumed that collators are only collating on a single teyrchain. Collations are generated by the [Collation
+Generation][CG] subsystem. We will keep up to one local collation per relay-parent, based on `DistributeCollation`
+messages. If the para is not scheduled on any core, at the relay-parent, or the relay-parent isn't in the active-leaves
+set, we ignore the message as it must be invalid in that case - although this indicates a logic error elsewhere in the
+node.
+
+We keep track of the Para ID we are collating on as a collator. This starts as `None`, and is updated with each
+`CollateOn` message received. If the `ParaId` of a collation requested to be distributed does not match the one we
+expect, we ignore the message.
+
+As with most other subsystems, we track the active leaves set by following `ActiveLeavesUpdate` signals.
+
+For the purposes of actually distributing a collation, we need to be connected to the validators who are interested in
+collations on that `ParaId` at this point in time. We assume that there is a discovery API for connecting to a set of
+validators.
+
+As seen in the [Scheduler Module][SCH] of the runtime, validator groups are fixed for an entire session and their
+rotations across cores are predictable. Collators will want to do these things when attempting to distribute collations
+at a given relay-parent:
+  * Determine which core the para collated-on is assigned to.
+  * Determine the group on that core.
+  * Issue a discovery request for the validators of the current group
+    with[`NetworkBridgeMessage`][NBM]`::ConnectToValidators`.
+
+Once connected to the relevant peers for the current group assigned to the core (transitively, the para), advertise the
+collation to any of them which advertise the relay-parent in their view (as provided by the [Network Bridge][NB]). If
+any respond with a request for the full collation, provide it. However, we only send one collation at a time per relay
+parent, other requests need to wait. This is done to reduce the bandwidth requirements of a collator and also increases
+the chance to fully send the collation to at least one validator. From the point where one validator has received the
+collation and seconded it, it will also start to share this collation with other validators in its backing group. Upon
+receiving a view update from any of these peers which includes a relay-parent for which we have a collation that they
+will find relevant, advertise the collation to them if we haven't already.
+
+### Validators
+
+On the validator side of the protocol, validators need to accept incoming connections from collators. They should keep
+some peer slots open for accepting new speculative connections from collators and should disconnect from collators who
+are not relevant.
+
+```dot process
+digraph G {
+  label = "Declaring, advertising, and providing collations";
+  labelloc = "t";
+  rankdir = LR;
+
+  subgraph cluster_collator {
+      rank = min;
+      label = "Collator";
+      graph[style = border, rank = min];
+
+      c1, c2 [label = ""];
+  }
+
+  subgraph cluster_validator {
+      rank = same;
+      label = "Validator";
+      graph[style = border];
+
+      v1, v2 [label = ""];
+  }
+
+  c1 -> v1 [label = "Declare and advertise"];
+
+  v1 -> c2 [label = "Request"];
+
+  c2 -> v2 [label = "Provide"];
+
+  v2 -> v2 [label = "Note Good/Bad"];
+}
+```
+
+When peers connect to us, they can `Declare` that they represent a collator with given public key and intend to collate
+on a specific para ID. Once they've declared that, and we checked their signature, they can begin to send advertisements
+of collations. The peers should not send us any advertisements for collations that are on a relay-parent outside of our
+view or for a para outside of the one they've declared.
+
+The protocol tracks advertisements received and the source of the advertisement. The advertisement source is the
+`PeerId` of the peer who sent the message. We accept one advertisement per collator per source per relay-parent.
+
+As a validator, we will handle requests from other subsystems to fetch a collation on a specific `ParaId` and
+relay-parent. These requests are made with the request response protocol `CollationFetchingRequest` request. To do so,
+we need to first check if we have already gathered a collation on that `ParaId` and relay-parent. If not, we need to
+select one of the advertisements and issue a request for it. If we've already issued a request, we shouldn't issue
+another one until the first has returned.
+
+When acting on an advertisement, we issue a `Requests::CollationFetchingV1`. However, we only request one collation at a
+time per relay parent. This reduces the bandwidth requirements and as we can second only one candidate per relay parent,
+the others are probably not required anyway. If the request times out, we need to note the collator as being unreliable
+and reduce its priority relative to other collators.
+
+### Interaction with [Candidate Backing][CB]
+
+As collators advertise the availability, a validator will simply second the first valid parablock candidate per relay
+head by sending a [`CandidateBackingMessage`][CBM]`::Second`. Note that this message contains the relay parent of the
+advertised collation, the candidate receipt and the [PoV][PoV].
+
+Subsequently, once a valid parablock candidate has been seconded, the [`CandidateBacking`][CB] subsystem will send a
+[`CollatorProtocolMessage`][CPM]`::Seconded`, which will trigger this subsystem to notify the collator at the `PeerId`
+that first advertised the parablock on the seconded relay head of their successful seconding.
+
+
+## Future Work
+
+Several approaches have been discussed, but all have some issues:
+
+* The current approach is very straightforward. However, that protocol is vulnerable to a single collator which, as an
+  attack or simply through chance, gets its block candidate to the node more often than its fair share of the time.
+* If collators produce blocks via Aura, BABE or in future Sassafras, it may be possible to choose an "Official" collator
+  for the round, but it may be tricky to ensure that the PVF logic is enforced at collator leader election.
+* We could use relay-chain BABE randomness to generate some delay `D` on the order of 1 second, +* 1 second. The
+  collator would then second the first valid parablock which arrives after `D`, or in case none has arrived by `2*D`,
+  the last valid parablock which has arrived. This makes it very hard for a collator to game the system to always get
+  its block nominated, but it reduces the maximum throughput of the system by introducing delay into an already tight
+  schedule.
+* A variation of that scheme would be to have a fixed acceptance window `D` for parablock candidates and keep track of
+  count `C`: the number of parablock candidates received. At the end of the period `D`, we choose a random number I in
+  the range `[0, C)` and second the block at Index I. Its drawback is the same: it must wait the full `D` period before
+  seconding any of its received candidates, reducing throughput.
+* In order to protect against DoS attacks, it may be prudent to run throw out collations from collators that have
+  behaved poorly (whether recently or historically) and subsequently only verify the PoV for the most suitable of
+  collations.
+
+[CB]: ../backing/candidate-backing.md
+[CBM]: ../../types/overseer-protocol.md#candidate-backing-mesage
+[CG]: collation-generation.md
+[CPM]: ../../types/overseer-protocol.md#collator-protocol-message
+[CS]: ../backing/candidate-selection.md
+[CSM]: ../../types/overseer-protocol.md#candidate-selection-message
+[NB]: ../utility/network-bridge.md
+[NBM]: ../../types/overseer-protocol.md#network-bridge-message
+[PoV]: ../../types/availability.md#proofofvalidity
+[RAM]: ../../types/overseer-protocol.md#runtime-api-message
+[SCH]: ../../runtime/scheduler.md
@@ -0,0 +1,18 @@
+# Disputes Subsystems
+
+If approval voting finds an invalid candidate, a dispute is raised. The disputes
+subsystems are concerned with the following:
+
+1. Disputes can be raised
+1. Disputes (votes) get propagated to all other validators
+1. Votes get recorded as necessary
+1. Nodes will participate in disputes in a sensible fashion
+1. Finality is stopped while a candidate is being disputed on chain
+1. Chains can be reverted in case a dispute concludes invalid
+1. Votes are provided to the provisioner for importing on chain, in order for
+   slashing to work.
+
+The dispute-coordinator subsystem interfaces with the provisioner and chain
+selection to make the bulk of this possible. `dispute-distribution` is concerned
+with getting votes out to other validators and receiving them in a spam
+resilient way.
@@ -0,0 +1,659 @@
+# Dispute Coordinator
+
+The coordinator is the central subsystem of the node-side components which participate in disputes. It wraps a database,
+which is used to track statements observed by _all_ validators over some window of sessions. Votes older than this
+session window are pruned.
+
+In particular the dispute-coordinator is responsible for:
+
+- Ensuring that the node is able to raise a dispute in case an invalid candidate is found during approval checking.
+- Ensuring that backing and approval votes will be recorded on chain. With these votes on chain we can be certain that
+  appropriate targets for slashing will be available for concluded disputes. Also, scraping these votes during a dispute
+  is necessary for critical spam prevention measures.
+- Ensuring backing votes will never get overridden by explicit votes.
+- Coordinating actual participation in a dispute, ensuring that the node participates in any justified dispute in a way
+  that ensures resolution of disputes on the network even in the case of many disputes raised (flood/DoS scenario).
+- Ensuring disabled validators are not able to spam disputes.
+- Ensuring disputes resolve, even for candidates on abandoned forks as much as reasonably possible, to rule out "free
+  tries" and thus guarantee our gambler's ruin property.
+- Providing an API for chain selection, so we can prevent finalization of any chain which has included candidates for
+  which a dispute is either ongoing or concluded invalid and avoid building on chains with an included invalid
+  candidate.
+- Providing an API for retrieving (resolved) disputes, including all votes, both implicit (approval, backing) and
+  explicit dispute votes. So validators can get rewarded/slashed accordingly.
+
+## Ensuring That Disputes Can Be Raised
+
+If a candidate turns out invalid in approval checking, the `approval-voting` subsystem will try to issue a dispute. For
+this, it will send a message `DisputeCoordinatorMessage::IssueLocalStatement` to the dispute coordinator, indicating to
+cast an explicit invalid vote. It is the responsibility of the dispute coordinator on reception of such a message to
+create and sign that explicit invalid vote and trigger a dispute if none for that candidate is already ongoing.
+
+In order to raise a dispute, a node has to be able to provide two opposing votes. Given that the reason of the backing
+phase is to have validators with skin in the game, the opposing valid vote will very likely be a backing vote. It could
+also be some already cast approval vote, but the significant point here is: As long as we have backing votes available,
+any node will be able to raise a dispute.
+
+Therefore a vital responsibility of the dispute coordinator is to make sure backing votes are available for all
+candidates that might still get disputed. To accomplish this task in an efficient way the dispute-coordinator relies on
+chain scraping. Whenever a candidate gets backed on chain, we record in chain storage the backing votes imported in that
+block. This way, given the chain state for a given relay chain block, we can retrieve via a provided runtime API the
+backing votes imported by that block. The dispute coordinator makes sure to query those votes for any non finalized
+blocks: In case of missed blocks, it will do chain traversal as necessary.
+
+Relying on chain scraping is very efficient for two reasons:
+
+1. Votes are already batched. We import all available backing votes for a candidate all at once. If instead we imported
+   votes from candidate-backing as they came along, we would import each vote individually which is inefficient in the
+   current dispute coordinator implementation (quadratic complexity).
+2. We also import less votes in total, as we avoid importing statements for candidates that never got successfully
+   backed on any chain.
+
+It also is secure, because disputes are only ever raised in the approval voting phase. A node only starts the approval
+process after it has seen a candidate included on some chain, for that to happen it must have been backed previously.
+Therefore backing votes are available at that point in time. Signals are processed first, so even if a block is skipped
+and we only start importing backing votes on the including block, we will have seen the backing votes by the time we
+process messages from approval voting.
+
+In summary, for making it possible for a dispute to be raised, recording of backing votes from chain is sufficient and
+efficient. In particular there is no need to preemptively import approval votes, which has shown to be a very
+inefficient process. (Quadratic complexity adds up, with 35 votes in total per candidate)
+
+Approval votes are very relevant nonetheless as we are going to see in the next section.
+
+## Ensuring approval votes will be recorded
+
+### Ensuring Recording
+
+Only votes recorded by the dispute coordinator will be considered for slashing.
+
+While there is no need to record approval votes in the dispute coordinator preemptively, we make some effort to have any
+in approval-voting received approval votes recorded when a dispute actually happens:
+
+This is not required for concluding the dispute, as nodes send their own vote anyway (either explicit valid or their
+existing approval-vote). What nodes can do though, is participating in approval-voting, casting a vote, but later when a
+dispute is raised reconsider their vote and send an explicit invalid vote. If they managed to only have that one
+recorded, then they could avoid a slash.
+
+This is not a problem for our basic security assumptions: The backers are the ones to be supposed to have skin in the
+game, so we are not too worried about colluding approval voters getting away slash free as the gambler's ruin property is
+maintained anyway. There is however a separate problem, from colluding approval-voters, that is "lazy" approval voters.
+If it were easy and reliable for approval-voters to reconsider their vote, in case of an actual dispute, then they don't
+have a direct incentive (apart from playing a part in securing the network) to properly run the validation function at
+all - they could just always vote "valid" totally risk free. (While they would always risk a slash by voting invalid.)
+
+
+So we do want to fetch approval votes from approval-voting. Importing votes is most efficient when batched. At the same
+time approval voting and disputes are running concurrently so approval votes are expected to trickle in still, when a
+dispute is already ongoing.
+
+Hence, we have the following requirements for importing approval votes:
+
+1. Only import them when there is a dispute, because otherwise we are wasting lots of resources _always_ for the
+   exceptional case of a dispute.
+2. Import votes batched when possible, to avoid quadratic import complexity.
+3. Take into account that approval voting is still ongoing, while a dispute is already running.
+
+With a design where approval voting sends votes to the dispute-coordinator by itself, we would need to make approval
+voting aware of ongoing disputes and once it is aware it could start sending all already existing votes batched and
+trickling in votes as they come. The problem with this is, that it adds some unnecessary complexity to approval-voting
+and also we might still import most of the votes unbatched one-by-one, depending on what point in time the dispute was
+raised.
+
+Instead of the dispute coordinator informing approval-voting of an ongoing dispute for it to begin forwarding votes to
+the dispute coordinator, it makes more sense for the dispute-coordinator to just ask approval-voting for votes of
+candidates in dispute. This way, the dispute coordinator can also pick the best time for maximizing the number of votes
+in the batch.
+
+Now the question remains, when should the dispute coordinator ask approval-voting for votes?
+
+In fact for slashing it is only relevant to have them once the dispute concluded, so we can query approval voting the
+moment the dispute concludes! Two concerns that come to mind, are easily addressed:
+
+1. Timing: We would like to rely as little as possible on implementation details of approval voting. In particular, if
+   the dispute is ongoing for a long time, do we have any guarantees that approval votes are kept around long enough by
+   approval voting? Will approval votes still be present by the time the dispute concludes in all cases? The answer is
+   nuanced, but in general we cannot rely on it. The problem is first, that finalization and approval-voting is an
+   off-chain process so there is no global consensus: As soon as at least f+1 honest (f=n/3, where n is the number of
+   validators/nodes) nodes have seen the dispute conclude, finalization will take place and approval votes will be
+   cleared. This would still be fine, if we had some guarantees that those honest nodes will be able to include those
+   votes in a block. This guarantee does not exist unfortunately, we will discuss the problem and solutions in more
+   detail [below][#Ensuring Chain Import].
+
+   The second problem is that approval-voting will abandon votes as soon as a chain can no longer be finalized (some
+   other/better fork already has been). This second problem can somehow be mitigated by also importing votes as soon as
+   a dispute is detected, but not fully resolved. It is still inherently racy. The good thing is, this should be good
+   enough: We are worried about lazy approval checkers, the system does not need to be perfect. It should be enough if
+   there is some risk of getting caught.
+2. We are not worried about the dispute not concluding, as nodes will always send their own vote, regardless of it being
+   an explicit or an already existing approval-vote.
+
+Conclusion: As long as we make sure, if our own approval vote gets imported (which would prevent dispute participation)
+to also distribute it via dispute-distribution, disputes can conclude. To mitigate raciness with approval-voting
+deleting votes we will import approval votes twice during a dispute: Once when it is raised, to make as sure as possible
+to see approval votes also for abandoned forks and second when the dispute concludes, to maximize the amount of
+potentially malicious approval votes to be recorded. The raciness obviously is not fully resolved by this, but this is
+fine as argued above.
+
+Ensuring vote import on chain is covered in the next section.
+
+What we don't care about is that honest approval-voters will likely validate twice, once in approval voting and once via
+dispute-participation. Avoiding that does not really seem worthwhile though, as disputes are for one exceptional, so a
+little wasted effort won't affect everyday performance - second, even with eager importing of approval votes, those
+doubled work is still present as disputes and approvals are racing. Every time participation is faster than approval, a
+node would do double work.
+
+### Ensuring Chain Import
+
+While in the previous section we discussed means for nodes to ensure relevant votes are recorded so lazy approval
+checkers get slashed properly, it is crucial to also discuss the actual chain import. Only if we guarantee that recorded
+votes will get imported on chain (on all potential chains really) we will succeed in executing slashes. Particularly we
+need to make sure backing votes end up on chain consistently.
+
+Dispute distribution will make sure all explicit dispute votes get distributed among nodes which includes current block
+producers (current authority set) which is an important property: If the dispute carries on across an era change, we
+need to ensure that the new validator set will learn about any disputes and their votes, so they can put that
+information on chain. Dispute-distribution luckily has this property and always sends votes to the current authority
+set. The issue is, for dispute-distribution, nodes send only their own explicit (or in some cases their approval vote)
+in addition to some opposing vote. This guarantees that at least some backing or approval vote will be present at the
+block producer, but we don't have a 100% guarantee to have votes for all backers, even less for approval checkers.
+
+Reason for backing votes: While backing votes will be present on at least some chain, that does not mean that any such
+chain is still considered for block production in the current set - they might only exist on an already abandoned fork.
+This means a block producer that just joined the set, might not have seen any of them.
+
+For approvals it is even more tricky and less necessary: Approval voting together with finalization is a completely
+off-chain process therefore those protocols don't care about block production at all. Approval votes only have a
+guarantee of being propagated between the nodes that are responsible for finalizing the concerned blocks. This implies
+that on an era change the current authority set, will not necessarily get informed about any approval votes for the
+previous era. Hence even if all validators of the previous era successfully recorded all approval votes in the dispute
+coordinator, they won't get a chance to put them on chain, hence they won't be considered for slashing.
+
+It is important to note, that the essential properties of the system still hold: Dispute-distribution will distribute at
+_least one_ "valid" vote to the current authority set, hence at least one node will get slashed in case of outcome
+"invalid". Also in reality the validator set is rarely exchanged 100%, therefore in practice some validators in the
+current authority set will overlap with the ones in the previous set and will be able to record votes on chain.
+
+Still, for maximum accountability we need to make sure a previous authority set can communicate votes to the next one,
+regardless of any chain: This is yet to be implemented see section "Resiliency" in dispute-distribution and
+[this](https://github.com/paritytech/polkadot/issues/3398) ticket.
+
+## Coordinating Actual Dispute Participation
+
+Once the dispute coordinator learns about a dispute, it is its responsibility to make sure the local node participates
+in that dispute.
+
+The dispute coordinator learns about a dispute by importing votes from either chain scraping or from
+dispute-distribution. If it finds opposing votes (always the case when coming from dispute-distribution), it records the
+presence of a dispute. Then, in case it does not find any local vote for that dispute already, it needs to trigger
+participation in the dispute (see previous section for considerations when the found local vote is an approval vote).
+
+Participation means, recovering availability and re-evaluating the POV. The result of that validation (either valid or
+invalid) will be the node's vote on that dispute: Either explicit "invalid" or "valid". The dispute coordinator will
+inform `dispute-distribution` about our vote and `dispute-distribution` will make sure that our vote gets distributed to
+all other validators.
+
+Nothing ever is that easy though. We can not blindly import anything that comes along and trigger participation no
+matter what.
+
+### Spam Considerations
+
+In Pezkuwi's security model, it is important that attempts to attack the system result in a slash of the offenders.
+Therefore we need to make sure that this slash is actually happening. Attackers could try to prevent the slashing from
+taking place, by overwhelming validators with disputes in such a way that no single dispute ever concludes, because
+nodes are busy processing newly incoming ones. Other attacks are imaginable as well, like raising disputes for
+candidates that don't exist, just filling up everyone's disk slowly or worse making nodes try to participate, which will
+result in lots of network requests for recovering availability.
+
+The last point brings up a significant consideration in general: Disputes are about escalation: Every node will suddenly
+want to check, instead of only a few. A single message will trigger the whole network to start significant amount of
+work and will cause lots of network traffic and messages. Hence the dispute system is very susceptible to being a brutal
+amplifier for DoS attacks, resulting in DoS attacks to become very easy and cheap, if we are not careful.
+
+One counter measure we are taking is making raising of disputes a costly thing: If you raise a dispute, because you
+claim a candidate is invalid, although it is in fact valid - you will get slashed, hence you pay for consuming those
+resources. The issue is: This only works if the dispute concerns a candidate that actually exists!
+
+If a node raises a dispute for a candidate that never got included (became available) on any chain, then the dispute can
+never conclude, hence nobody gets slashed. It makes sense to point out that this is less bad than it might sound at
+first, as trying to participate in a dispute for a non existing candidate is "relatively" cheap. Each node will send out
+a few hundred tiny request messages for availability chunks, which all will end up in a tiny response "NoSuchChunk" and
+then no participation will actually happen as there is nothing to participate. Malicious nodes could provide chunks,
+which would make things more costly, but at the full expense of the attackers bandwidth - no amplification here. I am
+bringing that up for completeness only: Triggering a thousand nodes to send out a thousand tiny network messages by just
+sending out a single garbage message, is still a significant amplification and is nothing to ignore - this could
+absolutely be used to cause harm!
+
+### Participation
+
+As explained, just blindly participating in any "dispute" that comes along is not a good idea. First we would like to
+make sure the dispute is actually genuine, to prevent cheap DoS attacks. Secondly, in case of genuine disputes, we would
+like to conclude one after the other, in contrast to processing all at the same time, slowing down progress on all of
+them, bringing individual processing to a complete halt in the worst case (nodes get overwhelmed at some stage in the
+pipeline).
+
+To ensure to only spend significant work on genuine disputes, we only trigger participation at all on any _vote import_
+if any of the following holds true:
+
+- We saw the disputed candidate included in some not yet finalized block on at least one fork of the chain.
+- We have seen the disputed candidate backed in some not yet finalized block on at least one fork of the chain. This
+  ensures the candidate is at least not completely made up and there has been some effort already flown into that
+  candidate. Generally speaking a dispute shouldn't be raised for a candidate which is backed but is not yet included.
+  Disputes are raised during approval checking. We participate on such disputes as a precaution - maybe we haven't seen
+  the `CandidateIncluded` event yet?
+- The dispute is already confirmed: Meaning that 1/3+1 nodes already participated, as this suggests in our threat model
+  that there was at least one honest node that already voted, so the dispute must be genuine.
+
+In addition to that, we only participate in a non-confirmed dispute if at least one vote against the candidate is from
+a non-disabled validator.
+
+Note: A node might be out of sync with the chain and we might only learn about a block, including a candidate, after we
+learned about the dispute. This means, we have to re-evaluate participation decisions on block import!
+
+With this, nodes won't waste significant resources on completely made up candidates. The next step is to process dispute
+participation in a (globally) ordered fashion. Meaning a majority of validators should arrive at at least roughly at the
+same ordering of participation, for disputes to get resolved one after another. This order is only relevant if there are
+lots of disputes, so we obviously only need to worry about order if participations start queuing up.
+
+We treat participation for candidates that we have seen included with priority and put them on a priority queue which
+sorts participation based on the block number of the relay parent of the candidate and for candidates with the same
+relay parent height further by the `CandidateHash`. This ordering is globally unique and also prioritizes older
+candidates.
+
+The latter property makes sense, because if an older candidate turns out invalid, we can roll back the full chain at
+once. If we resolved earlier disputes first and they turned out invalid as well, we might need to roll back a couple of
+times instead of just once to the oldest offender. This is obviously a good idea, in particular it makes it impossible
+for an attacker to prevent rolling back a very old candidate, by keeping raising disputes for newer candidates.
+
+For candidates we have not seen included, but we know are backed (thanks to chain scraping) or we have seen a dispute
+with 1/3+1 participation (confirmed dispute) on them - we put participation on a best-effort queue. It has got the same
+ordering as the priority one - by block heights of the relay parent, older blocks are with priority. There is a
+possibility not to be able to obtain the block number of the parent when we are inserting the dispute in the queue. To
+account for races, we will promote any existing participation request to the priority queue once we learn about an
+including block. NOTE: this is still work in progress and is tracked by [this
+issue](https://github.com/paritytech/polkadot/issues/5875).
+
+### Abandoned Forks
+
+Finalization: As mentioned we care about included and backed candidates on any non-finalized chain, given that any
+disputed chain will not get finalized, we don't need to care about finalized blocks, but what about forks that fall
+behind the finalized chain in terms of block number? For those we would still like to be able to participate in any
+raised disputes, otherwise attackers might be able to avoid a slash if they manage to create a better fork after they
+learned about the approval checkers. Therefore we do care about those forks even after they have fallen behind the
+finalized chain.
+
+For simplicity we also care about the actual finalized chain (not just forks) up to a certain depth. We do have to limit
+the depth, because otherwise we open a DoS vector again. The depth (into the finalized chain) should be oriented on the
+approval-voting execution timeout, in particular it should be significantly larger. Otherwise by the time the execution
+is allowed to finish, we already dropped information about those candidates and the dispute could not conclude.
+
+## Import
+
+### Spam Considerations
+
+In the last section we looked at how to treat queuing participations to handle heavy dispute load well. This already
+ensures, that honest nodes won't amplify cheap DoS attacks. There is one minor issue remaining: Even if we delay
+participation until we have some confirmation of the authenticity of the dispute, we should also not blindly import all
+votes arriving into the database as this might be used to just slowly fill up disk space, until the node is no longer
+functional. This leads to our last protection mechanism at the dispute coordinator level (dispute-distribution also has
+its own), which is spam slots. For each import containing an invalid vote, where we don't know whether it might be spam
+or not we increment a counter for each signing participant of explicit `invalid` votes.
+
+What votes do we treat as a potential spam? A vote will increase a spam slot if and only if all of the following
+conditions are satisfied:
+
+- the candidate under dispute was not seen included nor backed on any chain
+- the dispute is not confirmed
+- we haven't cast a vote for the dispute
+- at least one vote against the candidate is from a non-disabled validator
+
+Whenever any vote on a dispute is imported these conditions are checked. If the dispute is found not to be potential
+spam, then spam slots for the disputed candidate hash are cleared. This decrements the spam count for every validator
+which had voted invalid.
+
+To keep spam slots from filling up unnecessarily we want to clear spam slots whenever a candidate is seen to be backed
+or included. Fortunately this behavior is achieved by clearing slots on vote import as described above. Because on chain
+backing votes are processed when a block backing the disputed candidate is discovered, spam slots are cleared for every
+backed candidate. Included candidates have also been seen as backed on the same fork, so decrementing spam slots is
+handled in that case as well.
+
+The reason this works is because we only need to worry about actual dispute votes. Import of backing votes are already
+rate limited and concern only real candidates. For approval votes a similar argument holds (if they come from
+approval-voting), but we also don't import them until a dispute already concluded. For actual dispute votes we need two
+opposing votes, so there must be an explicit `invalid` vote in the import. Only a third of the validators can be
+malicious, so spam disk usage is limited to `2*vote_size*n/3*NUM_SPAM_SLOTS`, with `n` being the number of validators.
+
+### Disabling
+
+Once a validator has committed an offence (e.g. losing a dispute), it is considered disabled for the rest of the era.
+In addition to using the on-chain state of disabled validators, we also keep track of validators who lost a dispute
+off-chain. The reason for this is a dispute can be raised for a candidate in a previous era, which means that a
+validator that is going to be slashed for it might not even be in the current active set. That means it can't be
+disabled on-chain. We need a way to prevent someone from disputing all valid candidates in the previous era. We do this
+by keeping track of the validators who lost a dispute in the past few sessions and use that list in addition to the
+on-chain disabled validators state. In addition to past session misbehavior, this also helps in case a slash is delayed.
+
+When we receive a dispute statements set, we do the following:
+1. Take the on-chain state of disabled validators at the relay parent block.
+1. Take a list of those who lost a dispute in that session in the order that prioritizes the biggest and newest offence.
+1. Combine the two lists and take the first byzantine threshold validators from it.
+1. If the dispute is unconfirmed, check if all votes against the candidate are from disabled validators.
+If so, we don't participate in the dispute, but record the votes.
+
+### Backing Votes
+
+Backing votes are in some way special. For starters they are the only valid votes that are guaranteed to exist for any
+valid dispute to be raised. Second they are the only votes that commit to a shorter execution timeout
+`BACKING_EXECUTION_TIMEOUT`, compared to a more lenient timeout used in approval voting. To account properly for
+execution time variance across machines, slashing might treat backing votes differently (more aggressively) than other
+voting `valid` votes. Hence in import we shall never override a backing vote with another valid vote. They can not be
+assumed to be interchangeable.
+
+## Attacks & Considerations
+
+The following attacks on the priority queue and best-effort queues are considered in above design.
+
+### Priority Queue
+
+On the priority queue, we will only queue participations for candidates we have seen included on any chain. Any attack
+attempt would start with a candidate included on some chain, but an attacker could try to only reveal the including
+relay chain blocks to just some honest validators and stop as soon as it learns that some honest validator would have a
+relevant approval assignment.
+
+Without revealing the including block to any honest validator, we don't really have an attack yet. Once the block is
+revealed though, the above is actually very hard. Each honest validator will re-distribute the block it just learned
+about. This means an attacker would need to pull of a targeted DoS attack, which allows the validator to send its
+assignment, but prevents it from forwarding and sharing the relay chain block.
+
+This sounds already hard enough, provided that we also start participation if we learned about an including block after
+the dispute has been raised already (we need to update participation queues on new leaves), but to be even safer we
+choose to have an additional best-effort queue.
+
+### Best-Effort Queue
+
+While attacking the priority queue is already pretty hard, attacking the best-effort queue is even harder. For a
+candidate to be a threat, it has to be included on some chain. For it to be included, it has to have been backed before
+and at least n/3 honest nodes must have seen that block, so availability (inclusion) can be reached. Making a full third
+of the nodes not further propagate a block, while at the same time allowing them to fetch chunks, sign and distribute
+bitfields seems almost infeasible and even if accomplished, those nodes would be enough to confirm a dispute and we have
+not even touched the above fact that in addition, for an attack, the following including block must be shared with
+honest validators as well.
+
+It is worth mentioning that a successful attack on the priority queue as outlined above is already outside of our threat
+model, as it assumes n/3 malicious nodes + additionally malfunctioning/DoSed nodes. Even more so for attacks on the
+best-effort queue, as our threat model only allows for n/3 malicious _or_ malfunctioning nodes in total. It would
+therefore be a valid decision to ditch the best-effort queue, if it proves to become a burden or creates other issues.
+
+One issue we should not be worried about though is spam. For abusing best-effort for spam, the following scenario would
+be necessary:
+
+An attacker controls a backing group: The attacker can then have candidates backed and choose to not provide chunks.
+This should come at a cost to miss out on rewards for backing, so is not free. At the same time it is rate limited, as a
+backing group can only back so many candidates legitimately. (~ 1 per slot):
+
+1. They have to wait until a malicious actor becomes block producer (for causing additional forks via equivocation for
+  example).
+2. Forks are possible, but if caused by equivocation also not free.
+3. For each fork the attacker has to wait until the candidate times out, for backing another one.
+
+Assuming there can only be a handful of forks, 2) together with 3) the candidate timeout restriction, frequency should
+indeed be in the ballpark of once per slot. Scaling linearly in the number of controlled backing groups, so two groups
+would mean 2 backings per slot, ...
+
+So by this reasoning an attacker could only do very limited harm and at the same time will have to pay some price for it
+(it will miss out on rewards). Overall the work done by the network might even be in the same ballpark as if actors just
+behaved honestly:
+
+1. Validators would have fetched chunks
+2. Approval checkers would have done approval checks
+
+While because of the attack (backing, not providing chunks and afterwards disputing the candidate), the work for 1000
+validators would be:
+
+All validators sending out ~ 1000 tiny requests over already established connections, with also tiny (byte) responses.
+
+This means around a million requests, while in the honest case it would be ~ 10000 (30 approval checkers x330) - where
+each request triggers a response in the range of kilobytes. Hence network load alone will likely be higher in the honest
+case than in the DoS attempt case, which would mean the DoS attempt actually reduces load, while also costing rewards.
+
+In the worst case this can happen multiple times, as we would retry that on every vote import. The effect would still be
+in the same ballpark as honest behavior though and can also be mitigated by chilling repeated availability recovery
+requests for example.
+
+## Out of Scope
+
+### No Disputes for Non Included Candidates
+
+We only ever care about disputes for candidates that have been included on at least some chain (became available). This
+is because the availability system was designed for precisely that: Only with inclusion (availability) we have
+guarantees about the candidate to actually be available. Because only then we have guarantees that malicious backers can
+be reliably checked and slashed. Also, by design non included candidates do not pose any threat to the system.
+
+One could think of an (additional) dispute system to make it possible to dispute any candidate that has been proposed by
+a validator, no matter whether it got successfully included or even backed. Unfortunately, it would be very brittle (no
+availability) and also spam protection would be way harder than for the disputes handled by the dispute-coordinator. In
+fact, all the spam handling strategies described above would simply be unavailable.
+
+It is worth thinking about who could actually raise such disputes anyway: Approval checkers certainly not, as they will
+only ever check once availability succeeded. The only other nodes that meaningfully could/would are honest backing nodes
+or collators. For collators spam considerations would be even worse as there can be an unlimited number of them and we
+can not charge them for spam, so trying to handle disputes raised by collators would be even more complex. For honest
+backers: It actually makes more sense for them to wait until availability is reached as well, as only then they have
+guarantees that other nodes will be able to check. If they disputed before, all nodes would need to recover the data
+from them, so they would be an easy DoS target.
+
+In summary: The availability system was designed for raising disputes in a meaningful and secure way after availability
+was reached. Trying to raise disputes before does not meaningfully contribute to the systems security/might even weaken
+it as attackers are warned before availability is reached, while at the same time adding significant amount of
+complexity. We therefore punt on such disputes and concentrate on disputes the system was designed to handle.
+
+### No Disputes for Already Finalized Blocks
+
+Note that by above rules in the `Participation` section, we will not participate in disputes concerning a candidate in
+an already finalized block. This is because, disputing an already finalized block is simply too late and therefore of
+little value. Once finalized, bridges have already processed the block for example, so we have to assume the damage is
+already done. Governance has to step in and fix what can be fixed.
+
+Making disputes for already finalized blocks possible would only provide two features:
+
+1. We can at least still slash attackers.
+2. We can freeze the chain to some governance only mode, in an attempt to minimize potential harm done.
+
+Both seem kind of worthwhile, although as argued above, it is likely that there is not too much that can be done in 2
+and we would likely only ending up DoSing the whole system without much we can do. 1 can also be achieved via governance
+mechanisms.
+
+In any case, our focus should be making as sure as reasonably possible that any potentially invalid block does not get
+finalized in the first place. Not allowing disputing already finalized blocks actually helps a great deal with this goal
+as it massively reduces the amount of candidates that can be disputed.
+
+This makes attempts to overwhelm the system with disputes significantly harder and counter measures way easier. We can
+limit inclusion for example (as suggested [here](https://github.com/paritytech/polkadot/issues/5898) in case of high
+dispute load. Another measure we have at our disposal is that on finality lag block production will slow down,
+implicitly reducing the rate of new candidates that can be disputed. Hence, the cutting-off of the unlimited candidate
+supply of already finalized blocks, guarantees the necessary DoS protection and ensures we can have measures in place to
+keep up with processing of disputes.
+
+If we allowed participation for disputes for already finalized candidates, the above spam protection mechanisms would be
+insufficient/relying 100% on full and quick disabling of spamming validators.
+
+## Database Schema
+
+We use an underlying Key-Value database where we assume we have the following operations available:
+  - `write(key, value)`
+  - `read(key) -> Option<value>`
+  - `iter_with_prefix(prefix) -> Iterator<(key, value)>` - gives all keys and values in lexicographical order where the
+    key starts with `prefix`.
+
+We use this database to encode the following schema:
+
+```rust
+("candidate-votes", SessionIndex, CandidateHash) -> Option<CandidateVotes>
+"recent-disputes" -> RecentDisputes
+"earliest-session" -> Option<SessionIndex>
+```
+
+The meta information that we track per-candidate is defined as the `CandidateVotes` struct. This draws on the [dispute
+statement types][DisputeTypes]
+
+```rust
+/// Tracked votes on candidates, for the purposes of dispute resolution.
+pub struct CandidateVotes {
+  /// The receipt of the candidate itself.
+  pub candidate_receipt: CandidateReceipt,
+  /// Votes of validity, sorted by validator index.
+  pub valid: Vec<(ValidDisputeStatementKind, ValidatorIndex, ValidatorSignature)>,
+  /// Votes of invalidity, sorted by validator index.
+  pub invalid: Vec<(InvalidDisputeStatementKind, ValidatorIndex, ValidatorSignature)>,
+}
+
+/// The mapping for recent disputes; any which have not yet been pruned for being ancient.
+pub type RecentDisputes = std::collections::BTreeMap<(SessionIndex, CandidateHash), DisputeStatus>;
+
+/// The status of dispute. This is a state machine which can be altered by the
+/// helper methods.
+pub enum DisputeStatus {
+  /// The dispute is active and unconcluded.
+  Active,
+  /// The dispute has been concluded in favor of the candidate
+  /// since the given timestamp.
+  ConcludedFor(Timestamp),
+  /// The dispute has been concluded against the candidate
+  /// since the given timestamp.
+  ///
+  /// This takes precedence over `ConcludedFor` in the case that
+  /// both are true, which is impossible unless a large amount of
+  /// validators are participating on both sides.
+  ConcludedAgainst(Timestamp),
+  /// Dispute has been confirmed (more than `byzantine_threshold` have already participated/ or
+  /// we have seen the candidate included already/participated successfully ourselves).
+  Confirmed,
+}
+```
+
+## Protocol
+
+Input: [`DisputeCoordinatorMessage`][DisputeCoordinatorMessage]
+
+Output:
+  - [`RuntimeApiMessage`][RuntimeApiMessage]
+
+## Functionality
+
+This assumes a constant `DISPUTE_WINDOW: SessionWindowSize`. This should correspond to at least 1 day.
+
+Ephemeral in-memory state:
+
+```rust
+struct State {
+  keystore: Arc<LocalKeystore>,
+  rolling_session_window: RollingSessionWindow,
+  highest_session: SessionIndex,
+  spam_slots: SpamSlots,
+  participation: Participation,
+  ordering_provider: OrderingProvider,
+  participation_receiver: WorkerMessageReceiver,
+  metrics: Metrics,
+  // This tracks only rolling session window failures.
+  // It can be a `Vec` if the need to track more arises.
+  error: Option<SessionsUnavailable>,
+  /// Latest relay blocks that have been successfully scraped.
+  last_scraped_blocks: LruMap<Hash, ()>,
+}
+```
+
+### On startup
+
+When the subsystem is initialised it waits for a new leaf (message `OverseerSignal::ActiveLeaves`). The leaf is used to
+initialise a `RollingSessionWindow` instance (contains leaf hash and `DISPUTE_WINDOW` which is a constant).
+
+Next the active disputes are loaded from the DB and initialize spam slots accordingly, then for each loaded dispute, we
+either send a `DisputeDistribution::SendDispute` if there is a local vote from us available or if there is none and
+participation is in order, we push the dispute to participation.
+
+### The main loop
+
+Just after the subsystem initialisation the main loop (`fn run_until_error()`) runs until `OverseerSignal::Conclude`
+signal is received. Before executing the actual main loop the leaf and the participations, obtained during startup are
+enqueued for processing. If there is capacity (the number of running participations is less than
+`MAX_PARALLEL_PARTICIPATIONS`) participation jobs are started (`func participate`). Finally the component waits for
+messages from Overseer. The behaviour on each message is described in the following subsections.
+
+### On `OverseerSignal::ActiveLeaves`
+
+Initiates processing via the `Participation` module and updates the internal state of the subsystem. More concretely:
+
+- Passes the `ActiveLeavesUpdate` message to the ordering provider.
+- Updates the session info cache.
+- Updates `self.highest_session`.
+- Prunes old spam slots in case the session window has advanced.
+- Scrapes on chain votes.
+
+### On `MuxedMessage::Participation`
+
+This message is sent from `Participation` module and indicates a processed dispute participation. It's the result of
+the processing job initiated with `OverseerSignal::ActiveLeaves`. The subsystem issues a `DisputeMessage` with the
+result.
+
+### On `OverseerSignal::Conclude`
+
+Exit gracefully.
+
+### On `OverseerSignal::BlockFinalized`
+
+Performs cleanup of the finalized candidate.
+
+### On `DisputeCoordinatorMessage::ImportStatements`
+
+Import statements by validators are processed in `fn handle_import_statements()`. The function has got three main
+responsibilities:
+- Initiate participation in disputes and sending out of any existing own approval vote in case of a raised dispute.
+- Persist all fresh votes in the database. Fresh votes in this context means votes that are not already processed by the
+  node.
+- Spam protection on all invalid (`DisputeStatement::Invalid`) votes. Please check the SpamSlots section for details on
+  how spam protection works.
+
+### On `DisputeCoordinatorMessage::RecentDisputes`
+
+Returns all recent disputes saved in the DB.
+
+### On `DisputeCoordinatorMessage::ActiveDisputes`
+
+Returns all recent disputes concluded within the last `ACTIVE_DURATION_SECS` .
+
+### On `DisputeCoordinatorMessage::QueryCandidateVotes`
+
+Loads `candidate-votes` for every `(SessionIndex, CandidateHash)` in the input query and returns data within each
+`CandidateVote`. If a particular `candidate-vote` is missing, that particular request is omitted from the response.
+
+### On `DisputeCoordinatorMessage::IssueLocalStatement`
+
+Executes `fn issue_local_statement()` which performs the following operations:
+
+- Deconstruct into parts `{ session_index, candidate_hash, candidate_receipt, is_valid }`.
+- Construct a [`DisputeStatement`][DisputeStatement] based on `Valid` or `Invalid`, depending on the parameterization of
+  this routine.
+- Sign the statement with each key in the `SessionInfo`'s list of teyrchain validation keys which is present in the
+  keystore, except those whose indices appear in `voted_indices`. This will typically just be one key, but this does
+  provide some future-proofing for situations where the same node may run on behalf multiple validators. At the time of
+  writing, this is not a use-case we support as other subsystems do not invariably   provide this guarantee.
+- Write statement to DB.
+- Send a `DisputeDistributionMessage::SendDispute` message to get the vote distributed to other validators.
+
+### On `DisputeCoordinatorMessage::DetermineUndisputedChain`
+
+Executes `fn determine_undisputed_chain()` which performs the following:
+
+- Load `"recent-disputes"`.
+- Deconstruct into parts `{ base_number, block_descriptions, rx }`
+- Starting from the beginning of `block_descriptions`:
+  1. Check the `RecentDisputes` for a dispute of each candidate in the block description.
+  1. If there is a dispute which is active or concluded negative, exit the loop.
+- For the highest index `i` reached in the `block_descriptions`, send `(base_number + i + 1, block_hash)` on the
+  channel, unless `i` is 0, in which case `None` should be sent. The `block_hash` is determined by inspecting
+  `block_descriptions[i]`.
+
+[DisputeTypes]: ../../types/disputes.md
+[DisputeStatement]: ../../types/disputes.md#disputestatement
+[DisputeCoordinatorMessage]: ../../types/overseer-protocol.md#dispute-coordinator-message
+[RuntimeApiMessage]: ../../types/overseer-protocol.md#runtime-api-message
@@ -0,0 +1,429 @@
+# Dispute Distribution
+
+Dispute distribution is responsible for ensuring all concerned validators will
+be aware of a dispute and have the relevant votes.
+
+## Design Goals
+
+This design should result in a protocol that is:
+
+- resilient to nodes being temporarily unavailable
+- make sure nodes are aware of a dispute quickly
+- relatively efficient, should not cause too much stress on the network
+- be resilient when it comes to spam
+- be simple and boring: We want disputes to work when they happen
+
+## Protocol
+
+Distributing disputes needs to be a reliable protocol. We would like to make as
+sure as possible that our vote got properly delivered to all concerned
+validators. For this to work, this subsystem won't be gossip based, but instead
+will use a request/response protocol for application level confirmations. The
+request will be the payload (the actual votes/statements), the response will
+be the confirmation. See [below][#wire-format].
+
+### Input
+
+[`DisputeDistributionMessage`][DisputeDistributionMessage]
+
+### Output
+
+- [`DisputeCoordinatorMessage::ActiveDisputes`][DisputeCoordinatorMessage]
+- [`DisputeCoordinatorMessage::ImportStatements`][DisputeCoordinatorMessage]
+- [`DisputeCoordinatorMessage::QueryCandidateVotes`][DisputeCoordinatorMessage]
+- [`RuntimeApiMessage`][RuntimeApiMessage]
+
+### Wire format
+
+#### Disputes
+
+Protocol: `"/<genesis_hash>/<fork_id>/send_dispute/1"`
+
+Request:
+
+```rust
+struct DisputeRequest {
+  /// The candidate being disputed.
+  pub candidate_receipt: CandidateReceipt,
+
+  /// The session the candidate appears in.
+  pub session_index: SessionIndex,
+
+  /// The invalid vote data that makes up this dispute.
+  pub invalid_vote: InvalidDisputeVote,
+
+  /// The valid vote that makes this dispute request valid.
+  pub valid_vote: ValidDisputeVote,
+}
+
+/// Any invalid vote (currently only explicit).
+pub struct InvalidDisputeVote {
+  /// The voting validator index.
+  pub validator_index: ValidatorIndex,
+
+  /// The validator signature, that can be verified when constructing a
+  /// `SignedDisputeStatement`.
+  pub signature: ValidatorSignature,
+
+  /// Kind of dispute statement.
+  pub kind: InvalidDisputeStatementKind,
+}
+
+/// Any valid vote (backing, approval, explicit).
+pub struct ValidDisputeVote {
+  /// The voting validator index.
+  pub validator_index: ValidatorIndex,
+
+  /// The validator signature, that can be verified when constructing a
+  /// `SignedDisputeStatement`.
+  pub signature: ValidatorSignature,
+
+  /// Kind of dispute statement.
+  pub kind: ValidDisputeStatementKind,
+}
+```
+
+Response:
+
+```rust
+enum DisputeResponse {
+  Confirmed
+}
+```
+
+#### Vote Recovery
+
+Protocol: `"/<genesis_hash>/<fork_id>/req_votes/1"`
+
+```rust
+struct IHaveVotesRequest {
+  candidate_hash: CandidateHash,
+  session: SessionIndex,
+  valid_votes: Bitfield,
+  invalid_votes: Bitfield,
+}
+
+```
+
+Response:
+
+```rust
+struct VotesResponse {
+  /// All votes we have, but the requester was missing.
+  missing: Vec<(DisputeStatement, ValidatorIndex, ValidatorSignature)>,
+}
+```
+
+## Starting a Dispute
+
+A dispute is initiated once a node sends the first `DisputeRequest` wire message,
+which must contain an "invalid" vote and a "valid" vote.
+
+The dispute distribution subsystem can get instructed to send that message out to
+all concerned validators by means of a `DisputeDistributionMessage::SendDispute`
+message. That message must contain an invalid vote from the local node and some
+valid one, e.g. a backing statement.
+
+We include a valid vote as well, so any node regardless of whether it is synced
+with the chain or not or has seen backing/approval vote can see that there are
+conflicting votes available, hence we have a valid dispute. Nodes will still
+need to check whether the disputing votes are somewhat current and not some
+stale ones.
+
+## Participating in a Dispute
+
+Upon receiving a `DisputeRequest` message, a dispute distribution will trigger the
+import of the received votes via the dispute coordinator
+(`DisputeCoordinatorMessage::ImportStatements`). The dispute coordinator will
+take care of participating in that dispute if necessary. Once it is done, the
+coordinator will send a `DisputeDistributionMessage::SendDispute` message to dispute
+distribution. From here, everything is the same as for starting a dispute,
+except that if the local node deemed the candidate valid, the `SendDispute`
+message will contain a valid vote signed by our node and will contain the
+initially received `Invalid` vote.
+
+Note, that we rely on `dispute-coordinator` to check validity of a dispute for spam
+protection (see below).
+
+## Sending of messages
+
+Starting and participating in a dispute are pretty similar from the perspective
+of dispute distribution. Once we receive a `SendDispute` message, we try to make
+sure to get the data out. We keep track of all the teyrchain validators that
+should see the message, which are all the teyrchain validators of the session
+where the dispute happened as they will want to participate in the dispute.  In
+addition we also need to get the votes out to all authorities of the current
+session (which might be the same or not and may change during the dispute).
+Those authorities will not participate in the dispute, but need to see the
+statements so they can include them in blocks.
+
+### Reliability
+
+We only consider a message transmitted, once we received a confirmation message.
+If not, we will keep retrying getting that message out as long as the dispute is
+deemed alive. To determine whether a dispute is still alive we will ask the
+`dispute-coordinator` for a list of all still active disputes via a
+`DisputeCoordinatorMessage::ActiveDisputes` message before each retry run. Once
+a dispute is no longer live, we will clean up the state accordingly.
+
+### Order
+
+We assume `SendDispute` messages are coming in an order of importance, hence
+`dispute-distribution` will make sure to send out network messages in the same
+order, even on retry.
+
+### Rate Limit
+
+For spam protection (see below), we employ an artificial rate limiting on sending
+out messages in order to not hit the rate limit at the receiving side, which
+would result in our messages getting dropped and our reputation getting reduced.
+
+## Reception
+
+As we shall see the receiving side is mostly about handling spam and ensuring
+the dispute-coordinator learns about disputes as fast as possible.
+
+Goals for the receiving side:
+
+1. Get new disputes to the dispute-coordinator as fast as possible, so
+  prioritization can happen properly.
+2. Batch votes per disputes as much as possible for good import performance.
+3. Prevent malicious nodes exhausting node resources by sending lots of messages.
+4. Prevent malicious nodes from sending so many messages/(fake) disputes,
+  preventing us from concluding good ones.
+5. Limit ability of malicious nodes of delaying the vote import due to batching
+   logic.
+
+Goal 1 and 2 seem to be conflicting, but an easy compromise is possible: When
+learning about a new dispute, we will import the vote immediately, making the
+dispute coordinator aware and also getting immediate feedback on the validity.
+Then if valid we can batch further incoming votes, with less time constraints as
+the dispute-coordinator already knows about the dispute.
+
+Goal 3 and 4 are obviously very related and both can easily be solved via rate
+limiting as we shall see below. Rate limits should already be implemented at the
+Substrate level, but [are not](https://github.com/paritytech/substrate/issues/7750)
+at the time of writing. But even if they were, the enforced Substrate limits would
+likely not be configurable and thus would still be to high for our needs as we can
+rely on the following observations:
+
+1. Each honest validator will only send one message (apart from duplicates on
+  timeout) per candidate/dispute.
+2. An honest validator needs to fully recover availability and validate the
+  candidate for casting a vote.
+
+With these two observations, we can conclude that honest validators will usually
+not send messages at a high rate. We can therefore enforce conservative rate
+limits and thus minimize harm spamming malicious nodes can have.
+
+Before we dive into how rate limiting solves all spam issues elegantly, let's
+discuss that honest behaviour further:
+
+What about session changes? Here we might have to inform a new validator set of
+lots of already existing disputes at once.
+
+With observation 1) and a rate limit that is per peer, we are still good:
+
+Let's assume a rate limit of one message per 200ms per sender. This means 5
+messages from each validator per second. 5 messages means 5 disputes!
+Conclusively, we will be able to conclude 5 disputes per second - no matter what
+malicious actors are doing. This is assuming dispute messages are sent ordered,
+but even if not perfectly ordered: On average it will be 5 disputes per second.
+
+This is good enough! All those disputes are valid ones and will result in
+slashing and disabling of validators. Let's assume all of them conclude `valid`,
+and we disable validators only after 100 raised concluding valid disputes, we
+would still start disabling misbehaving validators in only 20 seconds.
+
+One could also think that in addition participation is expected to take longer,
+which means on average we can import/conclude disputes faster than they are
+generated - regardless of dispute spam. Unfortunately this is not necessarily
+true: There might be teyrchains with very light load where recovery and
+validation can be accomplished very quickly - maybe faster than we can import
+those disputes.
+
+This is probably an argument for not imposing a too low rate limit, although the
+issue is more general: Even without any rate limit, if an attacker generates
+disputes at a very high rate, nodes will be having trouble keeping participation
+up, hence the problem should be mitigated at a [more fundamental
+layer](https://github.com/paritytech/polkadot/issues/5898).
+
+For nodes that have been offline for a while, the same argument as for session
+changes holds, but matters even less: We assume 2/3 of nodes to be online, so
+even if the worst case 1/3 offline happens and they could not import votes fast
+enough (as argued above, they in fact can) it would not matter for consensus.
+
+### Rate Limiting
+
+As suggested previously, rate limiting allows to mitigate all threats that come
+from malicious actors trying to overwhelm the system in order to get away without
+a slash, when it comes to dispute-distribution. In this section we will explain
+how in greater detail.
+
+The idea is to open a queue with limited size for each peer. We will process
+incoming messages as fast as we can by doing the following:
+
+1. Check that the sending peer is actually a valid authority - otherwise drop
+   message and decrease reputation/disconnect.
+2. Put message on the peer's queue, if queue is full - drop it.
+
+Every `RATE_LIMIT` seconds (or rather milliseconds), we pause processing
+incoming requests to go a full circle and process one message from each queue.
+Processing means `Batching` as explained in the next section.
+
+### Batching
+
+To achieve goal 2 we will batch incoming votes/messages together before passing
+them on as a single batch to the `dispute-coordinator`. To adhere to goal 1 as
+well, we will do the following:
+
+1. For an incoming message, we check whether we have an existing batch for that
+   candidate, if not we import directly to the dispute-coordinator, as we have
+   to assume this is concerning a new dispute.
+2. We open a batch and start collecting incoming messages for that candidate,
+   instead of immediately forwarding.
+3. We keep collecting votes in the batch until we receive less than
+   `MIN_KEEP_BATCH_ALIVE_VOTES` unique votes in the last `BATCH_COLLECTING_INTERVAL`. This is
+   important to accommodate for goal 5 and also 3.
+4. We send the whole batch to the dispute-coordinator.
+
+This together with rate limiting explained above ensures we will be able to
+process valid disputes: We can limit the number of simultaneous existing batches
+to some high value, but can be rather certain that this limit will never be
+reached - hence we won't drop valid disputes:
+
+Let's assume `MIN_KEEP_BATCH_ALIVE_VOTES` is 10, `BATCH_COLLECTING_INTERVAL`
+is `500ms` and above `RATE_LIMIT` is `100ms`. 1/3 of validators are malicious,
+so for 1000 this means around 330 malicious actors worst case.
+
+All those actors can send a message every `100ms`, that is 10 per second. This
+means at the beginning of an attack they can open up around 3300 batches. Each
+containing two votes. So memory usage is still negligible. In reality it is even
+less, as we also demand 10 new votes to trickle in per batch in order to keep it
+alive, every `500ms`. Hence for the first second, each batch requires 20 votes
+each. Each message is 2 votes, so this means 10 messages per batch. Hence to
+keep those batches alive 10 attackers are needed for each batch. This reduces
+the number of opened batches by a factor of 10: So we only have 330 batches in 1
+second - each containing 20 votes.
+
+The next second: In order to further grow memory usage, attackers have to
+maintain 10 messages per batch and second. Number of batches equals the number
+of attackers, each has 10 messages per second, all are needed to maintain the
+batches in memory. Therefore we have a hard cap of around 330 (number of
+malicious nodes) open batches. Each can be filled with number of malicious
+actor's votes. So 330 batches with each 330 votes: Let's assume approximately 100
+bytes per signature/vote. This results in a worst case memory usage of
+`330 * 330 * 100 ~= 10 MiB`.
+
+For 10_000 validators, we are already in the Gigabyte range, which means that
+with a validator set that large we might want to be more strict with the rate limit or
+require a larger rate of incoming votes per batch to keep them alive.
+
+For a thousand validators a limit on batches of around 1000 should never be
+reached in practice. Hence due to rate limiting we have a very good chance to
+not ever having to drop a potential valid dispute due to some resource limit.
+
+Further safe guards are possible: The dispute-coordinator actually
+confirms/denies imports. So once we receive a denial by the dispute-coordinator
+for the initial imported votes, we can opt into flushing the batch immediately
+and importing the votes. This swaps memory usage for more CPU usage, but if that
+import is deemed invalid again we can immediately decrease the reputation of the
+sending peers, so this should be a net win. For the time being we punt on this
+for simplicity.
+
+Instead of filling batches to maximize memory usage, attackers could also try to
+overwhelm the dispute coordinator by only sending votes for new candidates all
+the time. This attack vector is mitigated also by above rate limit and
+decreasing the peer's reputation on denial of the invalid imports by the
+coordinator.
+
+### Node Startup
+
+Nothing special happens on node startup. We expect the `dispute-coordinator` to
+inform us about any ongoing disputes via `SendDispute` messages.
+
+## Backing and Approval Votes
+
+Backing and approval votes get imported when they arrive/are created via the
+dispute coordinator by corresponding subsystems.
+
+We assume that under normal operation each node will be aware of backing and
+approval votes and optimize for that case. Nevertheless we want disputes to
+conclude fast and reliable, therefore if a node is not aware of backing/approval
+votes it can request the missing votes from the node that informed it about the
+dispute (see [Resiliency](#Resiliency])
+
+## Resiliency
+
+The above protocol should be sufficient for most cases, but there are certain
+cases we also want to have covered:
+
+- Non validator nodes might be interested in ongoing voting, even before it is
+  recorded on chain.
+- Nodes might have missed votes, especially backing or approval votes.
+  Recovering them from chain is difficult and expensive, due to runtime upgrades
+  and untyped extrinsics.
+- More importantly, on era changes the new authority set, from the perspective
+  of approval-voting have no need to see "old" approval votes, hence they might
+  not see them, can therefore not import them into the dispute coordinator and
+  therefore no authority will put them on chain.
+
+To cover those cases, we introduce a second request/response protocol, which can
+be handled on a lower priority basis as the one above. It consists of the
+request/response messages as described in the [protocol
+section][#vote-recovery].
+
+Nodes may send those requests to validators, if they feel they are missing
+votes. E.g. after some timeout, if no majority was reached yet in their point of
+view or if they are not aware of any backing/approval votes for a received
+disputed candidate.
+
+The receiver of a `IHaveVotesRequest` message will do the following:
+
+1. See if the sender is missing votes we are aware of - if so, respond with
+   those votes.
+2. Check whether the sender knows about any votes, we don't know about and if so
+   send a `IHaveVotesRequest` request back, with our knowledge.
+3. Record the peer's knowledge.
+
+When to send `IHaveVotesRequest` messages:
+
+1. Whenever we are asked to do so via
+   `DisputeDistributionMessage::FetchMissingVotes`.
+2. Approximately once per block to some random validator as long as the dispute
+   is active.
+
+Spam considerations: Nodes want to accept those messages once per validator and
+per slot. They are free to drop more frequent requests or requests for stale
+data. Requests coming from non validator nodes, can be handled on a best effort
+basis.
+
+## Considerations
+
+Dispute distribution is critical. We should keep track of available validator
+connections and issue warnings if we are not connected to a majority of
+validators. We should also keep track of failed sending attempts and log
+warnings accordingly. As disputes are rare and TCP is a reliable protocol,
+probably each failed attempt should trigger a warning in logs and also logged
+into some Prometheus metric.
+
+## Disputes for non available candidates
+
+If deemed necessary we can later on also support disputes for non available
+candidates, but disputes for those cases have totally different requirements.
+
+First of all such disputes are not time critical. We just want to have
+some offender slashed at some point, but we have no risk of finalizing any bad
+data.
+
+Second, as we won't have availability for such data, the node that initiated the
+dispute will be responsible for providing the disputed data initially. Then
+nodes which did the check already are also providers of the data, hence
+distributing load and making prevention of the dispute from concluding harder
+and harder over time. Assuming an attacker can not DoS a node forever, the
+dispute will succeed eventually, which is all that matters. And again, even if
+an attacker managed to prevent such a dispute from happening somehow, there is
+no real harm done: There was no serious attack to begin with.
+
+[DisputeDistributionMessage]: ../../types/overseer-protocol.md#dispute-distribution-message
+[RuntimeApiMessage]: ../../types/overseer-protocol.md#runtime-api-message
@@ -0,0 +1,25 @@
+# GRANDPA Voting Rule
+
+Specifics on the motivation and types of constraints we apply to the GRANDPA voting logic as well as the definitions of
+**viable** and **finalizable** blocks can be found in the [Chain Selection Protocol](../protocol-chain-selection.md)
+section. The subsystem which provides us with viable leaves is the [Chain Selection
+Subsystem](utility/chain-selection.md).
+
+GRANDPA's regular voting rule is for each validator to select the longest chain they are aware of. GRANDPA proceeds in
+rounds, collecting information from all online validators and determines the blocks that a supermajority of validators
+all have in common with each other.
+
+The low-level GRANDPA logic will provide us with a **required block**. We can find the best leaf containing that block
+in its chain with the
+[`ChainSelectionMessage::BestLeafContaining`](../types/overseer-protocol.md#chain-selection-message). If the result is
+`None`, then we will simply cast a vote on the required block.
+
+The **viable** leaves provided from the chain selection subsystem are not necessarily **finalizable**, so we need to
+perform further work to discover the finalizable ancestor of the block. The first constraint is to avoid voting on any
+unapproved block. The highest approved ancestor of a given block can be determined by querying the Approval Voting
+subsystem via the [`ApprovalVotingMessage::ApprovedAncestor`](../types/overseer-protocol.md#approval-voting) message. If
+the response is `Some`, we continue and apply the second constraint. The second constraint is to avoid voting on any
+block containing a candidate undergoing an active dispute. The list of block hashes and candidates returned from
+`ApprovedAncestor` should be reversed, and passed to the
+[`DisputeCoordinatorMessage::DetermineUndisputedChain`](../types/overseer-protocol.md#dispute-coordinator-message) to
+determine the **finalizable** block which will be our eventual vote.
@@ -0,0 +1,147 @@
+# Overseer
+
+The overseer is responsible for these tasks:
+
+1. Setting up, monitoring, and handing failure for overseen subsystems.
+1. Providing a "heartbeat" of which relay-parents subsystems should be working on.
+1. Acting as a message bus between subsystems.
+
+The hierarchy of subsystems:
+
+```text
+--------------+      +------------------+    +--------------------+
+|              |      |                  |---->   Subsystem A      |
+| Block Import |      |                  |    +--------------------+
+|    Events    |------>                  |    +--------------------+
+--------------+      |                  |---->   Subsystem B      |
+                      |   Overseer       |    +--------------------+
+--------------+      |                  |    +--------------------+
+|              |      |                  |---->   Subsystem C      |
+| Finalization |------>                  |    +--------------------+
+|    Events    |      |                  |    +--------------------+
+|              |      |                  |---->   Subsystem D      |
+--------------+      +------------------+    +--------------------+
+
+```
+
+The overseer determines work to do based on block import events and block finalization events. It does this by keeping
+track of the set of relay-parents for which work is currently being done. This is known as the "active leaves" set. It
+determines an initial set of active leaves on startup based on the data on-disk, and uses events about blockchain import
+to update the active leaves. Updates lead to
+[`OverseerSignal`](../types/overseer-protocol.md#overseer-signal)`::ActiveLeavesUpdate` being sent according to new
+relay-parents, as well as relay-parents to stop considering. Block import events inform the overseer of leaves that no
+longer need to be built on, now that they have children, and inform us to begin building on those children. Block
+finalization events inform us when we can stop focusing on blocks that appear to have been orphaned.
+
+The overseer is also responsible for tracking the freshness of active leaves. Leaves are fresh when they're encountered
+for the first time, and stale when they're encountered for subsequent times. This can occur after chain reversions or
+when the fork-choice rule abandons some chain. This distinction is used to manage **Reversion Safety**. Consensus
+messages are often localized to a specific relay-parent, and it is often a misbehavior to equivocate or sign two
+conflicting messages. When reverting the chain, we may begin work on a leaf that subsystems have already signed messages
+for. Subsystems which need to account for reversion safety should avoid performing work on stale leaves.
+
+The overseer's logic can be described with these functions:
+
+## On Startup
+
+* Start all subsystems
+* Determine all blocks of the blockchain that should be built on. This should typically be the head of the best fork of
+  the chain we are aware of. Sometimes add recent forks as well.
+* Send an `OverseerSignal::ActiveLeavesUpdate` to all subsystems with `activated` containing each of these blocks.
+* Begin listening for block import and finality events
+
+## On Block Import Event
+
+* Apply the block import event to the active leaves. A new block should lead to its addition to the active leaves set
+  and its parent being deactivated.
+* Mark any stale leaves as stale. The overseer should track all leaves it activates to determine whether leaves are
+  fresh or stale.
+* Send an `OverseerSignal::ActiveLeavesUpdate` message to all subsystems containing all activated and deactivated
+  leaves.
+* Ensure all `ActiveLeavesUpdate` messages are flushed before resuming activity as a message router.
+
+> TODO: in the future, we may want to avoid building on too many sibling blocks at once. the notion of a "preferred
+> head" among many competing sibling blocks would imply changes in our "active leaves" update rules here
+
+## On Finalization Event
+
+* Note the height `h` of the newly finalized block `B`.
+* Prune all leaves from the active leaves which have height `<= h` and are not `B`.
+* Issue `OverseerSignal::ActiveLeavesUpdate` containing all deactivated leaves.
+
+## On Subsystem Failure
+
+Subsystems are essential tasks meant to run as long as the node does. Subsystems can spawn ephemeral work in the form of
+jobs, but the subsystems themselves should not go down. If a subsystem goes down, it will be because of a critical error
+that should take the entire node down as well.
+
+## Communication Between Subsystems
+
+When a subsystem wants to communicate with another subsystem, or, more typically, a job within a subsystem wants to
+communicate with its counterpart under another subsystem, that communication must happen via the overseer. Consider this
+example where a job on subsystem A wants to send a message to its counterpart under subsystem B. This is a realistic
+scenario, where you can imagine that both jobs correspond to work under the same relay-parent.
+
+```text
+     +--------+                                                           +--------+
+     |        |                                                           |        |
+     |Job A-1 | (sends message)                       (receives message)  |Job B-1 |
+     |        |                                                           |        |
+     +----|---+                                                           +----^---+
+          |                  +------------------------------+                  ^
+          v                  |                              |                  |
+---------v---------+        |                              |        +---------|---------+
+|                   |        |                              |        |                   |
+| Subsystem A       |        |       Overseer / Message     |        | Subsystem B       |
+|                   -------->>                  Bus         -------->>                   |
+|                   |        |                              |        |                   |
+-------------------+        |                              |        +-------------------+
+                             |                              |
+                             +------------------------------+
+```
+
+First, the subsystem that spawned a job is responsible for handling the first step of the communication. The overseer is
+not aware of the hierarchy of tasks within any given subsystem and is only responsible for subsystem-to-subsystem
+communication. So the sending subsystem must pass on the message via the overseer to the receiving subsystem, in such a
+way that the receiving subsystem can further address the communication to one of its internal tasks, if necessary.
+
+This communication prevents a certain class of race conditions. When the Overseer determines that it is time for
+subsystems to begin working on top of a particular relay-parent, it will dispatch a `ActiveLeavesUpdate` message to all
+subsystems to do so, and those messages will be handled asynchronously by those subsystems. Some subsystems will receive
+those messages before others, and it is important that a message sent by subsystem A after receiving
+`ActiveLeavesUpdate` message will arrive at subsystem B after its `ActiveLeavesUpdate` message. If subsystem A
+maintained an independent channel with subsystem B to communicate, it would be possible for subsystem B to handle the
+side message before the `ActiveLeavesUpdate` message, but it wouldn't have any logical course of action to take with the
+side message - leading to it being discarded or improperly handled. Well-architected state machines should have a
+single source of inputs, so that is what we do here.
+
+One exception is reasonable to make for responses to requests. A request should be made via the overseer in order to
+ensure that it arrives after any relevant `ActiveLeavesUpdate` message. A subsystem issuing a request as a result of a
+`ActiveLeavesUpdate` message can safely receive the response via a side-channel for two reasons:
+
+1. It's impossible for a request to be answered before it arrives, it is provable that any response to a request obeys
+   the same ordering constraint.
+1. The request was sent as a result of handling a `ActiveLeavesUpdate` message. Then there is no possible future in
+   which the `ActiveLeavesUpdate` message has not been handled upon the receipt of the response.
+
+So as a single exception to the rule that all communication must happen via the overseer we allow the receipt of
+responses to requests via a side-channel, which may be established for that purpose. This simplifies any cases where the
+outside world desires to make a request to a subsystem, as the outside world can then establish a side-channel to
+receive the response on.
+
+It's important to note that the overseer is not aware of the internals of subsystems, and this extends to the jobs that
+they spawn. The overseer isn't aware of the existence or definition of those jobs, and is only aware of the outer
+subsystems with which it interacts. This gives subsystem implementations leeway to define internal jobs as they see fit,
+and to wrap a more complex hierarchy of state machines than having a single layer of jobs for relay-parent-based work.
+Likewise, subsystems aren't required to spawn jobs. Certain types of subsystems, such as those for shared storage or
+networking resources, won't perform block-based work but would still benefit from being on the Overseer's message bus.
+These subsystems can just ignore the overseer's signals for block-based work.
+
+Furthermore, the protocols by which subsystems communicate with each other should be well-defined irrespective of the
+implementation of the subsystem. In other words, their interface should be distinct from their implementation. This will
+prevent subsystems from accessing aspects of each other that are beyond the scope of the communication boundary.
+
+## On shutdown
+
+Send an `OverseerSignal::Conclude` message to each subsystem and wait some time for them to conclude before
+hard-exiting.
@@ -0,0 +1,469 @@
+# Subsystems and Jobs
+
+In this section we define the notions of Subsystems and Jobs. These are
+guidelines for how we will employ an architecture of hierarchical state
+machines. We'll have a top-level state machine which oversees the next level of
+state machines which oversee another layer of state machines and so on. The next
+sections will lay out these guidelines for what we've called subsystems and
+jobs, since this model applies to many of the tasks that the Node-side behavior
+needs to encompass, but these are only guidelines and some Subsystems may have
+deeper hierarchies internally.
+
+Subsystems are long-lived worker tasks that are in charge of performing some
+particular kind of work. All subsystems can communicate with each other via a
+well-defined protocol. Subsystems can't generally communicate directly, but must
+coordinate communication through an [Overseer](overseer.md), which is
+responsible for relaying messages, handling subsystem failures, and dispatching
+work signals.
+
+Most work that happens on the Node-side is related to building on top of a
+specific relay-chain block, which is contextually known as the "relay parent".
+We call it the relay parent to explicitly denote that it is a block in the relay
+chain and not on a teyrchain. We refer to the parent because when we are in the
+process of building a new block, we don't know what that new block is going to
+be. The parent block is our only stable point of reference, even though it is
+usually only useful when it is not yet a parent but in fact a leaf of the
+block-DAG expected to soon become a parent (because validators are authoring on
+top of it). Furthermore, we are assuming a forkful blockchain-extension
+protocol, which means that there may be multiple possible children of the
+relay-parent. Even if the relay parent has multiple children blocks, the parent
+of those children is the same, and the context in which those children is
+authored should be the same. The parent block is the best and most stable
+reference to use for defining the scope of work items and messages, and is
+typically referred to by its cryptographic hash.
+
+Since this goal of determining when to start and conclude work relative to a
+specific relay-parent is common to most, if not all subsystems, it is logically
+the job of the Overseer to distribute those signals as opposed to each subsystem
+duplicating that effort, potentially being out of synchronization with each
+other. Subsystem A should be able to expect that subsystem B is working on the
+same relay-parents as it is. One of the Overseer's tasks is to provide this
+heartbeat, or synchronized rhythm, to the system.
+
+The work that subsystems spawn to be done on a specific relay-parent is known as
+a job. Subsystems should set up and tear down jobs according to the signals
+received from the overseer. Subsystems may share or cache state between jobs.
+
+Subsystems must be robust to spurious exits. The outputs of the set of
+subsystems as a whole comprises of signed messages and data committed to disk.
+Care must be taken to avoid issuing messages that are not substantiated. Since
+subsystems need to be safe under spurious exits, it is the expected behavior
+that an `OverseerSignal::Conclude` can just lead to breaking the loop and
+exiting directly as opposed to waiting for everything to shut down gracefully.
+
+## Subsystem Message Traffic
+
+Which subsystems send messages to which other subsystems.
+
+**Note**: This diagram omits the overseer for simplicity. In fact, all messages
+are relayed via the overseer.
+
+**Note**: Messages with a filled diamond arrowhead ("♦") include a
+`oneshot::Sender` which communicates a response from the recipient. Messages
+with an open triangle arrowhead ("Δ") do not include a return sender.
+
+```dot process
+digraph {
+    rankdir=LR;
+    node [shape = oval];
+    concentrate = true;
+
+    av_store    [label = "Availability Store"]
+    avail_dist  [label = "Availability Distribution"]
+    avail_rcov  [label = "Availability Recovery"]
+    bitf_dist   [label = "Bitfield Distribution"]
+    bitf_sign   [label = "Bitfield Signing"]
+    cand_back   [label = "Candidate Backing"]
+    cand_sel    [label = "Candidate Selection"]
+    cand_val    [label = "Candidate Validation"]
+    chn_api     [label = "Chain API"]
+    coll_gen    [label = "Collation Generation"]
+    coll_prot   [label = "Collator Protocol"]
+    net_brdg    [label = "Network Bridge"]
+    pov_dist    [label = "PoV Distribution"]
+    provisioner [label = "Provisioner"]
+    runt_api    [label = "Runtime API"]
+    stmt_dist   [label = "Statement Distribution"]
+
+    av_store    -> runt_api     [arrowhead = "diamond", label = "Request::CandidateEvents"]
+    av_store    -> chn_api      [arrowhead = "diamond", label = "BlockNumber"]
+    av_store    -> chn_api      [arrowhead = "diamond", label = "BlockHeader"]
+    av_store    -> runt_api     [arrowhead = "diamond", label = "Request::Validators"]
+    av_store    -> chn_api      [arrowhead = "diamond", label = "FinalizedBlockHash"]
+
+    avail_dist  -> net_brdg     [arrowhead = "onormal", label = "Request::SendValidationMessages"]
+    avail_dist  -> runt_api     [arrowhead = "diamond", label = "Request::AvailabilityCores"]
+    avail_dist  -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+    avail_dist  -> av_store     [arrowhead = "diamond", label = "QueryDataAvailability"]
+    avail_dist  -> av_store     [arrowhead = "diamond", label = "QueryChunk"]
+    avail_dist  -> av_store     [arrowhead = "diamond", label = "StoreChunk"]
+    avail_dist  -> runt_api     [arrowhead = "diamond", label = "Request::Validators"]
+    avail_dist  -> chn_api      [arrowhead = "diamond", label = "Ancestors"]
+    avail_dist  -> runt_api     [arrowhead = "diamond", label = "Request::SessionIndexForChild"]
+
+    avail_rcov  -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+    avail_rcov  -> av_store     [arrowhead = "diamond", label = "QueryChunk"]
+    avail_rcov  -> net_brdg     [arrowhead = "diamond", label = "ConnectToValidators"]
+    avail_rcov  -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage::Chunk"]
+    avail_rcov  -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage::RequestChunk"]
+
+    bitf_dist   -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+    bitf_dist   -> provisioner  [arrowhead = "onormal", label = "ProvisionableData::Bitfield"]
+    bitf_dist   -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage"]
+    bitf_dist   -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage"]
+    bitf_dist   -> runt_api     [arrowhead = "diamond", label = "Request::Validatiors"]
+    bitf_dist   -> runt_api     [arrowhead = "diamond", label = "Request::SessionIndexForChild"]
+
+    bitf_sign   -> av_store     [arrowhead = "diamond", label = "QueryChunkAvailability"]
+    bitf_sign   -> runt_api     [arrowhead = "diamond", label = "Request::AvailabilityCores"]
+    bitf_sign   -> bitf_dist    [arrowhead = "onormal", label = "DistributeBitfield"]
+
+    cand_back   -> av_store     [arrowhead = "diamond", label = "StoreAvailableData"]
+    cand_back   -> pov_dist     [arrowhead = "diamond", label = "FetchPoV"]
+    cand_back   -> cand_val     [arrowhead = "diamond", label = "ValidateFromChainState"]
+    cand_back   -> cand_sel     [arrowhead = "onormal", label = "Invalid"]
+    cand_back   -> provisioner  [arrowhead = "onormal", label = "ProvisionableData::MisbehaviorReport"]
+    cand_back   -> provisioner  [arrowhead = "onormal", label = "ProvisionableData::BackedCandidate"]
+    cand_back   -> pov_dist     [arrowhead = "onormal", label = "DistributePoV"]
+    cand_back   -> stmt_dist    [arrowhead = "onormal", label = "Share"]
+
+    cand_sel    -> coll_prot    [arrowhead = "diamond", label = "FetchCollation"]
+    cand_sel    -> cand_back    [arrowhead = "onormal", label = "Second"]
+
+    cand_val    -> runt_api     [arrowhead = "diamond", label = "Request::PersistedValidationData"]
+    cand_val    -> runt_api     [arrowhead = "diamond", label = "Request::ValidationCode"]
+    cand_val    -> runt_api     [arrowhead = "diamond", label = "Request::CheckValidationOutputs"]
+
+    coll_gen    -> coll_prot    [arrowhead = "onormal", label = "DistributeCollation"]
+
+    coll_prot   -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+    coll_prot   -> net_brdg     [arrowhead = "onormal", label = "Declare"]
+    coll_prot   -> net_brdg     [arrowhead = "onormal", label = "AdvertiseCollation"]
+    coll_prot   -> net_brdg     [arrowhead = "onormal", label = "Collation"]
+    coll_prot   -> net_brdg     [arrowhead = "onormal", label = "RequestCollation"]
+    coll_prot   -> cand_sel     [arrowhead = "onormal", label = "Collation"]
+
+    net_brdg    -> avail_dist   [arrowhead = "onormal", label = "NetworkBridgeUpdate"]
+    net_brdg    -> bitf_dist    [arrowhead = "onormal", label = "NetworkBridgeUpdate"]
+    net_brdg    -> pov_dist     [arrowhead = "onormal", label = "NetworkBridgeUpdate"]
+    net_brdg    -> stmt_dist    [arrowhead = "onormal", label = "NetworkBridgeUpdate"]
+    net_brdg    -> coll_prot    [arrowhead = "onormal", label = "NetworkBridgeUpdate"]
+
+    pov_dist    -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage"]
+    pov_dist    -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+
+    provisioner -> cand_back    [arrowhead = "diamond", label = "GetBackedCandidates"]
+    provisioner -> chn_api      [arrowhead = "diamond", label = "BlockNumber"]
+
+    stmt_dist   -> net_brdg     [arrowhead = "onormal", label = "SendValidationMessage"]
+    stmt_dist   -> net_brdg     [arrowhead = "onormal", label = "ReportPeer"]
+    stmt_dist   -> cand_back    [arrowhead = "onormal", label = "Statement"]
+    stmt_dist   -> runt_api     [arrowhead = "onormal", label = "Request::Validators"]
+    stmt_dist   -> runt_api     [arrowhead = "onormal", label = "Request::SessionIndexForChild"]
+}
+```
+
+## The Path to Inclusion (Node Side)
+
+Let's contextualize that diagram a bit by following a teyrchain block from its
+creation through finalization. Teyrchains can use completely arbitrary processes
+to generate blocks. The relay chain doesn't know or care about the details; each
+teyrchain just needs to provide a [collator](collators/collation-generation.md).
+
+**Note**: Inter-subsystem communications are relayed via the overseer, but that
+step is omitted here for brevity.
+
+**Note**: Dashed lines indicate a request/response cycle, where the response is
+communicated asynchronously via a oneshot channel. Adjacent dashed lines may be
+processed in parallel.
+
+```mermaid
+sequenceDiagram
+    participant Overseer
+    participant CollationGeneration
+    participant RuntimeApi
+    participant CollatorProtocol
+
+    Overseer ->> CollationGeneration: ActiveLeavesUpdate
+    loop for each activated head
+        CollationGeneration -->> RuntimeApi: Request availability cores
+        CollationGeneration -->> RuntimeApi: Request validators
+
+        Note over CollationGeneration: Determine an appropriate ScheduledCore <br/>and OccupiedCoreAssumption
+
+        CollationGeneration -->> RuntimeApi: Request full validation data
+
+        Note over CollationGeneration: Build the collation
+
+        CollationGeneration ->> CollatorProtocol: DistributeCollation
+    end
+```
+
+The `DistributeCollation` messages that `CollationGeneration` sends to the
+`CollatorProtocol` contains two items: a `CandidateReceipt` and `PoV`. The
+`CollatorProtocol` is then responsible for distributing that collation to
+interested validators. However, not all potential collations are of interest.
+The `CandidateSelection` subsystem is responsible for determining which
+collations are interesting, before `CollatorProtocol` actually fetches the
+collation.
+
+```mermaid
+sequenceDiagram
+    participant CollationGeneration
+    participant CS as CollatorProtocol::CollatorSide
+    participant NB as NetworkBridge
+    participant VS as CollatorProtocol::ValidatorSide
+    participant CandidateSelection
+
+    CollationGeneration ->> CS: DistributeCollation
+    CS -->> NB: ConnectToValidators
+
+    Note over CS,NB: This connects to multiple validators.
+
+    CS ->> NB: Declare
+    NB ->> VS: Declare
+
+    Note over CS: Ensure that the connected validator is among<br/>the para's validator set. Otherwise, skip it.
+
+    CS ->> NB: AdvertiseCollation
+    NB ->> VS: AdvertiseCollation
+
+    VS ->> CandidateSelection: Collation
+
+    Note over CandidateSelection: Lots of other machinery in play here,<br/>but there are only two outcomes from the<br/>perspective of the `CollatorProtocol`:
+
+    alt happy path
+        CandidateSelection -->> VS: FetchCollation
+        Activate VS
+        VS ->> NB: RequestCollation
+        NB ->> CS: RequestCollation
+        CS ->> NB: Collation
+        NB ->> VS: Collation
+        Deactivate VS
+
+    else CandidateSelection already selected a different candidate
+        Note over CandidateSelection: silently drop
+    end
+```
+
+Assuming we hit the happy path, flow continues with `CandidateSelection`
+receiving a `(candidate_receipt, pov)` as the return value from its
+`FetchCollation` request. The only time `CandidateSelection` actively requests a
+collation is when it hasn't yet seconded one for some `relay_parent`, and is
+ready to second.
+
+```mermaid
+sequenceDiagram
+    participant CS as CandidateSelection
+    participant CB as CandidateBacking
+    participant CV as CandidateValidation
+    participant PV as Provisioner
+    participant SD as StatementDistribution
+    participant PD as PoVDistribution
+
+    CS ->> CB: Second
+    % fn validate_and_make_available
+    CB -->> CV: ValidateFromChainState
+
+    Note over CB,CV: There's some complication in the source, as<br/>candidates are actually validated in a separate task.
+
+    alt valid
+        Note over CB: This is where we transform the CandidateReceipt into a CommittedCandidateReceipt
+        % CandidateBackingJob::sign_import_and_distribute_statement
+        % CandidateBackingJob::import_statement
+        CB ->> PV: ProvisionableData::BackedCandidate
+        % CandidateBackingJob::issue_new_misbehaviors
+        opt if there is misbehavior to report
+            CB ->> PV: ProvisionableData::MisbehaviorReport
+        end
+        % CandidateBackingJob::distribute_signed_statement
+        CB ->> SD: Share
+        % CandidateBackingJob::distribute_pov
+        CB ->> PD: DistributePoV
+    else invalid
+        CB ->> CS: Invalid
+    end
+```
+
+At this point, you'll see that control flows in two directions: to
+`StatementDistribution` to distribute the `SignedStatement`, and to
+`PoVDistribution` to distribute the `PoV`. However, that's largely a mirage:
+while the initial implementation distributes `PoV`s by gossip, that's
+inefficient, and will be replaced with a system which fetches `PoV`s only when
+actually necessary.
+
+> TODO: figure out more precisely the current status and plans; write them up
+
+Therefore, we'll follow the `SignedStatement`. The `StatementDistribution`
+subsystem is largely concerned with implementing a gossip protocol:
+
+```mermaid
+sequenceDiagram
+    participant SD as StatementDistribution
+    participant NB as NetworkBridge
+
+    alt On receipt of a<br/>SignedStatement from CandidateBacking
+        % fn circulate_statement_and_dependents
+        SD ->> NB: SendValidationMessage
+
+        Note right of NB: Bridge sends validation message to all appropriate peers
+    else On receipt of peer validation message
+        NB ->> SD: NetworkBridgeUpdate
+
+        % fn handle_incoming_message
+        alt if we aren't already aware of the relay parent for this statement
+            SD ->> NB: ReportPeer
+        end
+
+        % fn circulate_statement
+        opt if we know of peers who haven't seen this message, gossip it
+            SD ->> NB: SendValidationMessage
+        end
+    end
+```
+
+But who are these `Listener`s who've asked to be notified about incoming
+`SignedStatement`s? Nobody, as yet.
+
+Let's pick back up with the PoV Distribution subsystem.
+
+```mermaid
+sequenceDiagram
+    participant CB as CandidateBacking
+    participant PD as PoVDistribution
+    participant Listener
+    participant NB as NetworkBridge
+
+    CB ->> PD: DistributePoV
+
+    Note over PD,Listener: Various subsystems can register listeners for when PoVs arrive
+
+    loop for each Listener
+        PD ->> Listener: Arc<PoV>
+    end
+
+    Note over PD: Gossip to connected peers
+
+    PD ->> NB: SendPoV
+
+    Note over PD,NB: On receipt of a network PoV, PovDistribution forwards it to each Listener.<br/>It also penalizes bad gossipers.
+```
+
+Unlike in the case of `StatementDistribution`, there is another subsystem which
+in various circumstances already registers a listener to be notified when a new
+`PoV` arrives: `CandidateBacking`. Note that this is the second time that
+`CandidateBacking` has gotten involved. The first instance was from the
+perspective of the validator choosing to second a candidate via its
+`CandidateSelection` subsystem. This time, it's from the perspective of some
+other validator, being informed that this foreign `PoV` has been received.
+
+```mermaid
+sequenceDiagram
+    participant SD as StatementDistribution
+    participant CB as CandidateBacking
+    participant PD as PoVDistribution
+    participant AS as AvailabilityStore
+
+    SD ->> CB: Statement
+    % CB::maybe_validate_and_import => CB::kick_off_validation_work
+    CB -->> PD: FetchPoV
+    Note over CB,PD: This call creates the Listener from the previous diagram
+
+    CB ->> AS: StoreAvailableData
+```
+
+At this point, things have gone a bit nonlinear. Let's pick up the thread again
+with `BitfieldSigning`. As the `Overseer` activates each relay parent, it starts
+a `BitfieldSigningJob` which operates on an extremely simple metric: after
+creation, it immediately goes to sleep for 1.5 seconds. On waking, it records
+the state of the world pertaining to availability at that moment.
+
+```mermaid
+sequenceDiagram
+    participant OS as Overseer
+    participant BS as BitfieldSigning
+    participant RA as RuntimeApi
+    participant AS as AvailabilityStore
+    participant BD as BitfieldDistribution
+
+    OS ->> BS: ActiveLeavesUpdate
+    loop for each activated relay parent
+        Note over BS: Wait 1.5 seconds
+        BS -->> RA: Request::AvailabilityCores
+        loop for each availability core
+            BS -->> AS: QueryChunkAvailability
+        end
+        BS ->> BD: DistributeBitfield
+    end
+```
+
+`BitfieldDistribution` is, like the other `*Distribution` subsystems, primarily
+interested in implementing a peer-to-peer gossip network propagating its
+particular messages. However, it also serves as an essential relay passing the
+message along.
+
+```mermaid
+sequenceDiagram
+    participant BS as BitfieldSigning
+    participant BD as BitfieldDistribution
+    participant NB as NetworkBridge
+    participant PV as Provisioner
+
+    BS ->> BD: DistributeBitfield
+    BD ->> PV: ProvisionableData::Bitfield
+    BD ->> NB: SendValidationMessage::BitfieldDistribution::Bitfield
+```
+
+We've now seen the message flow to the `Provisioner`: both `CandidateBacking`
+and `BitfieldDistribution` contribute provisionable data. Now, let's look at
+that subsystem.
+
+Much like the `BitfieldSigning` subsystem, the `Provisioner` creates a new job
+for each newly-activated leaf, and starts a timer. Unlike `BitfieldSigning`, we
+won't depict that part of the process, because the `Provisioner` also has other
+things going on.
+
+```mermaid
+sequenceDiagram
+    participant A as Arbitrary
+    participant PV as Provisioner
+    participant CB as CandidateBacking
+    participant BD as BitfieldDistribution
+    participant RA as RuntimeApi
+    participant PI as TeyrchainsInherentDataProvider
+
+    alt receive provisionable data
+        alt
+            CB ->> PV: ProvisionableData
+        else
+            BD ->> PV: ProvisionableData
+        end
+
+        loop over stored Senders
+            PV ->> A: ProvisionableData
+        end
+
+        Note over PV: store bitfields and backed candidates
+    else receive request for inherent data
+        PI ->> PV: RequestInherentData
+        alt we have already constructed the inherent data
+            PV ->> PI: send the inherent data
+        else we have not yet constructed the inherent data
+            Note over PV,PI: Store the return sender without sending immediately
+        end
+    else timer times out
+        note over PV: Waited 2 seconds
+        PV -->> RA: RuntimeApiRequest::AvailabilityCores
+        Note over PV: construct and store the inherent data
+        loop over stored inherent data requests
+            PV ->> PI: (SignedAvailabilityBitfields, BackedCandidates)
+        end
+    end
+```
+
+In principle, any arbitrary subsystem could send a `RequestInherentData` to the
+`Provisioner`. In practice, only the `TeyrchainsInherentDataProvider` does so.
+
+The tuple `(SignedAvailabilityBitfields, BackedCandidates, ParentHeader)` is
+injected by the `TeyrchainsInherentDataProvider` into the inherent data. From
+that point on, control passes from the node to the runtime.
@@ -0,0 +1,3 @@
+# Utility Subsystems
+
+The utility subsystems are an assortment which don't have a natural home in another subsystem collection.
@@ -0,0 +1,240 @@
+# Availability Store
+
+This is a utility subsystem responsible for keeping available certain data and pruning that data.
+
+The two data types:
+
+- Full PoV blocks of candidates we have validated
+- Availability chunks of candidates that were backed and noted available on-chain.
+
+For each of these data we have pruning rules that determine how long we need to keep that data available.
+
+PoV hypothetically only need to be kept around until the block where the data was made fully available is finalized.
+However, disputes can revert finality, so we need to be a bit more conservative and we add a delay. We should keep the
+PoV until a block that finalized availability of it has been finalized for 1 day + 1 hour.
+
+Availability chunks need to be kept available until the dispute period for the corresponding candidate has ended. We can
+accomplish this by using the same criterion as the above. This gives us a pruning condition of the block finalizing
+availability of the chunk being final for 1 day + 1 hour.
+
+There is also the case where a validator commits to make a PoV available, but the corresponding candidate is never
+backed. In this case, we keep the PoV available for 1 hour.
+
+There may be multiple competing blocks all ending the availability phase for a particular candidate. Until finality, it
+will be unclear which of those is actually the canonical chain, so the pruning records for PoVs and Availability chunks
+should keep track of all such blocks.
+
+## Lifetime of the block data and chunks in storage
+
+```dot process
+digraph {
+ label = "Block data FSM\n\n\n";
+ labelloc = "t";
+ rankdir="LR";
+
+ st [label = "Stored"; shape = circle]
+ inc [label = "Included"; shape = circle]
+ fin [label = "Finalized"; shape = circle]
+ prn [label = "Pruned"; shape = circle]
+
+ st -> inc [label = "Block\nincluded"]
+ st -> prn [label = "Stored block\ntimed out"]
+ inc -> fin [label = "Block\nfinalized"]
+ inc -> st [label = "Competing blocks\nfinalized"]
+ fin -> prn [label = "Block keep time\n(1 day + 1 hour) elapsed"]
+}
+```
+
+## Database Schema
+
+We use an underlying Key-Value database where we assume we have the following operations available:
+
+- `write(key, value)`
+- `read(key) -> Option<value>`
+- `iter_with_prefix(prefix) -> Iterator<(key, value)>` - gives all keys and values in lexicographical order where the
+  key starts with `prefix`.
+
+We use this database to encode the following schema:
+
+```rust
+("available", CandidateHash) -> Option<AvailableData>
+("chunk", CandidateHash, u32) -> Option<ErasureChunk>
+("meta", CandidateHash) -> Option<CandidateMeta>
+
+("unfinalized", BlockNumber, BlockHash, CandidateHash) -> Option<()>
+("prune_by_time", Timestamp, CandidateHash) -> Option<()>
+```
+
+Timestamps are the wall-clock seconds since Unix epoch. Timestamps and block numbers are both encoded as big-endian so
+lexicographic order is ascending.
+
+The meta information that we track per-candidate is defined as the `CandidateMeta` struct
+
+```rust
+struct CandidateMeta {
+  state: State,
+  data_available: bool,
+  chunks_stored: Bitfield,
+}
+
+enum State {
+  /// Candidate data was first observed at the given time but is not available in any block.
+  Unavailable(Timestamp),
+  /// The candidate was first observed at the given time and was included in the given list of unfinalized blocks, which may be
+  /// empty. The timestamp here is not used for pruning. Either one of these blocks will be finalized or the state will regress to
+  /// `State::Unavailable`, in which case the same timestamp will be reused.
+  Unfinalized(Timestamp, Vec<(BlockNumber, BlockHash)>),
+  /// Candidate data has appeared in a finalized block and did so at the given time.
+  Finalized(Timestamp)
+}
+```
+
+We maintain the invariant that if a candidate has a meta entry, its available data exists on disk if `data_available` is
+true. All chunks mentioned in the meta entry are available.
+
+Additionally, there is exactly one `prune_by_time` entry which holds the candidate hash unless the state is
+`Unfinalized`. There may be zero, one, or many "unfinalized" keys with the given candidate, and this will correspond to
+the `state` of the meta entry.
+
+## Protocol
+
+Input: [`AvailabilityStoreMessage`][ASM]
+
+Output:
+
+- [`RuntimeApiMessage`][RAM]
+
+## Functionality
+
+For each head in the `activated` list:
+
+- Load all ancestors of the head back to the finalized block so we don't miss anything if import notifications are
+  missed. If a `StoreChunk` message is received for a candidate which has no entry, then we will prematurely lose the
+  data.
+- Note any new candidates backed in the head. Update the `CandidateMeta` for each. If the `CandidateMeta` does not
+  exist, create it as `Unavailable` with the current timestamp. Register a `"prune_by_time"` entry based on the current
+  timestamp + 1 hour.
+- Note any new candidate included in the head. Update the `CandidateMeta` for each, performing a transition from
+  `Unavailable` to `Unfinalized` if necessary. That includes removing the `"prune_by_time"` entry. Add the head hash and
+  number to the state, if unfinalized. Add an `"unfinalized"` entry for the block and candidate.
+- The `CandidateEvent` runtime API can be used for this purpose.
+
+On `OverseerSignal::BlockFinalized(finalized)` events:
+
+- for each key in `iter_with_prefix("unfinalized")`
+  - Stop if the key is beyond `("unfinalized, finalized)`
+  - For each block number f that we encounter, load the finalized hash for that block.
+    - The state of each `CandidateMeta` we encounter here must be `Unfinalized`, since we loaded the candidate from an
+      `"unfinalized"` key.
+    - For each candidate that we encounter under `f` and the finalized block hash,
+      - Update the `CandidateMeta` to have `State::Finalized`.  Remove all `"unfinalized"` entries from the old
+        `Unfinalized` state.
+      - Register a `"prune_by_time"` entry for the candidate based on the current time + 1 day + 1 hour.
+    - For each candidate that we encounter under `f` which is not under the finalized block hash,
+      - Remove all entries under `f` in the `Unfinalized` state.
+      - If the `CandidateMeta` has state `Unfinalized` with an empty list of blocks, downgrade to `Unavailable` and
+        re-schedule pruning under the timestamp + 1 hour. We do not prune here as the candidate still may be included in
+        a descendant of the finalized chain.
+    - Remove all `"unfinalized"` keys under `f`.
+- Update `last_finalized` = finalized.
+
+  This is roughly `O(n * m)` where n is the number of blocks finalized since the last update, and `m` is the number of
+  teyrchains.
+
+On `QueryAvailableData` message:
+
+- Query `("available", candidate_hash)`
+
+  This is `O(n)` in the size of the data, which may be large.
+
+On `QueryDataAvailability` message:
+
+- Query whether `("meta", candidate_hash)` exists and `data_available == true`.
+
+  This is `O(n)` in the size of the metadata which is small.
+
+On `QueryChunk` message:
+
+- Query `("chunk", candidate_hash, index)`
+
+  This is `O(n)` in the size of the data, which may be large.
+
+On `QueryAllChunks` message:
+
+- Query `("meta", candidate_hash)`. If `None`, send an empty response and return.
+- For all `1` bits in the `chunks_stored`, query `("chunk", candidate_hash, index)`. Ignore but warn on errors, and
+  return a vector of all loaded chunks.
+
+On `QueryChunkAvailability` message:
+
+- Query whether `("meta", candidate_hash)` exists and the bit at `index` is set.
+
+  This is `O(n)` in the size of the metadata which is small.
+
+On `StoreChunk` message:
+
+- If there is a `CandidateMeta` under the candidate hash, set the bit of the erasure-chunk in the `chunks_stored`
+  bitfield to `1`. If it was not `1` already, write the chunk under `("chunk", candidate_hash, chunk_index)`.
+
+  This is `O(n)` in the size of the chunk.
+
+On `StoreAvailableData` message:
+
+- Compute the erasure root of the available data and compare it with `expected_erasure_root`. Return
+  `StoreAvailableDataError::InvalidErasureRoot` on mismatch.
+- If there is no `CandidateMeta` under the candidate hash, create it with `State::Unavailable(now)`. Load the
+  `CandidateMeta` otherwise.
+- Store `data` under `("available", candidate_hash)` and set `data_available` to true.
+- Store each chunk under `("chunk", candidate_hash, index)` and set every bit in `chunks_stored` to `1`.
+
+  This is `O(n)` in the size of the data as the aggregate size of the chunks is proportional to the data.
+
+Every 5 minutes, run a pruning routine:
+
+- for each key in `iter_with_prefix("prune_by_time")`:
+  - If the key is beyond `("prune_by_time", now)`, return.
+  - Remove the key.
+  - Extract `candidate_hash` from the key.
+  - Load and remove the `("meta", candidate_hash)`
+  - For each erasure chunk bit set, remove `("chunk", candidate_hash, bit_index)`.
+  - If `data_available`, remove `("available", candidate_hash)`
+
+  This is O(n * m) in the amount of candidates and average size of the data stored. This is probably the most expensive
+  operation but does not need to be run very often.
+
+## Basic scenarios to test
+
+Basically we need to test the correctness of data flow through state FSMs described earlier. These tests obviously
+assume that some mocking of time is happening.
+
+- Stored data that is never included pruned in necessary timeout
+  - A block (and/or a chunk) is added to the store.
+  - We never note that the respective candidate is included.
+  - Until a defined timeout the data in question is available.
+  - After this timeout the data is no longer available.
+
+- Stored data is kept until we are certain it is finalized.
+  - A block (and/or a chunk) is added to the store.
+  - It is available.
+  - Before the inclusion timeout expires notify storage that the candidate was included.
+  - The data is still available.
+  - Wait for an absurd amount of time (longer than 1 day).
+  - Check that the data is still available.
+  - Send finality notification about the block in question.
+  - Wait for some time below finalized data timeout.
+  - The data is still available.
+  - Wait until the data should have been pruned.
+  - The data is no longer available.
+
+- Fork-awareness of the relay chain is taken into account
+  - Block `B1` is added to the store.
+  - Block `B2` is added to the store.
+  - Notify the subsystem that both `B1` and `B2` were included in different leafs of relay chain.
+  - Notify the subsystem that the leaf with `B1` was finalized.
+  - Leaf with `B2` is never finalized.
+  - Leaf with `B2` is pruned and its data is no longer available.
+  - Wait until the finalized data of `B1` should have been pruned.
+  - `B1` is no longer available.
+
+[RAM]: ../../types/overseer-protocol.md#runtime-api-message
+[ASM]: ../../types/overseer-protocol.md#availability-store-message
@@ -0,0 +1,99 @@
+# Candidate Validation
+
+This subsystem is responsible for handling candidate validation requests. It is a simple request/response server.
+
+A variety of subsystems want to know if a teyrchain block candidate is valid. None of them care about the detailed
+mechanics of how a candidate gets validated, just the results. This subsystem handles those details.
+
+## High-Level Flow
+
+```dot process
+digraph {
+ rankdir="LR";
+
+ pre [label = "Pvf-Checker"; shape = square]
+ bac [label = "Backing"; shape = square]
+ app [label = "Approval\nVoting"; shape = square]
+ dis [label = "Dispute\nCoordinator"; shape = square]
+
+ can [label = "Candidate\nValidation"; shape = square]
+
+ pvf [label = "PVF Host"; shape = square]
+
+ pre -> can [style = dashed]
+ bac -> can
+ app -> can
+ dis -> can
+
+ can -> pvf [label = "Precheck"; style = dashed]
+ can -> pvf [label = "Validate"]
+}
+```
+
+## Protocol
+
+Input: [`CandidateValidationMessage`](../../types/overseer-protocol.md#validation-request-type)
+
+Output: Validation result via the provided response side-channel.
+
+## Functionality
+
+This subsystem groups the requests it handles in two categories:  *candidate validation* and *PVF pre-checking*.
+
+The first category can be further subdivided in two request types: one which draws out validation data from the state,
+and another which accepts all validation data exhaustively. Validation returns three possible outcomes on the response
+channel: the candidate is valid, the candidate is invalid, or an internal error occurred.
+
+Teyrchain candidates are validated against their validation function: A piece of Wasm code that describes the
+state-transition of the teyrchain. Validation function execution is not metered. This means that an execution which is
+an infinite loop or simply takes too long must be forcibly exited by some other means. For this reason, we recommend
+dispatching candidate validation to be done on subprocesses which can be killed if they time-out.
+
+Upon receiving a validation request, the first thing the candidate validation subsystem should do is make sure it has
+all the necessary parameters to the validation function. These are:
+  * The Validation Function itself.
+  * The [`CandidateDescriptor`](../../types/candidate.md#candidatedescriptor).
+  * The [`ValidationData`](../../types/candidate.md#validationdata).
+  * The [`PoV`](../../types/availability.md#proofofvalidity).
+
+The second category is for PVF pre-checking. This is primarily used by the [PVF pre-checker](pvf-prechecker.md)
+subsystem.
+
+### Determining Parameters
+
+For a [`CandidateValidationMessage`][CVM]`::ValidateFromExhaustive`, these parameters are exhaustively provided.
+
+For a [`CandidateValidationMessage`][CVM]`::ValidateFromChainState`, some more work needs to be done. Due to the
+uncertainty of Availability Cores (implemented in the [`Scheduler`](../../runtime/scheduler.md) module of the runtime),
+a candidate at a particular relay-parent and for a particular para may have two different valid validation-data to be
+executed under depending on what is assumed to happen if the para is occupying a core at the onset of the new block.
+This is encoded as an `OccupiedCoreAssumption` in the runtime API.
+
+The way that we can determine which assumption the candidate is meant to be executed under is simply to do an exhaustive
+check of both possibilities based on the state of the relay-parent. First we fetch the validation data under the
+assumption that the block occupying becomes available. If the `validation_data_hash` of the `CandidateDescriptor`
+matches this validation data, we use that. Otherwise, if the `validation_data_hash` matches the validation data fetched
+under the `TimedOut` assumption, we use that. Otherwise, we return a `ValidationResult::Invalid` response and conclude.
+
+Then, we can fetch the validation code from the runtime based on which type of candidate this is. This gives us all the
+parameters. The descriptor and PoV come from the request itself, and the other parameters have been derived from the
+state.
+
+> TODO: This would be a great place for caching to avoid making lots of runtime requests. That would need a job, though.
+
+### Execution of the Teyrchain Wasm
+
+Once we have all parameters, we can spin up a background task to perform the validation in a way that doesn't hold up
+the entire event loop. Before invoking the validation function itself, this should first do some basic checks:
+  * The collator signature is valid (only if `CandidateDescriptor` has version 1)
+  * The PoV provided matches the `pov_hash` field of the descriptor
+
+For more details please see [PVF Host and Workers](pvf-host-and-workers.md).
+
+### Checking Validation Outputs
+
+If we can assume the presence of the relay-chain state (that is, during processing
+[`CandidateValidationMessage`][CVM]`::ValidateFromChainState`) we can run all the checks that the relay-chain would run
+at the inclusion time thus confirming that the candidate will be accepted.
+
+[CVM]: ../../types/overseer-protocol.md#validationrequesttype
@@ -0,0 +1,23 @@
+# Chain API
+
+The Chain API subsystem is responsible for providing a single point of access to chain state data via a set of
+pre-determined queries.
+
+## Protocol
+
+Input: [`ChainApiMessage`](../../types/overseer-protocol.md#chain-api-message)
+
+Output: None
+
+## Functionality
+
+On receipt of `ChainApiMessage`, answer the request and provide the response to the side-channel embedded within the
+request.
+
+Currently, the following requests are supported:
+* Block hash to number
+* Block hash to header
+* Block weight
+* Finalized block number to hash
+* Last finalized block number
+* Ancestors
@@ -0,0 +1,61 @@
+# Chain Selection Subsystem
+
+This subsystem implements the necessary metadata for the implementation of the [chain
+selection](../../protocol-chain-selection.md) portion of the protocol.
+
+The subsystem wraps a database component which maintains a view of the unfinalized chain and records the properties of
+each block: whether the block is **viable**, whether it is **stagnant**, and whether it is **reverted**. It should also
+maintain an updated set of active leaves in accordance with this view, which should be cheap to query. Leaves are
+ordered descending first by weight and then by block number.
+
+This subsystem needs to update its information on the unfinalized chain:
+  * On every leaf-activated signal
+  * On every block-finalized signal
+  * On every `ChainSelectionMessage::Approve`
+  * On every `ChainSelectionMessage::RevertBlocks`
+  * Periodically, to detect stagnation.
+
+Simple implementations of these updates do `O(n_unfinalized_blocks)` disk operations. If the amount of unfinalized
+blocks is relatively small, the updates should not take very much time. However, in cases where there are hundreds or
+thousands of unfinalized blocks the naive implementations of these update algorithms would have to be replaced with more
+sophisticated versions.
+
+## `OverseerSignal::ActiveLeavesUpdate`
+
+Determine all new blocks implicitly referenced by any new active leaves and add them to the view. Update the set of
+viable leaves accordingly. The weights of imported blocks can be determined by the
+[`ChainApiMessage::BlockWeight`](../../types/overseer-protocol.md#chain-api-message).
+
+## `OverseerSignal::BlockFinalized`
+
+Delete data for all orphaned chains and update all metadata descending from the new finalized block accordingly, along
+with the set of viable leaves. Note that finalizing a **reverted** or **stagnant** block means that the descendants of
+those blocks may lose that status because the definitions of those properties don't include the finalized chain. Update
+the set of viable leaves accordingly.
+
+## `ChainSelectionMessage::Approved`
+
+Update the approval status of the referenced block. If the block was stagnant and thus non-viable and is now viable,
+then the metadata of all of its descendants needs to be updated as well, as they may no longer be stagnant either.
+Update the set of viable leaves accordingly.
+
+## `ChainSelectionMessage::Leaves`
+
+Gets all leaves of the chain, i.e. block hashes that are suitable to build upon and have no suitable children. Supplies
+the leaves in descending order by score.
+
+## `ChainSelectionMessage::BestLeafContaining`
+
+If the required block is unknown or not viable, then return `None`. Iterate over all leaves in order of descending
+weight, returning the first leaf containing the required block in its chain, and `None` otherwise.
+
+## `ChainSelectionMessage::RevertBlocks`
+This message indicates that a dispute has concluded against a teyrchain block candidate. The message passes along a
+vector containing the block number and block hash of each block where the disputed candidate was included. The passed
+blocks will be marked as reverted, and their descendants will be marked as non-viable.
+
+
+## Periodically
+
+Detect stagnant blocks and apply the stagnant definition to all descendants. Update the set of viable leaves
+accordingly.
@@ -0,0 +1,19 @@
+# Gossip Support
+
+The Gossip Support Subsystem is responsible for keeping track of session changes
+and issuing a connection request to all validators in the next, current and
+a few past sessions if we are a validator in these sessions.
+The request will add all validators to a reserved PeerSet, meaning we will not
+reject a connection request from any validator in that set.
+
+In addition to that, it creates a gossip overlay topology per session which
+limits the amount of messages sent and received to be an order of sqrt of the
+validators. Our neighbors in this graph will be forwarded to the network bridge
+with the `NetworkBridgeMessage::NewGossipTopology` message.
+
+See https://github.com/paritytech/polkadot/issues/3239 for more details.
+
+The gossip topology is used by teyrchain distribution subsystems,
+such as Bitfield Distribution, (small) Statement Distribution and
+Approval Distribution to limit the amount of peers we send messages to
+and handle view updates.
@@ -0,0 +1,161 @@
+# Network Bridge
+
+One of the main features of the overseer/subsystem duality is to avoid shared ownership of resources and to communicate
+via message-passing. However, implementing each networking subsystem as its own network protocol brings a fair share of
+challenges.
+
+The most notable challenge is coordinating and eliminating race conditions of peer connection and disconnection events.
+If we have many network protocols that peers are supposed to be connected on, it is difficult to enforce that a peer is
+indeed connected on all of them or the order in which those protocols receive notifications that peers have connected.
+This becomes especially difficult when attempting to share peer state across protocols. All of the Teyrchain-Host's
+gossip protocols eliminate DoS with a data-dependency on current chain heads. However, it is inefficient and confusing
+to implement the logic for tracking our current chain heads as well as our peers' on each of those subsystems. Having
+one subsystem for tracking this shared state and distributing it to the others is an improvement in architecture and
+efficiency.
+
+One other piece of shared state to track is peer reputation. When peers are found to have provided value or cost, we
+adjust their reputation accordingly.
+
+So in short, this Subsystem acts as a bridge between an actual network component and a subsystem's protocol. The
+implementation of the underlying network component is beyond the scope of this module. We make certain assumptions about
+the network component:
+  - The network allows registering of protocols and multiple versions of each protocol.
+  - The network handles version negotiation of protocols with peers and only connects the peer on the highest version of
+    the protocol.
+  - Each protocol has its own peer-set, although there may be some overlap.
+  - The network provides peer-set management utilities for discovering the peer-IDs of validators and a means of dialing
+    peers with given IDs.
+
+The network bridge makes use of the peer-set feature, but is not generic over peer-set. Instead, it exposes two
+peer-sets that event producers can attach to: `Validation` and `Collation`. More information can be found on the
+documentation of the [`NetworkBridgeMessage`][NBM].
+
+## Protocol
+
+Input: [`NetworkBridgeMessage`][NBM]
+
+Output: - [`ApprovalDistributionMessage`][AppD]`::NetworkBridgeUpdate` -
+	[`BitfieldDistributionMessage`][BitD]`::NetworkBridgeUpdate` -
+	[`CollatorProtocolMessage`][CollP]`::NetworkBridgeUpdate` -
+	[`StatementDistributionMessage`][StmtD]`::NetworkBridgeUpdate`
+
+## Functionality
+
+This network bridge sends messages of these types over the network.
+
+```rust
+enum WireMessage<M> {
+	ProtocolMessage(M),
+	ViewUpdate(View),
+}
+```
+
+and instantiates this type twice, once using the [`ValidationProtocolV1`][VP1] message type, and once with the
+[`CollationProtocolV1`][CP1] message type.
+
+```rust
+type ValidationV1Message = WireMessage<ValidationProtocolV1>;
+type CollationV1Message = WireMessage<CollationProtocolV1>;
+```
+
+### Startup
+
+On startup, we register two protocols with the underlying network utility. One for validation and one for collation. We
+register only version 1 of each of these protocols.
+
+### Main Loop
+
+The bulk of the work done by this subsystem is in responding to network events, signals from the overseer, and messages
+from other subsystems.
+
+Each network event is associated with a particular peer-set.
+
+### Overseer Signal: `ActiveLeavesUpdate`
+
+The `activated` and `deactivated` lists determine the evolution of our local view over time. A
+`ProtocolMessage::ViewUpdate` is issued to each connected peer on each peer-set, and a
+`NetworkBridgeEvent::OurViewChange` is issued to each event handler for each protocol.
+
+We only send view updates if the node has indicated that it has finished major blockchain synchronization.
+
+If we are connected to the same peer on both peer-sets, we will send the peer two view updates as a result.
+
+### Overseer Signal: `BlockFinalized`
+
+We update our view's `finalized_number` to the provided one and delay `ProtocolMessage::ViewUpdate` and
+`NetworkBridgeEvent::OurViewChange` till the next `ActiveLeavesUpdate`.
+
+### Network Event: `PeerConnected`
+
+Issue a `NetworkBridgeEvent::PeerConnected` for each [Event Handler](#event-handlers) of the peer-set and negotiated
+protocol version of the peer. Also issue a `NetworkBridgeEvent::PeerViewChange` and send the peer our current view, but
+only if the node has indicated that it has finished major blockchain synchronization. Otherwise, we only send the peer
+an empty view.
+
+### Network Event: `PeerDisconnected`
+
+Issue a `NetworkBridgeEvent::PeerDisconnected` for each [Event Handler](#event-handlers) of the peer-set and negotiated
+protocol version of the peer.
+
+### Network Event: `ProtocolMessage`
+
+Map the message onto the corresponding [Event Handler](#event-handlers) based on the peer-set this message was received
+on and dispatch via overseer.
+
+### Network Event: `ViewUpdate`
+
+- Check that the new view is valid and note it as the most recent view update of the peer on this peer-set.
+- Map a `NetworkBridgeEvent::PeerViewChange` onto the corresponding [Event Handler](#event-handlers) based on the
+  peer-set this message was received on and dispatch  via overseer.
+
+### `ReportPeer`
+
+- Adjust peer reputation according to cost or benefit provided
+
+### `DisconnectPeer`
+
+- Disconnect the peer from the peer-set requested, if connected.
+
+### `SendValidationMessage` / `SendValidationMessages`
+
+- Issue a corresponding `ProtocolMessage` to each listed peer on the validation peer-set.
+
+### `SendCollationMessage` / `SendCollationMessages`
+
+- Issue a corresponding `ProtocolMessage` to each listed peer on the collation peer-set.
+
+### `ConnectToValidators`
+
+- Determine the DHT keys to use for each validator based on the relay-chain state and Runtime API.
+- Recover the Peer IDs of the validators from the DHT. There may be more than one peer ID per validator.
+- Send all `(ValidatorId, PeerId)` pairs on the response channel.
+- Feed all Peer IDs to peer set manager the underlying network provides.
+
+### `NewGossipTopology`
+
+- Map all `AuthorityDiscoveryId`s to `PeerId`s and issue a corresponding `NetworkBridgeUpdate` to all validation
+  subsystems.
+
+## Event Handlers
+
+Network bridge event handlers are the intended recipients of particular network protocol messages. These are each a
+variant of a message to be sent via the overseer.
+
+### Validation V1
+
+- `ApprovalDistributionV1Message -> ApprovalDistributionMessage::NetworkBridgeUpdate`
+- `BitfieldDistributionV1Message -> BitfieldDistributionMessage::NetworkBridgeUpdate`
+- `StatementDistributionV1Message -> StatementDistributionMessage::NetworkBridgeUpdate`
+
+### Collation V1
+
+- `CollatorProtocolV1Message -> CollatorProtocolMessage::NetworkBridgeUpdate`
+
+[NBM]: ../../types/overseer-protocol.md#network-bridge-message
+[AppD]: ../../types/overseer-protocol.md#approval-distribution-message
+[BitD]: ../../types/overseer-protocol.md#bitfield-distribution-message
+[StmtD]: ../../types/overseer-protocol.md#statement-distribution-message
+[CollP]: ../../types/overseer-protocol.md#collator-protocol-message
+
+[VP1]: ../../types/network.md#validation-v1
+[CP1]: ../../types/network.md#collation-v1
@@ -0,0 +1,9 @@
+# Peer Set Manager
+
+> TODO
+
+## Protocol
+
+## Functionality
+
+## Jobs, if any
@@ -0,0 +1,271 @@
+# Provisioner
+
+> NOTE: This module has suffered changes for the elastic scaling implementation. As a result, parts of this document may
+be out of date and will be updated at a later time. Issue tracking the update:
+https://github.com/pezkuwichain/pezkuwi-sdk/issues/132
+
+Relay chain block authorship authority is governed by BABE and is beyond the scope of the Overseer and the rest of the
+subsystems. That said, ultimately the block author needs to select a set of backable teyrchain candidates and other
+consensus data, and assemble a block from them. This subsystem is responsible for providing the necessary data to all
+potential block authors.
+
+## Provisionable Data
+
+There are several distinct types of provisionable data, but they share this property in common: all should eventually be
+included in a relay chain block.
+
+### Backed Candidates
+
+The block author can choose 0 or 1 backed teyrchain candidates per teyrchain; the only constraint is that each backable
+candidate has the appropriate relay parent. However, the choice of a backed candidate must be the block author's. The
+provisioner subsystem is how those block authors make this choice in practice.
+
+### Signed Bitfields
+
+[Signed bitfields](../../types/availability.md#signed-availability-bitfield) are attestations from a particular
+validator about which candidates it believes are available. Those will only be provided on fresh leaves.
+
+### Misbehavior Reports
+
+Misbehavior reports are self-contained proofs of misbehavior by a validator or group of validators. For example, it is
+very easy to verify a double-voting misbehavior report: the report contains two votes signed by the same key, advocating
+different outcomes. Concretely, misbehavior reports become inherents which cause dots to be slashed.
+
+Note that there is no mechanism in place which forces a block author to include a misbehavior report which it doesn't
+like, for example if it would be slashed by such a report. The chain's defense against this is to have a relatively long
+slash period, such that it's likely to encounter an honest author before the slash period expires.
+
+### Dispute Inherent
+
+The dispute inherent is similar to a misbehavior report in that it is an attestation of misbehavior on the part of a
+validator or group of validators. Unlike a misbehavior report, it is not self-contained: resolution requires coordinated
+action by several validators. The canonical example of a dispute inherent involves an approval checker discovering that
+a set of validators has improperly approved an invalid teyrchain block: resolving this requires the entire validator set
+to re-validate the block, so that the minority can be slashed.
+
+Dispute resolution is complex and is explained in substantially more detail [here](../../runtime/disputes.md).
+
+## Protocol
+
+The subsystem should maintain a set of handles to Block Authorship Provisioning iterations that are currently live.
+
+### On Overseer Signal
+
+- `ActiveLeavesUpdate`:
+  - For each `activated` head:
+    - spawn a Block Authorship Provisioning iteration with the given relay parent, storing a bidirectional channel with
+      that iteration.
+  - For each `deactivated` head:
+    - terminate the Block Authorship Provisioning iteration for the given relay parent, if any.
+- `Conclude`: Forward `Conclude` to all iterations, waiting a small amount of time for them to join, and then
+  hard-exiting.
+
+### On `ProvisionerMessage`
+
+Forward the message to the appropriate Block Authorship Provisioning iteration, or discard if no appropriate iteration
+is currently active.
+
+### Per Provisioning Iteration
+
+Input: [`ProvisionerMessage`](../../types/overseer-protocol.md#provisioner-message). Backed candidates come from the
+[Candidate Backing subsystem](../backing/candidate-backing.md), signed bitfields come from the [Bitfield Distribution
+subsystem](../availability/bitfield-distribution.md), and disputes come from the [Disputes
+Subsystem](../disputes/dispute-coordinator.md). Misbehavior reports are currently sent from the [Candidate Backing
+subsystem](../backing/candidate-backing.md) and contain the following misbehaviors:
+
+1. `Misbehavior::ValidityDoubleVote`
+2. `Misbehavior::UnauthorizedStatement`
+3. `Misbehavior::DoubleSign`
+
+But we choose not to punish these forms of misbehavior for the time being. Risks from misbehavior are sufficiently
+mitigated at the protocol level via reputation changes. Punitive actions here may become desirable enough to dedicate
+time to in the future.
+
+At initialization, this subsystem has no outputs.
+
+Block authors request the inherent data they should use for constructing the inherent in the block which contains
+teyrchain execution information.
+
+## Block Production
+
+When a validator is selected by BABE to author a block, it becomes a block producer. The provisioner is the subsystem
+best suited to choosing which specific backed candidates and availability bitfields should be assembled into the block.
+To engage this functionality, a `ProvisionerMessage::RequestInherentData` is sent; the response is a
+[`ParaInherentData`](../../types/runtime.md#parainherentdata). Each relay chain block backs at most one backable
+teyrchain block candidate per teyrchain. Additionally no further block candidate can be backed until the previous one
+either gets declared available or expired. If bitfields indicate that candidate A, predecessor of B, should be declared
+available, then B can be backed in the same relay block. Appropriate bitfields, as outlined in the section on [bitfield
+selection](#bitfield-selection), and any dispute statements should be attached as well.
+
+### Bitfield Selection
+
+Our goal with respect to bitfields is simple: maximize availability. However, it's not quite as simple as always
+including all bitfields; there are constraints which still need to be met:
+
+- not more than one bitfield per validator
+- each 1 bit must correspond to an occupied core
+
+Beyond that, a semi-arbitrary selection policy is fine. In order to meet the goal of maximizing availability, a
+heuristic of picking the bitfield with the greatest number of 1 bits set in the event of conflict is useful.
+
+### Dispute Statement Selection
+
+This is the point at which the block author provides further votes to active disputes or initiates new disputes in the
+runtime state.
+
+The block-authoring logic of the runtime has an extra step between handling the inherent-data and producing the actual
+inherent call, which we assume performs the work of filtering out disputes which are not relevant to the on-chain state.
+Backing votes are always kept in the dispute statement set. This ensures we punish the maximum number of misbehaving
+backers.
+
+To select disputes:
+
+- Issue a `DisputeCoordinatorMessage::RecentDisputes` message and wait for the response. This is a set of all disputes
+  in recent sessions which we are aware of.
+
+### Determining Bitfield Availability
+
+An occupied core has a `CoreAvailability` bitfield. We also have a list of `SignedAvailabilityBitfield`s. We need to
+determine from these whether or not a core at a particular index has become available.
+
+The key insight required is that `CoreAvailability` is transverse to the `SignedAvailabilityBitfield`s: if we
+conceptualize the list of bitfields as many rows, each bit of which is its own column, then `CoreAvailability` for a
+given core index is the vertical slice of bits in the set at that index.
+
+To compute bitfield availability, then:
+
+- Start with a copy of `OccupiedCore.availability`
+- For each bitfield in the list of `SignedAvailabilityBitfield`s:
+  - Get the bitfield's `validator_index`
+  - Update the availability. Conceptually, assuming bit vectors: `availability[validator_index] |= bitfield[core_idx]`
+- Availability has a 2/3 threshold. Therefore: `3 * availability.count_ones() >= 2 * availability.len()`
+
+### Candidate Selection: Prospective Teyrchains Mode
+
+The state of the provisioner `PerRelayParent` tracks an important setting, `ProspectiveTeyrchainsMode`. This setting
+determines which backable candidate selection method the provisioner uses.
+
+`ProspectiveTeyrchainsMode::Disabled` - The provisioner uses its own internal legacy candidate selection.
+`ProspectiveTeyrchainsMode::Enabled` - The provisioner requests that [prospective
+teyrchains](../backing/prospective-teyrchains.md) provide selected candidates.
+
+Candidates selected with `ProspectiveTeyrchainsMode::Enabled` are able to benefit from the increased block production
+time asynchronous backing allows. For this reason all Pezkuwi protocol networks will eventually use prospective
+teyrchains candidate selection. Then legacy candidate selection will be removed as obsolete.
+
+### Prospective Teyrchains Candidate Selection
+
+The goal of candidate selection is to determine which cores are free, and then to the degree possible, pick a candidate
+appropriate to each free core. In prospective teyrchains candidate selection the provisioner handles the former process
+while [prospective teyrchains](../backing/prospective-teyrchains.md) handles the latter.
+
+To select backable candidates:
+
+- Get the list of core states from the runtime API
+- For each core state:
+  - On `CoreState::Free`
+    - The core is unscheduled and doesn’t need to be provisioned with a candidate
+  - On `CoreState::Scheduled`
+    - The core is unoccupied and scheduled to accept a backed block for a particular `para_id`.
+    - The provisioner requests a backable candidate from [prospective teyrchains](../backing/prospective-teyrchains.md)
+      with the desired relay parent, the core’s scheduled `para_id`, and an empty required path.
+  - On `CoreState::Occupied`
+    - The availability core is occupied by a teyrchain block candidate pending availability. A further candidate need
+      not be provided by the provisioner unless the core will be vacated this block. This is the case when either
+      bitfields indicate the current core occupant has been made available or a timeout is reached.
+    - If `bitfields_indicate_availability`
+      - If `Some(scheduled_core) = occupied_core.next_up_on_available`, the core will be vacated and in need of a
+        provisioned candidate. The provisioner requests a backable candidate from [prospective
+        teyrchains](../backing/prospective-teyrchains.md) with the core’s scheduled `para_id` and a required path with
+        one entry. This entry corresponds to the parablock candidate previously occupying this core, which was made
+        available and can be built upon even though it hasn’t been seen as included in a relay chain block yet. See the
+        Required Path section below for more detail.
+      - If `occupied_core.next_up_on_available` is `None`, then the core being vacated is unscheduled and doesn’t need
+        to be provisioned with a candidate.
+    - Else-if `occupied_core.time_out_at == block_number`
+      - If `Some(scheduled_core) = occupied_core.next_up_on_timeout`, the core will be vacated and in need of a
+        provisioned candidate. A candidate is requested in exactly the same way as with `CoreState::Scheduled`.
+      - Else the core being vacated is unscheduled and doesn’t need to be provisioned with a candidate The end result of
+this process is a vector of `CandidateHash`s, sorted in order of their core index.
+
+#### Required Path
+
+Required path is a parameter for `ProspectiveTeyrchainsMessage::GetBackableCandidates`, which the provisioner sends in
+candidate selection.
+
+An empty required path indicates that the requested candidate chain should start with the most recently included
+parablock for the given `para_id` as of the given relay parent.
+
+In contrast, a required path with one or more entries prompts [prospective
+teyrchains](../backing/prospective-teyrchains.md) to step forward through its fragment tree for the given `para_id` and
+relay parent until the desired parablock is reached. We then select the chain starting with the direct child of that
+parablock to pass to the provisioner.
+
+The parablocks making up a required path do not need to have been previously seen as included in relay chain blocks.
+Thus the ability to provision backable candidates based on a required path effectively decouples backing from inclusion.
+
+### Legacy Candidate Selection
+
+Legacy candidate selection takes place in the provisioner. Thus the provisioner needs to keep an up to date record of
+all [backed_candidates](../../types/backing.md#backed-candidate) `PerRelayParent` to pick from.
+
+The goal of candidate selection is to determine which cores are free, and then to the degree possible, pick a candidate
+appropriate to each free core.
+
+To determine availability:
+
+- Get the list of core states from the runtime API
+- For each core state:
+  - On `CoreState::Scheduled`, then we can make an `OccupiedCoreAssumption::Free`.
+  - On `CoreState::Occupied`, then we may be able to make an assumption:
+    - If the bitfields indicate availability and there is a scheduled `next_up_on_available`, then we can make an
+      `OccupiedCoreAssumption::Included`.
+    - If the bitfields do not indicate availability, and there is a scheduled `next_up_on_time_out`, and
+      `occupied_core.time_out_at == block_number_under_production`, then we can make an
+      `OccupiedCoreAssumption::TimedOut`.
+  - If we did not make an `OccupiedCoreAssumption`, then continue on to the next core.
+  - Now compute the core's `validation_data_hash`: get the `PersistedValidationData` from the runtime, given the known
+    `ParaId` and `OccupiedCoreAssumption`;
+  - Find an appropriate candidate for the core.
+    - There are two constraints: `backed_candidate.candidate.descriptor.para_id == scheduled_core.para_id &&
+      candidate.candidate.descriptor.validation_data_hash == computed_validation_data_hash`.
+    - In the event that more than one candidate meets the constraints, selection between the candidates is arbitrary.
+      However, not more than one candidate can be selected per core.
+
+The end result of this process is a vector of `CandidateHash`s, sorted in order of their core index.
+
+### Retrieving Full `BackedCandidate`s for Selected Hashes
+
+Legacy candidate selection and prospective teyrchains candidate selection both leave us with a vector of
+`CandidateHash`s. These are passed to the backing subsystem with `CandidateBackingMessage::GetBackedCandidates`.
+
+The response is a vector of `BackedCandidate`s, sorted in order of their core index and ready to be provisioned to block
+authoring. The candidate selection and retrieval process should select at maximum one candidate which upgrades the
+runtime validation code.
+
+## Glossary
+
+- **Relay-parent:**
+  - A particular relay-chain block which serves as an anchor and reference point for processes and data which depend on
+    relay-chain state.
+- **Active Leaf:**
+  - A relay chain block which is the head of an active fork of the relay chain.
+  - Block authorship provisioning jobs are spawned per active leaf and concluded for any leaves which become inactive.
+- **Candidate Selection:**
+  - The process by which the provisioner selects backable teyrchain block candidates to pass to block authoring.
+  - Two versions, prospective teyrchains candidate selection and legacy candidate selection. See their respective
+    protocol sections for details.
+- **Availability Core:**
+  - Often referred to simply as "cores", availability cores are an abstraction used for resource management. For the
+    provisioner, availability cores are most relevant in that core states determine which `para_id`s to provision
+    backable candidates for.
+  - For more on availability cores see [Scheduler Module: Availability
+    Cores](../../runtime/scheduler.md#availability-cores)
+- **Availability Bitfield:**
+  - Often referred to simply as a "bitfield", an availability bitfield represents the view of parablock candidate
+    availability from a particular validator's perspective. Each bit in the bitfield corresponds to a single
+    [availability core](../../runtime-api/availability-cores.md).
+  - For more on availability bitfields see [availability](../../types/availability.md)
+- **Backable vs. Backed:**
+  - Note that we sometimes use "backed" to refer to candidates that are "backable", but not yet backed on chain.
+  - Backable means that a quorum of the candidate's assigned backing group have provided signed affirming statements.
@@ -0,0 +1,265 @@
+# PVF Host and Workers
+
+The PVF host is responsible for handling requests to prepare and execute PVF
+code blobs, which it sends to PVF **workers** running in their own child
+processes. These workers are spawned from the `pezkuwi-prepare-worker` and
+`pezkuwi-execute-worker` binaries.
+
+While the workers are generally long-living, they also spawn one-off secure
+**job processes** that perform the jobs. See "Job Processes" section below.
+
+## High-Level Flow
+
+```dot process
+digraph {
+ rankdir="LR";
+
+ can [label = "Candidate\nValidation\nSubsystem"; shape = square]
+
+ pvf [label = "PVF Host"; shape = square]
+
+ pq [label = "Prepare\nQueue"; shape = square]
+ eq [label = "Execute\nQueue"; shape = square]
+ pp [label = "Prepare\nPool"; shape = square]
+
+ subgraph "cluster partial_sandbox_prep" {
+  label = "pezkuwi-prepare-worker\n(Partial Sandbox)\n\n\n";
+  labelloc = "t";
+
+  pw [label = "Prepare\nWorker"; shape = square]
+
+  subgraph "cluster full_sandbox_prep" {
+   label = "Fully Isolated Sandbox\n\n\n";
+   labelloc = "t";
+
+   pj [label = "Prepare\nJob"; shape = square]
+  }
+ }
+
+ subgraph "cluster partial_sandbox_exec" {
+  label = "pezkuwi-execute-worker\n(Partial Sandbox)\n\n\n";
+  labelloc = "t";
+
+  ew [label = "Execute\nWorker"; shape = square]
+
+  subgraph "cluster full_sandbox_exec" {
+   label = "Fully Isolated Sandbox\n\n\n";
+   labelloc = "t";
+
+   ej [label = "Execute\nJob"; shape = square]
+  }
+ }
+
+ can -> pvf [label = "Precheck"; style = dashed]
+ can -> pvf [label = "Validate"]
+
+ pvf -> pq [label = "Prepare"; style = dashed]
+ pvf -> eq [label = "Execute";]
+ pvf -> pvf [label = "see (2) and (3)"; style = dashed]
+ pq -> pp [style = dashed]
+
+ pp -> pw [style = dashed]
+ eq -> ew
+
+ pw -> pj [style = dashed]
+ ew -> ej
+}
+```
+
+Some notes about the graph:
+
+1. Once a job has finished, the response will flow back up the way it came.
+2. In the case of execution, the host will send a request for preparation to the
+   Prepare Queue if needed. In that case, only after the preparation succeeds
+   does the Execute Queue continue with validation.
+3. Multiple requests for preparing the same artifact are coalesced, so that the
+   work is only done once.
+
+## Goals
+
+This system has two high-level goals that we will touch on here: *determinism*
+and *security*.
+
+## Determinism
+
+One high-level goal is to make PVF operations as deterministic as possible, to
+reduce the rate of disputes. Disputes can happen due to e.g. a job timing out on
+one machine, but not another. While we do not have full determinism, there are
+some dispute reduction mechanisms in place right now.
+
+### Retrying execution requests
+
+If the execution request fails during **preparation**, we will retry if it is
+possible that the preparation error was transient (e.g. if the error was a panic
+or time out). We will only retry preparation if another request comes in after
+15 minutes, to ensure any potential transient conditions had time to be
+resolved. We will retry up to 5 times.
+
+If the actual **execution** of the artifact fails, we will retry once if it was
+a possibly transient error, to allow the conditions that led to the error to
+hopefully resolve. We use a more brief delay here (1 second as opposed to 15
+minutes for preparation (see above)), because a successful execution must happen
+in a short amount of time.
+
+If the execution fails during the backing phase, we won't retry to reduce the chance of
+supporting nondeterministic candidates. This reduces the chance of nondeterministic blocks
+getting backed and honest backers getting slashed.
+
+We currently know of the following specific cases that will lead to a retried
+execution request:
+
+1. **OOM:** We have memory limits to try to prevent attackers from exhausting
+   host memory. If the memory limit is hit, we kill the job process and retry
+   the job. Alternatively, the host might have been temporarily low on memory
+   due to other processes running on the same machine. **NOTE:** This case will
+   lead to voting against the candidate (and possibly a dispute) if the retry is
+   still not successful.
+2. **Syscall violations:** If the job attempts a system call that is blocked by
+   the sandbox's security policy, the job process is immediately killed and we
+   retry. **NOTE:** In the future, if we have a proper way to detect that the
+   job died due to a security violation, it might make sense not to retry in
+   this case.
+3. **Artifact missing:** The prepared artifact might have been deleted due to
+   operator error or some bug in the system.
+4. **Job errors:** For example, the job process panicked for some indeterminate
+   reason, which may or may not be independent of the candidate or PVF.
+5. **Internal errors:** See "Internal Errors" section. In this case, after the
+   retry we abstain from voting.
+6. **RuntimeConstruction** error. The precheck handles a general case of a wrong
+   artifact but doesn't guarantee its consistency between the preparation and
+   the execution. If something happened with the artifact between
+   the preparation of the artifact and its execution (e.g. the artifact was
+   corrupted on disk or a dirty node upgrade happened when the prepare worker
+   has a wasmtime version different from the execute worker's wasmtime version).
+   We treat such an error as possibly transient due to local issues and retry
+   one time.
+
+### Preparation timeouts
+
+We use timeouts for both preparation and execution jobs to limit the amount of
+time they can take. As the time for a job can vary depending on the machine and
+load on the machine, this can potentially lead to disputes where some validators
+successfully execute a PVF and others don't.
+
+One dispute mitigation we have in place is a more lenient timeout for
+preparation during execution than during pre-checking. The rationale is that the
+PVF has already passed pre-checking, so we know it should be valid, and we allow
+it to take longer than expected, as this is likely due to an issue with the
+machine and not the PVF.
+
+### CPU clock timeouts
+
+Another timeout-related mitigation we employ is to measure the time taken by
+jobs using CPU time, rather than wall clock time. This is because the CPU time
+of a process is less variable under different system conditions. When the
+overall system is under heavy load, the wall clock time of a job is affected
+more than the CPU time.
+
+### Internal errors
+
+An internal, or local, error is one that we treat as independent of the PVF
+and/or candidate, i.e. local to the running machine. If this happens, then we
+will first retry the job and if the errors persists, then we simply do not vote.
+This prevents slashes, since otherwise our vote may not agree with that of the
+other validators.
+
+In general, for errors not raising a dispute we have to be very careful. This is
+only sound, if either:
+
+1. We ruled out that error in pre-checking. If something is not checked in
+   pre-checking, even if independent of the candidate and PVF, we must raise a
+   dispute.
+2. We are 100% confident that it is a hardware/local issue: Like corrupted file,
+   etc.
+
+Reasoning: Otherwise it would be possible to register a PVF where candidates can
+not be checked, but we don't get a dispute - so nobody gets punished. Second, we
+end up with a finality stall that is not going to resolve!
+
+Note that any error from the job process we cannot treat as internal. The job
+runs untrusted code and an attacker can therefore return arbitrary errors. If
+they were to return errors that we treat as internal, they could make us abstain
+from voting. Since we are unsure if such errors are legitimate, we will first
+retry the candidate, and if the issue persists we are forced to vote invalid.
+
+## Security
+
+With [on-demand teyrchains](https://github.com/orgs/paritytech/projects/67), it
+is much easier to submit PVFs to the chain for preparation and execution. This
+makes it easier for erroneous disputes and slashing to occur, whether
+intentional (as a result of a malicious attacker) or not (a bug or operator
+error occurred).
+
+Therefore, another goal of ours is to harden our security around PVFs, in order
+to protect the economic interests of validators and increase overall confidence
+in the system.
+
+### Possible attacks / threat model
+
+Webassembly is already sandboxed, but there have already been reported multiple
+CVEs enabling remote code execution. See e.g. these two advisories from
+[Mar 2023](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-ff4p-7xrq-q5r8)
+and [Jul 2022](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-7f6x-jwh5-m9r4).
+
+So what are we actually worried about? Things that come to mind:
+
+1. **Consensus faults** - If an attacker can get some source of randomness they
+   could vote against with 50% chance and cause unresolvable disputes.
+2. **Targeted slashes** - An attacker can target certain validators (e.g. some
+   validators running on vulnerable hardware) and make them vote invalid and get
+   them slashed.
+3. **Mass slashes** - With some source of randomness they can do an untargeted
+   attack. I.e. a baddie can do significant economic damage by voting against
+   with 1/3 chance, without even stealing keys or completely replacing the
+   binary.
+4. **Stealing keys** - That would be pretty bad. Should not be possible with
+   sandboxing. We should at least not allow filesystem-access or network access.
+5. **Taking control over the validator.** E.g. replacing the `pezkuwi` binary
+   with a `pezkuwi-evil` binary. Should again not be possible with the above
+   sandboxing in place.
+6. **Intercepting and manipulating packages** - Effect very similar to the
+   above, hard to do without also being able to do 4 or 5.
+
+We do not protect against (1), (2), and (3), because there are too many sources
+of randomness for an attacker to exploit.
+
+We provide very good protection against (4), (5), and (6).
+
+### Job Processes
+
+As mentioned above, our architecture includes long-living **worker processes**
+and one-off **job processes**. This separation is important so that the handling
+of untrusted code can be limited to the job processes. A hijacked job process
+can therefore not interfere with other jobs running in separate processes.
+
+Furthermore, if an unexpected execution error occurred in the execution worker
+and not the job itself, we generally can be confident that it has nothing to do
+with the candidate, so we can abstain from voting. On the other hand, a hijacked
+job is able to send back erroneous responses for candidates, so we know that we
+should not abstain from voting on such errors from jobs. Otherwise, an attacker
+could trigger a finality stall. (See "Internal Errors" section above.)
+
+### Restricting file-system access
+
+A basic security mechanism is to make sure that any process directly interfacing
+with untrusted code does not have unnecessary access to the file-system. This
+provides some protection against attackers accessing sensitive data or modifying
+data on the host machine.
+
+*Currently this is only supported on Linux.*
+
+### Restricting networking
+
+We also disable networking on PVF threads by disabling certain syscalls, such as
+the creation of sockets. This prevents attackers from either downloading
+payloads or communicating sensitive data from the validator's machine to the
+outside world.
+
+*Currently this is only supported on Linux.*
+
+### Clearing env vars
+
+We clear environment variables before handling untrusted code, because why give
+attackers potentially sensitive data unnecessarily? And even if everything else
+is locked down, env vars can potentially provide a source of randomness (see
+point 1, "Consensus faults" above).
@@ -0,0 +1,73 @@
+# PVF Pre-checker
+
+The PVF pre-checker is a subsystem that is responsible for watching the relay chain for new PVFs that require
+pre-checking. Head over to [overview] for the PVF pre-checking process overview.
+
+## Protocol
+
+There is no dedicated input mechanism for PVF pre-checker. Instead, PVF pre-checker looks on the `ActiveLeavesUpdate`
+event stream for work.
+
+This subsystem does not produce any output messages either. The subsystem will, however, send messages to the
+[Runtime API] subsystem to query for the pending PVFs and to submit votes. In addition to that, it will also
+communicate with [Candidate Validation] Subsystem to request PVF pre-check.
+
+## Functionality
+
+If the node is running in a collator mode, this subsystem will be disabled. The PVF pre-checker subsystem keeps track of
+the PVFs that are relevant for the subsystem.
+
+To be relevant for the subsystem, a PVF must be returned by the [`pvfs_require_precheck` runtime API][PVF pre-checking
+runtime API] in any of the active leaves. If the PVF is not present in any of the active leaves, it ceases to be
+relevant.
+
+When a PVF just becomes relevant, the subsystem will send a message to the [Candidate Validation] subsystem asking for
+the pre-check.
+
+Upon receiving a message from the candidate-validation subsystem, the pre-checker will note down that the PVF has its
+judgement and will also sign and submit a [`PvfCheckStatement`][PvfCheckStatement] via the [`submit_pvf_check_statement`
+runtime API][PVF pre-checking runtime API]. In case, a judgement was received for a PVF that is no longer in view it is
+ignored.
+
+Since a vote only is valid during [one session][overview], the subsystem will have to resign and submit the statements
+for the new session. The new session is assumed to be started if at least one of the leaves has a greater session index
+that was previously observed in any of the leaves.
+
+The subsystem tracks all the statements that it submitted within a session. If for some reason a PVF became irrelevant
+and then becomes relevant again, the subsystem will not submit a new statement for that PVF within the same session.
+
+If the node is not in the active validator set, it will still perform all the checks. However, it will only submit the
+check statements when the node is in the active validator set.
+
+### Rejecting failed PVFs
+
+It is possible that the candidate validation was not able to check the PVF, e.g. if it timed out. In that case, the PVF
+pre-checker will vote against it. This is considered safe, as there is no slashing for being on the wrong side of a
+pre-check vote.
+
+Rejecting instead of abstaining is better in several ways:
+
+1. Conclusion is reached faster - we have actual votes, instead of relying on a timeout.
+1. Being strict in pre-checking makes it safer to be more lenient in preparation errors afterwards. Hence we have more
+   leeway in avoiding raising dubious disputes, without making things less secure.
+
+Also, if we only abstain, an attacker can specially craft a PVF wasm blob so that it will fail on e.g. 50% of the
+validators. In that case a supermajority will never be reached and the vote will repeat multiple times, most likely with
+the same result (since all votes are cleared on a session change). This is avoided by rejecting failed PVFs, and by only
+requiring 1/3 of validators to reject a PVF to reach a decision.
+
+### Note on Disputes
+
+Having a pre-checking phase allows us to make certain assumptions later when preparing the PVF for execution. If a
+runtime passed pre-checking, then we know that the runtime should be valid, and therefore any issue during preparation
+for execution can be assumed to be a local problem on the current node.
+
+For this reason, even deterministic preparation errors should not trigger disputes. And since we do not dispute as a
+result of the pre-checking phase, as stated above, it should be impossible for preparation in general to result in
+disputes.
+
+[overview]: ../../pvf-prechecking.md
+[Runtime API]: runtime-api.md
+[PVF pre-checking runtime API]: ../../runtime-api/pvf-prechecking.md
+[Candidate Validation]: candidate-validation.md
+[PvfCheckStatement]: ../../types/pvf-prechecking.md#pvfcheckstatement
@@ -0,0 +1,21 @@
+# Runtime API
+
+The Runtime API subsystem is responsible for providing a single point of access to runtime state data via a set of
+pre-determined queries. This prevents shared ownership of a blockchain client resource by providing
+
+## Protocol
+
+Input: [`RuntimeApiMessage`](../../types/overseer-protocol.md#runtime-api-message)
+
+Output: None
+
+## Functionality
+
+On receipt of `RuntimeApiMessage::Request(relay_parent, request)`, answer the request using the post-state of the
+`relay_parent` provided and provide the response to the side-channel embedded within the request.
+
+## Jobs
+
+> TODO Don't limit requests based on parent hash, but limit caching. No caching should be done for any requests on
+> `relay_parent`s that are not active based on `ActiveLeavesUpdate` messages. Maybe with some leeway for things that
+> have just been stopped.