Request based availability distribution (#2423)

* WIP

* availability distribution, still very wip.

Work on the requesting side of things.

* Some docs on what I intend to do.

* Checkpoint of session cache implementation

as I will likely replace it with something smarter.

* More work, mostly on cache

and getting things to type check.

* Only derive MallocSizeOf and Debug for std.

* availability-distribution: Cache feature complete.

* Sketch out logic in `FetchTask` for actual fetching.

- Compile fixes.
- Cleanup.

* Format cleanup.

* More format fixes.

* Almost feature complete `fetch_task`.

Missing:

- Check for cancel
- Actual querying of peer ids.

* Finish FetchTask so far.

* Directly use AuthorityDiscoveryId in protocol and cache.

* Resolve `AuthorityDiscoveryId` on sending requests.

* Rework fetch_task

- also make it impossible to check the wrong chunk index.
- Export needed function in validator_discovery.

* From<u32> implementation for `ValidatorIndex`.

* Fixes and more integration work.

* Make session cache proper lru cache.

* Use proper lru cache.

* Requester finished.

* ProtocolState -> Requester

Also make sure to not fetch our own chunk.

* Cleanup + fixes.

* Remove unused functions

- FetchTask::is_finished
- SessionCache::fetch_session_info

* availability-distribution responding side.

* Cleanup + Fixes.

* More fixes.

* More fixes.

adder-collator is running!

* Some docs.

* Docs.

* Fix reporting of bad guys.

* Fix tests

* Make all tests compile.

* Fix test.

* Cleanup + get rid of some warnings.

* state -> requester

* Mostly doc fixes.

* Fix test suite.

* Get rid of now redundant message types.

* WIP

* Rob's review remarks.

* Fix test suite.

* core.relay_parent -> leaf for session request.

* Style fix.

* Decrease request timeout.

* Cleanup obsolete errors.

* Metrics + don't fail on non fatal errors.

* requester.rs -> requester/mod.rs

* Panic on invalid BadValidator report.

* Fix indentation.

* Use typed default timeout constant.

* Make channel size 0, as each sender gets one slot anyways.

* Fix incorrect metrics initialization.

* Fix build after merge.

* More fixes.

* Hopefully valid metrics names.

* Better metrics names.

* Some tests that already work.

* Slightly better docs.

* Some more tests.

* Fix network bridge test.
This commit is contained in:
Robert Klotzner
2021-02-26 18:58:07 +01:00
committed by GitHub
parent 241b1f12a7
commit 48409e5548
45 changed files with 2037 additions and 1523 deletions
@@ -0,0 +1,275 @@
// Copyright 2021 Parity Technologies (UK) Ltd.
// This file is part of Polkadot.
// Polkadot is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
// Polkadot is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
// You should have received a copy of the GNU General Public License
// along with Polkadot. If not, see <http://www.gnu.org/licenses/>.
use std::collections::HashSet;
use lru::LruCache;
use rand::{seq::SliceRandom, thread_rng};
use sp_application_crypto::AppKey;
use sp_core::crypto::Public;
use sp_keystore::{CryptoStore, SyncCryptoStorePtr};
use polkadot_node_subsystem_util::{
request_session_index_for_child_ctx, request_session_info_ctx,
};
use polkadot_primitives::v1::SessionInfo as GlobalSessionInfo;
use polkadot_primitives::v1::{
AuthorityDiscoveryId, GroupIndex, Hash, SessionIndex, ValidatorId, ValidatorIndex,
};
use polkadot_subsystem::SubsystemContext;
use super::{
error::{recv_runtime, Result},
Error,
LOG_TARGET,
};
/// Caching of session info as needed by availability distribution.
///
/// It should be ensured that a cached session stays live in the cache as long as we might need it.
pub struct SessionCache {
/// Get the session index for a given relay parent.
///
/// We query this up to a 100 times per block, so caching it here without roundtrips over the
/// overseer seems sensible.
session_index_cache: LruCache<Hash, SessionIndex>,
/// Look up cached sessions by SessionIndex.
///
/// Note: Performance of fetching is really secondary here, but we need to ensure we are going
/// to get any existing cache entry, before fetching new information, as we should not mess up
/// the order of validators in `SessionInfo::validator_groups`. (We want live TCP connections
/// wherever possible.)
session_info_cache: LruCache<SessionIndex, SessionInfo>,
/// Key store for determining whether we are a validator and what `ValidatorIndex` we have.
keystore: SyncCryptoStorePtr,
}
/// Localized session information, tailored for the needs of availability distribution.
#[derive(Clone)]
pub struct SessionInfo {
/// The index of this session.
pub session_index: SessionIndex,
/// Validator groups of the current session.
///
/// Each group's order is randomized. This way we achieve load balancing when requesting
/// chunks, as the validators in a group will be tried in that randomized order. Each node
/// should arrive at a different order, therefore we distribute the load on individual
/// validators.
pub validator_groups: Vec<Vec<AuthorityDiscoveryId>>,
/// Information about ourself:
pub our_index: ValidatorIndex,
/// Remember to which group we belong, so we won't start fetching chunks for candidates with
/// our group being responsible. (We should have that chunk already.)
pub our_group: GroupIndex,
}
/// Report of bad validators.
///
/// Fetching tasks will report back validators that did not respond as expected, so we can re-order
/// them.
pub struct BadValidators {
/// The session index that was used.
pub session_index: SessionIndex,
/// The group, the not properly responding validators belong to.
pub group_index: GroupIndex,
/// The list of bad validators.
pub bad_validators: Vec<AuthorityDiscoveryId>,
}
impl SessionCache {
/// Create a new `SessionCache`.
pub fn new(keystore: SyncCryptoStorePtr) -> Self {
SessionCache {
// 5 relatively conservative, 1 to 2 should suffice:
session_index_cache: LruCache::new(5),
// We need to cache the current and the last session the most:
session_info_cache: LruCache::new(2),
keystore,
}
}
/// Tries to retrieve `SessionInfo` and calls `with_info` if successful.
///
/// If this node is not a validator, the function will return `None`.
///
/// Use this function over any `fetch_session_info` if all you need is a reference to
/// `SessionInfo`, as it avoids an expensive clone.
pub async fn with_session_info<Context, F, R>(
&mut self,
ctx: &mut Context,
parent: Hash,
with_info: F,
) -> Result<Option<R>>
where
Context: SubsystemContext,
F: FnOnce(&SessionInfo) -> R,
{
let session_index = match self.session_index_cache.get(&parent) {
Some(index) => *index,
None => {
let index =
recv_runtime(request_session_index_for_child_ctx(parent, ctx).await)
.await?;
self.session_index_cache.put(parent, index);
index
}
};
if let Some(info) = self.session_info_cache.get(&session_index) {
return Ok(Some(with_info(info)));
}
if let Some(info) = self
.query_info_from_runtime(ctx, parent, session_index)
.await?
{
let r = with_info(&info);
self.session_info_cache.put(session_index, info);
return Ok(Some(r));
}
Ok(None)
}
/// Variant of `report_bad` that never fails, but just logs errors.
///
/// Not being able to report bad validators is not fatal, so we should not shutdown the
/// subsystem on this.
pub fn report_bad_log(&mut self, report: BadValidators) {
if let Err(err) = self.report_bad(report) {
tracing::warn!(
target: LOG_TARGET,
err= ?err,
"Reporting bad validators failed with error"
);
}
}
/// Make sure we try unresponsive or misbehaving validators last.
///
/// We assume validators in a group are tried in reverse order, so the reported bad validators
/// will be put at the beginning of the group.
#[tracing::instrument(level = "trace", skip(self, report), fields(subsystem = LOG_TARGET))]
pub fn report_bad(&mut self, report: BadValidators) -> Result<()> {
let session = self
.session_info_cache
.get_mut(&report.session_index)
.ok_or(Error::NoSuchCachedSession)?;
let group = session
.validator_groups
.get_mut(report.group_index.0 as usize)
.expect("A bad validator report must contain a valid group for the reported session. qed.");
let bad_set = report.bad_validators.iter().collect::<HashSet<_>>();
// Get rid of bad boys:
group.retain(|v| !bad_set.contains(v));
// We are trying validators in reverse order, so bad ones should be first:
let mut new_group = report.bad_validators;
new_group.append(group);
*group = new_group;
Ok(())
}
/// Query needed information from runtime.
///
/// We need to pass in the relay parent for our call to `request_session_info_ctx`. We should
/// actually don't need that: I suppose it is used for internal caching based on relay parents,
/// which we don't use here. It should not do any harm though.
async fn query_info_from_runtime<Context>(
&self,
ctx: &mut Context,
parent: Hash,
session_index: SessionIndex,
) -> Result<Option<SessionInfo>>
where
Context: SubsystemContext,
{
let GlobalSessionInfo {
validators,
discovery_keys,
mut validator_groups,
..
} = recv_runtime(request_session_info_ctx(parent, session_index, ctx).await)
.await?
.ok_or(Error::NoSuchSession(session_index))?;
if let Some(our_index) = self.get_our_index(validators).await {
// Get our group index:
let our_group = validator_groups
.iter()
.enumerate()
.find_map(|(i, g)| {
g.iter().find_map(|v| {
if *v == our_index {
Some(GroupIndex(i as u32))
} else {
None
}
})
})
.expect("Every validator should be in a validator group. qed.");
// Shuffle validators in groups:
let mut rng = thread_rng();
for g in validator_groups.iter_mut() {
g.shuffle(&mut rng)
}
// Look up `AuthorityDiscoveryId`s right away:
let validator_groups: Vec<Vec<_>> = validator_groups
.into_iter()
.map(|group| {
group
.into_iter()
.map(|index| {
discovery_keys.get(index.0 as usize)
.expect("There should be a discovery key for each validator of each validator group. qed.")
.clone()
})
.collect()
})
.collect();
let info = SessionInfo {
validator_groups,
our_index,
session_index,
our_group,
};
return Ok(Some(info));
}
return Ok(None);
}
/// Get our `ValidatorIndex`.
///
/// Returns: None if we are not a validator.
async fn get_our_index(&self, validators: Vec<ValidatorId>) -> Option<ValidatorIndex> {
for (i, v) in validators.iter().enumerate() {
if CryptoStore::has_keys(&*self.keystore, &[(v.to_raw_vec(), ValidatorId::ID)])
.await
{
return Some(ValidatorIndex(i as u32));
}
}
None
}
}