Introduce approval-voting/distribution benchmark (#2621)

## Summary
Built on top of the tooling and ideas introduced in
https://github.com/paritytech/polkadot-sdk/pull/2528, this PR introduces
a synthetic benchmark for measuring and assessing the performance
characteristics of the approval-voting and approval-distribution
subsystems.

Currently this allows, us to simulate the behaviours of these systems
based on the following dimensions:
```
TestConfiguration:
# Test 1
- objective: !ApprovalsTest
    last_considered_tranche: 89
    min_coalesce: 1
    max_coalesce: 6
    enable_assignments_v2: true
    send_till_tranche: 60
    stop_when_approved: false
    coalesce_tranche_diff: 12
    workdir_prefix: "/tmp"
    num_no_shows_per_candidate: 0
    approval_distribution_expected_tof: 6.0
    approval_distribution_cpu_ms: 3.0
    approval_voting_cpu_ms: 4.30
  n_validators: 500
  n_cores: 100
  n_included_candidates: 100
  min_pov_size: 1120
  max_pov_size: 5120
  peer_bandwidth: 524288000000
  bandwidth: 524288000000
  latency:
    min_latency:
      secs: 0
      nanos: 1000000
    max_latency:
      secs: 0
      nanos: 100000000
  error: 0
  num_blocks: 10
```

## The approach
1. We build a real overseer with the real implementations for
approval-voting and approval-distribution subsystems.
2. For a given network size, for each validator we pre-computed all
potential assignments and approvals it would send, because this a
computation heavy operation this will be cached on a file on disk and be
re-used if the generation parameters don't change.
3. The messages will be sent accordingly to the configured parameters
and those are split into 3 main benchmarking scenarios.

## Benchmarking scenarios

### Best case scenario *approvals_throughput_best_case.yaml*
It send to the approval-distribution only the minimum required tranche
to gathered the needed_approvals, so that a candidate is approved.

### Behaviour in the presence of no-shows *approvals_no_shows.yaml*
It sends the tranche needed to approve a candidate when we have a
maximum of *num_no_shows_per_candidate* tranches with no-shows for each
candidate.

### Maximum throughput *approvals_throughput.yaml*
It sends all the tranches for each block and measures the used CPU and
necessary network bandwidth. by the approval-voting and
approval-distribution subsystem.

## How to run it
```
cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml
```

## Evaluating performance
### Use the real subsystems metrics
If you follow the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana
for installing locally prometheus and grafana, all real metrics for the
`approval-distribution`, `approval-voting` and overseer are available.
E.g:
<img width="2149" alt="Screenshot 2023-12-05 at 11 07 46"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/cb8ae2dd-178b-4922-bfa4-dc37e572ed38">

<img width="2551" alt="Screenshot 2023-12-05 at 11 09 42"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/8b4542ba-88b9-46f9-9b70-cc345366081b">

<img width="2154" alt="Screenshot 2023-12-05 at 11 10 15"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/b8874d8d-632e-443a-9840-14ad8e90c54f">

<img width="2535" alt="Screenshot 2023-12-05 at 11 10 52"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/779a439f-fd18-4985-bb80-85d5afad78e2">

### Profile with pyroscope
1. Setup pyroscope following the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope,
then run any of the benchmark scenario with `--profile` as the
arguments.
2. Open the pyroscope dashboard in grafana, e.g:
<img width="2544" alt="Screenshot 2024-01-09 at 17 09 58"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/58f50c99-a910-4d20-951a-8b16639303d9">



### Useful  logs
1. Network bandwidth requirements:
```
Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
```

2. Cpu usage by the approval-distribution/approval-voting subsystems.
```
approval-distribution CPU usage 84.061s
approval-distribution CPU usage per block 8.406s
approval-voting CPU usage 96.532s
approval-voting CPU usage per block 9.653s
```

3. Time passed until a given block is approved
```
 Chain selection approved  after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101
Chain selection approved  after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202
```

### Using benchmark to quantify improvements from
https://github.com/paritytech/polkadot-sdk/pull/1178 +
https://github.com/paritytech/polkadot-sdk/pull/1191

Using a versi-node we compare the scenarios where all new optimisations
are disabled with a scenarios where tranche0 assignments are sent in a
single message and a conservative simulation where the coalescing of
approvals gives us just 50% reduction in the number of messages we send.

Overall, what we see is a speedup of around 30-40% in the time it takes
to process the necessary messages and a 30-40% reduction in the
necessary bandwidth.

#### Best case scenario comparison(minimum required tranches sent).
Unoptimised
```
    Number of blocks: 10
    Payload bytes received from peers: 53289 KiB total, 5328 KiB/block
    Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block
    approval-distribution CPU usage 6.732s
    approval-distribution CPU usage per block 0.673s
    approval-voting CPU usage 9.523s
    approval-voting CPU usage per block 0.952s
```

vs Optimisation enabled
```
   Number of blocks: 10
   Payload bytes received from peers: 32141 KiB total, 3214 KiB/block
   Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block
   approval-distribution CPU usage 4.658s
   approval-distribution CPU usage per block 0.466s
   approval-voting CPU usage 6.236s
   approval-voting CPU usage per block 0.624s
```

#### Worst case all tranches sent, very unlikely happens when sharding
breaks.

Unoptimised
```
   Number of blocks: 10
   Payload bytes received from peers: 746393 KiB total, 74639 KiB/block
   Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block
   approval-distribution CPU usage 118.681s
   approval-distribution CPU usage per block 11.868s
   approval-voting CPU usage 124.118s
   approval-voting CPU usage per block 12.412s
```

vs optimised
```
    Number of blocks: 10
    Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
    Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
    approval-distribution CPU usage 84.061s
    approval-distribution CPU usage per block 8.406s
    approval-voting CPU usage 96.532s
    approval-voting CPU usage per block 9.653s
```


## TODOs
[x] Polish implementation.
[x] Use what we have so far to evaluate
https://github.com/paritytech/polkadot-sdk/pull/1191 before merging.
[x] List of features and additional dimensions we want to use for
benchmarking.
[x] Run benchmark on hardware similar with versi and kusama nodes.
[ ] Add benchmark to be run in CI for catching regression in
performance.
[ ] Rebase on latest changes for network emulation.

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This commit is contained in:
Alexandru Gheorghe
2024-02-05 08:46:22 +02:00
committed by GitHub
parent 90849b66b9
commit f9f886886b
29 changed files with 2856 additions and 126 deletions
@@ -55,11 +55,11 @@ pub struct OurAssignment {
}
impl OurAssignment {
pub(crate) fn cert(&self) -> &AssignmentCertV2 {
pub fn cert(&self) -> &AssignmentCertV2 {
&self.cert
}
pub(crate) fn tranche(&self) -> DelayTranche {
pub fn tranche(&self) -> DelayTranche {
self.tranche
}
@@ -225,7 +225,7 @@ fn assigned_core_transcript(core_index: CoreIndex) -> Transcript {
/// Information about the world assignments are being produced in.
#[derive(Clone, Debug)]
pub(crate) struct Config {
pub struct Config {
/// The assignment public keys for validators.
assignment_keys: Vec<AssignmentId>,
/// The groups of validators assigned to each core.
@@ -321,7 +321,7 @@ impl AssignmentCriteria for RealAssignmentCriteria {
/// different times. The idea is that most assignments are never triggered and fall by the wayside.
///
/// This will not assign to anything the local validator was part of the backing group for.
pub(crate) fn compute_assignments(
pub fn compute_assignments(
keystore: &LocalKeystore,
relay_vrf_story: RelayVRFStory,
config: &Config,
+28 -13
View File
@@ -92,11 +92,11 @@ use time::{slot_number_to_tick, Clock, ClockExt, DelayedApprovalTimer, SystemClo
mod approval_checking;
pub mod approval_db;
mod backend;
mod criteria;
pub mod criteria;
mod import;
mod ops;
mod persisted_entries;
mod time;
pub mod time;
use crate::{
approval_checking::{Check, TranchesToApproveResult},
@@ -159,6 +159,7 @@ pub struct ApprovalVotingSubsystem {
db: Arc<dyn Database>,
mode: Mode,
metrics: Metrics,
clock: Box<dyn Clock + Send + Sync>,
}
#[derive(Clone)]
@@ -444,6 +445,25 @@ impl ApprovalVotingSubsystem {
keystore: Arc<LocalKeystore>,
sync_oracle: Box<dyn SyncOracle + Send>,
metrics: Metrics,
) -> Self {
ApprovalVotingSubsystem::with_config_and_clock(
config,
db,
keystore,
sync_oracle,
metrics,
Box::new(SystemClock {}),
)
}
/// Create a new approval voting subsystem with the given keystore, config, and database.
pub fn with_config_and_clock(
config: Config,
db: Arc<dyn Database>,
keystore: Arc<LocalKeystore>,
sync_oracle: Box<dyn SyncOracle + Send>,
metrics: Metrics,
clock: Box<dyn Clock + Send + Sync>,
) -> Self {
ApprovalVotingSubsystem {
keystore,
@@ -452,6 +472,7 @@ impl ApprovalVotingSubsystem {
db_config: DatabaseConfig { col_approval_data: config.col_approval_data },
mode: Mode::Syncing(sync_oracle),
metrics,
clock,
}
}
@@ -493,15 +514,10 @@ fn db_sanity_check(db: Arc<dyn Database>, config: DatabaseConfig) -> SubsystemRe
impl<Context: Send> ApprovalVotingSubsystem {
fn start(self, ctx: Context) -> SpawnedSubsystem {
let backend = DbBackend::new(self.db.clone(), self.db_config);
let future = run::<DbBackend, Context>(
ctx,
self,
Box::new(SystemClock),
Box::new(RealAssignmentCriteria),
backend,
)
.map_err(|e| SubsystemError::with_origin("approval-voting", e))
.boxed();
let future =
run::<DbBackend, Context>(ctx, self, Box::new(RealAssignmentCriteria), backend)
.map_err(|e| SubsystemError::with_origin("approval-voting", e))
.boxed();
SpawnedSubsystem { name: "approval-voting-subsystem", future }
}
@@ -909,7 +925,6 @@ enum Action {
async fn run<B, Context>(
mut ctx: Context,
mut subsystem: ApprovalVotingSubsystem,
clock: Box<dyn Clock + Send + Sync>,
assignment_criteria: Box<dyn AssignmentCriteria + Send + Sync>,
mut backend: B,
) -> SubsystemResult<()>
@@ -923,7 +938,7 @@ where
let mut state = State {
keystore: subsystem.keystore,
slot_duration_millis: subsystem.slot_duration_millis,
clock,
clock: subsystem.clock,
assignment_criteria,
spans: HashMap::new(),
};
@@ -549,7 +549,7 @@ fn test_harness<T: Future<Output = VirtualOverseer>>(
let subsystem = run(
context,
ApprovalVotingSubsystem::with_config(
ApprovalVotingSubsystem::with_config_and_clock(
Config {
col_approval_data: test_constants::TEST_CONFIG.col_approval_data,
slot_duration_millis: SLOT_DURATION_MILLIS,
@@ -558,8 +558,8 @@ fn test_harness<T: Future<Output = VirtualOverseer>>(
Arc::new(keystore),
sync_oracle,
Metrics::default(),
clock.clone(),
),
clock.clone(),
assignment_criteria,
backend,
);
+18 -6
View File
@@ -33,14 +33,14 @@ use std::{
};
use polkadot_primitives::{Hash, ValidatorIndex};
const TICK_DURATION_MILLIS: u64 = 500;
pub const TICK_DURATION_MILLIS: u64 = 500;
/// A base unit of time, starting from the Unix epoch, split into half-second intervals.
pub(crate) type Tick = u64;
pub type Tick = u64;
/// A clock which allows querying of the current tick as well as
/// waiting for a tick to be reached.
pub(crate) trait Clock {
pub trait Clock {
/// Yields the current tick.
fn tick_now(&self) -> Tick;
@@ -49,7 +49,7 @@ pub(crate) trait Clock {
}
/// Extension methods for clocks.
pub(crate) trait ClockExt {
pub trait ClockExt {
fn tranche_now(&self, slot_duration_millis: u64, base_slot: Slot) -> DelayTranche;
}
@@ -61,7 +61,8 @@ impl<C: Clock + ?Sized> ClockExt for C {
}
/// A clock which uses the actual underlying system clock.
pub(crate) struct SystemClock;
#[derive(Clone)]
pub struct SystemClock;
impl Clock for SystemClock {
/// Yields the current tick.
@@ -93,11 +94,22 @@ fn tick_to_time(tick: Tick) -> SystemTime {
}
/// assumes `slot_duration_millis` evenly divided by tick duration.
pub(crate) fn slot_number_to_tick(slot_duration_millis: u64, slot: Slot) -> Tick {
pub fn slot_number_to_tick(slot_duration_millis: u64, slot: Slot) -> Tick {
let ticks_per_slot = slot_duration_millis / TICK_DURATION_MILLIS;
u64::from(slot) * ticks_per_slot
}
/// Converts a tick to the slot number.
pub fn tick_to_slot_number(slot_duration_millis: u64, tick: Tick) -> Slot {
let ticks_per_slot = slot_duration_millis / TICK_DURATION_MILLIS;
(tick / ticks_per_slot).into()
}
/// Converts a tranche from a slot to the tick number.
pub fn tranche_to_tick(slot_duration_millis: u64, slot: Slot, tranche: u32) -> Tick {
slot_number_to_tick(slot_duration_millis, slot) + tranche as u64
}
/// A list of delayed futures that gets triggered when the waiting time has expired and it is
/// time to sign the candidate.
/// We have a timer per relay-chain block.