Introduce subsystem benchmarking tool (#2528)

This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.

## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.

## PR contents
- CLI tool 
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started

### Data Availability Read test

An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.

The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.

### Here is how the CLI looks like:

```
[2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
    
    Total received from network: 415 MiB
    Total sent to network: 724 KiB
    Total subsystem CPU usage 24.00s
    CPU usage per block 8.00s
    Total test environment CPU usage 0.15s
    CPU usage per block 0.05s
```

### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
This commit is contained in:
Andrei Sandu
2023-12-14 12:57:17 +02:00
committed by GitHub
parent 07550e2d71
commit 8a6e9ef189
31 changed files with 5829 additions and 38 deletions
@@ -0,0 +1,186 @@
// Copyright (C) Parity Technologies (UK) Ltd.
// This file is part of Polkadot.
// Polkadot is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
// Polkadot is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
// You should have received a copy of the GNU General Public License
// along with Polkadot. If not, see <http://www.gnu.org/licenses/>.
//! A tool for running subsystem benchmark tests designed for development and
//! CI regression testing.
use clap::Parser;
use color_eyre::eyre;
use colored::Colorize;
use std::{path::Path, time::Duration};
pub(crate) mod availability;
pub(crate) mod cli;
pub(crate) mod core;
use availability::{prepare_test, NetworkEmulation, TestState};
use cli::TestObjective;
use core::{
configuration::TestConfiguration,
environment::{TestEnvironment, GENESIS_HASH},
};
use clap_num::number_range;
use crate::core::display::display_configuration;
fn le_100(s: &str) -> Result<usize, String> {
number_range(s, 0, 100)
}
fn le_5000(s: &str) -> Result<usize, String> {
number_range(s, 0, 5000)
}
#[derive(Debug, Parser)]
#[allow(missing_docs)]
struct BenchCli {
#[arg(long, value_enum, ignore_case = true, default_value_t = NetworkEmulation::Ideal)]
/// The type of network to be emulated
pub network: NetworkEmulation,
#[clap(flatten)]
pub standard_configuration: cli::StandardTestOptions,
#[clap(short, long)]
/// The bandwidth of simulated remote peers in KiB
pub peer_bandwidth: Option<usize>,
#[clap(short, long)]
/// The bandwidth of our simulated node in KiB
pub bandwidth: Option<usize>,
#[clap(long, value_parser=le_100)]
/// Simulated conection error ratio [0-100].
pub peer_error: Option<usize>,
#[clap(long, value_parser=le_5000)]
/// Minimum remote peer latency in milliseconds [0-5000].
pub peer_min_latency: Option<u64>,
#[clap(long, value_parser=le_5000)]
/// Maximum remote peer latency in milliseconds [0-5000].
pub peer_max_latency: Option<u64>,
#[command(subcommand)]
pub objective: cli::TestObjective,
}
impl BenchCli {
fn launch(self) -> eyre::Result<()> {
let configuration = self.standard_configuration;
let mut test_config = match self.objective {
TestObjective::TestSequence(options) => {
let test_sequence =
core::configuration::TestSequence::new_from_file(Path::new(&options.path))
.expect("File exists")
.into_vec();
let num_steps = test_sequence.len();
gum::info!(
"{}",
format!("Sequence contains {} step(s)", num_steps).bright_purple()
);
for (index, test_config) in test_sequence.into_iter().enumerate() {
gum::info!("{}", format!("Step {}/{}", index + 1, num_steps).bright_purple(),);
display_configuration(&test_config);
let mut state = TestState::new(&test_config);
let (mut env, _protocol_config) = prepare_test(test_config, &mut state);
env.runtime()
.block_on(availability::benchmark_availability_read(&mut env, state));
}
return Ok(())
},
TestObjective::DataAvailabilityRead(ref _options) => match self.network {
NetworkEmulation::Healthy => TestConfiguration::healthy_network(
self.objective,
configuration.num_blocks,
configuration.n_validators,
configuration.n_cores,
configuration.min_pov_size,
configuration.max_pov_size,
),
NetworkEmulation::Degraded => TestConfiguration::degraded_network(
self.objective,
configuration.num_blocks,
configuration.n_validators,
configuration.n_cores,
configuration.min_pov_size,
configuration.max_pov_size,
),
NetworkEmulation::Ideal => TestConfiguration::ideal_network(
self.objective,
configuration.num_blocks,
configuration.n_validators,
configuration.n_cores,
configuration.min_pov_size,
configuration.max_pov_size,
),
},
};
let mut latency_config = test_config.latency.clone().unwrap_or_default();
if let Some(latency) = self.peer_min_latency {
latency_config.min_latency = Duration::from_millis(latency);
}
if let Some(latency) = self.peer_max_latency {
latency_config.max_latency = Duration::from_millis(latency);
}
if let Some(error) = self.peer_error {
test_config.error = error;
}
if let Some(bandwidth) = self.peer_bandwidth {
// CLI expects bw in KiB
test_config.peer_bandwidth = bandwidth * 1024;
}
if let Some(bandwidth) = self.bandwidth {
// CLI expects bw in KiB
test_config.bandwidth = bandwidth * 1024;
}
display_configuration(&test_config);
let mut state = TestState::new(&test_config);
let (mut env, _protocol_config) = prepare_test(test_config, &mut state);
// test_config.write_to_disk();
env.runtime()
.block_on(availability::benchmark_availability_read(&mut env, state));
Ok(())
}
}
fn main() -> eyre::Result<()> {
color_eyre::install()?;
env_logger::builder()
.filter(Some("hyper"), log::LevelFilter::Info)
// Avoid `Terminating due to subsystem exit subsystem` warnings
.filter(Some("polkadot_overseer"), log::LevelFilter::Error)
.filter(None, log::LevelFilter::Info)
// .filter(None, log::LevelFilter::Trace)
.try_init()
.unwrap();
let cli: BenchCli = BenchCli::parse();
cli.launch()?;
Ok(())
}