This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.
## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.
## PR contents
- CLI tool
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started
### Data Availability Read test
An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.
The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.
### Here is how the CLI looks like:
```
[2023-11-28T13:06:27Z INFO subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO subsystem_bench::availability]
Total received from network: 415 MiB
Total sent to network: 724 KiB
Total subsystem CPU usage 24.00s
CPU usage per block 8.00s
Total test environment CPU usage 0.15s
CPU usage per block 0.05s
```
### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
We currently use a bit of a hack in `.cargo/config` to make sure that
clippy isn't too annoying by specifying the list of lints.
There is now a stable way to define lints for a workspace. The only down
side is that every crate seems to have to opt into this so there's a
*few* files modified in this PR.
Dependencies:
- [x] PR that upgrades CI to use rust 1.74 is merged.
---------
Co-authored-by: joe petrowski <25483142+joepetrowski@users.noreply.github.com>
Co-authored-by: Branislav Kontur <bkontur@gmail.com>
Co-authored-by: Liam Aharon <liam.aharon@hotmail.com>
Collators were previously reencoding the available data and checking the
erasure root.
Replace that with just checking the PoV hash, which consumes much less
CPU and takes less time.
We also don't need to check the `PersistedValidationData` hash, as
collators don't use it.
Reason:
https://github.com/paritytech/polkadot-sdk/issues/575#issuecomment-1806572230
After systematic chunks recovery is merged, collators will no longer do
any reed-solomon encoding/decoding, which has proven to be a great CPU
consumer.
Signed-off-by: alindima <alin@parity.io>
While investigating some db migrations that make the node startup fail,
I noticed that the node wasn't exiting and that the log file were
growing exponentially, until my whole system was freezing and that makes
it really hard to actually find why it was failing in the first place.
E.g:
```
ls -lh /tmp/zombie-01a04c2a2c0265d85f6440cf01c0f44a_-51319-uyggzuD4wEpV/bob.log
32,6G oct 27 11:16 /tmp/zombie-01a04c2a2c0265d85f6440cf01c0f44a_-51319-uyggzuD4wEpV/bob.log
```
This was happening because the following errors were being printed
continously without the subsystem main loop exiting:
From dispute-coordinator:
```
WARN tokio-runtime-worker parachain::dispute-coordinator: error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
```
From availability recovery:
```
Erasure task channel closed. Node shutting down ?
```
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Refactors availability-recovery strategies to allow for easily adding
new hotpaths and failover mechanisms.
The new interface allows for chaining multiple `RecoveryStrategy`-es
together, to cleanly express the relationship between them and share
state and code where neccessary/possible:
This was done in order to aid in implementing new hotpaths like
[systematic chunks
recovery](https://github.com/paritytech/polkadot-sdk/issues/598) and
[fetching from approval
checkers](https://github.com/paritytech/polkadot-sdk/issues/575).
Thanks to this design, intermediate state can be shared between the
strategies. For example, if the systematic chunks recovery retrieved
less than the needed amount of chunks, pass them over to the next
FetchChunks strategy, which will only need to recover the remaining
number of chunks.
Draft example of how a systematic chunk recovery strategy would look:
https://github.com/paritytech/polkadot-sdk/commit/667d870bdf1470525d66c13929d5eac7249dd995
(notice how easy it was to add and reuse code)
Note that this PR doesn't itself add any new strategy, it should fully
preserve backwards compatiblity in terms of functionality. Follow-up PRs
to add new strategies will come.
* polkadot: propagate UnpinHandle to ActiveLeafUpdate
Also extract the leaf creation for tests
into a common function.
* dispute-coordinator: try pinned blocks for slashin
* apparently 1.72 is smarter than 1.70
* address nits
* rename fresh_leaf to new_leaf
* Happy New Year!
* Remove year entierly
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Remove years from copyright notice in the entire repo
---------
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* rust 1.64 enables workspace properties
* add edition, repository and authors.
* of course, update the version in one place.
Co-authored-by: Andronik <write@reusable.software>
* Add clippy config and remove .cargo from gitignore
* first fixes
* Clippyfied
* Add clippy CI job
* comment out rusty-cachier
* minor
* fix ci
* remove DAG from check-dependent-project
* add DAG to clippy
Co-authored-by: alvicsam <alvicsam@gmail.com>
* westend: update transaction version
* polkadot: update transaction version
* kusama: update transaction version
* Bump spec_version to 9330
* bump versions to 0.9.33
Was seeing these warnings when running `cargo check --all`:
```
warning: for loop over an `Option`. This is more readably written as an `if let` statement
--> node/core/approval-voting/src/lib.rs:1147:21
|
1147 | for activated in update.activated {
| ^^^^^^^^^^^^^^^^
|
= note: `#[warn(for_loops_over_fallibles)]` on by default
help: to check pattern in a loop use `while let`
|
1147 | while let Some(activated) = update.activated {
| ~~~~~~~~~~~~~~~ ~~~
help: consider using `if let` to clear intent
|
1147 | if let Some(activated) = update.activated {
| ~~~~~~~~~~~~ ~~~
```
My guess is that `activated` used to be a SmallVec or similar, as is
`deactivated`. It was changed to an `Option`, the `for` still compiled (it's
technically correct, just weird), and the compiler didn't catch it until now.
* Bump crate versions
* Bump spec_version to 9280 for kusama
* Bump spec_version to 9280 for polkadot
* Bump spec_version to 9280 for rococo
* Bump spec_version to 9280 for westend
* update Cargo.lock
Co-authored-by: parity-processbot <>
* foo
* rolling session window
* fixup
* remove use statemetn
* fmt
* split NetworkBridge into two subsystems
Pending cleanup
* split
* chore: reexport OrchestraError as OverseerError
* chore: silence warnings
* fixup tests
* chore: add default timenout of 30s to subsystem test helper ctx handle
* single item channel
* fixins
* fmt
* cleanup
* remove dead code
* remove sync bounds again
* wire up shared state
* deal with some FIXMEs
* use distinct tags
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* use tag
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* address naming
tx and rx are common in networking and also have an implicit meaning regarding networking
compared to incoming and outgoing which are already used with subsystems themselvesq
* remove unused sync oracle
* remove unneeded state
* fix tests
* chore: fmt
* do not try to register twice
* leak Metrics type
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Andronik <write@reusable.software>