Commit Graph

7 Commits

Author SHA1 Message Date
jokess123 1d60f9ca9a Fix typos (#3725) 2024-03-18 12:18:40 +00:00
cuinix ea458d0b95 fix some typos (#3587)
Signed-off-by: cuinix <915115094@qq.com>
Co-authored-by: Bastian Köcher <git@kchr.de>
2024-03-08 21:28:04 +00:00
Andrei Eres f0e589d72e subsystem-bench: add regression tests for availability read and write (#3311)
### What's been done
- `subsystem-bench` has been split into two parts: a cli benchmark
runner and a library.
- The cli runner is quite simple. It just allows us to run `.yaml` based
test sequences. Now it should only be used to run benchmarks during
development.
- The library is used in the cli runner and in regression tests. Some
code is changed to make the library independent of the runner.
- Added first regression tests for availability read and write that
replicate existing test sequences.

### How we run regression tests
- Regression tests are simply rust integration tests without the
harnesses.
- They should only be compiled under the `subsystem-benchmarks` feature
to prevent them from running with other tests.
- This doesn't work when running tests with `nextest` in CI, so
additional filters have been added to the `nextest` runs.
- Each benchmark run takes a different time in the beginning, so we
"warm up" the tests until their CPU usage differs by only 1%.
- After the warm-up, we run the benchmarks a few more times and compare
the average with the exception using a precision.

### What is still wrong?
- I haven't managed to set up approval voting tests. The spread of their
results is too large and can't be narrowed down in a reasonable amount
of time in the warm-up phase.
- The tests start an unconfigurable prometheus endpoint inside, which
causes errors because they use the same 9999 port. I disable it with a
flag, but I think it's better to extract the endpoint launching outside
the test, as we already do with `valgrind` and `pyroscope`. But we still
use `prometheus` inside the tests.

### Future work
* https://github.com/paritytech/polkadot-sdk/issues/3528
* https://github.com/paritytech/polkadot-sdk/issues/3529
* https://github.com/paritytech/polkadot-sdk/issues/3530
* https://github.com/paritytech/polkadot-sdk/issues/3531

---------

Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com>
2024-03-01 14:30:43 +00:00
Andrei Sandu 47e46d178b Add subsystem benchmarks for availability-distribution and biftield-distribution (availability write) (#2970)
Introduce a new test objective : `DataAvailabilityWrite`.

The new benchmark measures the network and cpu usage of
`availability-distribution`, `biftield-distribution` and
`availability-store` subsystems from the perspective of a validator node
during the process when candidates are made available.

Additionally I refactored the networking emulation to support bandwidth
acounting and limits of incoming and outgoing requests.

Screenshot of succesful run


<img width="1293" alt="Screenshot 2024-01-17 at 19 17 44"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/fde11280-e25b-4dc3-9dc9-d4b9752f9b7a">

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
2024-01-25 17:02:24 +00:00
Andrei Eres ec7bfae00a subsystem-bench: cache misses profiling (#2893)
## Why we need it
To provide another level of understanding to why polkadot's subsystems
may perform slower than expected. Cache misses occur when processing
large amounts of data, such as during availability recovery.

## Why Cachegrind
Cachegrind has many drawbacks: it is slow, it uses its own cache
simulation, which is very basic. But unlike `perf`, which is a great
tool, Cachegrind can run in a virtual machine. This means we can easily
run it in remote installations and even use it in CI/CD to catch
possible regressions.

Why Cachegrind and not Callgrind, another part of Valgrind? It is simply
empirically proven that profiling runs faster with Cachegrind.

## First results
First results have been obtained while testing of the approach. Here is
an example.

```
$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
$ cat cachegrind_report.txt
I refs:        64,622,081,485
I1  misses:         3,018,168
LLi misses:           437,654
I1  miss rate:           0.00%
LLi miss rate:           0.00%

D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
D1  miss rate:            1.4% (          0.7%     +           4.2%  )
LLd miss rate:            0.3% (          0.2%     +           0.7%  )

LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
LL miss rate:             0.0% (          0.0%     +           0.7%  )
```

The CLI output shows that 1.4% of the L1 data cache missed, which is not
so bad, given that the last-level cache had that data most of the time
missing only 0.3%. Instruction data of the L1 has 0.00% misses of the
time. Looking at an output file with `cg_annotate` shows that most of
the misses occur during reed-solomon, which is expected.
2024-01-16 17:14:29 +00:00
Andrei Eres 526c81b138 subsystem benchmarks: add cpu profiling (#2734)
Ready-to-merge version of
https://github.com/paritytech/polkadot-sdk/pull/2601

- Added optional CPU profiling
- Updated instructions how to set up Prometheus, Pyroscope and Graphana
- Added a flamegraph dashboard
<img width="1470" alt="image"
src="https://github.com/paritytech/polkadot-sdk/assets/27277055/c8f3b33d-3c01-4ec0-ac34-72d52325b6e6">

---------

Co-authored-by: ordian <write@reusable.software>
2023-12-19 12:14:22 +01:00
Andrei Sandu 8a6e9ef189 Introduce subsystem benchmarking tool (#2528)
This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.

## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.

## PR contents
- CLI tool 
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started

### Data Availability Read test

An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.

The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.

### Here is how the CLI looks like:

```
[2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
    
    Total received from network: 415 MiB
    Total sent to network: 724 KiB
    Total subsystem CPU usage 24.00s
    CPU usage per block 8.00s
    Total test environment CPU usage 0.15s
    CPU usage per block 0.05s
```

### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
2023-12-14 12:57:17 +02:00