Document benchmarking CLI (#11246)

* Decrese default repeats

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Add benchmarking READMEs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update docs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update docs

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Update README

Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Review fixes

Co-authored-by: Shawn Tabrizi <shawntabrizi@gmail.com>

Co-authored-by: parity-processbot <>
Co-authored-by: Shawn Tabrizi <shawntabrizi@gmail.com>
This commit is contained in:
Oliver Tale-Yazdi
2022-05-25 05:47:21 +02:00
committed by GitHub
parent 35af8fd726
commit 29474f9893
9 changed files with 505 additions and 8 deletions
@@ -1 +1,46 @@
License: Apache-2.0
# The Benchmarking CLI
This crate contains commands to benchmark various aspects of Substrate and the hardware.
All commands are exposed by the Substrate node but can be exposed by any Substrate client.
The goal is to have a comprehensive suite of benchmarks that cover all aspects of Substrate and the hardware that its running on.
Invoking the root benchmark command prints a help menu:
```sh
$ cargo run --profile=production -- benchmark
Sub-commands concerned with benchmarking.
USAGE:
substrate benchmark <SUBCOMMAND>
OPTIONS:
-h, --help Print help information
-V, --version Print version information
SUBCOMMANDS:
block Benchmark the execution time of historic blocks
machine Command to benchmark the hardware.
overhead Benchmark the execution overhead per-block and per-extrinsic
pallet Benchmark the extrinsic weight of FRAME Pallets
storage Benchmark the storage speed of a chain snapshot
```
All examples use the `production` profile for correctness which makes the compilation *very* slow; for testing you can use `--release`.
For the final results the `production` profile and reference hardware should be used, otherwise the results are not comparable.
The sub-commands are explained in depth here:
- [block] Compare the weight of a historic block to its actual resource usage
- [machine] Gauges the speed of the hardware
- [overhead] Creates weight files for the *Block*- and *Extrinsic*-base weights
- [pallet] Creates weight files for a Pallet
- [storage] Creates weight files for *Read* and *Write* storage operations
License: Apache-2.0
<!-- LINKS -->
[pallet]: ../../../frame/benchmarking/README.md
[machine]: src/machine/README.md
[storage]: src/storage/README.md
[overhead]: src/overhead/README.md
[block]: src/block/README.md
@@ -0,0 +1,118 @@
# The `benchmark block` command
The whole benchmarking process in Substrate aims to predict the resource usage of an unexecuted block.
This command measures how accurate this prediction was by executing a block and comparing the predicted weight to its actual resource usage.
It can be used to measure the accuracy of the pallet benchmarking.
In the following it will be explained once for Polkadot and once for Substrate.
## Polkadot # 1
<sup>(Also works for Kusama, Westend and Rococo)</sup>
Suppose you either have a synced Polkadot node or downloaded a snapshot from [Polkachu].
This example uses a pruned ParityDB snapshot from the 2022-4-19 with the last block being 9939462.
For pruned snapshots you need to know the number of the last block (to be improved [here]).
Pruned snapshots normally store the last 256 blocks, archive nodes can use any block range.
In this example we will benchmark just the last 10 blocks:
```sh
cargo run --profile=production -- benchmark block --from 9939453 --to 9939462 --db paritydb
```
Output:
```pre
Block 9939453 with 2 tx used 4.57% of its weight ( 26,458,801 of 579,047,053 ns)
Block 9939454 with 3 tx used 4.80% of its weight ( 28,335,826 of 590,414,831 ns)
Block 9939455 with 2 tx used 4.76% of its weight ( 27,889,567 of 586,484,595 ns)
Block 9939456 with 2 tx used 4.65% of its weight ( 27,101,306 of 582,789,723 ns)
Block 9939457 with 2 tx used 4.62% of its weight ( 26,908,882 of 582,789,723 ns)
Block 9939458 with 2 tx used 4.78% of its weight ( 28,211,440 of 590,179,467 ns)
Block 9939459 with 4 tx used 4.78% of its weight ( 27,866,077 of 583,260,451 ns)
Block 9939460 with 3 tx used 4.72% of its weight ( 27,845,836 of 590,462,629 ns)
Block 9939461 with 2 tx used 4.58% of its weight ( 26,685,119 of 582,789,723 ns)
Block 9939462 with 2 tx used 4.60% of its weight ( 26,840,938 of 583,697,101 ns)
```
### Output Interpretation
<sup>(Only results from reference hardware are relevant)</sup>
Each block is executed multiple times and the results are averaged.
The percent number is the interesting part and indicates how much weight was used as compared to how much was predicted.
The closer to 100% this is without exceeding 100%, the better.
If it exceeds 100%, the block is marked with "**OVER WEIGHT!**" to easier spot them. This is not good since then the benchmarking under-estimated the weight.
This would mean that an honest validator would possibly not be able to keep up with importing blocks since users did not pay for enough weight.
If that happens the validator could lag behind the chain and get slashed for missing deadlines.
It is therefore important to investigate any overweight blocks.
In this example you can see an unexpected result; only < 5% of the weight was used!
The measured blocks can be executed much faster than predicted.
This means that the benchmarking process massively over-estimated the execution time.
Since they are off by so much, it is an issue [polkadot#5192].
The ideal range for these results would be 85-100%.
## Polkadot # 2
Let's take a more interesting example where the blocks use more of their predicted weight.
Every day when validators pay out rewards, the blocks are nearly full.
Using an archive node here is the easiest.
The Polkadot blocks TODO-TODO for example contain large batch transactions for staking payout.
```sh
cargo run --profile=production -- benchmark block --from TODO --to TODO --db paritydb
```
```pre
TODO
```
## Substrate
It is also possible to try the procedure in Substrate, although it's a bit boring.
First you need to create some blocks with either a local or dev chain.
This example will use the standard development spec.
Pick a non existing directory where the chain data will be stored, eg `/tmp/dev`.
```sh
cargo run --profile=production -- --dev -d /tmp/dev
```
You should see after some seconds that it started to produce blocks:
```pre
✨ Imported #1 (0x801d…9189)
```
You can now kill the node with `Ctrl+C`. Then measure how long it takes to execute these blocks:
```sh
cargo run --profile=production -- benchmark block --from 1 --to 1 --dev -d /tmp/dev --pruning archive
```
This will benchmark the first block. If you killed the node at a later point, you can measure multiple blocks.
```pre
Block 1 with 1 tx used 72.04% of its weight ( 4,945,664 of 6,864,702 ns)
```
In this example the block used ~72% of its weight.
The benchmarking therefore over-estimated the effort to execute the block.
Since this block is empty, its not very interesting.
## Arguments
- `--from` Number of the first block to measure (inclusive).
- `--to` Number of the last block to measure (inclusive).
- `--repeat` How often each block should be measured.
- [`--db`]
- [`--pruning`]
License: Apache-2.0
<!-- LINKS -->
[Polkachu]: https://polkachu.com/snapshots
[here]: https://github.com/paritytech/substrate/issues/11141
[polkadot#5192]: https://github.com/paritytech/polkadot/issues/5192
[`--db`]: ../shared/README.md#arguments
[`--pruning`]: ../shared/README.md#arguments
@@ -0,0 +1,71 @@
# The `benchmark machine` command
Different Substrate chains can have different hardware requirements.
It is therefore important to be able to quickly gauge if a piece of hardware fits a chains' requirements.
The `benchmark machine` command archives this by measuring key metrics and making them comparable.
Invoking the command looks like this:
```sh
cargo run --profile=production -- benchmark machine --dev
```
## Output
The output on reference hardware:
```pre
+----------+----------------+---------------+--------------+-------------------+
| Category | Function | Score | Minimum | Result |
+----------+----------------+---------------+--------------+-------------------+
| CPU | BLAKE2-256 | 1023.00 MiB/s | 1.00 GiB/s | ✅ Pass ( 99.4 %) |
+----------+----------------+---------------+--------------+-------------------+
| CPU | SR25519-Verify | 665.13 KiB/s | 666.00 KiB/s | ✅ Pass ( 99.9 %) |
+----------+----------------+---------------+--------------+-------------------+
| Memory | Copy | 14.39 GiB/s | 14.32 GiB/s | ✅ Pass (100.4 %) |
+----------+----------------+---------------+--------------+-------------------+
| Disk | Seq Write | 457.00 MiB/s | 450.00 MiB/s | ✅ Pass (101.6 %) |
+----------+----------------+---------------+--------------+-------------------+
| Disk | Rnd Write | 190.00 MiB/s | 200.00 MiB/s | ✅ Pass ( 95.0 %) |
+----------+----------------+---------------+--------------+-------------------+
```
The *score* is the average result of each benchmark. It always adheres to "higher is better".
The *category* indicate which part of the hardware was benchmarked:
- **CPU** Processor intensive task
- **Memory** RAM intensive task
- **Disk** Hard drive intensive task
The *function* is the concrete benchmark that was run:
- **BLAKE2-256** The throughput of the [Blake2-256] cryptographic hashing function with 32 KiB input. The [blake2_256 function] is used in many places in Substrate. The throughput of a hash function strongly depends on the input size, therefore we settled to use a fixed input size for comparable results.
- **SR25519 Verify** Sr25519 is an optimized version of the [Curve25519] signature scheme. Signature verification is used by Substrate when verifying extrinsics and blocks.
- **Copy** The throughput of copying memory from one place in the RAM to another.
- **Seq Write** The throughput of writing data to the storage location sequentially. It is important that the same disk is used that will later-on be used to store the chain data.
- **Rnd Write** The throughput of writing data to the storage location in a random order. This is normally much slower than the sequential write.
The *score* needs to reach the *minimum* in order to pass the benchmark. This can be reduced with the `--tolerance` flag.
The *result* indicated if a specific benchmark was passed by the machine or not. The percent number is the relative score reached to the *minimum* that is needed. The `--tolerance` flag is taken into account for this decision. For example a benchmark that passes even with 95% since the *tolerance* was set to 10% would look like this: `✅ Pass ( 95.0 %)`.
## Interpretation
Ideally all results show a `Pass` and the program exits with code 0. Currently some of the benchmarks can fail even on reference hardware; they are still being improved to make them more deterministic.
Make sure to run nothing else on the machine when benchmarking it.
You can re-run them multiple times to get more reliable results.
## Arguments
- `--tolerance` A percent number to reduce the *minimum* requirement. This should be used to ignore outliers of the benchmarks. The default value is 10%.
- `--verify-duration` How long the verification benchmark should run.
- `--disk-duration` How long the *read* and *write* benchmarks should run each.
- `--allow-fail` Always exit the program with code 0.
- `--chain` / `--dev` Specify the chain config to use. This will be used to compare the results with the requirements of the chain (WIP).
- [`--base-path`]
License: Apache-2.0
<!-- LINKS -->
[Blake2-256]: https://www.blake2.net/
[blake2_256 function]: https://crates.parity.io/sp_core/hashing/fn.blake2_256.html
[Curve25519]: https://en.wikipedia.org/wiki/Curve25519
[`--base-path`]: ../shared/README.md#arguments
@@ -0,0 +1,136 @@
# The `benchmark overhead` command
Each time an extrinsic or a block is executed, a fixed weight is charged as "execution overhead".
This is necessary since the weight that is calculated by the pallet benchmarks does not include this overhead.
The exact overhead to can vary per Substrate chain and needs to be calculated per chain.
This command calculates the exact values of these overhead weights for any Substrate chain that supports it.
## How does it work?
The benchmark consists of two parts; the [`BlockExecutionWeight`] and the [`ExtrinsicBaseWeight`].
Both are executed sequentially when invoking the command.
## BlockExecutionWeight
The block execution weight is defined as the weight that it takes to execute an *empty block*.
It is measured by constructing an empty block and measuring its executing time.
The result are written to a `block_weights.rs` file which is created from a template.
The file will contain the concrete weight value and various statistics about the measurements. For example:
```rust
/// Time to execute an empty block.
/// Calculated by multiplying the *Average* with `1` and adding `0`.
///
/// Stats [NS]:
/// Min, Max: 3_508_416, 3_680_498
/// Average: 3_532_484
/// Median: 3_522_111
/// Std-Dev: 27070.23
///
/// Percentiles [NS]:
/// 99th: 3_631_863
/// 95th: 3_595_674
/// 75th: 3_526_435
pub const BlockExecutionWeight: Weight = 3_532_484 * WEIGHT_PER_NANOS;
```
In this example it takes 3.5 ms to execute an empty block. That means that it always takes at least 3.5 ms to execute *any* block.
This constant weight is therefore added to each block to ensure that Substrate budgets enough time to execute it.
## ExtrinsicBaseWeight
The extrinsic base weight is defined as the weight that it takes to execute an *empty* extrinsic.
An *empty* extrinsic is also called a *NO-OP*. It does nothing and is the equivalent to the empty block form above.
The benchmark now constructs a block which is filled with only NO-OP extrinsics.
This block is then executed many times and the weights are measured.
The result is divided by the number of extrinsics in that block and the results are written to `extrinsic_weights.rs`.
The relevant section in the output file looks like this:
```rust
/// Time to execute a NO-OP extrinsic, for example `System::remark`.
/// Calculated by multiplying the *Average* with `1` and adding `0`.
///
/// Stats [NS]:
/// Min, Max: 67_561, 69_855
/// Average: 67_745
/// Median: 67_701
/// Std-Dev: 264.68
///
/// Percentiles [NS]:
/// 99th: 68_758
/// 95th: 67_843
/// 75th: 67_749
pub const ExtrinsicBaseWeight: Weight = 67_745 * WEIGHT_PER_NANOS;
```
In this example it takes 67.7 µs to execute a NO-OP extrinsic. That means that it always takes at least 67.7 µs to execute *any* extrinsic.
This constant weight is therefore added to each extrinsic to ensure that Substrate budgets enough time to execute it.
## Invocation
The base command looks like this (for debugging you can use `--release`):
```sh
cargo run --profile=production -- benchmark overhead --dev
```
Output:
```pre
# BlockExecutionWeight
Running 10 warmups...
Executing block 100 times
Per-block execution overhead [ns]:
Total: 353248430
Min: 3508416, Max: 3680498
Average: 3532484, Median: 3522111, Stddev: 27070.23
Percentiles 99th, 95th, 75th: 3631863, 3595674, 3526435
Writing weights to "block_weights.rs"
# Setup
Building block, this takes some time...
Extrinsics per block: 12000
# ExtrinsicBaseWeight
Running 10 warmups...
Executing block 100 times
Per-extrinsic execution overhead [ns]:
Total: 6774590
Min: 67561, Max: 69855
Average: 67745, Median: 67701, Stddev: 264.68
Percentiles 99th, 95th, 75th: 68758, 67843, 67749
Writing weights to "extrinsic_weights.rs"
```
The complete command for Polkadot looks like this:
```sh
cargo run --profile=production -- benchmark overhead --chain=polkadot-dev --execution=wasm --wasm-execution=compiled --weight-path=runtime/polkadot/constants/src/weights/
```
This will overwrite the the [block_weights.rs](https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/block_weights.rs) and [extrinsic_weights.rs](https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/extrinsic_weights.rs) files in the Polkadot runtime directory.
You can try the same for *Rococo* and to see that the results slightly differ.
👉 It is paramount to use `--profile=production`, `--execution=wasm` and `--wasm-execution=compiled` as the results are otherwise useless.
## Output Interpretation
Lower is better. The less weight the execution overhead needs, the better.
Since the weights of the overhead is charged per extrinsic and per block, a larger weight results in less extrinsics per block.
Minimizing this is important to have a large transaction throughput.
## Arguments
- `--chain` / `--dev` Set the chain specification.
- `--weight-path` Set the output directory or file to write the weights to.
- `--repeat` Set the repetitions of both benchmarks.
- `--warmup` Set the rounds of warmup before measuring.
- `--execution` Should be set to `wasm` for correct results.
- `--wasm-execution` Should be set to `compiled` for correct results.
- [`--mul`](../shared/README.md#arguments)
- [`--add`](../shared/README.md#arguments)
- [`--metric`](../shared/README.md#arguments)
- [`--weight-path`](../shared/README.md#arguments)
License: Apache-2.0
<!-- LINKS -->
[`ExtrinsicBaseWeight`]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/support/src/weights/extrinsic_weights.rs#L26
[`BlockExecutionWeight`]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/support/src/weights/block_weights.rs#L26
[System::Remark]: https://github.com/paritytech/substrate/blob/580ebae17fa30082604f1c9720f6f4a1cfe95b50/frame/system/src/lib.rs#L382
@@ -43,11 +43,11 @@ use crate::shared::Stats;
#[derive(Debug, Default, Serialize, Clone, PartialEq, Args)]
pub struct BenchmarkParams {
/// Rounds of warmups before measuring.
#[clap(long, default_value = "100")]
#[clap(long, default_value = "10")]
pub warmup: u32,
/// How many times the benchmark should be repeated.
#[clap(long, default_value = "1000")]
#[clap(long, default_value = "100")]
pub repeat: u32,
/// Maximal number of extrinsics that should be put into a block.
@@ -0,0 +1,3 @@
The pallet command is explained in [frame/benchmarking](../../../../../frame/benchmarking/README.md).
License: Apache-2.0
@@ -0,0 +1,15 @@
# Shared code
Contains code that is shared among multiple sub-commands.
## Arguments
- `--mul` Multiply the result with a factor. Can be used to manually adjust for future chain growth.
- `--add` Add a value to the result. Can be used to manually offset the results.
- `--metric` Set the metric to use for calculating the final weight from the raw data. Defaults to `average`.
- `--weight-path` Set the file or directory to write the weight files to.
- `--db` The database backend to use. This depends on your snapshot.
- `--pruning` Set the pruning mode of the node. Some benchmarks require you to set this to `archive`.
- `--base-path` The location on the disk that should be used for the benchmarks. You can try this on different disks or even on a mounted RAM-disk. It is important to use the same location that will later-on be used to store the chain data to get the correct results.
License: Apache-2.0
@@ -0,0 +1,105 @@
# The `benchmark storage` command
The cost of storage operations in a Substrate chain depends on the current chain state.
It is therefore important to regularly update these weights as the chain grows.
This sub-command measures the cost of storage operations for a concrete snapshot.
For the Substrate node it looks like this (for debugging you can use `--release`):
```sh
cargo run --profile=production -- benchmark storage --dev --state-version=1
```
Running the command on Substrate itself is not verify meaningful, since the genesis state of the `--dev` chain spec is used.
The output for the Polkadot client with a recent chain snapshot will give you a better impression. A recent snapshot can be downloaded from [Polkachu].
Then run (remove the `--db=paritydb` if you have a RocksDB snapshot):
```sh
cargo run --profile=production -- benchmark storage --dev --state-version=0 --db=paritydb --weight-path runtime/polkadot/constants/src/weights
```
This takes a while since reads and writes all keys from the snapshot:
```pre
# The 'read' benchmark
Preparing keys from block BlockId::Number(9939462)
Reading 1379083 keys
Time summary [ns]:
Total: 19668919930
Min: 6450, Max: 1217259
Average: 14262, Median: 14190, Stddev: 3035.79
Percentiles 99th, 95th, 75th: 18270, 16190, 14819
Value size summary:
Total: 265702275
Min: 1, Max: 1381859
Average: 192, Median: 80, Stddev: 3427.53
Percentiles 99th, 95th, 75th: 3368, 383, 80
# The 'write' benchmark
Preparing keys from block BlockId::Number(9939462)
Writing 1379083 keys
Time summary [ns]:
Total: 98393809781
Min: 12969, Max: 13282577
Average: 71347, Median: 69499, Stddev: 25145.27
Percentiles 99th, 95th, 75th: 135839, 106129, 79239
Value size summary:
Total: 265702275
Min: 1, Max: 1381859
Average: 192, Median: 80, Stddev: 3427.53
Percentiles 99th, 95th, 75th: 3368, 383, 80
Writing weights to "paritydb_weights.rs"
```
You will see that the [paritydb_weights.rs] files was modified and now contains new weights.
The exact command for Polkadot can be seen at the top of the file.
This uses the most recent block from your snapshot which is printed at the top.
The value size summary tells us that the pruned Polkadot chain state is ~253 MiB in size.
Reading a value on average takes (in this examples) 14.3 µs and writing 71.3 µs.
The interesting part in the generated weight file tells us the weight constants and some statistics about the measurements:
```rust
/// Time to read one storage item.
/// Calculated by multiplying the *Average* of all values with `1.1` and adding `0`.
///
/// Stats [NS]:
/// Min, Max: 4_611, 1_217_259
/// Average: 14_262
/// Median: 14_190
/// Std-Dev: 3035.79
///
/// Percentiles [NS]:
/// 99th: 18_270
/// 95th: 16_190
/// 75th: 14_819
read: 14_262 * constants::WEIGHT_PER_NANOS,
/// Time to write one storage item.
/// Calculated by multiplying the *Average* of all values with `1.1` and adding `0`.
///
/// Stats [NS]:
/// Min, Max: 12_969, 13_282_577
/// Average: 71_347This works under the assumption that the *average* read a
/// Median: 69_499
/// Std-Dev: 25145.27
///
/// Percentiles [NS]:
/// 99th: 135_839
/// 95th: 106_129
/// 75th: 79_239
write: 71_347 * constants::WEIGHT_PER_NANOS,
```
## Arguments
- `--db` Specify which database backend to use. This greatly influences the results.
- `--state-version` Set the version of the state encoding that this snapshot uses. Should be set to `1` for Substrate `--dev` and `0` for Polkadot et al. Using the wrong version can corrupt the snapshot.
- [`--mul`](../shared/README.md#arguments)
- [`--add`](../shared/README.md#arguments)
- [`--metric`](../shared/README.md#arguments)
- [`--weight-path`](../shared/README.md#arguments)
- `--json-read-path` Write the raw 'read' results to this file or directory.
- `--json-write-path` Write the raw 'write' results to this file or directory.
License: Apache-2.0
<!-- LINKS -->
[Polkachu]: https://polkachu.com/snapshots
[paritydb_weights.rs]: https://github.com/paritytech/polkadot/blob/c254e5975711a6497af256f6831e9a6c752d28f5/runtime/polkadot/constants/src/weights/paritydb_weights.rs#L60