If approval was in progress we didn't actually restart it, so we end up
in a situation where we distribute our assignment, but we don't
distribute any approval.
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
This MR is the merge of
https://github.com/paritytech/substrate/pull/14414 and
https://github.com/paritytech/substrate/pull/14275. It implements
[RFC#13](https://github.com/polkadot-fellows/RFCs/pull/13), closes
https://github.com/paritytech/polkadot-sdk/issues/198.
-----
This Merge request introduces three major topicals:
1. Multi-Block-Migrations
1. New pallet `poll` hook for periodic service work
1. Replacement hooks for `on_initialize` and `on_finalize` in cases
where `poll` cannot be used
and some more general changes to FRAME.
The changes for each topical span over multiple crates. They are listed
in topical order below.
# 1.) Multi-Block-Migrations
Multi-Block-Migrations are facilitated by creating `pallet_migrations`
and configuring `System::Config::MultiBlockMigrator` to point to it.
Executive picks this up and triggers one step of the migrations pallet
per block.
The chain is in lockdown mode for as long as an MBM is ongoing.
Executive does this by polling `MultiBlockMigrator::ongoing` and not
allowing any transaction in a block, if true.
A MBM is defined through trait `SteppedMigration`. A condensed version
looks like this:
```rust
/// A migration that can proceed in multiple steps.
pub trait SteppedMigration {
type Cursor: FullCodec + MaxEncodedLen;
type Identifier: FullCodec + MaxEncodedLen;
fn id() -> Self::Identifier;
fn max_steps() -> Option<u32>;
fn step(
cursor: Option<Self::Cursor>,
meter: &mut WeightMeter,
) -> Result<Option<Self::Cursor>, SteppedMigrationError>;
}
```
`pallet_migrations` can be configured with an aggregated tuple of these
migrations. It then starts to migrate them one-by-one on the next
runtime upgrade.
Two things are important here:
- 1. Doing another runtime upgrade while MBMs are ongoing is not a good
idea and can lead to messed up state.
- 2. **Pallet Migrations MUST BE CONFIGURED IN `System::Config`,
otherwise it is not used.**
The pallet supports an `UpgradeStatusHandler` that can be used to notify
external logic of upgrade start/finish (for example to pause XCM
dispatch).
Error recovery is very limited in the case that a migration errors or
times out (exceeds its `max_steps`). Currently the runtime dev can
decide in `FailedMigrationHandler::failed` how to handle this. One
follow-up would be to pair this with the `SafeMode` pallet and enact
safe mode when an upgrade fails, to allow governance to rescue the
chain. This is currently not possible, since governance is not
`Mandatory`.
## Runtime API
- `Core`: `initialize_block` now returns `ExtrinsicInclusionMode` to
inform the Block Author whether they can push transactions.
### Integration
Add it to your runtime implementation of `Core` and `BlockBuilder`:
```patch
diff --git a/runtime/src/lib.rs b/runtime/src/lib.rs
@@ impl_runtime_apis! {
impl sp_block_builder::Core<Block> for Runtime {
- fn initialize_block(header: &<Block as BlockT>::Header) {
+ fn initialize_block(header: &<Block as BlockT>::Header) -> RuntimeExecutiveMode {
Executive::initialize_block(header)
}
...
}
```
# 2.) `poll` hook
A new pallet hook is introduced: `poll`. `Poll` is intended to replace
mostly all usage of `on_initialize`.
The reason for this is that any code that can be called from
`on_initialize` cannot be migrated through an MBM. Currently there is no
way to statically check this; the implication is to use `on_initialize`
as rarely as possible.
Failing to do so can result in broken storage invariants.
The implementation of the poll hook depends on the `Runtime API` changes
that are explained above.
# 3.) Hard-Deadline callbacks
Three new callbacks are introduced and configured on `System::Config`:
`PreInherents`, `PostInherents` and `PostTransactions`.
These hooks are meant as replacement for `on_initialize` and
`on_finalize` in cases where the code that runs cannot be moved to
`poll`.
The reason for this is to make the usage of HD-code (hard deadline) more
explicit - again to prevent broken invariants by MBMs.
# 4.) FRAME (general changes)
## `frame_system` pallet
A new memorize storage item `InherentsApplied` is added. It is used by
executive to track whether inherents have already been applied.
Executive and can then execute the MBMs directly between inherents and
transactions.
The `Config` gets five new items:
- `SingleBlockMigrations` this is the new way of configuring migrations
that run in a single block. Previously they were defined as last generic
argument of `Executive`. This shift is brings all central configuration
about migrations closer into view of the developer (migrations that are
configured in `Executive` will still work for now but is deprecated).
- `MultiBlockMigrator` this can be configured to an engine that drives
MBMs. One example would be the `pallet_migrations`. Note that this is
only the engine; the exact MBMs are injected into the engine.
- `PreInherents` a callback that executes after `on_initialize` but
before inherents.
- `PostInherents` a callback that executes after all inherents ran
(including MBMs and `poll`).
- `PostTransactions` in symmetry to `PreInherents`, this one is called
before `on_finalize` but after all transactions.
A sane default is to set all of these to `()`. Example diff suitable for
any chain:
```patch
@@ impl frame_system::Config for Test {
type MaxConsumers = ConstU32<16>;
+ type SingleBlockMigrations = ();
+ type MultiBlockMigrator = ();
+ type PreInherents = ();
+ type PostInherents = ();
+ type PostTransactions = ();
}
```
An overview of how the block execution now looks like is here. The same
graph is also in the rust doc.
<details><summary>Block Execution Flow</summary>
<p>

</p>
</details>
## Inherent Order
Moved to https://github.com/paritytech/polkadot-sdk/pull/2154
---------------
## TODO
- [ ] Check that `try-runtime` still works
- [ ] Ensure backwards compatibility with old Runtime APIs
- [x] Consume weight correctly
- [x] Cleanup
---------
Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
Co-authored-by: Liam Aharon <liam.aharon@hotmail.com>
Co-authored-by: Juan Girini <juangirini@gmail.com>
Co-authored-by: command-bot <>
Co-authored-by: Francisco Aguirre <franciscoaguirreperez@gmail.com>
Co-authored-by: Gavin Wood <gavin@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Hey everyone,
this PR will replace existing Polkadotters bootnodes for Polkadot,
Kusama and Westend and add Paseo bootnode to the relay chain suite. At
the same time, it will add new bootnodes for all the system parachains,
including People on Westend. This PR is a part of our membership in the
IBP, meaning that all the bootnodes are hosted on our hardware housed in
the data center in Christchurch, New Zealand.
All the bootnodes were tested with an empty chain spec file with the
following command yielding 1 peer.
The test commands used are as follows:
```
./polkadot --base-path /tmp/node --reserved-only --chain paseo --reserved-nodes "/dns/paseo.bootnodes.polkadotters.com/tcp/30540/wss/p2p/12D3KooWPbbFy4TefEGTRF5eTYhq8LEzc4VAHdNUVCbY4nAnhqPP"
./polkadot --base-path /tmp/node --reserved-only --chain westend --reserved-nodes "/dns/westend.bootnodes.polkadotters.com/tcp/30310/wss/p2p/12D3KooWHPHb64jXMtSRJDrYFATWeLnvChL8NtWVttY67DCH1eC5"
./polkadot --base-path /tmp/node --reserved-only --chain kusama --reserved-nodes "/dns/kusama.bootnodes.polkadotters.com/tcp/30313/wss/p2p/12D3KooWHB5rTeNkQdXNJ9ynvGz8Lpnmsctt7Tvp7mrYv6bcwbPG"
./polkadot --base-path /tmp/node --no-hardware-benchmarks --reserved-only --chain polkadot --reserved-nodes "/dns/polkadot.bootnodes.polkadotters.com/tcp/30316/wss/p2p/12D3KooWPAVUgBaBk6n8SztLrMk8ESByncbAfRKUdxY1nygb9zG3"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain asset-hub-kusama --reserved-nodes "/dns/asset-hub-kusama.bootnodes.polkadotters.com/tcp/30513/wss/p2p/12D3KooWDpk7wVH7RgjErEvbvAZ2kY5VeaAwRJP5ojmn1e8b8UbU"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain asset-hub-polkadot --reserved-nodes "/dns/asset-hub-polkadot.bootnodes.polkadotters.com/tcp/30510/wss/p2p/12D3KooWKbfY9a9oywxMJKiALmt7yhrdQkjXMtvxhhDDN23vG93R"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain asset-hub-westend --reserved-nodes "/dns/asset-hub-westend.bootnodes.polkadotters.com/tcp/30516/wss/p2p/12D3KooWNFYysCqmojxqjjaTfD2VkWBNngfyUKWjcR4WFixfHNTk"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain bridge-hub-kusama --reserved-nodes "/dns/bridge-hub-kusama.bootnodes.polkadotters.com/tcp/30522/wss/p2p/12D3KooWH3pucezRRS5esoYyzZsUkKWcPSByQxEvmM819QL1HPLV"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain bridge-hub-kusama --reserved-nodes "/dns/bridge-hub-kusama.bootnodes.polkadotters.com/tcp/30522/wss/p2p/12D3KooWH3pucezRRS5esoYyzZsUkKWcPSByQxEvmM819QL1HPLV"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain bridge-hub-westend --reserved-nodes "/dns/bridge-hub-westend.bootnodes.polkadotters.com/tcp/30525/wss/p2p/12D3KooWPkwgJofp4GeeRwNgXqkp2aFwdLkCWv3qodpBJLwK43Jj"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain collectives-polkadot --reserved-nodes "/dns/collectives-polkadot.bootnodes.polkadotters.com/tcp/30528/wss/p2p/12D3KooWNohUjvJtGKUa8Vhy8C1ZBB5N8JATB6e7rdLVCioeb3ff"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain collectives-westend --reserved-nodes "/dns/collectives-westend.bootnodes.polkadotters.com/tcp/30531/wss/p2p/12D3KooWAFkXNSBfyPduZVgfS7pj5NuVpbU8Ee5gHeF8wvos7Yqn"
./polkadot-parachain --base-path /tmp/node --reserved-only --chain people-westend --reserved-nodes "/dns/identity-westend.bootnodes.polkadotters.com/tcp/30534/wss/p2p/12D3KooWKr9San6KTM7REJ95cBaDoiciGcWnW8TTftEJgxGF5Ehb"
```
Best regards,
Petr, Polkadotters
Add more debug logs to understand if statement-distribution is in a bad
state, should be useful for debugging
https://github.com/paritytech/polkadot-sdk/issues/3314 on production
networks.
Additionally, increase the number of parallel requests should make,
since I notice that requests take around 100ms on kusama, and the 5
parallel request was picked mostly random, no reason why we can do more
than that.
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: ordian <write@reusable.software>
Fixes https://github.com/paritytech/polkadot-sdk/issues/3144
Builds on top of https://github.com/paritytech/polkadot-sdk/pull/3229
### Summary
Some preparations for Runtime to support elastic scaling, guarded by
config node features bit `FeatureIndex::ElasticScalingMVP`. This PR
introduces a per-candidate `CoreIndex` but does it in a hacky way to
avoid changing `CandidateCommitments`, `CandidateReceipts` primitives
and networking protocols.
#### Including `CoreIndex` in `BackedCandidate`
If the `ElasticScalingMVP` feature bit is enabled then
`BackedCandidate::validator_indices` is extended by 8 bits.
The value stored in these bits represents the assumed core index for the
candidate.
It is temporary solution which works by creating a mapping from
`BackedCandidate` to `CoreIndex` by assuming the `CoreIndex` can be
discovered by checking in which validator group the validator that
signed the statement is.
TODO:
- [x] fix tests
- [x] add new tests
- [x] Bump runtime API for Kusama, so we have that node features thing!
-> https://github.com/polkadot-fellows/runtimes/pull/194
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: alindima <alin@parity.io>
Co-authored-by: alindima <alin@parity.io>
The `fee` should be calculated with the reanchored asset, otherwise it
could lead to a failure where the set aside fee ends up not being
enough.
@acatangiu
---------
Co-authored-by: Adrian Catangiu <adrian@parity.io>
This introduces a check to ensure that the parachain code matches the
validation code stored in the relay chain state. If not, it will print a
warning. This should be mainly useful for parachain builders to make
sure they have setup everything correctly.
First step in implementing
https://github.com/paritytech/polkadot-sdk/issues/3144
### Summary of changes
- switch statement `Table` candidate mapping from `ParaId` to
`CoreIndex`
- introduce experimental `InjectCoreIndex` node feature.
- determine and assume a `CoreIndex` for a candidate based on statement
validator index. If the signature is valid it means validator controls
the validator that index and we can easily map it to a validator
group/core.
- introduce a temporary provisioner fix until we fully enable elastic
scaling in the subystem. The fix ensures we don't fetch the same
backable candidate when calling `get_backable_candidate` for each core.
TODO:
- [x] fix backing tests
- [x] fix statement table tests
- [x] add new test
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: alindima <alin@parity.io>
Co-authored-by: alindima <alin@parity.io>
The rationale behind this, is that it may be useful for some users
actually disable RPC batch requests or limit them by length instead of
the total size bytes of the batch.
This PR adds two new CLI options:
```
--rpc-disable-batch-requests - disable batch requests on the server
--rpc-max-batch-request-len <LEN> - limit batches to LEN on the server.
```
Lifting some more dependencies to the workspace. Just using the
most-often updated ones for now.
It can be reproduced locally.
```sh
# First you can check if there would be semver incompatible bumps (looks good in this case):
$ zepter transpose dependency lift-to-workspace --ignore-errors syn quote thiserror "regex:^serde.*"
# Then apply the changes:
$ zepter transpose dependency lift-to-workspace --version-resolver=highest syn quote thiserror "regex:^serde.*" --fix
# And format the changes:
$ taplo format --config .config/taplo.toml
```
---------
Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
~The previous fix was actually incomplete because we update the
authorties only on the situation where we decided to reconnect because
we had a low connectivity issue. Now the problem is that
update_authority_ids use the list of connected peers, so on restart that
does contain anything, so calling immediately after
issue_connection_request won't detect all authorities, so we need to
also check every block as the comment said, but that did not match the
code.~
Actually the fix was correct the flow is follow if more than 1/3 of the
authorities can not be resolved we set last_failure and call
`ConnectToResolvedValidators`.
We will call UpdateAuthorities for all the authorities already connected
and for which we already know the address and for the ones that will
connect later on `PeerConnected` will have the AuthorityId field set,
because it is already known, so approval-distribution will update its
cache topology.
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com>
Add RPC server rate limiting which can be utilized by the CLI
`--rpc-rate-limit <calls/per minute>`
Resolves first part of
https://github.com/paritytech/polkadot-sdk/issues/3028
//cc @PierreBesson @kogeler you might be interested in this one
---------
Co-authored-by: James Wilson <james@jsdw.me>
Co-authored-by: Xiliang Chen <xlchen1291@gmail.com>
Changes (partial https://github.com/paritytech/polkadot-sdk/issues/994):
- Set log to `0.4.20` everywhere
- Lift `log` to the workspace
Starting with a simpler one after seeing
https://github.com/paritytech/polkadot-sdk/pull/2065 from @jsdw.
This sets the `default-features` to `false` in the root and then
overwrites that in each create to its original value. This is necessary
since otherwise the `default` features are additive and its impossible
to disable them in the crate again once they are enabled in the
workspace.
I am using a tool to do this, so its mostly a test to see that it works
as expected.
---------
Signed-off-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
On grid distribution messages have two paths of reaching a node, so
there is the possiblity of a race when two peers send each other the
same statement around the same time. Statement local_knowledge will tell
us that the peer should have not send the statement because we sent it
to it.
Fix it by also keeping track only of the statement we received from a
given peer and penalize it only if it sends it to us more than once.
Fixes: https://github.com/paritytech/polkadot-sdk/issues/2346
Additionally, also use different Cost labels for different paths to make
it easier to debug things.
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
This PR removes the configuration of subsystem benchmarks via CLI
arguments. After this, we only keep configurations only in yaml files.
It removes unnecessary code duplication
1. Benchmark results are collected in a single struct.
2. The output of the results is prettified.
3. The result struct used to save the output as a yaml and store it in
artifacts in a CI job.
```
$ cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/availability_read.yaml | tee output.txt
$ cat output.txt
polkadot/node/subsystem-bench/examples/availability_read.yaml #1
Network usage, KiB total per block
Received from peers 510796.000 170265.333
Sent to peers 221.000 73.667
CPU usage, s total per block
availability-recovery 38.671 12.890
Test environment 0.255 0.085
polkadot/node/subsystem-bench/examples/availability_read.yaml #2
Network usage, KiB total per block
Received from peers 413633.000 137877.667
Sent to peers 353.000 117.667
CPU usage, s total per block
availability-recovery 52.630 17.543
Test environment 0.271 0.090
polkadot/node/subsystem-bench/examples/availability_read.yaml #3
Network usage, KiB total per block
Received from peers 424379.000 141459.667
Sent to peers 703.000 234.333
CPU usage, s total per block
availability-recovery 51.128 17.043
Test environment 0.502 0.167
```
```
$ cargo run -p polkadot-subsystem-bench --release -- --ci test-sequence --path polkadot/node/subsystem-bench/examples/availability_read.yaml | tee output.txt
$ cat output.txt
- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #1'
network:
- resource: Received from peers
total: 509011.0
per_block: 169670.33333333334
- resource: Sent to peers
total: 220.0
per_block: 73.33333333333333
cpu:
- resource: availability-recovery
total: 31.845848445
per_block: 10.615282815
- resource: Test environment
total: 0.23582828799999941
per_block: 0.07860942933333313
- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #2'
network:
- resource: Received from peers
total: 411738.0
per_block: 137246.0
- resource: Sent to peers
total: 351.0
per_block: 117.0
cpu:
- resource: availability-recovery
total: 18.93596025099999
per_block: 6.31198675033333
- resource: Test environment
total: 0.2541994199999979
per_block: 0.0847331399999993
- benchmark_name: 'polkadot/node/subsystem-bench/examples/availability_read.yaml #3'
network:
- resource: Received from peers
total: 424548.0
per_block: 141516.0
- resource: Sent to peers
total: 703.0
per_block: 234.33333333333334
cpu:
- resource: availability-recovery
total: 16.54178526900001
per_block: 5.513928423000003
- resource: Test environment
total: 0.43960946299999537
per_block: 0.14653648766666513
```
---------
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
## Summary
Built on top of the tooling and ideas introduced in
https://github.com/paritytech/polkadot-sdk/pull/2528, this PR introduces
a synthetic benchmark for measuring and assessing the performance
characteristics of the approval-voting and approval-distribution
subsystems.
Currently this allows, us to simulate the behaviours of these systems
based on the following dimensions:
```
TestConfiguration:
# Test 1
- objective: !ApprovalsTest
last_considered_tranche: 89
min_coalesce: 1
max_coalesce: 6
enable_assignments_v2: true
send_till_tranche: 60
stop_when_approved: false
coalesce_tranche_diff: 12
workdir_prefix: "/tmp"
num_no_shows_per_candidate: 0
approval_distribution_expected_tof: 6.0
approval_distribution_cpu_ms: 3.0
approval_voting_cpu_ms: 4.30
n_validators: 500
n_cores: 100
n_included_candidates: 100
min_pov_size: 1120
max_pov_size: 5120
peer_bandwidth: 524288000000
bandwidth: 524288000000
latency:
min_latency:
secs: 0
nanos: 1000000
max_latency:
secs: 0
nanos: 100000000
error: 0
num_blocks: 10
```
## The approach
1. We build a real overseer with the real implementations for
approval-voting and approval-distribution subsystems.
2. For a given network size, for each validator we pre-computed all
potential assignments and approvals it would send, because this a
computation heavy operation this will be cached on a file on disk and be
re-used if the generation parameters don't change.
3. The messages will be sent accordingly to the configured parameters
and those are split into 3 main benchmarking scenarios.
## Benchmarking scenarios
### Best case scenario *approvals_throughput_best_case.yaml*
It send to the approval-distribution only the minimum required tranche
to gathered the needed_approvals, so that a candidate is approved.
### Behaviour in the presence of no-shows *approvals_no_shows.yaml*
It sends the tranche needed to approve a candidate when we have a
maximum of *num_no_shows_per_candidate* tranches with no-shows for each
candidate.
### Maximum throughput *approvals_throughput.yaml*
It sends all the tranches for each block and measures the used CPU and
necessary network bandwidth. by the approval-voting and
approval-distribution subsystem.
## How to run it
```
cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml
```
## Evaluating performance
### Use the real subsystems metrics
If you follow the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana
for installing locally prometheus and grafana, all real metrics for the
`approval-distribution`, `approval-voting` and overseer are available.
E.g:
<img width="2149" alt="Screenshot 2023-12-05 at 11 07 46"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/cb8ae2dd-178b-4922-bfa4-dc37e572ed38">
<img width="2551" alt="Screenshot 2023-12-05 at 11 09 42"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/8b4542ba-88b9-46f9-9b70-cc345366081b">
<img width="2154" alt="Screenshot 2023-12-05 at 11 10 15"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/b8874d8d-632e-443a-9840-14ad8e90c54f">
<img width="2535" alt="Screenshot 2023-12-05 at 11 10 52"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/779a439f-fd18-4985-bb80-85d5afad78e2">
### Profile with pyroscope
1. Setup pyroscope following the steps in
https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope,
then run any of the benchmark scenario with `--profile` as the
arguments.
2. Open the pyroscope dashboard in grafana, e.g:
<img width="2544" alt="Screenshot 2024-01-09 at 17 09 58"
src="https://github.com/paritytech/polkadot-sdk/assets/49718502/58f50c99-a910-4d20-951a-8b16639303d9">
### Useful logs
1. Network bandwidth requirements:
```
Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
```
2. Cpu usage by the approval-distribution/approval-voting subsystems.
```
approval-distribution CPU usage 84.061s
approval-distribution CPU usage per block 8.406s
approval-voting CPU usage 96.532s
approval-voting CPU usage per block 9.653s
```
3. Time passed until a given block is approved
```
Chain selection approved after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101
Chain selection approved after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202
```
### Using benchmark to quantify improvements from
https://github.com/paritytech/polkadot-sdk/pull/1178 +
https://github.com/paritytech/polkadot-sdk/pull/1191
Using a versi-node we compare the scenarios where all new optimisations
are disabled with a scenarios where tranche0 assignments are sent in a
single message and a conservative simulation where the coalescing of
approvals gives us just 50% reduction in the number of messages we send.
Overall, what we see is a speedup of around 30-40% in the time it takes
to process the necessary messages and a 30-40% reduction in the
necessary bandwidth.
#### Best case scenario comparison(minimum required tranches sent).
Unoptimised
```
Number of blocks: 10
Payload bytes received from peers: 53289 KiB total, 5328 KiB/block
Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block
approval-distribution CPU usage 6.732s
approval-distribution CPU usage per block 0.673s
approval-voting CPU usage 9.523s
approval-voting CPU usage per block 0.952s
```
vs Optimisation enabled
```
Number of blocks: 10
Payload bytes received from peers: 32141 KiB total, 3214 KiB/block
Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block
approval-distribution CPU usage 4.658s
approval-distribution CPU usage per block 0.466s
approval-voting CPU usage 6.236s
approval-voting CPU usage per block 0.624s
```
#### Worst case all tranches sent, very unlikely happens when sharding
breaks.
Unoptimised
```
Number of blocks: 10
Payload bytes received from peers: 746393 KiB total, 74639 KiB/block
Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block
approval-distribution CPU usage 118.681s
approval-distribution CPU usage per block 11.868s
approval-voting CPU usage 124.118s
approval-voting CPU usage per block 12.412s
```
vs optimised
```
Number of blocks: 10
Payload bytes received from peers: 503993 KiB total, 50399 KiB/block
Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block
approval-distribution CPU usage 84.061s
approval-distribution CPU usage per block 8.406s
approval-voting CPU usage 96.532s
approval-voting CPU usage per block 9.653s
```
## TODOs
[x] Polish implementation.
[x] Use what we have so far to evaluate
https://github.com/paritytech/polkadot-sdk/pull/1191 before merging.
[x] List of features and additional dimensions we want to use for
benchmarking.
[x] Run benchmark on hardware similar with versi and kusama nodes.
[ ] Add benchmark to be run in CI for catching regression in
performance.
[ ] Rebase on latest changes for network emulation.
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
This change is mainly for people running the local variants. They can
directly start with async backing.
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Candidate validation has a lot of operations that are cpu bound on its
main loop things, like:
```
validation_code.hash()
sp_maybe_compressed_blob::decompress(
&validation_code.0,
VALIDATION_CODE_BOMB_LIMIT,
)
sp_maybe_compressed_blob::decompress(&pov.block_data.0, POV_BOMB_LIMIT)
let code_hash = sp_crypto_hashing::blake2_256(&code).into();
```
When you add all that you for large POV and CODE it is going to take in
the order of 10s of ms and because these are cpu bound operation it is
going to hog the executor thread and negatively affect other subsystems
around it, so it is better to just move the subsystem on the blocking
pool to make sure such unexpected behaviour is avoided.
Note! In practice this subsystem does not have a high number of work to
be done, so probably the impact of it is really low, but better safe
than sorry.
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Currently, collators and their alongside nodes spin up a full-scale
overseer running a bunch of subsystems that are not needed if the node
is not a validator. That was considered to be harmless; however, we've
got problems with unused subsystems getting stalled for a reason not
currently known, resulting in the overseer exiting and bringing down the
whole node.
This PR aims to only run needed subsystems on such nodes, replacing the
rest with `DummySubsystem`.
It also enables collator-optimized availability recovery subsystem
implementation.
Partially solves #1730.
Introduce a new test objective : `DataAvailabilityWrite`.
The new benchmark measures the network and cpu usage of
`availability-distribution`, `biftield-distribution` and
`availability-store` subsystems from the perspective of a validator node
during the process when candidates are made available.
Additionally I refactored the networking emulation to support bandwidth
acounting and limits of incoming and outgoing requests.
Screenshot of succesful run
<img width="1293" alt="Screenshot 2024-01-17 at 19 17 44"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/fde11280-e25b-4dc3-9dc9-d4b9752f9b7a">
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Fixes: https://github.com/paritytech/polkadot-sdk/issues/2138.
Especially on restart AuthorithyDiscovery cache is not populated so we
create an invalid topology and messages won't be routed correctly for
the entire session. This PR proposes to try to fix this by updating the
topology as soon as we now the Authority/PeerId mapping, that should
impact the situation dramatically.
[This issue was hit
yesterday](https://grafana.teleport.parity.io/goto/o9q2625Sg?orgId=1),
on Westend and resulted in stalling the finality.
# TODO
- [x] Unit tests
- [x] Test impact on versi
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Found the issue while investigating the recent finality stall on Westend
after upgrading to 1.6.0. Approval distribution aggression is supposed
to trade off bandwidth and re-send assignemnts/approvals until enough
approvals are be received by at least 2/3 validators. This is supposed
to be a catch all mechanism when network connectivity goes south or many
validators reboot at the same time.
This fix ensures that we always resend approvals starting with the first
unfinalized block even in the case when it appears approved from the
node's perspective.
TODO:
- [x] Versi test
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
This is a rather big change in jsonrpsee, the major things in this bump
are:
- Server backpressure (the subscription impls are modified to deal with
that)
- Allow custom error types / return types (remove jsonrpsee::core::Error
and jsonrpee::core::CallError)
- Bug fixes (graceful shutdown in particular not used by substrate
anyway)
- Less dependencies for the clients in particular
- Return type requires Clone in method call responses
- Moved to tokio channels
- Async subscription API (not used in this PR)
Major changes in this PR:
- The subscriptions are now bounded and if subscription can't keep up
with the server it is dropped
- CLI: add parameter to configure the jsonrpc server bounded message
buffer (default is 64)
- Add our own subscription helper to deal with the unbounded streams in
substrate
The most important things in this PR to review is the added helpers
functions in `substrate/client/rpc/src/utils.rs` and the rest is pretty
much chore.
Regarding the "bounded buffer limit" it may cause the server to handle
the JSON-RPC calls
slower than before.
The message size limit is bounded by "--rpc-response-size" thus "by
default 10MB * 64 = 640MB"
but the subscription message size is not covered by this limit and could
be capped as well.
Hopefully the last release prior to 1.0, sorry in advance for a big PR
Previous attempt: https://github.com/paritytech/substrate/pull/13992
Resolves https://github.com/paritytech/polkadot-sdk/issues/748, resolves
https://github.com/paritytech/polkadot-sdk/issues/627
Step towards https://github.com/paritytech/polkadot-sdk/issues/1975
As reported
https://github.com/paritytech/polkadot-sdk/issues/1975#issuecomment-1774534225
I'd like to encapsulate crypto related stuff in a dedicated folder.
Currently all cryptographic primitive wrappers are all sparsed in
`substrate/core` which contains "misc core" stuff.
To simplify the process, as the first step with this PR I propose to
move the cryptographic hashing there.
The `substrate/crypto` folder was already created to contains `ec-utils`
crate.
Notes:
- rename `sp-core-hashing` to `sp-crypto-hashing`
- rename `sp-core-hashing-proc-macro` to `sp-crypto-hashing-proc-macro`
- As the crates name is changed I took the freedom to restart fresh from
version 0.1.0 for both crates
---------
Co-authored-by: Robert Hambrock <roberthambrock@gmail.com>