[litep2p](https://github.com/altonen/litep2p) is a libp2p-compatible P2P
networking library. It supports all of the features of `rust-libp2p`
that are currently being utilized by Polkadot SDK.
Compared to `rust-libp2p`, `litep2p` has a quite different architecture
which is why the new `litep2p` network backend is only able to use a
little of the existing code in `sc-network`. The design has been mainly
influenced by how we'd wish to structure our networking-related code in
Polkadot SDK: independent higher-levels protocols directly communicating
with the network over links that support bidirectional backpressure. A
good example would be `NotificationHandle`/`RequestResponseHandle`
abstractions which allow, e.g., `SyncingEngine` to directly communicate
with peers to announce/request blocks.
I've tried running `polkadot --network-backend litep2p` with a few
different peer configurations and there is a noticeable reduction in
networking CPU usage. For high load (`--out-peers 200`), networking CPU
usage goes down from ~110% to ~30% (80 pp) and for normal load
(`--out-peers 40`), the usage goes down from ~55% to ~18% (37 pp).
These should not be taken as final numbers because:
a) there are still some low-hanging optimization fruits, such as
enabling [receive window
auto-tuning](https://github.com/libp2p/rust-yamux/pull/176), integrating
`Peerset` more closely with `litep2p` or improving memory usage of the
WebSocket transport
b) fixing bugs/instabilities that incorrectly cause `litep2p` to do less
work will increase the networking CPU usage
c) verification in a more diverse set of tests/conditions is needed
Nevertheless, these numbers should give an early estimate for CPU usage
of the new networking backend.
This PR consists of three separate changes:
* introduce a generic `PeerId` (wrapper around `Multihash`) so that we
don't have use `NetworkService::PeerId` in every part of the code that
uses a `PeerId`
* introduce `NetworkBackend` trait, implement it for the libp2p network
stack and make Polkadot SDK generic over `NetworkBackend`
* implement `NetworkBackend` for litep2p
The new library should be considered experimental which is why
`rust-libp2p` will remain as the default option for the time being. This
PR currently depends on the master branch of `litep2p` but I'll cut a
new release for the library once all review comments have been
addresses.
---------
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: Alexandru Vasile <alexandru.vasile@parity.io>
**Update:** Pushed additional changes based on the review comments.
**This pull request fixes various spelling mistakes in this
repository.**
Most of the changes are contained in the first **3** commits:
- `Fix spelling mistakes in comments and docs`
- `Fix spelling mistakes in test names`
- `Fix spelling mistakes in error messages, panic messages, logs and
tracing`
Other source code spelling mistakes are separated into individual
commits for easier reviewing:
- `Fix the spelling of 'authority'`
- `Fix the spelling of 'REASONABLE_HEADERS_IN_JUSTIFICATION_ANCESTRY'`
- `Fix the spelling of 'prev_enqueud_messages'`
- `Fix the spelling of 'endpoint'`
- `Fix the spelling of 'children'`
- `Fix the spelling of 'PenpalSiblingSovereignAccount'`
- `Fix the spelling of 'PenpalSudoAccount'`
- `Fix the spelling of 'insufficient'`
- `Fix the spelling of 'PalletXcmExtrinsicsBenchmark'`
- `Fix the spelling of 'subtracted'`
- `Fix the spelling of 'CandidatePendingAvailability'`
- `Fix the spelling of 'exclusive'`
- `Fix the spelling of 'until'`
- `Fix the spelling of 'discriminator'`
- `Fix the spelling of 'nonexistent'`
- `Fix the spelling of 'subsystem'`
- `Fix the spelling of 'indices'`
- `Fix the spelling of 'committed'`
- `Fix the spelling of 'topology'`
- `Fix the spelling of 'response'`
- `Fix the spelling of 'beneficiary'`
- `Fix the spelling of 'formatted'`
- `Fix the spelling of 'UNKNOWN_PROOF_REQUEST'`
- `Fix the spelling of 'succeeded'`
- `Fix the spelling of 'reopened'`
- `Fix the spelling of 'proposer'`
- `Fix the spelling of 'InstantiationNonce'`
- `Fix the spelling of 'depositor'`
- `Fix the spelling of 'expiration'`
- `Fix the spelling of 'phantom'`
- `Fix the spelling of 'AggregatedKeyValue'`
- `Fix the spelling of 'randomness'`
- `Fix the spelling of 'defendant'`
- `Fix the spelling of 'AquaticMammal'`
- `Fix the spelling of 'transactions'`
- `Fix the spelling of 'PassingTracingSubscriber'`
- `Fix the spelling of 'TxSignaturePayload'`
- `Fix the spelling of 'versioning'`
- `Fix the spelling of 'descendant'`
- `Fix the spelling of 'overridden'`
- `Fix the spelling of 'network'`
Let me know if this structure is adequate.
**Note:** The usage of the words `Merkle`, `Merkelize`, `Merklization`,
`Merkelization`, `Merkleization`, is somewhat inconsistent but I left it
as it is.
~~**Note:** In some places the term `Receival` is used to refer to
message reception, IMO `Reception` is the correct word here, but I left
it as it is.~~
~~**Note:** In some places the term `Overlayed` is used instead of the
more acceptable version `Overlaid` but I also left it as it is.~~
~~**Note:** In some places the term `Applyable` is used instead of the
correct version `Applicable` but I also left it as it is.~~
**Note:** Some usage of British vs American english e.g. `judgement` vs
`judgment`, `initialise` vs `initialize`, `optimise` vs `optimize` etc.
are both present in different places, but I suppose that's
understandable given the number of contributors.
~~**Note:** There is a spelling mistake in `.github/CODEOWNERS` but it
triggers errors in CI when I make changes to it, so I left it as it
is.~~
Previously, it was only possible to retry the same request on a
different protocol name that had the exact same binary payloads.
Introduce a way of trying a different request on a different protocol if
the first one fails with Unsupported protocol.
This helps with adding new req-response versions in polkadot while
preserving compatibility with unupgraded nodes.
The way req-response protocols were bumped previously was that they were
bundled with some other notifications protocol upgrade, like for async
backing (but that is more complicated, especially if the feature does
not require any changes to a notifications protocol). Will be needed for
implementing https://github.com/polkadot-fellows/RFCs/pull/47
TODO:
- [x] add tests
- [x] add guidance docs in polkadot about req-response protocol
versioning
This tool makes it easy to run parachain consensus stress/performance
testing on your development machine or in CI.
## Motivation
The parachain consensus node implementation spans across many modules
which we call subsystems. Each subsystem is responsible for a small part
of logic of the parachain consensus pipeline, but in general the most
load and performance issues are localized in just a few core subsystems
like `availability-recovery`, `approval-voting` or
`dispute-coordinator`. In the absence of such a tool, we would run large
test nets to load/stress test these parts of the system. Setting up and
making sense of the amount of data produced by such a large test is very
expensive, hard to orchestrate and is a huge development time sink.
## PR contents
- CLI tool
- Data Availability Read test
- reusable mockups and components needed so far
- Documentation on how to get started
### Data Availability Read test
An overseer is built with using a real `availability-recovery` susbsytem
instance while dependent subsystems like `av-store`, `network-bridge`
and `runtime-api` are mocked. The network bridge will emulate all the
network peers and their answering to requests.
The test is going to be run for a number of blocks. For each block it
will generate send a “RecoverAvailableData” request for an arbitrary
number of candidates. We wait for the subsystem to respond to all
requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU
metrics and show some nice progress reports while running.
### Here is how the CLI looks like:
```
[2023-11-28T13:06:27Z INFO subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO subsystem_bench::availability]
Total received from network: 415 MiB
Total sent to network: 724 KiB
Total subsystem CPU usage 24.00s
CPU usage per block 8.00s
Total test environment CPU usage 0.15s
CPU usage per block 0.05s
```
### Prometheus/Grafana stack in action
<img width="1246" alt="Screenshot 2023-11-28 at 15 11 10"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 01"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea">
<img width="1246" alt="Screenshot 2023-11-28 at 15 12 38"
src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357">
---------
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Collators were previously reencoding the available data and checking the
erasure root.
Replace that with just checking the PoV hash, which consumes much less
CPU and takes less time.
We also don't need to check the `PersistedValidationData` hash, as
collators don't use it.
Reason:
https://github.com/paritytech/polkadot-sdk/issues/575#issuecomment-1806572230
After systematic chunks recovery is merged, collators will no longer do
any reed-solomon encoding/decoding, which has proven to be a great CPU
consumer.
Signed-off-by: alindima <alin@parity.io>
While investigating some db migrations that make the node startup fail,
I noticed that the node wasn't exiting and that the log file were
growing exponentially, until my whole system was freezing and that makes
it really hard to actually find why it was failing in the first place.
E.g:
```
ls -lh /tmp/zombie-01a04c2a2c0265d85f6440cf01c0f44a_-51319-uyggzuD4wEpV/bob.log
32,6G oct 27 11:16 /tmp/zombie-01a04c2a2c0265d85f6440cf01c0f44a_-51319-uyggzuD4wEpV/bob.log
```
This was happening because the following errors were being printed
continously without the subsystem main loop exiting:
From dispute-coordinator:
```
WARN tokio-runtime-worker parachain::dispute-coordinator: error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
```
From availability recovery:
```
Erasure task channel closed. Node shutting down ?
```
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Refactors availability-recovery strategies to allow for easily adding
new hotpaths and failover mechanisms.
The new interface allows for chaining multiple `RecoveryStrategy`-es
together, to cleanly express the relationship between them and share
state and code where neccessary/possible:
This was done in order to aid in implementing new hotpaths like
[systematic chunks
recovery](https://github.com/paritytech/polkadot-sdk/issues/598) and
[fetching from approval
checkers](https://github.com/paritytech/polkadot-sdk/issues/575).
Thanks to this design, intermediate state can be shared between the
strategies. For example, if the systematic chunks recovery retrieved
less than the needed amount of chunks, pass them over to the next
FetchChunks strategy, which will only need to recover the remaining
number of chunks.
Draft example of how a systematic chunk recovery strategy would look:
https://github.com/paritytech/polkadot-sdk/commit/667d870bdf1470525d66c13929d5eac7249dd995
(notice how easy it was to add and reuse code)
Note that this PR doesn't itself add any new strategy, it should fully
preserve backwards compatiblity in terms of functionality. Follow-up PRs
to add new strategies will come.
* polkadot: propagate UnpinHandle to ActiveLeafUpdate
Also extract the leaf creation for tests
into a common function.
* dispute-coordinator: try pinned blocks for slashin
* apparently 1.72 is smarter than 1.70
* address nits
* rename fresh_leaf to new_leaf
* Happy New Year!
* Remove year entierly
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Remove years from copyright notice in the entire repo
---------
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Add clippy config and remove .cargo from gitignore
* first fixes
* Clippyfied
* Add clippy CI job
* comment out rusty-cachier
* minor
* fix ci
* remove DAG from check-dependent-project
* add DAG to clippy
Co-authored-by: alvicsam <alvicsam@gmail.com>
Was seeing these warnings when running `cargo check --all`:
```
warning: for loop over an `Option`. This is more readably written as an `if let` statement
--> node/core/approval-voting/src/lib.rs:1147:21
|
1147 | for activated in update.activated {
| ^^^^^^^^^^^^^^^^
|
= note: `#[warn(for_loops_over_fallibles)]` on by default
help: to check pattern in a loop use `while let`
|
1147 | while let Some(activated) = update.activated {
| ~~~~~~~~~~~~~~~ ~~~
help: consider using `if let` to clear intent
|
1147 | if let Some(activated) = update.activated {
| ~~~~~~~~~~~~ ~~~
```
My guess is that `activated` used to be a SmallVec or similar, as is
`deactivated`. It was changed to an `Option`, the `for` still compiled (it's
technically correct, just weird), and the compiler didn't catch it until now.
* foo
* rolling session window
* fixup
* remove use statemetn
* fmt
* split NetworkBridge into two subsystems
Pending cleanup
* split
* chore: reexport OrchestraError as OverseerError
* chore: silence warnings
* fixup tests
* chore: add default timenout of 30s to subsystem test helper ctx handle
* single item channel
* fixins
* fmt
* cleanup
* remove dead code
* remove sync bounds again
* wire up shared state
* deal with some FIXMEs
* use distinct tags
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* use tag
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* address naming
tx and rx are common in networking and also have an implicit meaning regarding networking
compared to incoming and outgoing which are already used with subsystems themselvesq
* remove unused sync oracle
* remove unneeded state
* fix tests
* chore: fmt
* do not try to register twice
* leak Metrics type
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Andronik <write@reusable.software>
* explicitly tag network requests with version
* fmt
* make PeerSet more aware of versioning
* some generalization of the network bridge to support upgrades
* walk back some renaming
* walk back some version stuff
* extract version from fallback
* remove V1 from NetworkBridgeUpdate
* add accidentally-removed timer
* implement focusing for versioned messages
* fmt
* fix up network bridge & tests
* remove inaccurate version check in bridge
* remove some TODO [now]s
* fix fallout in statement distribution
* fmt
* fallout in gossip-support
* fix fallout in collator-protocol
* fix fallout in bitfield-distribution
* fix fallout in approval-distribution
* fmt
* use never!
* fmt
* remove v0 primitives from polkadot-primitives
* first pass: remove v0
* fix fallout in erasure-coding
* remove v1 primitives, consolidate to v2
* the great import update
* update runtime_api_impl_v1 to v2 as well
* guide: add `Version` request for runtime API
* add version query to runtime API
* reintroduce OldV1SessionInfo in a limited way
* seed commit for fatality based errors
* fatality
* first draft of fatality
* cleanup
* differnt approach
* simplify
* first working version for enums, with documentation
* add split
* fix simple split test case
* extend README.md
* update fatality impl
* make tests passed
* apply fatality to first subsystem
* fatality fixes
* use fatality in a subsystem
* fix subsystemg
* fixup proc macro
* fix/test: log::*! do not execute when log handler is missing
* fix spelling
* rename Runtime2 to something sane
* allow nested split with `forward` annotations
* add free license
* enable and fixup all tests
* use external fatality
Makes this more reviewable.
* bump fatality dep
Avoid duplicate expander compilations.
* migrate availability distribution
* more fatality usage
* chore: bump fatality to 0.0.6
* fixup remaining subsystems
* chore: fmt
* make cargo spellcheck happy
* remove single instance of `#[fatal(false)]`
* last quality sweep
* fixup
* Companion PR for removing Prometheus metrics prefix
* Was missing some metrics
* Fix missing renames
* Fix test
* Fixes
* Update test
* Update Substrate
* Second time
* remove prefix from intergration test for zombienet
* update zombienet image
* Update Substrate
Co-authored-by: Bastian Köcher <info@kchr.de>
Co-authored-by: Javier Viola <pepoviola@gmail.com>
* remove Default from CandidateHash
* Apply suggestions from code review
Co-authored-by: Andronik Ordian <write@reusable.software>
* chore: fmt
* remove backed candidate default
* Partial migration away from CandidateReceipt::default
* Remove more CandidateReceipt defaults
* fmt
* Mostly remove CommittedCandidateReceipt default usage
* Remove CommittedCandidateReceipt
* Remove more Defaults from polakdot primitives v1 + fmt
* Remove more Default from polkadot primites v1
* WIP trying to get overseer example + tests to compile
* feat: add primitives test helpers
* reduce deps of helper
* update primitive helpers
* make candidate validation compile
* fixup cargo lock
* make av-store compile
* fixup disputes coordinator tests
* test: fixup backing
* test: fixup approval voting
* fixup bitfield signing
* test: fixup runtime-api
* test: fixup availability dist
* foxi[ pverseer test]
* remove some Defaults, remove bounds from `dummy`
All `fn dummy` in primitives need to be removed anyways.
This aids in the transition.
* it's a test helper, so always use std
* test: fixup parachains runtime tests
Excluding benches.
* fix keyring
* fix paras runtime properly, no more default
* Remove fn dummy() usage from approval voting
* Move TestCandidateBuilder out of av store to test helpers
* Make candidate validation tests pass
* Make most dispute coirdinator tests pass
* Make provisioner tests work
* Make availability recovery tests work with test helpers
* Update polkadot-collator-protocol tests
* Update statement distribution tests
* Update polkadot overseer examples and tests
* Derive default for validation code so we don't break unrelated things
* Make para runtime test pass (no bench)
* Some more work
* chore: cargo fmt
* cargo fix
* avoid some Default::default
* fixup dispute coordinator test
* remove unused crate deps
* remove Default::default wherever possible, replace by dummy_* for the most part
* chore: cargo fmt
* Remove some warnings
* Remove CommittedCandidateReceipt dummy
* Remove CandidateReceipt dummy
* Remove CandidateDescriptor dummy
* Remove commented out code
* Fix para runtime tests
* chore: nightly
* Some updates to the builder
* Dynamically adjust mock head data size
* Make dispute cooridinator tests work
* Fix test candidate_backing_reorders_votes work
* +nightly-2021-10-29 fmt
* Spelling and remove a default use in builder
* Various clean up
* More small updates
* fmt
* More small updates
* Doc comments for test helpers
* cargo run --quiet --release --features=runtime-benchmarks -- benchmark --chain=kusama-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras_inherent --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/kusama/src/weights/runtime_parachains_paras_inherent.rs
* cargo run --quiet --release --features=runtime-benchmarks -- benchmark --chain=polkadot-dev --steps=50 --repeat=20 --pallet=runtime_parachains::paras_inherent --extrinsic=* --execution=wasm --wasm-execution=compiled --heap-pages=4096 --header=./file_header.txt --output=./runtime/polkadot/src/weights/runtime_parachains_paras_inherent.rs
* Update lib.rs
* review comments
* fix warnings
* fix test by using correct candidate receipt relay parent
Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: emostov <32168567+emostov@users.noreply.github.com>
Co-authored-by: Parity Bot <admin@parity.io>
Co-authored-by: Gavin Wood <gavin@parity.io>