* metrics: Increase the resolution of histogram metrics
These metrics are using the default histogram buckets:
```
pub const DEFAULT_BUCKETS: &[f64; 11] = &[
0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];
```
Which give us a resolution of 5ms, that's good, but there are some subsystems
where we process hundreds or even a few thousands of messages per second like
approval-voting or approval-distribution, so it makes sense to increse the
resoution of the bucket to better understand if the procesisng is in the range
of useconds.
The new bucket ranges will be:
```
[0.0001, 0.0004, 0.0016, 0.0064, 0.0256, 0.1024, 0.4096, 1.6384, 6.5536]
```
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
* Use buckets with higher resolution
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
* approval-distribution: Add approvals/assignments spans on all paths
The approval and assignment logic gets called from multiple paths, so make sure
we create a tracing span on all paths to make debugging easier and be able and
correlate with the spans from approval-voting.
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
* Tag each label with a difference tracing name
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
* Address review feedback
Use the source to determine the tag name
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
---------
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
* Happy New Year!
* Remove year entierly
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Remove years from copyright notice in the entire repo
---------
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
* Pass the PerLeafSpan as mutable reference to handle_new_head function
* cargo +nightly fmt --all
* Add mock span for test
* cargo +nightly fmt --all
* add new-blocks-hashes to span
* ref span in match statement, set span to disabled if not passed
* remove second match clause, make handle_new_head_span mutable
* cargo +nightly fmt --all
* improve tag on error and warning
* add imported blocks and info span
* cargo +nightly fmt --all
* Improve error for imported_blocks_and_info trace
* format tags on get_header_span
* add lost-to-finality tag
* add missing bracket
* - Add bitfield child span
- Add block db insertion span
* - fix update-bitfield span tag
* - Fix type conversion to u64
- Add missing argument
* - Cargo fmt
* - Test add_follows_from
* - Revert as relationship between spans not working correctly
* - use drop to test if parent-child relationship can be re-established
* - remove bitfield span, check if parent-child relationship can be reestablished
* - Remove dangling bitfield span which is not used, to see if parent-child relationship can be re-established
* Another dangling bitfield span
* cargo fmt
* - add imported blocks and info span
- add candidate span per candidate
* add tags before moving block_header to push scope
* - Add db-insertion span
* cargo fmt
* fix types
* * Pass mutable reference to span in handle_new_head
* Change get-header-span tags in handle_new_head
* Create cache-session-info span in handle_new_head
* Create optional argument in determine_new_blocks
* Pass mutable reference to handle_new_head_span in determine_new_blocks in handle_new_head function
* Add candidate-hash, candidate-number, lost-to-finality tags to candidate_span in handle_new_head function
* Manually drop db_insertion_span and remove superfluous tags to it, only keeping approved-bitfields tag
* Add ApprovalVoting stage in jaeger
* * Pass mutable reference to jaeger::Span in stead of PerLeafSpan
* Add block-import span
* *Pass optional_span (optional argument) to determine_new_blocks util function
* * Add num-candidates int tag to block_import_span
* * Add head tag to cache_session_span
* * Create PerLeafSpan in handle_from_overseer (this is required to establish parent-child relationship between approval-voting span, and leaf-activated root span)
* * Add candidate-import-span as child of block-import-span
* Add candidate-hash and num-approval tags to candidate-import-span
* * Fix num-candidate tag to bitvec-len tag in candidate-import-span
* *Fix imported_blocKs_and_info span to create new-block-span as not dealing with candidates
* Consider the future::select! block
* Use HashMap<Hash, jaeger::PerLeafSpan>
* Remove Stage 9
* Add missing spans
* cargo +nightly fmt --all
* Remove optional span argument for determine_new_blocks
* * Remove no-longer needed default PerLeafSpan implementation
* Remove no-longer necessary mock span given re-factoring of handle_new_head() no longer neeing mutable span
* Split validation-result and request-data (availability and validation code) spans into two by dropping request_validation_data_spans
* Remove drop statements for cache_session_info_span
*
* Remove unnecessary span
* Remove another excessively spammy span
* Add missing spans from State in import tests
* Use functional approach to get spans
* - Add functional approach for the approval-voting span
- Add doc on block_numbers given labelling ambiguity
- Add span pruning logic
- Use .add_para_id on validation_result_span
* Replace for hash_set in hash_set_iter with map closure
* cargo +nightly fmt --all
* Change from unconsumed `map` to `.for_each`
* cargo +nightly fmt --all
* Refactor add_para_id to validation_result_span
* cargo +nightly fmt --all
* Remove duplicate tag
* Add missing tag to handle-approved-ancestor span
* Refactor span pruning to only invoke retain once
* Typo in span name
* - Replace unwrap_or with unwrap_or_else due to lazy evaluation of trace-identifier in polkadot_node_jaeger
- Remove some redundant spans
* Add approval-distribution spans
* - Add unwrap_or_else on note-approved-in-chain-selection
- Use child_with_trace_id to add traceID string tag on span (note this does not change the traceID, but just adds a tag)
* cargo +nightly fmt --all
* - Add traceID tags were necessary in approval-voting and availability-distribution
- Always use block-hash tag in stead of relay-parent tag in approval-distribution
* Remove schedule-wakeup span as it will duplicate spans on existing wakeups (which should be a no-op)
* Remove a couple of warnings related to mutability
* Fix failing tests in availability distribution
* Add traceID tag to launch-approval and validation-result
* Reshuffle the validation and validation result spans to where more appropriate and add block-hash tag
* - Add tranche and should-trigger tag to process-wakeup span
- Add candidate-hash and traceID to check-and-import-approval span
* cargo fmt
* - Adjustments after PR comments
* Move span pruning after other pruning logic
* Remove DerefMut - no longer needed
* Relabel request-chunk spans
* - Fix typo in span label
- Add docs for drops
* Add new approval-voting span pruning logic
* Undo removal of !
* cargo fmt
* rust 1.64 enables workspace properties
* add edition, repository and authors.
* of course, update the version in one place.
Co-authored-by: Andronik <write@reusable.software>
* Add clippy config and remove .cargo from gitignore
* first fixes
* Clippyfied
* Add clippy CI job
* comment out rusty-cachier
* minor
* fix ci
* remove DAG from check-dependent-project
* add DAG to clippy
Co-authored-by: alvicsam <alvicsam@gmail.com>
* westend: update transaction version
* polkadot: update transaction version
* kusama: update transaction version
* Bump spec_version to 9330
* bump versions to 0.9.33
* Add `DisputeState` to `DisputeCoordinatorMessage::RecentDisputes`
The new signature of the message is:
```
RecentDisputes(oneshot::Sender<Vec<(SessionIndex, CandidateHash, DisputeStatus)>>),
```
As part of the change also add `DispiteStatus` to
`polkadot_node_primitives`.
* Move dummy_signature() in primitives/test-helpers
* Enable staging runtime api on Rococo
* Implementation
* Move disputes to separate module
* Vote prioritisation
* Duplicates handling
* Double vote handling
* Unit tests
* Logs and metrics
* Code review feedback
* Fix ACTIVE/INACTIVE separation and update partition names
* Add `fn dispute_is_inactive` to node primitives and refactor `fn get_active_with_status()` logic
* Keep the 'old' logic if the staging api is not enabled
* Fix some comments in tests
* Add warning message if there are any inactive_unknown_onchain disputes
* Add file headers and remove `use super::*;` usage outside tests
* Adding doc comments
* Fix test methods names
* Fix staging api usage
* Fix `get_disputes` runtime function implementation
* Fix compilation error
* Fix arithmetic operations in tests
* Use smaller test data
* Rename `RuntimeApiRequest::StagingDisputes` to `RuntimeApiRequest::Disputes`
* Remove `staging-client` feature flag
* fmt
* Remove `vstaging` feature flag
* Some comments regarding the staging api
* Rename dispute selection modules in provisioner
with_staging_api -> prioritized_selection
without_staging_api -> random_selection
* Comments for staging api
* Comments
* Additional logging
* Code review feedback
process_selected_disputes -> into_multi_dispute_statement_set
typo
In trait VoteType: vote_value -> is_valid
* Code review feedback
* Fix metrics
* get_disputes -> disputes
* Get time only once during partitioning
* Fix partitioning
* Comments
* Reduce the number of hardcoded api versions
* Code review feedback
* Unused import
* Comments
* More precise log messages
* Code review feedback
* Code review feedback
* Code review feedback - remove `trait VoteType`
* Code review feedback
* Trace log for DisputeCoordinatorMessage::QueryCandidateVotes counter in vote_selection
* Bump crate versions
* Bump spec_version to 9280 for kusama
* Bump spec_version to 9280 for polkadot
* Bump spec_version to 9280 for rococo
* Bump spec_version to 9280 for westend
* update Cargo.lock
Co-authored-by: parity-processbot <>
* Don't import backing statements directly
into the dispute coordinator. This also gets rid of a redundant
signature check. Both should have some impact on backing performance.
In general this PR should make us scale better in the number of parachains.
Reasoning (aka why this is fine):
For the signature check: As mentioned, it is a redundant check. The
signature has already been checked at this point. This is even made
obvious by the used types. The smart constructor is not perfect as
discussed [here](https://github.com/paritytech/polkadot/issues/3455),
but is still a reasonable security.
For not importing to the dispute-coordinator: This should be good as the
dispute coordinator does scrape backing votes from chain. This suffices
in practice as a super majority of validators must have seen a backing
fork in order for a candidate to get included and only included
candidates pose a threat to our system. The import from chain is
preferable over direct import of backing votes for two reasons:
1. The import is batched, greatly improving import performance. All
backing votes for a candidate are imported with a single import.
And indeed we were able to see in metrics that importing votes
from chain is fast.
2. We do less work in general as not every candidate for which
statements are gossiped might actually make it on a chain. The
dispute coordinator as with the current implementation would still
import and keep those votes around for six sessions.
While redundancy is good for reliability in the event of bugs, this also
comes at a non negligible cost. The dispute-coordinator right now is the
subsystem with the highest load, despite the fact that it should not be
doing much during mormal operation and it is only getting worse
with more parachains as the load is a direct function of the number of statements.
We'll see on Versi how much of a performance improvement this PR
* Get rid of dead code.
* Dont send approval vote
* Make it pass CI
* Bring back tests for fixing them later.
* Explicit signature check.
* Resurrect approval-voting tests (not fixed yet)
* Send out approval votes in dispute-distribution.
Use BTreeMap for ordered dispute votes.
* Bring back an important warning.
* Fix approval voting tests.
* Don't send out dispute message on import + test
+ Some cleanup.
* Guide changes.
Note that the introduced complexity is actually redundant.
* WIP: guide changes.
* Finish guide changes about dispute-coordinator
conceputally. Requires more proof read still.
Also removed obsolete implementation details, where the code is better
suited as the source of truth.
* Finish guide changes for now.
* Remove own approval vote import logic.
* Implement logic for retrieving approval-votes
into approval-voting and approval-distribution subsystems.
* Update roadmap/implementers-guide/src/node/disputes/dispute-coordinator.md
Co-authored-by: asynchronous rob <rphmeier@gmail.com>
* Review feedback.
In particular: Add note about disputes of non included candidates.
* Incorporate Review Remarks
* Get rid of superfluous space.
* Tidy up import logic a bit.
Logical vote import is now separated, making the code more readable and
maintainable.
Also: Accept import if there is at least one invalid signer that has not
exceeded its spam slots, instead of requiring all of them to not exceed
their limits. This is more correct and a preparation for vote batching.
* We don't need/have empty imports.
* Fix tests and bugs.
* Remove error prone redundancy.
* Import approval votes on dispute initiated/concluded.
* Add test for approval vote import.
* Make guide checker happy (hopefully)
* Another sanity check + better logs.
* Reasoning about boundedness.
* Use `CandidateIndex` as opposed to `CoreIndex`.
* Remove redundant import.
* Review remarks.
* Add metric for calls to request signatures
* More review remarks.
* Add metric on imported approval votes.
* Include candidate hash in logs.
* More trace log
* Break cycle.
* Add some tracing.
* Cleanup allowed messages.
* fmt
* Tracing + timeout for get inherent data.
* Better error.
* Break cycle in all places.
* Clarified comment some more.
* Typo.
* Break cycle approval-distribution - approval-voting.
Co-authored-by: asynchronous rob <rphmeier@gmail.com>
* foo
* rolling session window
* fixup
* remove use statemetn
* fmt
* split NetworkBridge into two subsystems
Pending cleanup
* split
* chore: reexport OrchestraError as OverseerError
* chore: silence warnings
* fixup tests
* chore: add default timenout of 30s to subsystem test helper ctx handle
* single item channel
* fixins
* fmt
* cleanup
* remove dead code
* remove sync bounds again
* wire up shared state
* deal with some FIXMEs
* use distinct tags
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* use tag
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* address naming
tx and rx are common in networking and also have an implicit meaning regarding networking
compared to incoming and outgoing which are already used with subsystems themselvesq
* remove unused sync oracle
* remove unneeded state
* fix tests
* chore: fmt
* do not try to register twice
* leak Metrics type
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: Andronik <write@reusable.software>
* Move NewGossipTopology -> SessionGridTopology outside as this implementation is shared
* Add method to return peers difference between topologies
* Implement basic grid topology usage for the bitfield distribution
* Fix tests
* Oops, fix tests
* Add some tests for random routing
* Add a unit test for topology distribution
* Store the current and the previous topology to match sessions boundaries
* Update tests
* Update node/network/bitfield-distribution/src/lib.rs
Co-authored-by: Andronik <write@reusable.software>
* Update node/network/protocol/src/grid_topology.rs
Co-authored-by: Andronik <write@reusable.software>
* Update node/network/bitfield-distribution/src/lib.rs
Co-authored-by: Andronik <write@reusable.software>
* Add some debug
* Fix tests as HashSet order is undefined
* Move session bounded topology to the common code part
* Fix tests
* Allow to select routing by peer index
* Implement grid topology in the statement distribution subsystem
* Fix tests compilation
* Fix test
* Refactor API slightly
* Address review comments
* Reduce runtime error logging severity
* Update node/network/protocol/src/grid_topology.rs
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
* Update node/network/bitfield-distribution/src/tests.rs
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
* Fmt run
* Use named struct
* Fix logging stuff
* One more accidental fmt damage
* Increase active queue size and add metrics
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
* Revert "Increase active queue size and add metrics"
This reverts commit c4f48e8bded6dfeb9c62814ba2f8d815c34b04cf.
* Use validator index to choose the routing strategy
Noted by: @rphmeier
* Fix test after distribution logic fix
Co-authored-by: Andronik <write@reusable.software>
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
* Double grandpa gossip duration.
* Make resend period slightly larger.
So it won't get triggered by additional grandpa delay.
* Bump other values as well.
* Don't change gossip duration on Polkadot.
(and Westend as it is meant to be a testbed for Polkadot)
* Move NewGossipTopology -> SessionGridTopology outside as this implementation is shared
* Add method to return peers difference between topologies
* Implement basic grid topology usage for the bitfield distribution
* Fix tests
* Oops, fix tests
* Add some tests for random routing
* Add a unit test for topology distribution
* Store the current and the previous topology to match sessions boundaries
* Update tests
* Update node/network/bitfield-distribution/src/lib.rs
Co-authored-by: Andronik <write@reusable.software>
* Update node/network/protocol/src/grid_topology.rs
Co-authored-by: Andronik <write@reusable.software>
* Update node/network/bitfield-distribution/src/lib.rs
Co-authored-by: Andronik <write@reusable.software>
* Add some debug
* Fix tests as HashSet order is undefined
Co-authored-by: Andronik <write@reusable.software>
* Initial attempt to extract grid topology related code
* Use shared code in the approval distribution subsystem
* Fix spellcheck issues
* Moe Aggression stuff back to the approval-distribution subsystem
* Cargo fmt
* explicitly tag network requests with version
* fmt
* make PeerSet more aware of versioning
* some generalization of the network bridge to support upgrades
* walk back some renaming
* walk back some version stuff
* extract version from fallback
* remove V1 from NetworkBridgeUpdate
* add accidentally-removed timer
* implement focusing for versioned messages
* fmt
* fix up network bridge & tests
* remove inaccurate version check in bridge
* remove some TODO [now]s
* fix fallout in statement distribution
* fmt
* fallout in gossip-support
* fix fallout in collator-protocol
* fix fallout in bitfield-distribution
* fix fallout in approval-distribution
* fmt
* use never!
* fmt
* gossip-support: be explicit about dimensions
* some guide updates
* update network-bridge to distinguish x and y dimensions
* get everything to compile
* beginnings
* some TODOs
* polkadot runtime: use relevant_authorities
* make gossip topologies per-session
* better formatting
* gossip support: use current session validators
* expand in comment
* adjust tests and fix index bug
* add past/present/future connection test and clean up code
* fmt
* network bridge: updated types
* update protocols to new gossip topology message
* guide updates
* add session to BlockApprovalMeta
* add session to block info
* refactor knowledge and remove most unify logic
* start replacing gossip_peers with new SessionTopologies
* add routing information to message state
* add some utilities to SessionTopology
* implement new gossip topology logic
* re-implement unify_with_peer
* distribute assignments according to topology
* finish grid topology implementation
* refactor network bridge slightly
* issue connection requests on all past/present/future
* fmt
* address grumbles
* tighten invariants in unify_with_peer
* implement random propagation
* refactor: extract required routing adjustment logic
* some block-age logic
* aggressively propagate messages when finality is slow
* overhaul aggression system to have 3 levels
* add aggression metrics
* remove aggression L3
* reduce random circulation
* remove PeerData
* get approval tests compiling
* use btree_map in known_by to make deterministic
* Revert "use btree_map in known_by to make deterministic"
This reverts commit 330d65343a7bb6fe4dd0f24bd8dbc15c0cbdbd9d.
* test XY grid propagation
* remove stray println
* test unshared dimension propagation
* add random gossip check
* test unify_with_peer better
* test sending after getting gossip topology
* test L1 aggression on originator
* test L1 aggression for non-originators
* test non-originator aggression L2
* fnt
* ~spellcheck
* fix statement-distribution tests
* fix flaky test
* fix metrics typo
* re-send periodically
* test resending
* typo
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
* add more metrics about apd messages
* add back unify_with_peer logs
* make Resend an enum
* be more explicit when resending
* fmt
* fix error
* add a TODO for refactoring
* remove debug metrics
* add some guide stuff
* fmt
* update runtime API in test-runtim
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
This issue happens when some peer sends a good but already known Seconded statement and the statement-distribution code does not update the statements_received field in the peer_knowledge structure. Subsequently, a Valid statement causes out-of-view message that is incorrectly emitted and causes reputation lose.
This PR also introduces a concept of passing the specific pseudo-random generator to subsystems to make it easier to write deterministic tests. This functionality is not really necessary for the specific issue and unit test but it can be useful for other tests and subsystems.
* Try to fix out-of-view messages in approval distribution
Suggested by: @ordian
* Cargo fmt
* Add a unit test for the proposed fix
* Spelling fix
* Use a simplier approach to fix the race condition as suggested by @rphmeier
* Cargo fmt run
* remove v0 primitives from polkadot-primitives
* first pass: remove v0
* fix fallout in erasure-coding
* remove v1 primitives, consolidate to v2
* the great import update
* update runtime_api_impl_v1 to v2 as well
* guide: add `Version` request for runtime API
* add version query to runtime API
* reintroduce OldV1SessionInfo in a limited way