Commit Graph

79 Commits

Author SHA1 Message Date
Bernhard Schuster 2a6f460e4c reword error: channel is _terminated and_ empty (#3041) 2021-05-18 07:51:46 +02:00
Bastian Köcher 7830bae524 Companion for Substrate#8526 (#2845)
* Update branch

* Make it compile

* Compile

* gate approval-checking logic (#2470)

* Fix build

* Updates

* Fix merge

* Adds missing crate

* Companion for Substrate#8386

https://github.com/paritytech/substrate/pull/8386

* Fix fix fix

* Fix

* Fix compilation

* Rewrite to `ParachainsInherentDataProvider`

* Make it compile

* Renamings

* Revert stuff

* Remove stale file

* Guide updates

* Update node/core/parachains-inherent/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Update node/core/parachains-inherent/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Apply suggestions from code review

* Reset accidental changes

* More

* Remove stale file

* update Substrate

Co-authored-by: Robert Habermeier <rphmeier@gmail.com>
Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: parity-processbot <>
2021-05-03 15:21:13 +00:00
Bernhard Schuster 1ef8eac8ec feat: add proc macro to reduce overseer mock boilerplate (#2949) 2021-04-29 12:07:28 +02:00
Robert Klotzner 305375e1e4 Req/res optimization for statement distribution (#2803)
* Wip

* Increase proposer timeout.

* WIP.

* Better timeout values now that we are going to be connected to all nodes. (#2778)

* Better timeout values.

* Fix typo.

* Fix validator bandwidth.

* Fix compilation.

* Better and more consistent sizes.

Most importantly code size is now 5 Meg, which is the limit we currently
want to support in statement distribution.

* Introduce statement fetching request.

* WIP

* Statement cache retrieval logic.

* Review remarks by @rphmeier

* Fixes.

* Better requester logic.

* WIP: Handle requester messages.

* Missing dep.

* Fix request launching logic.

* Finish fetching logic.

* Sending logic.

* Redo code size calculations.

Now that max code size is compressed size.

* Update Cargo.lock (new dep)

* Get request receiver to statement distribution.

* Expose new functionality for responding to requests.

* Cleanup.

* Responder logic.

* Fixes + Cleanup.

* Cargo.lock

* Whitespace.

* Add lost copyright.

* Launch responder task.

* Typo.

* info -> warn

* Typo.

* Fix.

* Fix.

* Update comment.

* Doc fix.

* Better large statement heuristics.

* Fix tests.

* Fix network bridge tests.

* Add test for size estimate.

* Very simple tests that checks we get LargeStatement.

* Basic check, that fetching of large candidates is performed.

* More tests.

* Basic metrics for responder.

* More metrics.

* Use Encode::encoded_size().

* Some useful spans.

* Get rid of redundant metrics.

* Don't add peer on duplicate.

* Properly check hash

instead of relying on signatures alone.

* Preserve ordering + better flood protection.

* Get rid of redundant clone.

* Don't shutdown responder on failed query.

And add test for this.

* Smaller fixes.

* Quotes.

* Better queue size calculation.

* A bit saner response sizes.

* Fixes.
2021-04-09 21:30:12 +00:00
Robert Habermeier 57038b2e46 Remove real-overseer 🎉 (#2834)
* remove real-overseer

* overseer: only activate leaves which support parachains

* integrate HeadSupportsParachains into service

* remove unneeded line
2021-04-08 20:24:06 +02:00
Robert Habermeier fc154d2ada Reduce signal channel sizes and more logging on approval-voting (#2751)
* reduce signal channel capacity

* more tracing for approval-voting
2021-03-29 15:46:12 +02:00
Robert Klotzner 0a9fe852df Move non runtime related stuff into node/primitives (#2743)
* Remove stuff out of the runtime that does not belong there.

There might be more, but it is a start.

* White space fixes.

* Fix tests.

* Leave whitespace in ui tests alone.

* Add back zstd for no reason.

* Fix browser wasm (hopefully)
2021-03-29 02:15:44 +02:00
Robert Habermeier 5952e790fa Overseer: subsystems communicate directly (#2227)
* overseer: pass messages directly between subsystems

* test that message is held on to

* Update node/overseer/src/lib.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* give every subsystem an unbounded sender too

* remove metered_channel::name

1. we don't provide good names
2. these names are never used anywhere

* unused mut

* remove unnecessary &mut

* subsystem unbounded_send

* remove unused MaybeTimer

We have channel size metrics that serve the same purpose better now and the implementation of message timing was pretty ugly.

* remove comment

* split up senders and receivers

* update metrics

* fix tests

* fix test subsystem context

* fix flaky test

* fix docs

* doc

* use select_biased to favor signals

* Update node/subsystem/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
Co-authored-by: Andronik Ordian <write@reusable.software>
2021-03-28 15:55:10 +00:00
Robert Klotzner c6f07d8f31 Request based PoV distribution (#2640)
* Indentation fix.

* Prepare request-response for PoV fetching.

* Drop old PoV distribution.

* WIP: Fetch PoV directly from backing.

* Backing compiles.

* Runtime access and connection management for PoV distribution.

* Get rid of seemingly dead code.

* Implement PoV fetching.

Backing does not yet use it.

* Don't send `ConnectToValidators` for empty list.

* Even better - no need to check over and over again.

* PoV fetching implemented.

+ Typechecks
+ Should work

Missing:

- Guide
- Tests
- Do fallback fetching in case fetching from seconding validator fails.

* Check PoV hash upon reception.

* Implement retry of PoV fetching in backing.

* Avoid pointless validation spawning.

* Add jaeger span to pov requesting.

* Add back tracing.

* Review remarks.

* Whitespace.

* Whitespace again.

* Cleanup + fix tests.

* Log to log target in overseer.

* Fix more tests.

* Don't fail if group cannot be found.

* Simple test for PoV fetcher.

* Handle missing group membership better.

* Add test for retry functionality.

* Fix flaky test.

* Spaces again.

* Guide updates.

* Spaces.
2021-03-28 17:11:38 +02:00
Robert Habermeier 73b9247c10 Separate metrics for messages sent & received (#2721)
* metered channel - sent & received

* Add for readouts

* metrics for both sent & received

* retract on send failure
2021-03-26 14:11:01 +01:00
Robert Habermeier 064df81ee4 Add block number to activated leaves and associated fixes (#2718)
* add number to `ActivatedLeavesUpdate`

* update subsystem util and overseer

* use new ActivatedLeaf everywhere

* sort view

* sorted and limited view in network bridge

* use live block hash only if it's newer

* grumples
2021-03-26 13:06:40 +01:00
Robert Habermeier 2388141b25 overseer: AllSubsystems magic and subsystem channel sizes metrics (#2711)
* overseer: AllSubsystems magic and report subsystem channel sizes to prometheus

* fix tests
2021-03-25 21:21:55 +01:00
Bernhard Schuster ea6294fa79 restructure polkadot-node-jaeger (#2642) 2021-03-19 16:51:16 +01:00
Andronik Ordian baa691deb1 prefix parachain log targets with parachain:: (#2600)
* prefix parachain log targets with parachain::

* even more consistent
2021-03-10 17:07:56 +01:00
Andronik Ordian 4c1de66d5d subsystem for issuing background connection requests (#2538)
* initial subsystem for issuing connection requests

* finish the initial impl

* integrate with the overseer

* rename to gossip-support

* fix renamings leftover

* remove run_inner

* fix compilation

* random subset of sqrt
2021-03-02 10:40:06 +00:00
Robert Klotzner 48409e5548 Request based availability distribution (#2423)
* WIP

* availability distribution, still very wip.

Work on the requesting side of things.

* Some docs on what I intend to do.

* Checkpoint of session cache implementation

as I will likely replace it with something smarter.

* More work, mostly on cache

and getting things to type check.

* Only derive MallocSizeOf and Debug for std.

* availability-distribution: Cache feature complete.

* Sketch out logic in `FetchTask` for actual fetching.

- Compile fixes.
- Cleanup.

* Format cleanup.

* More format fixes.

* Almost feature complete `fetch_task`.

Missing:

- Check for cancel
- Actual querying of peer ids.

* Finish FetchTask so far.

* Directly use AuthorityDiscoveryId in protocol and cache.

* Resolve `AuthorityDiscoveryId` on sending requests.

* Rework fetch_task

- also make it impossible to check the wrong chunk index.
- Export needed function in validator_discovery.

* From<u32> implementation for `ValidatorIndex`.

* Fixes and more integration work.

* Make session cache proper lru cache.

* Use proper lru cache.

* Requester finished.

* ProtocolState -> Requester

Also make sure to not fetch our own chunk.

* Cleanup + fixes.

* Remove unused functions

- FetchTask::is_finished
- SessionCache::fetch_session_info

* availability-distribution responding side.

* Cleanup + Fixes.

* More fixes.

* More fixes.

adder-collator is running!

* Some docs.

* Docs.

* Fix reporting of bad guys.

* Fix tests

* Make all tests compile.

* Fix test.

* Cleanup + get rid of some warnings.

* state -> requester

* Mostly doc fixes.

* Fix test suite.

* Get rid of now redundant message types.

* WIP

* Rob's review remarks.

* Fix test suite.

* core.relay_parent -> leaf for session request.

* Style fix.

* Decrease request timeout.

* Cleanup obsolete errors.

* Metrics + don't fail on non fatal errors.

* requester.rs -> requester/mod.rs

* Panic on invalid BadValidator report.

* Fix indentation.

* Use typed default timeout constant.

* Make channel size 0, as each sender gets one slot anyways.

* Fix incorrect metrics initialization.

* Fix build after merge.

* More fixes.

* Hopefully valid metrics names.

* Better metrics names.

* Some tests that already work.

* Slightly better docs.

* Some more tests.

* Fix network bridge test.
2021-02-26 11:58:07 -06:00
Andronik Ordian 241b1f12a7 make runtime_api non blocking task again (#2531) 2021-02-26 12:04:58 +01:00
Robert Habermeier ae218bb608 make runtime API and chain API subsystems blocking too (#2526) 2021-02-25 07:26:49 +00:00
Robert Habermeier 3cdaa88509 spawn availability store and approval voting subsystems as blocking tasks (#2521)
* spawn availability store and approval voting subsystems as blocking tasks

* refactor
2021-02-24 14:22:48 -06:00
Bernhard Schuster 49c6aa9a76 feat/jaeger: more spans, more stages (#2477)
* feat/jaeger: more spans, more stages

Stage numbers are still arbitrarily picked.

* feat/jaeger: additional spans

* chore/spellcheck: improve the dictionary

* fix/jaeger JaegerSpan -> jaeger::Span
2021-02-19 14:19:43 +00:00
Robert Habermeier b7aac51341 A fast-path for requesting AvailableData from backing validators (#2453)
* guide changes for a fast-path requesting from backing validators

* add backing group to availability recovery message

* add new phase to interaction

* typos

* add full data messages

* handle new network messages

* dispatch full data requests

* cleanup

* check chunk index

* test for invalid recovery

* tests

* Typos.

* fix some grumbles

* be more explicit about error handling and control flow

* fast-path param

* use with_chunks_only in Service

Co-authored-by: Robert Klotzner <robert.klotzner@gmx.at>
2021-02-17 13:51:50 -06:00
Bernhard Schuster 1e2161258b refactor/reputation: unify the values used (#2462)
* refactor/reputation: unify the values used

* chore/rep: rename Annoy* to Cost*, make duplicate message Cost*Repeated

* fix/reputation: lost and found, convert at the boundary to substrate

* refactor/rep: move conversion to base reputation one level down, left conversions

* fix/rep: order of magnitude adjustments

Thanks pierre!

* remove spaces

* chore/rep: give rationale for order of magnitude

* refactor/rep: move UnifiedReputationChange to separate file

* fix/rep: order of magnitudes correction
2021-02-17 17:18:13 +01:00
Robert Habermeier 4f21cc7e5c Integrate Approval Voting into Overseer / Service / GRANDPA (#2412)
* integrate approval voting into overseer

* expose public API and make keystore arc

* integrate overseer in service

* guide: `ApprovedAncestor` returns block number

* return block number along with hash from ApprovedAncestor

* introduce a voting rule for reporting on approval checking

* integrate the delay voting rule

* Rococo configuration

* fix compilation and add slack

* fix web-wasm build

* tweak parameterization

* migrate voting rules to asycn

* remove hack comment
2021-02-15 22:37:13 -06:00
Bastian Köcher 4975521d48 Notify collators about seconded collation (#2430)
* Notify collators about seconded collation

This pr adds functionality to inform a collator that its collation was
seconded by a parachain validator. Before this signed statement was only
gossiped over the validation substream. Now, we explicitly send the
seconded statement to the collator after it was validated successfully.

Besides that it changes the `CollatorFn` to return an optional result
sender that is informed when the build collation was seconded by a
parachain validator.

* Add test

* Make sure we only send `Seconded` statements

* Make sure we only receive valid statements

* Review feedback
2021-02-14 17:36:04 +01:00
Robert Habermeier e48c687504 Implement Approval Voting Subsystem (#2112)
* skeleton

* skeleton aux-schema module

* start approval types

* start aux schema with aux store

* doc

* finish basic types

* start approval types

* doc

* finish basic types

* write out schema types

* add debug and codec impls to approval types

* add debug and codec impls to approval types

also add some key computation

* add debug and codec impls to approval types

* getters for block and candidate entries

* grumbles

* remove unused AssignmentId

* load_decode utility

* implement DB clearing

* function for adding new block entry to aux store

* start `canonicalize` implementation

* more skeleton

* finish implementing canonicalize

* tag TODO

* implement a test AuxStore

* add allow(unused)

* basic loading and deleting test

* block_entry test function

* add a test for `add_block_entry`

* ensure range is exclusive at end

* test clear()

* test that add_block sets children

* add a test for canonicalize

* extract Pre-digest from header

* utilities for extracting RelayVRFStory from the header-chain

* add approval voting message types

* approval distribution message type

* subsystem skeleton

* state struct

* add futures-timer

* prepare service for babe slot duration

* more skeleton

* better integrate AuxStore

* RelayVRF -> RelayVRFStory

* canonicalize

* implement some tick functionality

* guide: tweaks

* check_approval

* more tweaks and helpers

* guide: add core index to candidate event

* primitives: add core index to candidate event

* runtime: add core index to candidate events

* head handling (session window)

* implement `determine_new_blocks`

* add TODO

* change error type on functions

* compute RelayVRFModulo assignments

* compute RelayVRFDelay assignments

* fix delay tranche calc

* assignment checking

* pluralize

* some dummy code for fetching assignments

* guide: add babe epoch runtime API

* implement a current_epoch() runtime API

* compute assignments

* candidate events get backing group

* import blocks and assignments into DB

* push block approval meta

* add message types, no overseer integration yet

* notify approval distribution of new blocks

* refactor import into separate functions

* impl tranches_to_approve

* guide: improve function signatures

* guide: remove Tick from ApprovalEntry

* trigger and broadcast assignment

* most of approval launching

* remove byteorder crate

* load blocks back to finality, except on startup

* check unchecked assignments

* add claimed core to approval voting message

* fix checks

* assign only to backing group

* remove import_checked_assignment from guide

* newline

* import assignments

* abstract out a bit

* check and import approvals

* check full approvals from assignment import too

* comment

* create a Transaction utility

* must_use

* use transaction in `check_full_approvals`

* wire up wakeups

* add Ord to CandidateHash

* wakeup refactoring

* return candidate info from add_block_entry

* schedule wakeups

* background task: do candidate validation

* forward candidate validation requests

* issue approval votes when requested

* clean up a couple TODOs

* fix up session caching

* clean up last unimplemented!() items

* fix remaining warnings

* remove TODO

* implement handle_approved_ancestor

* update Cargo.lock

* fix runtime API tests

* guide: cleanup assignment checking

* use claimed candidate index instead of core

* extract time to a trait

* tests module

* write a mock clock for testing

* allow swapping out the clock

* make abstract over assignment criteria

* add some skeleton tests and simplify params

* fix backing group check

* do backing group check inside check_assignment_cert

* write some empty test functions to implement

* add a test for non-backing

* test that produced checks pass

* some empty test ideas

* runtime/inclusion: remove outdated TODO

* fix compilation

* av-store: fix tests

* dummy cert

* criteria tests

* move `TestStore` to main tests file

* fix unused warning

* test harness beginnings

* resolve slots renaming fallout

* more compilation fixes

* wip: extract pure data into a separate module

* wip: extract pure data into a separate module

* move types completely to v1

* add persisted_entries

* add conversion trait impls

* clean up some warnings

* extract import logic to own module

* schedule wakeups

* experiment with Actions

* uncomment approval-checking

* separate module for approval checking utilities

* port more code to use actions

* get approval pipeline using actions

* all logic is uncommented

* main loop processes actions

* all loop logic uncommented

* separate function for handling actions

* remove last unimplemented item

* clean up warnings

* State gives read-only access to underlying DB

* tests for approval checking

* tests for approval criteria

* skeleton test module for import

* list of import tests to do

* some test glue code

* test reject bad assignment

* test slot too far in future

* test reject assignment with unknown candidate

* remove loads_blocks tests

* determine_new_blocks back to finalized & harness

* more coverage for determining new blocks

* make `imported_block_info` have less reliance on State

* candidate_info tests

* tests for session caching

* remove println

* extricate DB and main TestStores

* rewrite approval checking logic to counteract early delays

* move state out of function

* update approval-checking tests

* tweak wakeups & scheduling logic

* rename check_full_approvals

* test that assignment import updates candidate

* some approval import tests

* some tests for check_and_apply_approval

* add 'full' qualifier to avoid confusion

* extract should-trigger logic to separate function

* some tests for all triggering

* tests for when we trigger assignments

* test wakeups

* add block utilities for testing

* some more tests for approval updates

* approved_ancestor tests

* new action type for launch approval

* process-wakeup tests

* clean up some warnings

* fix in_future test

* approval checking tests

* tighten up too-far-in-future

* special-case genesis when caching sessions

* fix bitfield len

Co-authored-by: Andronik Ordian <write@reusable.software>
2021-02-11 10:21:47 -06:00
Robert Klotzner 0cb1ccd122 Generic request/response infrastructure for Polkadot (#2352)
* Move NetworkBridgeEvent to subsystem::messages.

It is not protocol related at all, it is in fact only part of the
subsystem communication as it gets wrapped into messages of each
subsystem.

* Request/response infrastructure is taking shape.

WIP: Does not compile.

* Multiplexer variant not supported by Rusts type system.

* request_response::request type checks.

* Cleanup.

* Minor fixes for request_response.

* Implement request sending + move multiplexer.

Request multiplexer is moved to bridge, as there the implementation is
more straight forward as we can specialize on `AllMessages` for the
multiplexing target.

Sending of requests is mostly complete, apart from a few `From`
instances. Receiving is also almost done, initializtion needs to be
fixed and the multiplexer needs to be invoked.

* Remove obsolete multiplexer.

* Initialize bridge with multiplexer.

* Finish generic request sending/receiving.

Subsystems are now able to receive and send requests and responses via
the overseer.

* Doc update.

* Fixes.

* Link issue for not yet implemented code.

* Fixes suggested by @ordian - thanks!

- start encoding at 0
- don't crash on zero protocols
- don't panic on not yet implemented request handling

* Update node/network/protocol/src/request_response/v1.rs

Use index 0 instead of 1.

Co-authored-by: Andronik Ordian <write@reusable.software>

* Update node/network/protocol/src/request_response.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Fix existing tests.

* Better avoidance of division by zoro errors.

* Doc fixes.

* send_request -> start_request.

* Fix missing renamings.

* Update substrate.

* Pass TryConnect instead of true.

* Actually import `IfDisconnected`.

* Fix wrong import.

* Update node/network/bridge/src/lib.rs

typo

Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com>

* Update node/network/bridge/src/multiplexer.rs

Remove redundant import.

Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com>

* Stop doing tracing from within `From` instance.

Thanks for the catch @tomaka!

* Get rid of redundant import.

* Formatting cleanup.

* Fix tests.

* Add link to issue.

* Clarify comments some more.

* Fix tests.

* Formatting fix.

* tabs

* Fix link

Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>

* Use map_err.

Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>

* Improvements inspired by suggestions by @drahnr.

- Channel size is now determined by function.
- Explicitely scope NetworkService::start_request.

Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com>
Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
2021-02-03 20:21:09 +00:00
Andronik Ordian 3f1e1a6ff7 impl approval distribution (#2160)
* initial impl approval distribution

* initial tests and fixes

* batching seems difficult: different peers have different needs

* bridge: fix test after merge

* some guide updates

* only send assignments to peers who know about the block

* fix a test, add approvals test

* simplify

* do not send assignment to peers for finalized blocks

* guide: protocol input and output

* one more test

* more comments, logs, initial metrics

* fix a typo

* one more thing: early return when reimporting a thing locally
2021-01-25 18:14:32 -05:00
Bernhard Schuster 366c229f6f use fill_level gauges instead of histograms (#2276) 2021-01-15 18:46:31 +01:00
Fedor Sakharov 90a686266f Availability recovery subsystem (#2122)
* Adds message types

* Add code skeleton

* Adds subsystem code.

* Adds a first test

* Adds interaction result to availability_lru

* Use LruCache instead of a HashMap

* Whitespaces to tabs

* Do not ignore errors

* Change error type

* Add a timeout to chunk requests

* Add custom errors and log them

* Adds replace_availability_recovery method

* recovery_threshold computed by erasure crate

* change core to std

* adds docs to error type

* Adds a test for invalid reconstruction

* refactors interaction run into multiple methods

* Cleanup AwaitedChunks

* Even more fixes

* Test that recovery with wrong root is an error

* Break to launch another requests

* Styling fixes

* Add SessionIndex to API

* Proper relay parents for MakeRequest

* Remove validator_discovery and use message

* Remove a stream on exhaustion

* On cleanup free the request streams

* Fix merge and refactor
2021-01-15 02:06:25 +00:00
Andronik Ordian deff43fd30 small cleanup (#2267)
* session_info: fix authorities docstring

* overseer: more consistent metrics naming

* session_info: mention ordering

* use correct bucket sizes

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2021-01-14 10:41:19 +00:00
Bernhard Schuster d7adf8f201 metered mpsc channels (#2235) 2021-01-13 17:40:27 +01:00
Robert Habermeier 03e39cf5bc subsystems have an unbounded channel to the overseer (#2236)
* subsystems have an unbounded channel to the overseer

* Update node/overseer/src/lib.rs

Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>

* bump Cargo.lock

Co-authored-by: Bernhard Schuster <bernhard@ahoi.io>
2021-01-08 21:08:59 +00:00
Bastian Köcher 475915ff10 Do not send empty view updates to peers (#2233)
* Do not send empty view updates to peers

It happened that we send empty view updates to our peers, because we
only updated our finalized block. This could lead to situations where we
overwhelmed sub systems with too many messages. On Rococo this lead to
constant restarts of our nodes, because some node apparently was
finalizing a lot of blocks.

To prevent this, the pr is doing the following:

1. If a peer sends us an empty view, we report this peer and decrease it
reputation.

2. We ensure that we only send a view update when the `heads` changed
and not only the `finalized_number`.

3. We do not send empty `ActiveLeavesUpdates` from the overseer, as this
makes no sense to send these empty updates. If some subsystem is relying
on the finalized block, it needs to listen for the overseer signal.

* Update node/network/bridge/src/lib.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Don't work if they're are no added heads

* Fix test

* Ahhh

* More fixes

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2021-01-08 13:05:19 -05:00
Bastian Köcher 0d614374e9 Switch to latest Jaeger and improve the spans (#2216)
* Switch to latest Jaeger and improve the spans

* Update node/jaeger/src/lib.rs

Co-authored-by: Robert Habermeier <rphmeier@gmail.com>

* Use better span in bitfield signing

* Update node/core/bitfield-signing/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

Co-authored-by: Robert Habermeier <rphmeier@gmail.com>
Co-authored-by: Andronik Ordian <write@reusable.software>
2021-01-07 20:44:20 +01:00
Peter Goodspeed-Niklaus cd58c02cd3 Add metrics timing message passing from OverseerSubsystemContext to Overseer::route_message (#2201)
* add timing setup to OverseerSubsystemContext

* figure out how to initialize the rng

* attach a timer to a portion of the messages traveling to the Overseer

This timer only exists / logs a fraction of the time (configurable
by `MESSAGE_TIMER_METRIC_CAPTURE_RATE`). When it exists, it tracks
the span between the `OverSubsystemContext` receiving the message
and its receipt in `Overseer::run`.

* propagate message timing to the start of route_message

This should be more accurate; it ensures that the timer runs
at least as long as that function. As `route_message` is async,
it may not actually run for some time after it is called (or ever).

* fix failing test

* rand_chacha apparently implicitly has getrandom feature

* change rng initialization

The previous impl using `from_entropy` depends on the `getrandom`
crate, which uses the system entropy source, and which does not
work on `wasm32-unknown-unknown` because it wants to fall back to
a JS implementation which we can't assume exists.

This impl depends only on `rand::thread_rng`, which has no documentation
stating that it's similarly limited.

* remove randomness in favor of a simpler 1 of N procedure

This deserves a bit of explanation, as the motivating issue explicitly
requested randomness. In short, it's hard to get randomness to compile
for `wasm32-unknown-unknown` because that is explicitly intended to be
as deterministic as practical. Additionally, even though it would never
be used for consensus purposes, it still felt offputting to intentionally
introduce randomness into a node's operations. Except, it wasn't really
random, either: it was a deterministic PRNG varying only in its state,
and getting the state to work right for that target would have required
initializing from a constant.

Given that it was a deterministic sequence anyway, it seemed much simpler
and more explicit to simply select one of each N messages instead of
attempting any kind of realistic randomness.

* reinstate randomness for better statistical properties

This partially reverts commit 0ab8594c328b3f9ce1f696fe405556d4000630e9.

`oorandom` is much lighter than the previous `rand`-based implementation,
which makes this easier to work with.

This implementation gives each subsystem and each child RNG a distinct
increment, which should ensure they produce distinct streams of values.
2021-01-06 14:25:04 +01:00
Bastian Köcher 5be092894e Add one Jaeger span per relay parent (#2196)
* Add one Jaeger span per relay parent

This adds one Jaeger span per relay parent, instead of always creating
new spans per relay parent. This should improve the UI view, because
subsystems are now grouped below one common span.

* Fix doc tests

* Replace `PerLeaveSpan` to `PerLeafSpan`

* More renaming

* Moare

* Update node/subsystem/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Skip the spans

* Increase `spec_version`

Co-authored-by: Andronik Ordian <write@reusable.software>
2021-01-05 15:09:25 +01:00
Robert Habermeier e95be77eb6 overseer: observe stalled subsystems and shut down (#2148)
* overseer: observe stalled subsystems and shut down

* notify on send_message failure as well
2020-12-20 20:30:02 +00:00
Robert Klotzner 9ebb9015d3 Some typos and misspellings in docs I found, during my studies. (#2144)
* Fix stale link to overseer docs

* Some typos and mispellings in docs/comments

I found during studying how Polkadot works.
2020-12-18 18:31:43 -05:00
Andronik Ordian 3f5156e866 refactor View to include finalized_number (#2128)
* refactor View to include finalized_number

* guide: update the NetworkBridge on BlockFinalized

* av-store: fix the tests

* actually fix tests

* grumbles

* ignore macro doctest

* use Hash::repeat_bytes more consistently

* broadcast empty leaves updates as well

* fix issuing view updates on empty leaves updates
2020-12-17 12:50:58 -05:00
Bastian Köcher d7047578e9 Fix tests on master (#2080)
Because of a bug in the test script, we didn't stopped CI when the main
tests are failed.
2020-12-07 14:47:39 +00:00
Bastian Köcher 4ce744818c Some code cleanup in overseer (#2008)
* Some code cleanup in overseer

- Switches to select! in the overseer run loop to be more fair about
message processing between the different sources.
- Added a check to only send `ActiveLeaves` if the update actually
contains any data.

* Move the check

* Restore old behavior

* Simplify message sending and signal sending to subsystems

* Update node/subsystem/src/lib.rs
2020-11-25 09:27:50 +00:00
Andronik Ordian 69b103b1d5 overseer: send_msg should not return an error (#1995)
* send_message should not return an error

* Apply suggestions from code review

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* s/send_logging_error/send_and_log_error

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2020-11-23 12:42:14 +01:00
Peter Goodspeed-Niklaus e49989971d Add tracing support to node (#1940)
* drop in tracing to replace log

* add structured logging to trace messages

* add structured logging to debug messages

* add structured logging to info messages

* add structured logging to warn messages

* add structured logging to error messages

* normalize spacing and Display vs Debug

* add instrumentation to the various 'fn run'

* use explicit tracing module throughout

* fix availability distribution test

* don't double-print errors

* remove further redundancy from logs

* fix test errors

* fix more test errors

* remove unused kv_log_macro

* fix unused variable

* add tracing spans to collation generation

* add tracing spans to av-store

* add tracing spans to backing

* add tracing spans to bitfield-signing

* add tracing spans to candidate-selection

* add tracing spans to candidate-validation

* add tracing spans to chain-api

* add tracing spans to provisioner

* add tracing spans to runtime-api

* add tracing spans to availability-distribution

* add tracing spans to bitfield-distribution

* add tracing spans to network-bridge

* add tracing spans to collator-protocol

* add tracing spans to pov-distribution

* add tracing spans to statement-distribution

* add tracing spans to overseer

* cleanup
2020-11-20 12:02:04 +01:00
Andronik Ordian 0a8a607a58 update most of the dependencies (#1946)
* update tiny-keccak to 0.2

* update deps except bitvec and shared_memory

* fix some warning after futures upgrade

* remove useless package rename caused by bug in cargo-upgrade

* revert parity-util-mem *

* remove unused import

* cargo update

* remove all renames on parity-scale-codec

* remove the leftovers

* remove unused dep
2020-11-17 11:16:31 +01:00
Bastian Köcher 640264f38b Make CandidateHash a real type (#1916)
* Make `CandidateHash` a real type

This pr adds a new type `CandidateHash` that is used instead of the
opaque `Hash` type. This helps to ensure on the type system level that
we are passing the correct types.

This pr also fixes wrong usage of `relay_parent` as `candidate_hash`
when communicating with the av storage.

* Update core-primitives/src/lib.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Wrap the lines

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2020-11-05 16:28:45 +01:00
Bastian Köcher 002e1141a8 Parachain improvements (#1905)
* Parachain improvements

- Set the parachains configuration in Rococo genesis
- Don't stop the overseer when a subsystem job is stopped
- Several small code changes

* Remove unused functionality

* Return error from the runtime instead of printing it

* Apply suggestions from code review

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Update primitives/src/v1.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Update primitives/src/v1.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Fix test

* Revert "Update primitives/src/v1.rs"

This reverts commit 11fce2785acd1de481ca57815b8e18400f09fd52.

* Revert "Update primitives/src/v1.rs"

This reverts commit d6439fed4f954360c89fb1e12b73954902c76a41.

* Revert "Return error from the runtime instead of printing it"

This reverts commit cb4b5c0830ac516a6d54b2c24197e9354f2b98cb.

* Revert "Fix test"

This reverts commit 0c5fa1b5566d4cd3c55a55d485e707165ce7a59e.

* Update runtime/parachains/src/runtime_api_impl/v1.rs

Co-authored-by: Sergei Shulepov <sergei@parity.io>

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
Co-authored-by: Sergei Shulepov <sergei@parity.io>
2020-11-03 11:22:38 +00:00
Fedor Sakharov 935fcd1666 Change SpawnedSubsystem type to log subsystem errors (#1878)
* Change SpawnedSubsystem type to log subsystem errors

* Remove clone
2020-10-28 20:57:06 +01:00
Peter Goodspeed-Niklaus 1a25c41277 start working on building the real overseer (#1795)
* start working on building the real overseer

Unfortunately, this fails to compile right now due to an upstream
failure to compile which is probably brought on by a recent upgrade
to rustc v1.47.

* fill in AllSubsystems internal constructors

* replace fn make_metrics with Metrics::attempt_to_register

* update to account for #1740

* remove Metrics::register, rename Metrics::attempt_to_register

* add 'static bounds to real_overseer type params

* pass authority_discovery and network_service to real_overseer

It's not straightforwardly obvious that this is the best way to handle
the case when there is no authority discovery service, but it seems
to be the best option available at the moment.

* select a proper database configuration for the availability store db

* use subdirectory for av-store database path

* apply Basti's patch which avoids needing to parameterize everything on Block

* simplify path extraction

* get all tests to compile

* Fix Prometheus double-registry error

for debugging purposes, added this to node/subsystem-util/src/lib.rs:472-476:

```rust
Some(registry) => Self::try_register(registry).map_err(|err| {
	eprintln!("PrometheusError calling {}::register: {:?}", std::any::type_name::<Self>(), err);
	err
}),
```

That pointed out where the registration was failing, which led to
this fix. The test still doesn't pass, but it now fails in a new
and different way!

* authorities must have authority discovery, but not necessarily overseer handlers

* fix broken SpawnedSubsystem impls

detailed logging determined that using the `Box::new` style of
future generation, the `self.run` method was never being called,
leading to dropped receivers / closed senders for those subsystems,
causing the overseer to shut down immediately.

This is not the final fix needed to get things working properly,
but it's a good start.

* use prometheus properly

Prometheus lets us register simple counters, which aren't very
interesting. It also allows us to register CounterVecs, which are.
With a CounterVec, you can provide a set of labels, which can
later be used to filter the counts.

We were using them wrong, though. This pattern was repeated in a
variety of places in the code:

```rust
// panics with an cardinality mismatch
let my_counter = register(CounterVec::new(opts, &["succeeded", "failed"])?, registry)?;
my_counter.with_label_values(&["succeeded"]).inc()
```

The problem is that the labels provided in the constructor are not
the set of legal values which can be annotated, but a set of individual
label names which can have individual, arbitrary values.

This commit fixes that.

* get av-store subsystem to actually run properly and not die on first signal

* typo fix: incomming -> incoming

* don't disable authority discovery in test nodes

* Fix rococo-v1 missing session keys

* Update node/core/av-store/Cargo.toml

* try dummying out av-store on non-full-nodes

* overseer and subsystems are required only for full nodes

* Reduce the amount of warnings on browser target

* Fix two more warnings

* InclusionInherent should actually have an Inherent module on rococo

* Ancestry: don't return genesis' parent hash

* Update Cargo.lock

* fix broken test

* update test script: specify chainspec as script argument

* Apply suggestions from code review

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Update node/service/src/lib.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* node/service/src/lib: Return error via ? operator

* post-merge blues

* add is_collator flag

* prevent occasional av-store test panic

* simplify fix; expand application

* run authority_discovery in Role::Discover when collating

* distinguish between proposer closed channel errors

* add IsCollator enum, remove is_collator CLI flag

* improve formatting

* remove nop loop

* Fix some stuff

Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: Fedor Sakharov <fedor.sakharov@gmail.com>
Co-authored-by: Robert Habermeier <robert@Roberts-MBP.lan1>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Max Inden <mail@max-inden.de>
2020-10-28 10:26:50 +00:00
Bernhard Schuster f345123748 introduce errors with info (#1834) 2020-10-27 08:10:03 +01:00
Bastian Köcher f2d7b6f5ac Make AllSubsystems usage easier in tests (#1794)
* Make `AllSubsystems` usage easier in tests

This makes the usage of `AllSubsystems` easier in tests by introducing
new methods.

- `dummy` initializes `AllSubsystems` with all systems set to dummy
- `replace_*` to replace any subsystem

Besides that this pr adds a `ForwardSubsystem` that is also useful for
tests. This subsystem will forward all incoming messages to the given channel.

* Update node/overseer/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Update node/subsystem/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Update node/subsystem/src/lib.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* Move ForwardSubsystem and add a test

* Break some lines

Co-authored-by: Andronik Ordian <write@reusable.software>
2020-10-08 11:27:19 +02:00