Commit Graph

22 Commits

Author SHA1 Message Date
Andronik Ordian 69b103b1d5 overseer: send_msg should not return an error (#1995)
* send_message should not return an error

* Apply suggestions from code review

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* s/send_logging_error/send_and_log_error

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2020-11-23 12:42:14 +01:00
Peter Goodspeed-Niklaus 0a5bc82529 Add Prometheus timers to the subsystems (#1923)
* reexport prometheus-super for ease of use of other subsystems

* add some prometheus timers for collation generation subsystem

* add timing metrics to av-store

* add metrics to candidate backing

* add timing metric to bitfield signing

* add timing metrics to candidate selection

* add timing metrics to candidate-validation

* add timing metrics to chain-api

* add timing metrics to provisioner

* add timing metrics to runtime-api

* add timing metrics to availability-distribution

* add timing metrics to bitfield-distribution

* add timing metrics to collator protocol: collator side

* add timing metrics to collator protocol: validator side

* fix candidate validation test failures

* add timing metrics to pov distribution

* add timing metrics to statement-distribution

* use substrate_prometheus_endpoint prometheus reexport instead of prometheus_super

* don't include JOB_DELAY in bitfield-signing metrics

* give adder-collator ability to easily export its genesis-state and validation code

* wip: adder-collator pushbutton script

* don't attempt to register the adder-collator automatically

Instead, get these values with

```sh
target/release/adder-collator export-genesis-state
target/release/adder-collator export-genesis-wasm
```

And then register the parachain on https://polkadot.js.org/apps/?rpc=ws%3A%2F%2F127.0.0.1%3A9944#/explorer

To collect prometheus data, after running the script, create `prometheus.yml` per the instructions
at https://www.notion.so/paritytechnologies/Setting-up-Prometheus-locally-835cb3a9df7541a781c381006252b5ff
and then run:

```sh
docker run -v `pwd`/prometheus.yml:/etc/prometheus/prometheus.yml:z --network host prom/prometheus
```

Demonstrates that data makes it across to prometheus, though it is likely to be useful in the future
to tweak the buckets.

* Update parachain/test-parachains/adder/collator/src/cli.rs

Co-authored-by: Andronik Ordian <write@reusable.software>

* use the grandpa-pause parameter

* skip metrics in tracing instrumentation

* remove unnecessary grandpa_pause cli param

Co-authored-by: Andronik Ordian <write@reusable.software>
2020-11-20 15:04:51 +01:00
Peter Goodspeed-Niklaus e49989971d Add tracing support to node (#1940)
* drop in tracing to replace log

* add structured logging to trace messages

* add structured logging to debug messages

* add structured logging to info messages

* add structured logging to warn messages

* add structured logging to error messages

* normalize spacing and Display vs Debug

* add instrumentation to the various 'fn run'

* use explicit tracing module throughout

* fix availability distribution test

* don't double-print errors

* remove further redundancy from logs

* fix test errors

* fix more test errors

* remove unused kv_log_macro

* fix unused variable

* add tracing spans to collation generation

* add tracing spans to av-store

* add tracing spans to backing

* add tracing spans to bitfield-signing

* add tracing spans to candidate-selection

* add tracing spans to candidate-validation

* add tracing spans to chain-api

* add tracing spans to provisioner

* add tracing spans to runtime-api

* add tracing spans to availability-distribution

* add tracing spans to bitfield-distribution

* add tracing spans to network-bridge

* add tracing spans to collator-protocol

* add tracing spans to pov-distribution

* add tracing spans to statement-distribution

* add tracing spans to overseer

* cleanup
2020-11-20 12:02:04 +01:00
Bastian Köcher 640264f38b Make CandidateHash a real type (#1916)
* Make `CandidateHash` a real type

This pr adds a new type `CandidateHash` that is used instead of the
opaque `Hash` type. This helps to ensure on the type system level that
we are passing the correct types.

This pr also fixes wrong usage of `relay_parent` as `candidate_hash`
when communicating with the av storage.

* Update core-primitives/src/lib.rs

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>

* Wrap the lines

Co-authored-by: Peter Goodspeed-Niklaus <coriolinus@users.noreply.github.com>
2020-11-05 16:28:45 +01:00
Fedor Sakharov 935fcd1666 Change SpawnedSubsystem type to log subsystem errors (#1878)
* Change SpawnedSubsystem type to log subsystem errors

* Remove clone
2020-10-28 20:57:06 +01:00
Peter Goodspeed-Niklaus 1a25c41277 start working on building the real overseer (#1795)
* start working on building the real overseer

Unfortunately, this fails to compile right now due to an upstream
failure to compile which is probably brought on by a recent upgrade
to rustc v1.47.

* fill in AllSubsystems internal constructors

* replace fn make_metrics with Metrics::attempt_to_register

* update to account for #1740

* remove Metrics::register, rename Metrics::attempt_to_register

* add 'static bounds to real_overseer type params

* pass authority_discovery and network_service to real_overseer

It's not straightforwardly obvious that this is the best way to handle
the case when there is no authority discovery service, but it seems
to be the best option available at the moment.

* select a proper database configuration for the availability store db

* use subdirectory for av-store database path

* apply Basti's patch which avoids needing to parameterize everything on Block

* simplify path extraction

* get all tests to compile

* Fix Prometheus double-registry error

for debugging purposes, added this to node/subsystem-util/src/lib.rs:472-476:

```rust
Some(registry) => Self::try_register(registry).map_err(|err| {
	eprintln!("PrometheusError calling {}::register: {:?}", std::any::type_name::<Self>(), err);
	err
}),
```

That pointed out where the registration was failing, which led to
this fix. The test still doesn't pass, but it now fails in a new
and different way!

* authorities must have authority discovery, but not necessarily overseer handlers

* fix broken SpawnedSubsystem impls

detailed logging determined that using the `Box::new` style of
future generation, the `self.run` method was never being called,
leading to dropped receivers / closed senders for those subsystems,
causing the overseer to shut down immediately.

This is not the final fix needed to get things working properly,
but it's a good start.

* use prometheus properly

Prometheus lets us register simple counters, which aren't very
interesting. It also allows us to register CounterVecs, which are.
With a CounterVec, you can provide a set of labels, which can
later be used to filter the counts.

We were using them wrong, though. This pattern was repeated in a
variety of places in the code:

```rust
// panics with an cardinality mismatch
let my_counter = register(CounterVec::new(opts, &["succeeded", "failed"])?, registry)?;
my_counter.with_label_values(&["succeeded"]).inc()
```

The problem is that the labels provided in the constructor are not
the set of legal values which can be annotated, but a set of individual
label names which can have individual, arbitrary values.

This commit fixes that.

* get av-store subsystem to actually run properly and not die on first signal

* typo fix: incomming -> incoming

* don't disable authority discovery in test nodes

* Fix rococo-v1 missing session keys

* Update node/core/av-store/Cargo.toml

* try dummying out av-store on non-full-nodes

* overseer and subsystems are required only for full nodes

* Reduce the amount of warnings on browser target

* Fix two more warnings

* InclusionInherent should actually have an Inherent module on rococo

* Ancestry: don't return genesis' parent hash

* Update Cargo.lock

* fix broken test

* update test script: specify chainspec as script argument

* Apply suggestions from code review

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Update node/service/src/lib.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* node/service/src/lib: Return error via ? operator

* post-merge blues

* add is_collator flag

* prevent occasional av-store test panic

* simplify fix; expand application

* run authority_discovery in Role::Discover when collating

* distinguish between proposer closed channel errors

* add IsCollator enum, remove is_collator CLI flag

* improve formatting

* remove nop loop

* Fix some stuff

Co-authored-by: Andronik Ordian <write@reusable.software>
Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: Fedor Sakharov <fedor.sakharov@gmail.com>
Co-authored-by: Robert Habermeier <robert@Roberts-MBP.lan1>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Max Inden <mail@max-inden.de>
2020-10-28 10:26:50 +00:00
Bernhard Schuster f345123748 introduce errors with info (#1834) 2020-10-27 08:10:03 +01:00
Rakan Alhneiti bd75a4ce18 Update to work with async keystore – Companion PR for #7000 (#1740)
* Fix keystore types

* Use SyncCryptoStorePtr

* Borrow keystore

* Fix unused imports

* Fix polkadot service

* Fix bitfield-distribution tests

* Fix indentation

* Fix backing tests

* Fix tests

* Fix provisioner tests

* Removed SyncCryptoStorePtr

* Fix services

* Address PR feedback

* Address PR feedback - 2

* Update CryptoStorePtr imports to be from sp_keystore

* Typo

* Fix CryptoStore import

* Document the reason behind using filesystem keystore

* Remove VALIDATORS

* Fix duplicate dependency

* Mark sp-keystore as optional

* Fix availability distribution

* Fix call to sign_with

* Fix keystore usage

* Remove tokio and fix parachains Cargo config

* Typos

* Fix keystore dereferencing

* Fix CryptoStore import

* Fix provisioner

* Fix node backing

* Update services

* Cleanup dependencies

* Use sync_keystore

* Fix node service

* Fix node service - 2

* Fix node service - 3

* Rename CryptoStorePtr to SyncCryptoStorePtr

* "Update Substrate"

* Apply suggestions from code review

* Update node/core/backing/Cargo.toml

* Update primitives/src/v0.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Fix wasm build

* Update Cargo.lock

Co-authored-by: parity-processbot <>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
2020-10-09 10:54:03 +00:00
Andronik Ordian 579614d127 implement remaining subsystem metrics (#1770)
* overseer metrics: messages relayed

* provisioner metrics: cosmetic changes

* candidate selection metrics: cosmetic changes

* availability bitfields metrics

* availability distribution metrics

* PoV distribution metrics

* statement-distribution: small simplification

* statement-distribution: extract log target into a const

* statement-distribution: metrics

* address review nits
2020-10-01 12:08:03 +02:00
Andronik Ordian de05bec4d6 move Metrics to utils (#1765) 2020-09-29 11:42:20 +00:00
Fedor Sakharov 9756b2d676 Register listeners in statement distribution (#1759)
* Register listeners in statement distribution

* Review fixes
2020-09-28 13:55:59 +00:00
Andronik Ordian e7ead40255 initial prometheus metrics (#1536)
* service-new: cosmetic changes

* overseer: draft of prometheus metrics

* metrics: update active_leaves metrics

* metrics: extract into functions

* metrics: resolve XXX

* metrics: it's ugly, but it works

* Bump Substrate

* metrics: move a bunch of code around

* Bumb substrate again

* metrics: fix a warning

* fix a warning in runtime

* metrics: statements signed

* metrics: statements impl RegisterMetrics

* metrics: refactor Metrics trait

* metrics: add Metrics assoc type to JobTrait

* metrics: move Metrics trait to util

* metrics: fix overseer

* metrics: fix backing

* metrics: fix candidate validation

* metrics: derive Default

* metrics: docs

* metrics: add stubs for other subsystems

* metrics: add more stubs and fix compilation

* metrics: fix doctest

* metrics: move to subsystem

* metrics: fix candidate validation

* metrics: bitfield signing

* metrics: av store

* metrics: chain API

* metrics: runtime API

* metrics: stub for avad

* metrics: candidates seconded

* metrics: ok I gave up

* metrics: provisioner

* metrics: remove a clone by requiring Metrics: Sync

* metrics: YAGNI

* metrics: remove another TODO

* metrics: for later

* metrics: add parachain_ prefix

* metrics: s/signed_statement/signed_statements

* utils: add a comment for job metrics

* metrics: address review comments

* metrics: oops

* metrics: make sure to save files before commit 😅

* use _total suffix for requests metrics

Co-authored-by: Max Inden <mail@max-inden.de>

* metrics: add tests for overseer

* update Cargo.lock

* overseer: add a test for CollationGeneration

* collation-generation: impl metrics

* collation-generation: use kebab-case for name

* collation-generation: add a constructor

Co-authored-by: Gav Wood <gavin@parity.io>
Co-authored-by: Ashley Ruglys <ashley.ruglys@gmail.com>
Co-authored-by: Max Inden <mail@max-inden.de>
2020-08-18 11:18:54 +02:00
Robert Habermeier a6b1d91d6e Network bridge refactoring impl (#1537)
* update networking types

* port over overseer-protocol message types

* Add the collation protocol to network bridge

* message sending

* stub for ConnectToValidators

* add some helper traits and methods to protocol types

* add collator protocol message

* leaves-updating

* peer connection and disconnection

* add utilities for dispatching multiple events

* implement message handling

* add an observedrole enum with equality and no sentry nodes

* derive partial-eq on network bridge event

* add PartialEq impls for network message types

* add Into implementation for observedrole

* port over existing network bridge tests

* add some more tests

* port bitfield distribution

* port over bitfield distribution tests

* add codec indices

* port PoV distribution

* port over PoV distribution tests

* port over statement distribution

* port over statement distribution tests

* update overseer and service-new

* address review comments

* port availability distribution

* port over availability distribution tests
2020-08-12 11:16:28 +00:00
Peter Goodspeed-Niklaus 9fda6cb416 break out subsystem-util and subsystem-test-helpers into individual crates (#1553)
* break out subsystem-util and subsystem-test-helpers into individual crates

* cause all packages to check successfully
2020-08-07 16:51:36 +02:00
Robert Habermeier c8cdfbfd17 Add new Runtime API messages and make runtime API request fallible (#1485)
* polkadot-subsystem: update runtime API message types

* update all networking subsystems to use fallible runtime APIs

* fix bitfield-signing and make it use new runtime APIs

* port candidate-backing to handle runtime API errors and new types

* remove old runtime API messages

* remove unused imports

* fix grumbles

* fix backing tests
2020-07-28 18:02:39 +00:00
Fedor Sakharov 32a20a178c Availability store subsystem (#1404)
* Initial commit

* WIP

* Make atomic transactions

* Remove pruning code

* Fix build and add a Nop to bridge

* Fixes from review

* Move config struct around for clarity

* Rename constructor and warn on missing docs

* Fix a test and rename a message

* Fix some more reviews

* Obviously failed to rebase cleanly
2020-07-27 14:13:02 +03:00
Peter Goodspeed-Niklaus 106bd929ce add ActiveLeavesUpdate, remove StartWork, StopWork (#1458)
* add ActiveLeavesUpdate, remove StartWork, StopWork

* replace StartWork, StopWork in subsystem crate tests

* mechanically update OverseerSignal in other modules

* convert overseer to take advantage of new multi-hash update abilities

Note: this does not yet convert the tests; some of the tests now freeze:

test tests::overseer_start_stop_works ... test tests::overseer_start_stop_works has been running for over 60 seconds
test tests::overseer_finalize_works ... test tests::overseer_finalize_works has been running for over 60 seconds

* fix broken overseer tests

* manually impl PartialEq for ActiveLeavesUpdate, rm trait Equivalent

This cleans up the code a bit and makes it easier in the future to
do the right thing when comparing ALUs.

* use target in all network bridge logging

* reduce spamming of  and
2020-07-27 10:39:52 +02:00
Bastian Köcher fa598f176b Companion for #6726 (#1469)
* Companion for #6726

* Spaces

* 'Update substrate'

Co-authored-by: parity-processbot <>
2020-07-26 13:16:09 +00:00
Peter Goodspeed-Niklaus 5cfcc8446c Add test suite and minor refinements to the utility subsystem (#1403)
* get conclude signal working properly; don't allocate a vector

* wip: add test suite / example / explanation for using utility subsystem

Unfortunately, the test fails right now for reasons which seem
very odd. Just have to keep poking at it.

* explicitly import everything

* fix subsystem-util test

The root problem here was two-fold:

- there was a circular dependency from subsystem -> test-helpers/subsystem ->
  subsystem
- cfg(test) doesn't propagate between crates

The solution: move the subsystem test helpers into a sub-module
within subsystem. Publicly export them from the previous location
so no other code breaks.

Doing this has an additional benefit: it ensures that no production
code can ever accidentally use the subsystem helpers, as they are compile-
gated on cfg(test).

* fully commit to moving test helpers into a subsystem module

* add some more tests

* get rid of log tests in favor of real error forwarding

It's not obvious whether we'll ever really want to chase down
these errors outside a testing context, but having the capability
won't hurt.

* fix issue which caused test to hang on osx

* only require that job errors are PartialEq when testing

also fix polkadot-node-core-backing tests

* get rid of any notion of partialeq

* rethink testing

Combine tests of starting and stopping job: leaving a test executor
with a job running was pretty clearly the cause of the sometimes-hang.

Also, add a timeout so tests _can't_ hang anymore; they just fail
after a while.

* rename fwd_errors -> forward_errors

* warn on error propagation failure

* fix unused import leftover from merge

* derive eq for subsystemerror
2020-07-20 20:35:14 -04:00
Fedor Sakharov 5624bd8bf4 Use SpawnNamed instead of Spawn in Overseer (#1430)
* Use SpawnNamed instead of Spawn in Overseer

* reexport SpawnNamed and fix doc tests

* Fix deps
2020-07-17 20:04:02 +03:00
Robert Habermeier 3b13cd9a85 Refactor primitives (#1383)
* create a v1 primitives module

* Improve guide on availability types

* punctuate

* new parachains runtime uses new primitives

* tests of new runtime now use new primitives

* add ErasureChunk to guide

* export erasure chunk from v1 primitives

* subsystem crate uses v1 primitives

* node-primitives uses new v1 primitives

* port overseer to new primitives

* new-proposer uses v1 primitives (no ParachainHost anymore)

* fix no-std compilation for primitives

* service-new uses v1 primitives

* network-bridge uses new primitives

* statement distribution uses v1 primitives

* PoV distribution uses v1 primitives; add PoV::hash fn

* move parachain to v0

* remove inclusion_inherent module and place into v1

* remove everything from primitives crate root

* remove some unused old types from v0 primitives

* point everything else at primitives::v0

* squanch some warns up

* add RuntimeDebug import to no-std as well

* port over statement-table and validation

* fix final errors in validation and node-primitives

* add dummy Ord impl to committed candidate receipt

* guide: update CandidateValidationMessage

* add primitive for validationoutputs

* expand CandidateValidationMessage further

* bikeshed

* add some impls to omitted-validation-data and available-data

* expand CandidateValidationMessage

* make erasure-coding generic over v1/v0

* update usages of erasure-coding

* implement commitments.hash()

* use Arc<Pov> for CandidateValidation

* improve new erasure-coding method names

* fix up candidate backing

* update docs a bit

* fix most tests and add short-circuiting to make_pov_available

* fix remainder of candidate backing tests

* squanching warns

* squanch it up

* some fallout

* overseer fallout

* free from polkadot-test-service hell
2020-07-09 21:23:03 -04:00
Robert Habermeier ac8e1ca206 Implement the Statement Distribution Subsystem (#1326)
* set up data types and control flow for statement distribution

* add some set-like methods to View

* implement sending to peers

* start fixing equivocation handling

* Add a section to the statement distribution subsystem on equivocations and flood protection

* fix typo and amend wording

* implement flood protection

* have peer knowledge tracker follow when peer first learns about a candidate

* send dependents after circulating

* add another TODO

* trigger send in one more place

* refactors from review

* send new statements to candidate backing

* instantiate active head data with runtime API values

* track our view changes and peer view changes

* apply a benefit to peers who send us statements we want

* remove unneeded TODO

* add some comments and improve Hash implementation

* start tests and fix `note_statement`

* test active_head seconding logic

* test that the per-peer tracking logic works

* test per-peer knowledge tracker

* test that peer view updates lead to messages being sent

* test statement circulation

* address review comments

* have view set methods return references
2020-07-06 11:16:17 -04:00