* .maintain/monitoring: Add alert when continuous task ends
Through the `polkadot_tasks_ended_total` Prometheus metric one can tell
when a task ended. Use this metric to alert when specific
known-to-be-continuous tasks end on a node.
* .maintain/monitoring: Don't hard-code task names
* .maintain/monitoring: Normalize alerting rules
- Start alert names with their component and end with the describing
adjective.
- Describe alert duration in `message` with `for more than` across all
alerts.
* .maintain/monitoring: Fix alert tests
* Initiate chaostest cli test suite: singlenodeheight on one dev node
Added chaostest stages in CI
Added new docker/k8s resources and environments to CI
Added new chaos-only tag to gitlab-ci.yml
* Update .maintain/chaostest/src/commands/singlenodeheight/index.js
Co-authored-by: Max Inden <mail@max-inden.de>
* change nameSpace to namespace(one word)
* update chaos ci job to match template
* rename build-pr ci stage to docker [chaos:basic]
* test gitlab-ci [chaos:basic]
* Update .gitlab-ci.yml
* add new build-chaos-only condition
* add *default-vars to singlenodeheight [chaos:basic]
* change build-only to build-rules on substrate jobs [chaos:basic]
* test and change when:on_success to when:always [chaos:basic]
* resolve conflicts and test [chaos:basic]
Co-authored-by: Max Inden <mail@max-inden.de>
Co-authored-by: Denis Pisarev <denis.pisarev@parity.io>
The `HighCPUUsage` alert is based on the `cpu_usage_percentage` metric.
Instead of exposing the overall CPU usage in percent, the metric exposes
the per core usage summed over all cores.
This commit removes the alert for two reasons:
1. Substrate itself does not expose the core count and thus one can not
alert based on the `cpu_usage_percentage` metric.
2. Alerting based on CPU usage is generic and not specific to Substrate
or Blockchains. Thus any CPU usage alert suffice.
* Initial commit
Forked at: f54614e256
Parent branch: origin/master
* Remove polkadot companion detection from branch name
Even though it was nice it was also error prone as there were no indication whatsoever on the PR
that a polkadot companion branch exists.
* Update SubstrateCli to return String
* Add default implementation for executable_name()
* Use display instead of PathBuf
* Get file_name in default impl of executable_name
* Remove String::from and use .into()
* Use default impl for executable_name()
* Use .as_str() and remove useless .to_string()
* Update only sp-io when running companion build
* Remove unneeded update of sp-io in CI
Co-authored-by: Cecile Tonglet <cecile@parity.io>
The transaction queue size alert has been firing with a constant 10
transactions in the queue. While maybe problematic those 10 transactions
don't need to be the same across scrape intervals.
Instead of alerting with a size above 10, alert based on two things:
1. Monotonically increasing queue size
2. Upper limit queue size reached
Create a place to collaborate on Prometheus alerting rules for
Substrate starting with a basic set of rules covering:
- Resource usage
- Block production
- Block finalization
- Transaction queue
- Networking
- ... Others
* Run script in strict mode
* Add proper seperator between revision and file
* Fix copy paste error
* Do not repeat limit number in error text
* Fix bad revision error
* Do not mask pipe errors
* Fix typo
* Remove unnecessary ... syntax
* Do not fetch all commits of master
* Fetching one commit is enough
* client/authority-discovery: Allow to be run by sentry node
When run as a sentry node, the authority discovery module does not
publish any addresses to the dht, but still discovers validators and
sentry nodes of validators.
* client/authority-discovery/src/lib: Wrap lines at 100 characters
* client/authority-discovery: Remove TODO and unused import
* client/authority-discovery: Pass role to new unit tests
* client/authority-discovery: Apply suggestions
Co-Authored-By: André Silva <123550+andresilva@users.noreply.github.com>
* bin/node/cli/src/service: Use expressions instead of statements
Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>
* Substrate Dashboard example
* Improve README
* Update README_dashboard.md
* Add screenshots
* Minor fix
* Minor fix, image link
* .maintain/sentry-node: Add monitoring to docker-compose stack
With this patch a user can run the following fully configured and
monitored setup with a single command:
`docker-compose -f .maintain/sentry-node/docker-compose.yml up`
- 2 validators in two different network namespaces, connected via one
sentry node.
- Polkadot-js/apps to connect to one of the nodes above.
- Prometheus scraping the 3 Substrate nodes.
- Grafana displaying data from Prometheus with community dashboards
* .maintain/monitoring/grafana: Change default datasource name
* .maintain/monitoring/grafana: Add metric namespace option
* .maintain/monitoring/grafana: Remove `host` metric from most metrics
* .maintain/monitoring/grafana: Remove underscore from metric_namespace
* .maintain/monitoring: Use `instance` label instead of `hostname`
To identify a scrape target, one should use `instance` and not
`hostname` as multiple targets might run on the same node.
See https://prometheus.io/docs/concepts/jobs_instances/ for details.
* .maintain/monitoring: Introduce instance variable
* .maintain/monitoring/grafana: Rename substrate_block_height_number
* .maitain/monitoring/grafana: Use instance instead of host in legend
* .maintain/monitoring: Remove node exporter dependency
* .maintain/sentry-node/prometheus: Simplify configuration
* .maintain/monitoring/grafana: Update README and remove images
* .maintain/sentry-node: Improve docs
* .maintain/monitoring/grafana: Use metric_namespace template variable
* Use --sentry from v0.7.29 instead of a reserved-node
* .maintain/sentry-node: Revert sentry-a using validator-b as bootnode
Co-authored-by: DerFredy - @derfredy:matrix.org <derfredy@gmail.com>
Co-authored-by: david <davidd@custom.home>