Commit Graph

18 Commits

Author SHA1 Message Date
Pierre Krieger 7dc47ab93d Add Prometheus alerts if unbounded channels are too large (#7866)
* Add Prometheus alerts if unbounded channels are too large

* Tweaks
2021-01-12 09:59:17 +01:00
André Silva 1871a95088 grandpa: remove light-client specific block import pipeline (#7546)
* grandpa: remove light-client specific block import

* consensus, network: remove finality proofs
2020-11-23 14:28:55 +00:00
Pierre Krieger a12df194c6 Exclude basic-authorship-proposer from the continuous tasks alert (#7484) 2020-11-04 14:01:37 +01:00
Max Inden 0ff724c939 .maintain/monitoring: Add alert when continuous task ends (#7250)
* .maintain/monitoring: Add alert when continuous task ends

Through the `polkadot_tasks_ended_total` Prometheus metric one can tell
when a task ended. Use this metric to alert when specific
known-to-be-continuous tasks end on a node.

* .maintain/monitoring: Don't hard-code task names
2020-10-05 08:40:24 +00:00
Max Inden 51c0d27aa1 .maintain/monitoring: Normalize alerting rules (#7232)
* .maintain/monitoring: Normalize alerting rules

- Start alert names with their component and end with the describing
adjective.

- Describe alert duration in `message` with `for more than` across all
alerts.

* .maintain/monitoring: Fix alert tests
2020-09-30 08:48:48 +00:00
Pierre Krieger b4aa5f328e Update networking Prometheus dashboard (#7180) 2020-09-24 09:30:27 +00:00
Pierre Krieger 7c7ad18d4e Update the service tasks Grafana dashboard (#7038) 2020-09-08 09:59:37 +00:00
Max Inden f04afd596b .maintain/monitoring/alerting-rules: Add fd alert (#6946)
Alert on high file descriptor allocation.
2020-08-24 15:37:07 +02:00
Pierre Krieger 63bd1d8346 Add the Substrate Service Tasks dashboard (#6665) 2020-07-27 09:34:02 +00:00
Max Inden fe9c01fc68 .maintain/monitoring/alerting-rules: Remove HighCPUUsage alert (#6648)
The `HighCPUUsage` alert is based on the `cpu_usage_percentage` metric.
Instead of exposing the overall CPU usage in percent, the metric exposes
the per core usage summed over all cores.

This commit removes the alert for two reasons:

1. Substrate itself does not expose the core count and thus one can not
alert based on the `cpu_usage_percentage` metric.

2. Alerting based on CPU usage is generic and not specific to Substrate
or Blockchains. Thus any CPU usage alert suffice.
2020-07-17 07:43:57 +00:00
Pierre Krieger 84d607b5ff Update substrate-networking Grafana dashboard (#6649) 2020-07-16 16:45:12 +00:00
Max Inden 585ea531a3 .maintain/monitoring/alerting-rules: Adjust transaction queue size alert (#6426)
The transaction queue size alert has been firing with a constant 10
transactions in the queue. While maybe problematic those 10 transactions
don't need to be the same across scrape intervals.

Instead of alerting with a size above 10, alert based on two things:

1. Monotonically increasing queue size

2. Upper limit queue size reached
2020-07-01 10:31:56 +02:00
Max Inden fe76ebd548 .maintain/monitoring: Add alerting rule tests (#6343)
* .maintain/monitoring: Add alerting rule tests

* .maintain/monitoring/alerting-rules/alerting-rules.yaml: Break lines

* .gitlab-ci.yml: Add promtool rule testing step
2020-06-19 08:31:42 +02:00
Pierre Krieger 9cac359f44 Add a Substrate networking Grafana dashboard template (#6171)
* Add a substrate-networking grafana dashboard

* Capitalize data source

* Do changes
2020-06-01 10:46:34 +02:00
Max Inden aa95c596e6 .maintain/monitoring: Add an initial set of Prometheus alerting rules (#6095)
Create a place to collaborate on Prometheus alerting rules for
Substrate starting with a basic set of rules covering:

- Resource usage
- Block production
- Block finalization
- Transaction queue
- Networking
- ... Others
2020-05-21 16:26:29 +02:00
Max Inden 61d64e2ca1 .maintain/sentry-node: Add monitoring to docker-compose stack (#5321)
* Substrate Dashboard example

* Improve README

* Update README_dashboard.md

* Add screenshots

* Minor fix

* Minor fix, image link

* .maintain/sentry-node: Add monitoring to docker-compose stack

With this patch a user can run the following fully configured and
monitored setup with a single command:

`docker-compose -f .maintain/sentry-node/docker-compose.yml up`

- 2 validators in two different network namespaces, connected via one
sentry node.

- Polkadot-js/apps to connect to one of the nodes above.

- Prometheus scraping the 3 Substrate nodes.

- Grafana displaying data from Prometheus with community dashboards

* .maintain/monitoring/grafana: Change default datasource name

* .maintain/monitoring/grafana: Add metric namespace option

* .maintain/monitoring/grafana: Remove `host` metric from most metrics

* .maintain/monitoring/grafana: Remove underscore from metric_namespace

* .maintain/monitoring: Use `instance` label instead of `hostname`

To identify a scrape target, one should use `instance` and not
`hostname` as multiple targets might run on the same node.

See https://prometheus.io/docs/concepts/jobs_instances/ for details.

* .maintain/monitoring: Introduce instance variable

* .maintain/monitoring/grafana: Rename substrate_block_height_number

* .maitain/monitoring/grafana: Use instance instead of host in legend

* .maintain/monitoring: Remove node exporter dependency

* .maintain/sentry-node/prometheus: Simplify configuration

* .maintain/monitoring/grafana: Update README and remove images

* .maintain/sentry-node: Improve docs

* .maintain/monitoring/grafana: Use metric_namespace template variable

* Use --sentry from v0.7.29 instead of a reserved-node

* .maintain/sentry-node: Revert sentry-a using validator-b as bootnode

Co-authored-by: DerFredy - @derfredy:matrix.org <derfredy@gmail.com>
Co-authored-by: david <davidd@custom.home>
2020-04-14 16:08:09 +02:00
Max Inden 733d486814 Revert "Substrate Dashboard example (#5284)" (#5293)
This reverts commit 082b66434e.
2020-03-18 11:13:58 +01:00
derfredy 082b66434e Substrate Dashboard example (#5284) 2020-03-18 10:31:50 +01:00