# Monitoring multiple Celestia nodes

By [GLCstaked](https://paragraph.com/@glcstaked) · 2023-05-15

---

**This will cover an attempt at monitoring a series of light nodes and full storage nodes across different hardware profiles, hosted locally and externally**.

To use any of the monitoring tools discussed here see:

[

celestia-node-scripts/multi-network/monitoring/README.md at main · GLCNI/celestia-node-scripts
----------------------------------------------------------------------------------------------

deployment scripts for celestia nodes. Contribute to GLCNI/celestia-node-scripts development by creating an account on GitHub.

https://github.com

![](https://storage.googleapis.com/papyrus_images/e0ed20cc75ba37d28fb09f94147d5fdae53bab10ab8dd5fa5a566a7b0bc15abc.png)

](https://github.com/GLCNI/celestia-node-scripts/blob/main/multi-network/monitoring/README.md)

**Dashboard Overview- All Connected Nodes**

![](https://storage.googleapis.com/papyrus_images/08f693228374aa15be61a39f4e741ad2a72281a8363555777722f71ae0e50b4d.png)

Hardware - Nodes
----------------

Idea to setup light nodes and full storage across different hardware profiles and setup monitoring

**Full Storage Nodes**

![](https://storage.googleapis.com/papyrus_images/f942d080520ee709fadb4efa074f0351847af8f8b0171415c3c0ce6567914cee.png)

**Light Nodes**

![](https://storage.googleapis.com/papyrus_images/3837398af92cad68236ccf359182822ad7c674f5e7876a9d5e2eb9709f44836b.png)

Node Setup
----------

To quickly deploy nodes for testing I used my \`multi-client\` deployment scripts,

[

celestia-node-scripts/multi-network at main · GLCNI/celestia-node-scripts
-------------------------------------------------------------------------

deployment scripts for celestia nodes. Contribute to GLCNI/celestia-node-scripts development by creating an account on GitHub.

https://github.com

![](https://storage.googleapis.com/papyrus_images/e0ed20cc75ba37d28fb09f94147d5fdae53bab10ab8dd5fa5a566a7b0bc15abc.png)

](https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network)

This was useful for deploying and re-deploying nodes quickly across different hardware types for testing, such as ARM based devices (Pinephone/ Rpi 4)

Monitoring Setup
----------------

**SNMP: (Simple Network Management Protocol)** for hardware-based monitoring.

is a remote probe, which can be deployed to monitor most devices. SNMP is a widely used protocol designed for managing devices on IP networks.

This can be deployed to any device (for Linux systems)

    sudo apt update
    sudo apt install snmpd snmp libsnmp-dev
    

See example configuration settings for connecting to local network devices here:

[

celestia-node-scripts/multi-network/monitoring/snmp at main · GLCNI/celestia-node-scripts
-----------------------------------------------------------------------------------------

deployment scripts for celestia nodes. Contribute to GLCNI/celestia-node-scripts development by creating an account on GitHub.

https://github.com

![](https://storage.googleapis.com/papyrus_images/e0ed20cc75ba37d28fb09f94147d5fdae53bab10ab8dd5fa5a566a7b0bc15abc.png)

](https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/monitoring/snmp)

**To Open Externally:**

To connect monitoring to servers outside the network edit the config file:

    nano /etc/snmp/snmpd.conf
    

![](https://storage.googleapis.com/papyrus_images/4cef32adcc0ff5ecaf7e01ef7ce3cc55740c199cdf65c49304095b5aa21c327d.png)

under the `rocommunity` section, open localhost and the server to connect too

![](https://storage.googleapis.com/papyrus_images/c8fa1707b9133270a83ca60a43d4eac6e7667f5cc63c861c2e785751a8b8cd5c.png)

Ensure port \`161\` is open on the device and on the monitoring device

**PRTG:** for alerts and dashboards

![](https://storage.googleapis.com/papyrus_images/f649a3df2fb9b475b7aa928c133fc5abf43c6ab2ea07f994a7c1da51b99d3838.png)

PRTG is a Windows based powerful network monitoring software, it can monitor any device with an IP address. SNMP is an open and supported protocol for hardware-based monitoring, PRTG is a one example compatible monitoring dashboard compatible with SNMP (including other protocols).

Download PTRG server to windows device:

[

PRTG Network Monitor: All-in-One Network Monitoring Software
------------------------------------------------------------

Paessler PRTG is professional and flexible network monitoring software. Analyze and monitor your entire IT infrastructure

https://www.paessler.com

![](https://storage.googleapis.com/papyrus_images/cf2557055d71286a2f342f2874ad4fb223e80e06d73f8982f6f0504cf4628272.jpg)

](https://www.paessler.com/prtg)

**Setup:** It is very straight forward to add local network devices. Setup account (default login: `prtgadmin`: `prtgadmin`), simply ‘add devices’ using the local IP and ‘recommend sensors’ to auto discover what is available. With SNMP enabled on the target device, the sensors should appear for selection.

**Connecting the External Sensors**

is more difficult to  setup, some extra configuration is required (also see SNMP side)

Windows Firewall settings: add rules for Port 443 to allow incoming and outgoing, allow for 161 port used for SNMP.

Port forwarding: if windows server is locally hosted, port forward via router settings ports 161, 443 to the local IP of PRTG server

Notes on Windows settings: recommend playing with network settings, and sleep and auto-updates settings.

In PRTG Settings: Ensure in ‘probe connection settings: all IP addresses available on this computer’ is selected.

![](https://storage.googleapis.com/papyrus_images/d8db0e856c2e238fcc36a62432b915a15f982d498526e4b90f3bed268406ed89.png)

**Selecting Sensors**

There may be a lot of redundant, or duplicate sensors, this is a reduced list of the most useful

![](https://storage.googleapis.com/papyrus_images/babf640df51b33e587a83b0c1cd16bd4f9768414261616b650a91a15c1c28b29.png)

**Arrange devices into groups**

![](https://storage.googleapis.com/papyrus_images/7d602050614a8f5add1f25f13bfec1da21a0114d45deca180689ba332cb56d2e.png)

So long as PRTG remains active sensor data will be logged and can be reviewed, examples

![system memory usage- from Pinephone running light client](https://storage.googleapis.com/papyrus_images/3297efd980cbdeab57e3dd78e999a2695b24dcb6ceb7eb9f5eb310e01b833518.png)

system memory usage- from Pinephone running light client

Select device > sensor: example: CPU Load

![CPU load over 7 days – Light node Device 4](https://storage.googleapis.com/papyrus_images/8a7ac64a727a14ca83e82c8e8c0c1e5b14f8f3c3edaba88fdd60d78ca06c608c.png)

CPU load over 7 days – Light node Device 4

Monitoring Celestia Service - Liveness
--------------------------------------

As one problem such as the service failing to start / or restarting, especially during testnets can be a frequent occurrence. [Example Issue here](https://github.com/celestiaorg/celestia-node/issues/2135) that I encountered myself during ‘blockspacerace’

It is useful to monitor the service liveness directly, so when the node encounters errors itself without hardware based failures, you can be alerted immediately.

![](https://storage.googleapis.com/papyrus_images/2fb930ac8f9e1721e988d992b01d319614e94e952e81223be2926dbd9791800c.png)

SNMP monitors system sensors, and cannot tell if the celestia node service is active (same if deployed with docker)

This can be achieved by setting up a script on the server, the script simply script checks if the `celestia-full.service` is active or error condition.

`sudo systemctl status celestia-full` and output 1 for active or 0 for inactive

Then have PRTG run the script by adding it as a custom sensor and connecting via SSH.

[https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/monitoring/snmp#monitor-system-service-with-prtg](https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/monitoring/snmp#monitor-system-service-with-prtg)

Monitoring DA node performance
------------------------------

I wanted to monitor DA node-based metrics, across the setup, in order to best compare performance against each other

using RPC API to query node metrics directly on the device, there are many RPC metrics [available here](https://node-rpc-docs.celestia.org/)

**Sampling Stats**

    export CELESTIA_NODE_AUTH_TOKEN=$(celestia full auth admin --p2p.network blockspacerace)
    celestia rpc das SamplingStats
    

celestia data availability nodes performance sampling on the data availability network, this is a good measure of performance, nodes that are struggling to sync will have trouble keeping up and the `head_of_sampled_chain` will be far from `network_head_height`

A script was setup, to query the API every 15 minutes and store the results to a logfile, this can be deployed to any DA node to capture the data so it can be reviewed and graphed later.

[

celestia-node-scripts/multi-network/monitoring at main · GLCNI/celestia-node-scripts
------------------------------------------------------------------------------------

deployment scripts for celestia nodes. Contribute to GLCNI/celestia-node-scripts development by creating an account on GitHub.

https://github.com

![](https://storage.googleapis.com/papyrus_images/e0ed20cc75ba37d28fb09f94147d5fdae53bab10ab8dd5fa5a566a7b0bc15abc.png)

](https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/monitoring)

* * *

Monitoring Analysis
-------------------

The second part of this document is an attempt at analysing captured data using the setup in part 1: this was difficult to get anything really concreate due to encountered problems in remote access, despite the setup working and being able to access hardware metrics externally, I found out alerts need to be properly utilized, access configured and better standards to be able to monitor and address maintain a node cluster.

### 1\. CPU Temp problem - ARM devices

With the ARM devices I noticed regular downtime with gaps in monitoring and then the devices being off

![SNMP CPU (Mobian left and RPi right), Red is downtimewhere device would have shutdown. ](https://storage.googleapis.com/papyrus_images/10c643d8d4bc6dc8e6520c38728a08a3543e92bfce1092845d2d483a2fc0216f.png)

SNMP CPU (Mobian left and RPi right), Red is downtimewhere device would have shutdown. 

coupled with high CPU usage as seen on the charts and the devices being very hot when physically checked it seemed that the ARM devices did not like running light clients put the CPU under a high load.

![](https://storage.googleapis.com/papyrus_images/926b551b4cb0240ac1f5983f064f143943330c17ff0dbc335ac89919d904d5a6.png)

What seemed to curb this behaviour was adding more extreme cooling, I had to remove the back cover and place the phone on top of a fan and the Pi despite having a heat sink enclosure, had to be placed in a PC enclosure with more cooling.

Note: the version, this was running on version v0.9.1 which was known to have `shrex` errors causing higher than normal CPU usage.

[

shrex/eds causes node to stop syncing · Issue #2097 · celestiaorg/celestia-node
-------------------------------------------------------------------------------

Celestia Node version 0.9.1 and 0.9.0 OS ubuntu Install tools No response Others No response Steps to reproduce it As mentioned in the name, that warning/error makes a node also to not be able to sync anymore, especially if the warn/erro...

https://github.com

![](https://storage.googleapis.com/papyrus_images/ebb7e83814d3aafcb381cfa528761e73175bea775413db1c57a141ac78197779.png)

](https://github.com/celestiaorg/celestia-node/issues/2097)

### 2\. Updating Nodes

**Updating and effect on CPU**

Upgrading Devices seems to have a random effect on CPU load, v0.9.3 did not entirely fix the `shrex` errors (seen in release notes) but as seen in the captures below in some cases worse CPU load or in some a reduction on CPU load.

**Full Storage Nodes**

![](https://storage.googleapis.com/papyrus_images/aa71faae3b856d0356c3abb39e37ea0a529962221e4928e03bf9fcd7017761ca.png)

**Light Nodes**

![](https://storage.googleapis.com/papyrus_images/8b7e566263c9a724fdda30edf0d359e4d1713ad1f9cee3571041047eae82f26b.png)

It’s hard to tell, but this may be the result of other processes more device related, which makes sense for non-dedicated devices (PinePhone/ Steam Deck)

### 3\. General Metrics comparison

Running v0.9.3 and monitoring while abroad had many issues such as downtime and setup. Shows the importance of properly setting up alerts and being able to address them, static IPs should have been set for local devices to avoid losing access because of port forwarding rules. Alerts should have been setup via PRTG as opposed to just relying on checking manually.

**Light Nodes**

![Device 4: VPS](https://storage.googleapis.com/papyrus_images/4e13b4dbe7f92dc6e1c975461f71022880f51a89a71caddbb9b8d897460dc40d.png)

Device 4: VPS

Running v0.9.3 and 0.9.4 upgrade, capture over 2 days

![Device 6: PinePhone](https://storage.googleapis.com/papyrus_images/84461f5112c15b749a1890b6a4e26ec08233ae4c01c853b864973d3662e89efb.png)

Device 6: PinePhone

Mobian was in down state, hence the large gaps, until restarted? And high load, while playing catch up, the upgrade spike (like others) pales compared to the high load under these conditions, still not sure went wrong here

![Device 5: Pi 4b](https://storage.googleapis.com/papyrus_images/b8bc02c0724ecd36c4237d42954d8f7021d8ca03e21d8561945443dab807a000.png)

Device 5: Pi 4b

Memory consistent during v0.9.3 operation for 2 days. Note: messy in pi, this is noise from the downtime field (should have been off and is incorrect). RPi CPU load is much more volatile compared to VPS.

**Full Nodes**

Running v0.9.4 and v0.9.5 upgrade

v0.9.5 is the upgrade which fixed many of the previous issues, such as `shrex` errors leading to higher CPU load.

![](https://storage.googleapis.com/papyrus_images/44e6504524e860215cf1f00ce15d80c5339daf8c20e6bca972c6b3464d06337a.png)

![](https://storage.googleapis.com/papyrus_images/be521f8c2286207c5ea57256de34f52526a5d2b34e10cd0159c3bdeb6f49b433.png)

### 4\. PFB Transactions

Setup automated PFBs on celestia nodes every minute to see the effects on hardware strain and performance.

using a script, I made here: [https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/payforblob](https://github.com/GLCNI/celestia-node-scripts/tree/main/multi-network/payforblob)

this was active since, 0.9.4 upgrade on light nodes, and there was no apparent or noticeable effect on any hardware metrics, although perhaps single PFBs at minute intervals are not enough, I suspect its mostly the difficulties in version and trying to capture data remotely.

Note: as all PFBs are saved to a logfile, I’ve noticed that there are occasional failed PFBs, perhaps there might be some insight here to look into at a later date.

### 5\. DAS Performance comparison

This was extremely difficult to capture and compare, because of the number of frequent updates and troubleshooting of devices to get monitoring and access right, I had to settle (for now) capturing sample of 24hrs on recently updated version v0.9.5

**Full Nodes:**

![Note: difference in time zone, UTC / local UTC-1](https://storage.googleapis.com/papyrus_images/c83e3cc47c198e85ec5a89411138a0a0b0d51aedac168947032a3e1034884601.png)

Note: difference in time zone, UTC / local UTC-1

Device 2: ThinkCentre Left and VPS on right, seemed the locally run server had some trouble keeping up, but not far behind eventually catching up.

_NOTE: that data from both Rpi and SteamDeck were omitted with issues of remote access at the time._

**Checking when home – days later**

Deck was still on v0.9.4 as unable to upgrade at the time, once update v0.9.5 was applied and nodes allowed to sync there was no real difference as far as performance between running locally and on virtual private server, it would be interesting though to get a longer capture and graph the output, Its still the case the most stable is the dedicated server throughout.

![](https://storage.googleapis.com/papyrus_images/ed8f3fe18817a3d844eb0772ab35115d36fb7ec4caa9910d1d9a3237993716c1.png)

**Light Nodes:**

**Problems:** so as mentioned above, the issue with this is still early testnet, hard to tell whether the lagging in sampling was device related or because of PFBs. While issues with the phone, the Light node run on VPS (device 4) had near perfect sampling stats.  

example from PinePhone: sampled is stuck at 489,908

![](https://storage.googleapis.com/papyrus_images/1d10ed16a594cd725d205ec4b88e4e7b4c41298ee3f8272498829e00fed736a5.png)

but this seemed to have been cleared by running `unsafe reset` but lagged very far behind

![After v0.9.5 and reset](https://storage.googleapis.com/papyrus_images/25ca33afb8538158dd2c4d87bff75f483872afe8e77c7644c3cd67c5a54667db.png)

After v0.9.5 and reset

seemed to fix itself after being allowed to run for a while on its own.

![](https://storage.googleapis.com/papyrus_images/0f8d4769037caa16bbc7b60505cf215243f8a7038d742373ff7b104ed276fc12.png)

**Conclusion:**

Lessons on the importance of properly setup configuration for system monitoring of node clusters and managing remote data access for maintaining. The setup covered in the first part of this document is a good starting point for monitoring multiple nodes, but this needs to be further optimised.

Need to run this for a longer period on a stable release with the same stable version across all devices, to be able to graph compare and get something more valuable.

In both node types it’s the dedicated sever that appears to have best performance, which is not surprising,

I thought spamming PFBs would have a more noticeable effect but appears this was not the case, although there were failed PFBs on the logfiles which might be worth looking into later.

---

*Originally published on [GLCstaked](https://paragraph.com/@glcstaked/monitoring-multiple-celestia-nodes)*
