This post was written by Tim on Sep 27, 2023
TL;DR The Go-Flare Helper Script is designed to assist in installing a Flare node and upgrade it with a single command. It will install all required dependencies including Go, build-essentials and JQ. It clones the latest node repository, switches to the desired tag, builds and creates systemd service to start the node and enabling start on server reboot.
It also enables upgrades done by accessing the go-flare directory, pulling any updates and building. During systemd configuration, a node runner script is created as the executable for the service file in order to dynamically fetch the latest bootstrap node ID and IP on node launch.
Note: This script is not fully developed - some features might not work as expected.
For anyone wishing to run a Flare / Songbird node I have created a script to install a node with only a couple of commands required - it also has some helper functions to check the node status, and upgrade in one command.
👉 Also read on for my current node strategy.
It’s an evolving script, so not perfect but perhaps a great starting point for anyone wanting to make their life a bit easier when managing their nodes. Contributions are also very welcome!
Node Script: GitHub Link
A bit more detail …
I have been running Flare / Songbird nodes for almost three years now but previously haven’t been able to invest as much time into their management.
I wanted to create a plan that made me confident in the infrastructure I’m running. When many applications such as Flare Metrics, FTSO AU’s data provider & public nodes and auxiliary apps depend on this vital component it brings a lot of stress.
Provisioning and maintaining infrastructure, especially on the scale I have been (20+ individual servers), isn’t something I’d consider myself an expert at. But I’ve kept at it long enough to figure it out.
My requirements were to:
~ Have highly available & easily scalable RPC nodes.
~ Fast disaster recovery.
~ Monitoring and alerting.
The way I tackled this was to first write a bash script which would allow me to fully provision my node of choice essentially with a single command. That is: install all dependencies, fetch the latest node source, build and start the nodes.
Writing the script was fairly easy even though bash isn’t something I have previously used often. Especially with the help of ChatGPT to give me advice and dampen the learning curve.
The challenge that I found was which strategy I wanted to take to backup my nodes. I intended to write some logic in my bash script to copy the node database to an AWS S3 or GCP bucket and then download it to any new node to significantly reduce bootstrapping time (Songbird nodes take almost a week to bootstrap now from scratch). Now, this is actually also an expensive operation to take but I found it also quite slow to upload and download. I ideally wanted to mitigate any downtime of my node - and here is a tip, you must shutdown you node before copying or moving your database else it has a high chance of becoming corrupt.
I always intended to have a single node dedicated to backups only - so downtime wasn’t too big of a concern but I still wanted to make it as fast as possible so here’s what I did …
I wrote a simple bash script which checks if my node is healthy, stops the node and then runs a GCP VM snapshot of the disk that hosts the database. Finally the node starts again and continues normal operation.
Two notes here:
~ I check if the node is healthy first; if it isn’t there is a chance that the database is corrupted and therefore useless. Additionally I use that information as a label on the snapshot.
~ The script that installs my node allows me to configure where the node database is hosted, so I attach an additional SSD to my VM that is dedicated to the database making it highly transportable and concise.
Storing snapshots can be a bit pricey, so my script deletes any snapshots older than 7 days. This probably could be lower, but I feel more comfortable having a few options in the case of any corruption in one of them.
So, what have we got so far?
~ A script which installs our node.
~ A backup mechanism using snapshots of a nodes database disk.
We still need an efficient way to deploy these nodes and also monitoring/alerting …
Now, if you’ve had to manage more than a handful of servers - it can be very tedious to manually configure and deploy them especially if you wish to quickly iterate on configurations. There was a product that I heard of before but always dismissed it. That product is Terraform. It allows you to configure servers using code which makes servers highly replicable. I can define very specific configurations with variables allowing me to essentially scale up or down my nodes by adding or removing servers from a simple array configuration. Each node can be configured to have individual performance allocations and also since I can configure variables here, I can also code certain configurations for my underlying nodes.
The way that is done is through startup scripts on my GCP VM’s. When I launch a new node through my Terraform configuration, it will:
~ Run a script which mounts an additional provisioned SSD disk, if I define a snapshot it will be created with that image (the latest bootstrapped database).
~ Install my node using my previously created bash script with parameters passed from the Terraform config.
~ If I define the node as dedicated for backup, it will install an additional script as previously described to run snapshots daily using a cronjob.
That is all done automatically for as many nodes as I define in my array.
My nodes can be fully bootstrapped in ~15-45 minutes depending when the latest backup was done.
I think that’s a pretty good outcome and I am very pleased with how easily I can launch a new node and how fast it can come online.
The missing piece is monitoring and alerting. The speed at which I can launch nodes isn’t so beneficial if I don’t know my nodes are offline - especially considering I haven’t opted into any kind of automated scaling or disaster recovery.
For this, each node exposes a metrics page that can be consumed by Prometheus (found at /ext/metrics). So, I simply setup a VM for monitoring which runs a Prometheus server and connects to my deployed nodes metrics page. I also deployed a Grafana server which enables me to visualise the metrics and of course set up alerts. I can now monitor each nodes number of peers, any failing health checks, their bootstrap status and configure alerts if any of these go into dangerous territory. Another important metric is monitoring the database size, I hate to admit it but it’s caught me out more than once - you must ensure your disk size keeps up with the growing node database.
So there you have it, my current strategy - and while there are likely a number of things that could be improved it has increased my confidence in the infrastructure I run.
~ Flare/Songbird nodes are installed with a custom bash script.
~ Backups are made using GCP VM snapshots of a dedicated disk for the nodes database (node is paused during snapshot).
~ Scaling nodes up and down is done via Terraform which automatically provisions nodes using startup scripts.
~ Monitoring and alerting is done using Prometheus (targeting nodes /ext/metrics endpoint) and visualised/alerting with Grafana.
I think another nice note here is that this configuration is very low cost as I’m not using any paid third-party software, everything is self-hosted. This setup does require a little bit of hands on maintainence such as when upgrading nodes but otherwise it serves its purpose for the key outcomes I defined earlier.
Again, I am perhaps not a devops expert, but I’ve learned a lot through this process and happy to help anyone setting up their own node - even if it’s not intended to have a complex setup like this one. Just reach out!