Incident Report - 14 Feb 2020
At 01:40am UTC the Jormungandr process running on our node stopped functioning properly.
Unfortunately, our watchdog process design to monitor and restart the node in case of this type of events was not up and running. The watchdog process was stopped for maintanence activities and never restarted.
This was the first major problem we observed since the upgrade to Jormungandr v0.8.9.
In the morning we opted for stopping the node and restarting it with the latest release of Jormungandr v0.8.10. We observed the same problems we already had with v0.8.10-alpha1 and decided to revert back to v0.8.9.
The performance of our node took a big hit in the current epoch. We started Epoch 62 with the lowest slot count we had so far and we lost 12 out of 22 slots due to the node downtime.
The node is now running with the same configuration that gave us near 100% perfomance for the previous epochs and should recover normal operation starting with Epoch 63.
Update - 28 Jan 2020
LEAF pool has been running on jormungandr v0.8.7 for the last 5 days. Pool operators expressed mixed feelings about this particular release in the official Telegram channel. We found it to be as stable as the previous release but we also noticed a different load on the cores of our server.
A few threads of the Jormungandr process started consuming an unusual amount of cpu-time. This is something we never noticed before and we are now monitoring. It could also be related to a few changes we made to the pool configuration, things are moving quite fast. We are updating the node on a weekly basis and tweaking configurations in between slot times.
In the meanwhile, interesting commits are piling up in the Jormungandr repo in preparation for the next release. With v0.8.8 we decided to change our upgrade policy. We will start by upgrading a passive node first, look for possible regressions and then we will update the pool process 24h later.
Update - 21 Jan 2020
The latest release of Jormungandr (v0.8.6) improved the stability of the node, our uptime is now at 6 days and counting.
Unfortunately, it is still not perfect. Issue #1580 is causing us to miss a few blocks. The issue has been marked as 'high priority' and we are confident that the IOHK team will fix it soon. In the meanwhile, the node perfromance will be slightly impacted. Considering that in turn we will have less leader slots assigned to us, there is a possibility that the ROI of our pool will decrease to 8-10% during the next few epochs.
Jormungandr is on a weekly schedule release. We will upgrade our node as soon as a new release is available.
Our first million Ada
We are very proud to announce that at the end of epoch 34 we reached the one million ada mark in rewards for our delegators.
Since upgrading to Jormungandr v0.8.6, our node has been much, much more stable. The node uptime is now at 50 hours and counting and it's serving ~2400 connections.
The way forward
We decided to test fewer server configurations for a longer period of time. The instability of the network introduced high variance in the pool performace, epoch over epoch, even for the same configuration.
We are currently collecting more data and trying to define better metrics to assess the server performance, resources utilization and bottlenecks before going back for more optimizations.
When we are happy with the pool performance and our set of tools to monitor and control the stake pool, it will be time to show some love to this website (newsletter, live statistics, alerts, etc...).
Our mission is to contribute to the Cardano network's decentralization by running a reliable and independent node.
And we want to get it right.
Pedal to the metal
After the hardware upgrade of December 21st, we tried several node and kernel configurations. At the end we were not pleased with our virtual server provider and decided to go for a dedicated server solution. On December 28th at the beginning of Epoch 15, LEAF stake pool migrated to the new server.
Simply put, the new server is stupid fast and overspec'd for the task.
We also added a couple of charts at the beginning of the page. Return of investment (ROI) data is straight from adapools.org, while the performance graph represents the number of blocks signed over the total number of blocks assigned to the stake pool.
With the old server, we had a failure rate of 50-60% (we should also consider that the network was really unstable in those epochs). The new server is making a difference reducing the failure rate to 5-10%.
Our goal is, of course, 100% reliability. We will spend the next two weeks applying incremental changes to the server configuration. Improvements/regressions will be quantified over a two-epoch observation period. We will keep you posted.
The server maintenance activity scheduled on 17:20 UTC on December 21st was successful. In less than 10 minutes we took the server offline, doubled the number of vCPUs and RAM and got back online.
The sense of community
Jormungandr v0.8.4 has been released. The node was updated right at the beginning of Epoch 6, but it didn't solve the fork issues. Practically, every node has been missing assigned blocks.
The general feeling is that all stakepool operators are into this together and are trying their best, no discouragement. It is called Testnet for a reason after all.
Shout out to all those helping out fellow stakepool operators and welcoming newcomers on the official forum, telegram channels and their Reddit posts, tweets, blogs, YouTube videos and live streams.
The long night of Epoch 5
Epoch 5 is almost over. It was not easy.
The forking of the blockchain that we saw happening in the previous days got worse. We still managed to process ~90 blocks during this epoch, but we had to trade in a few hours of sleep to make sure that the node was not running on the wrong side of the chain.
Hopefully things will get better with future releases of Jormungandr.
We had a bumpy start
During Epoch 3 and 4 the Shelley Incentivized Testnet network became partitioned. We had to restart our node in numerous occasions, encountering problems during the bootstrapping phase.
We do apologize for the prolonged downtime of our node. We are closely monitoring the state of the network to minimize the impact of similar events should they occur again.
In the next weeks, we will work to establish our presence on social media channels and setup an email contact for support requests.
In the meantime, please buckle up and enjoy the ride :)