Health Checks are a means of seeing how a system is running at the time of performing the check. Let’s see how we can apply them to Orleans!
Health Checks are generally exposed via HTTP endpoints, and when hit (often at a “/hc” or “/health” endpoint) they are able to report of the “health” of the current system.
The health checks, at least in .net land, are comprised of an enum HealthStatus which indicates Health, Degraded, or Unhealthy. The health checks themselves are created by implementing concretions of the IHealthCheck interface.
Any system can include one or more health checks, and what “a health check” means is completely up to you as the implementer. You could as an example have 1 health check that “checks”:
All contained within a “single” health check called “muhSystem” (or whatever), or you could implement the three above “checks” as their own individual health checks; so a single check vs multiple checks, all of which represent “the same thing”. Why would you choose one over the other? Well, going the single check route allows you to check on
Bazes health, without “leaking” any information about what’s being checked. This sort of check could be useful if you needed to be careful about revealing information on some of the internal workings of your system.
In the “three separate checks” scenario, it might not matter to you if you leak some information about your system, you want to give your users (or perhaps your watchdog) information on a more detailed level.
We’ll be starting the code section of this post from the v0.58 tag on my OrleansGettingStarted repository.
I did not write a blog post about the changes that were performed in the update to the v0.58 tag, but one of the changes was getting the silo host running under the next
UseOrleans extension method. In the newer version of .net core and Orleans, you’re able to host multiple “processes” from the same
IHostBuilder. What this allows us to do is host a small API that will serve the health check endpoint through http requests.
First thing we’ll do is add a default web host to our host builder - which after the change will be hosting both our silo host, as well as our api:
(Note there are going to be other varying changes that I may not be specifically calling out, but the end code is here and in the references at the bottom of the post.)
We’ll also introduce a
The above class should more or less be the “default”
Startup class present when creating a new web API project from template.
The first health check we’re going to do will be a basic one - in fact that’s what we’ll name it. For this health check, we’ll use an
IClusterClient and ensure it can get an instance of a grain, and get a result from that grain. If it can get a result from the grain the health check should return “Healthy”, otherwise “Unhealthy”.
As always, we first need our Orleans grain interface:
The above is quite simple, we’re creating a class that implements both a
IGrainWithGuidKey should be familiar from some of my other Orleans posts, and the
IHealthCheck grain was mentioned earlier in this post, it’s an interface that describes a health check. We’re not adding anything to this interface that isn’t already provided via the
Our basic health check grain implementation looks like this:
That’s it! Just return a healthy result. If our actual
IHealthCheck implementation is unable to get an instance of this grain and an exception is encountered, the exception handler will return “Unhealthy” for us.
Now that we have a health check grain, we’ll need an actual
IHealthCheck implementation that will utilize our newly created “health check grain”. I know we’ll be creating several health checks here, all of which do “a lot of the same thing”, so this seems like the perfect opportunity to introduce an abstract class
All of our health checks will depend on a connection to the cluster, so the above will take in a
IClusterClient, and ensure the cluster is initialized prior to proceeding with the to be implemented actual check.
As I mentioned previously, for our “Basic” health check, we’ll just be checking that we can get an instance of a grain, and return a value. Such a health check will look like:
Now that the basic health check is out of the way, we can implement some more meaningful ones. The following health checks require a registered
IHostEnvironmentStatistics (which you can find out more about here).
These health checks will be especially useful for gauging the utilization of an Orleans node over time, which would allow you to make decisions on questions such as “should I spin up or down additional nodes for this cluster?”. Answers to such questions, especially if running your Orleans cluster in a k8s environment, are much simpler when you have performance metrics exposed via a health check endpoint, and are making use of a watchdog.
Going to go through these fast, they should be mostly self explanatory but you can view the completed code for anything I don’t specifically cover.
New grain interface, new grain implementation for CPU health checking. We’re going to return Unhealthy if above 90% CPU, Degraded if above 70%, Healthy otherwise.
Same basic idea for the memory health check, again making use of our registered
In this case, I’m going a slightly different route and returning “Unhealthy” if the memory information cannot be determined, this should probably be consistently done between this and the CPU health check, but I wanted to show how you as the implementer is able to choose what “Healthy” vs “Unhealthy” means. For this memory health check, we’re unhealthy if above 95% memory utilization, degraded if above 90, healthy otherwise.
Now that we have our health check grains, we’ll introduce new
IHealthChecks very similar to the
BasicOrleansHealthCheck which extended
Now we need to wire all of these health checks up to our “/health” endpoint within our webhost. Luckily, this is pretty easy. The earlier
The difference being we’re adding the health checks (and giving them names) within
ConfigureServices, and mapping the health checks to a “/health” endpoint within
Let’s fire up the silo, and test this thing out!
Well, that was pretty anticlimactic… we’ll have to see about prettying up that health check response, hopefully in another post that I’ll totally write real soon!