Microsoft Orleans - Health Checks

Microsoft Orleans - Health Checks

Health Checks are a means of seeing how a system is running at the time of performing the check. Let’s see how we can apply them to Orleans!

Overview

Health Checks are generally exposed via HTTP endpoints, and when hit (often at a “/hc” or “/health” endpoint) they are able to report of the “health” of the current system.

The health checks, at least in .net land, are comprised of an enum HealthStatus which indicates Health, Degraded, or Unhealthy. The health checks themselves are created by implementing concretions of the IHealthCheck interface.

Any system can include one or more health checks, and what “a health check” means is completely up to you as the implementer. You could as an example have 1 health check that “checks”:

  • Foo
  • Bar
  • Baz

All contained within a “single” health check called “muhSystem” (or whatever), or you could implement the three above “checks” as their own individual health checks; so a single check vs multiple checks, all of which represent “the same thing”. Why would you choose one over the other? Well, going the single check route allows you to check on Foo, Bar, and Bazes health, without “leaking” any information about what’s being checked. This sort of check could be useful if you needed to be careful about revealing information on some of the internal workings of your system.

In the “three separate checks” scenario, it might not matter to you if you leak some information about your system, you want to give your users (or perhaps your watchdog) information on a more detailed level.

Getting set up

We’ll be starting the code section of this post from the v0.58 tag on my OrleansGettingStarted repository.

I did not write a blog post about the changes that were performed in the update to the v0.58 tag, but one of the changes was getting the silo host running under the next UseOrleans extension method. In the newer version of .net core and Orleans, you’re able to host multiple “processes” from the same IHostBuilder. What this allows us to do is host a small API that will serve the health check endpoint through http requests.

First thing we’ll do is add a default web host to our host builder - which after the change will be hosting both our silo host, as well as our api:

hostBuilder.ConfigureWebHostDefaults(builder => { builder.UseStartup<Startup>(); })

(Note there are going to be other varying changes that I may not be specifically calling out, but the end code is here and in the references at the bottom of the post.)

We’ll also introduce a Startup class:

public class Startup
{
public Startup(IConfiguration configuration)
{
Configuration = configuration;
}

public IConfiguration Configuration { get; }

// This method gets called by the runtime. Use this method to add services to the container.
public void ConfigureServices(IServiceCollection services)
{
services.AddControllers();
}

// This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
if (env.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}

app.UseHttpsRedirection();

app.UseRouting();

app.UseAuthorization();

app.UseEndpoints(endpoints =>
{
endpoints.MapControllers();
});
}
}

The above class should more or less be the “default” Startup class present when creating a new web API project from template.

Health Checks

Basic Health Check

The first health check we’re going to do will be a basic one - in fact that’s what we’ll name it. For this health check, we’ll use an IClusterClient and ensure it can get an instance of a grain, and get a result from that grain. If it can get a result from the grain the health check should return “Healthy”, otherwise “Unhealthy”.

As always, we first need our Orleans grain interface:

public interface IBasicHealthCheckGrain : IHealthCheck, IGrainWithGuidKey
{
}

The above is quite simple, we’re creating a class that implements both a IHealthCheck and IGrainWithGuidKey. IGrainWithGuidKey should be familiar from some of my other Orleans posts, and the IHealthCheck grain was mentioned earlier in this post, it’s an interface that describes a health check. We’re not adding anything to this interface that isn’t already provided via the IHealthCheck or IGrainWithGuidKey.

Our basic health check grain implementation looks like this:

[StatelessWorker(1)]
public class BasicHealthCheckGrain : Grain, IBasicHealthCheckGrain
{
public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = new CancellationToken())
{
return Task.FromResult(new HealthCheckResult(HealthStatus.Healthy));
}
}

That’s it! Just return a healthy result. If our actual IHealthCheck implementation is unable to get an instance of this grain and an exception is encountered, the exception handler will return “Unhealthy” for us.

Now that we have a health check grain, we’ll need an actual IHealthCheck implementation that will utilize our newly created “health check grain”. I know we’ll be creating several health checks here, all of which do “a lot of the same thing”, so this seems like the perfect opportunity to introduce an abstract class OrleansHealthCheckBase:

public abstract class OrleansHealthCheckBase : IHealthCheck
{
protected readonly IClusterClient _client;

protected OrleansHealthCheckBase(IClusterClient client)
{
_client = client;
}

/// <summary>
/// Entry into health check, ensures the client is initialized, if it is not returns a healthy status.
/// </summary>
/// <param name="context">The health check context.</param>
/// <param name="cancellationToken">The cancellation token.</param>
/// <returns><see cref="Task"/> of <see cref="HealthCheckResult"/></returns>
public virtual async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = new CancellationToken())
{
if (!_client.IsInitialized)
{
return HealthCheckResult.Healthy($"{nameof(_client)} not yet initialized.");
}

return await CheckHealthGrainAsync(context, cancellationToken);
}

/// <summary>
/// Perform the actual health check work within this implemented method.
/// </summary>
/// <param name="context">The health check context.</param>
/// <param name="cancellationToken">The cancellation token.</param>
/// <returns><see cref="Task"/> of <see cref="HealthCheckResult"/></returns>
protected abstract Task<HealthCheckResult> CheckHealthGrainAsync(HealthCheckContext context, CancellationToken cancellationToken);
}

All of our health checks will depend on a connection to the cluster, so the above will take in a IClusterClient, and ensure the cluster is initialized prior to proceeding with the to be implemented actual check.

As I mentioned previously, for our “Basic” health check, we’ll just be checking that we can get an instance of a grain, and return a value. Such a health check will look like:

public class BasicOrleansHealthCheck :  OrleansHealthCheckBase
{
public BasicOrleansHealthCheck(IClusterClient client) : base(client)
{

}

protected override async Task<HealthCheckResult> CheckHealthGrainAsync(HealthCheckContext context, CancellationToken cancellationToken)
{
try
{
return await _client.GetGrain<IBasicHealthCheckGrain>(Guid.Empty)
.CheckHealthAsync(context, cancellationToken);
}
catch (Exception e)
{
return HealthCheckResult.Unhealthy($"Health check failed.", e);
}
}
}

Performance Health Checks

Now that the basic health check is out of the way, we can implement some more meaningful ones. The following health checks require a registered IHostEnvironmentStatistics (which you can find out more about here).

These health checks will be especially useful for gauging the utilization of an Orleans node over time, which would allow you to make decisions on questions such as “should I spin up or down additional nodes for this cluster?”. Answers to such questions, especially if running your Orleans cluster in a k8s environment, are much simpler when you have performance metrics exposed via a health check endpoint, and are making use of a watchdog.

Going to go through these fast, they should be mostly self explanatory but you can view the completed code for anything I don’t specifically cover.

CPU Health Check

public interface ICpuHealthCheckGrain : IHealthCheck, IGrainWithGuidKey
{

}

[StatelessWorker(1)]
public class CpuHealthCheckGrain : Grain, ICpuHealthCheckGrain
{
private const float UnhealthyThreshold = 90;
private const float DegradedThreshold = 70;

private readonly IHostEnvironmentStatistics _hostEnvironmentStatistics;

public CpuHealthCheckGrain(IHostEnvironmentStatistics hostEnvironmentStatistics)
{
_hostEnvironmentStatistics = hostEnvironmentStatistics;
}

public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = new CancellationToken())
{
if (_hostEnvironmentStatistics.CpuUsage > UnhealthyThreshold)
{
return Task.FromResult(HealthCheckResult.Unhealthy(
$"CPU utilization is unhealthy at {_hostEnvironmentStatistics.CpuUsage}%."));

}

if (_hostEnvironmentStatistics.CpuUsage > DegradedThreshold)
{
return Task.FromResult(HealthCheckResult.Degraded(
$"CPU utilization is degraded at {_hostEnvironmentStatistics.CpuUsage}%."));
}

return Task.FromResult(HealthCheckResult.Healthy(
$"CPU utilization is healthy at {_hostEnvironmentStatistics.CpuUsage}%."));
}
}

New grain interface, new grain implementation for CPU health checking. We’re going to return Unhealthy if above 90% CPU, Degraded if above 70%, Healthy otherwise.

Memory Health Check

Same basic idea for the memory health check, again making use of our registered IHostEnvironmentStatistics:

public interface IMemoryHealthCheckGrain : IHealthCheck, IGrainWithGuidKey
{

}

[StatelessWorker(1)]
public class MemoryHealthCheckGrain : Grain, IMemoryHealthCheckGrain
{
private const float UnhealthyThreshold = 95;
private const float DegradedThreshold = 90;

private readonly IHostEnvironmentStatistics _hostEnvironmentStatistics;

public MemoryHealthCheckGrain(IHostEnvironmentStatistics hostEnvironmentStatistics)
{
_hostEnvironmentStatistics = hostEnvironmentStatistics;
}

public Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken = new CancellationToken())
{
if (_hostEnvironmentStatistics?.AvailableMemory == null || _hostEnvironmentStatistics?.TotalPhysicalMemory == null)
{
return Task.FromResult(HealthCheckResult.Unhealthy("Could not determine memory calculation."));
}

if (_hostEnvironmentStatistics?.AvailableMemory == 0 && _hostEnvironmentStatistics?.AvailableMemory == 0)
{
return Task.FromResult(HealthCheckResult.Unhealthy("Could not determine memory calculation."));
}

var memoryUsed = 100 - ((float)_hostEnvironmentStatistics.AvailableMemory / (float)_hostEnvironmentStatistics.TotalPhysicalMemory * 100);

if (memoryUsed > UnhealthyThreshold)
{
return Task.FromResult(HealthCheckResult.Unhealthy(
$"Memory utilization is unhealthy at {memoryUsed:0.00}%."));

}

if (memoryUsed > DegradedThreshold)
{
return Task.FromResult(HealthCheckResult.Degraded(
$"Memory utilization is degraded at {memoryUsed:0.00}%."));
}

return Task.FromResult(HealthCheckResult.Healthy(
$"Memory utilization is healthy at {memoryUsed:0.00}%."));
}
}

In this case, I’m going a slightly different route and returning “Unhealthy” if the memory information cannot be determined, this should probably be consistently done between this and the CPU health check, but I wanted to show how you as the implementer is able to choose what “Healthy” vs “Unhealthy” means. For this memory health check, we’re unhealthy if above 95% memory utilization, degraded if above 90, healthy otherwise.

Wiring up the health checks

Now that we have our health check grains, we’ll introduce new IHealthChecks very similar to the BasicOrleansHealthCheck which extended OrleansHealthCheckBase.

public class CpuOrleansHealthCheck : OrleansHealthCheckBase
{
public CpuOrleansHealthCheck(IClusterClient client) : base(client)
{
}

protected override async Task<HealthCheckResult> CheckHealthGrainAsync(HealthCheckContext context, CancellationToken cancellationToken)
{
try
{
return await _client.GetGrain<ICpuHealthCheckGrain>(Guid.Empty)
.CheckHealthAsync(context, cancellationToken);
}
catch (Exception e)
{
return HealthCheckResult.Unhealthy($"Health check failed.", e);
}
}
}

public class MemoryOrleansHealthCheck : OrleansHealthCheckBase
{
public MemoryOrleansHealthCheck(IClusterClient client) : base(client)
{
}

protected override async Task<HealthCheckResult> CheckHealthGrainAsync(HealthCheckContext context, CancellationToken cancellationToken)
{
try
{
return await _client.GetGrain<IMemoryHealthCheckGrain>(Guid.Empty)
.CheckHealthAsync(context, cancellationToken);
}
catch (Exception e)
{
return HealthCheckResult.Unhealthy($"Health check failed.", e);
}
}
}

Now we need to wire all of these health checks up to our “/health” endpoint within our webhost. Luckily, this is pretty easy. The earlier Startup:

Pre-change Startup.cs

Becomes:

public class Startup
{
public Startup(IConfiguration configuration)
{
Configuration = configuration;
}

public IConfiguration Configuration { get; }

// This method gets called by the runtime. Use this method to add services to the container.
public void ConfigureServices(IServiceCollection services)
{
services.AddControllers();
services.AddHealthChecks()
.AddCheck<BasicOrleansHealthCheck>("basicOrleans")
.AddCheck<CpuOrleansHealthCheck>("cpuOrleans")
.AddCheck<MemoryOrleansHealthCheck>("memoryOrleans");
}

// This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
if (env.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}

app.UseHttpsRedirection();

app.UseRouting();

app.UseAuthorization();

app.UseEndpoints(endpoints =>
{
endpoints.MapHealthChecks("/health").WithMetadata(new AllowAnonymousAttribute());
endpoints.MapControllers();
});
}
}

The difference being we’re adding the health checks (and giving them names) within ConfigureServices, and mapping the health checks to a “/health” endpoint within Configure.

Testing it out

Let’s fire up the silo, and test this thing out!

Health Check breakpoint

tada!

Well, that was pretty anticlimactic… we’ll have to see about prettying up that health check response, hopefully in another post that I’ll totally write real soon!

References

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×