Bad Health Checks With Dynamic Container Ports

Last night, we solved a long-standing bug in one of our Terraform modules. It’s been haunting us for a long time - damn near over a year. We managed to produce this bug in a CloudFormation stack as well, but we just couldn’t figure out where it was coming from.

For reference, we were trying to stand up an Elastic Container Service cluster with containers using dynamic port routing. When you do this, you build it with an Application Load Balancer (ALB) plus a target group. The way things are supposed to work is that when a container is spun up on the cluster, it chooses an ephemeral port and that port is registered to the target group’s health checks.

That was essentially working, but something was adding an additional and erroneous health check to the exposed port (https/443) which would cause the auto-scaling group to think things were amiss… and continously terminate/rebuild instances. Not a fun situation. Our workaround was to manually remove the health check. But each time the ASG terminated and added an instance, the bad health check would come back. We finally figured this mess out.

Organically Adjusting Devsecops Processes

I’ve spent a number of years working in a few DevOps/DevSecOps roles to transform organizations into new ways of doing business. I love automation and cloud, and I particularly love infrastructure as code. DevSecOps transformation is not only about the tech, but it’s also about the people. Maybe even more so.