基于 Kubernetes 和 Spring Boot 构建自愈型应用

1. 引言

在本教程中，我们将深入探讨 Kubernetes 的 探针机制（Probes），并演示如何利用 Spring Boot Actuator 提供的 HealthIndicator 来更准确地反映应用的真实运行状态。

为了便于理解，我们假设读者已经具备一定的 Spring Boot Actuator、Kubernetes 和 Docker 使用经验。

2. Kubernetes 探针机制

Kubernetes 定义了两种不同的探针类型用于定期检查应用是否正常工作：存活探针（Liveness） 和 就绪探针（Readiness）。

2.1. Liveness 与 Readiness 探针的区别

通过 Liveness 和 Readiness 探针，Kubelet 能够在检测到异常时及时做出响应，从而减少应用的停机时间。

虽然两者配置方式类似，但它们的语义不同，Kubelet 根据触发的探针类型执行不同的操作：

✅ Readiness：验证 Pod 是否已准备好接收流量。只有当所有容器都处于就绪状态时，该 Pod 才会被标记为就绪。
❌ Liveness：检查 Pod 是否需要重启。如果应用正在运行但处于无法继续处理请求的状态（例如死锁），则会触发重启。

我们可以在容器级别配置这两种探针：

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: k8s.gcr.io/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 2
      failureThreshold: 1
      successThreshold: 1
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20
      timeoutSeconds: 2
      failureThreshold: 1
      successThreshold: 1

我们可以对探针进行更精细的控制，以下是几个关键字段说明：

initialDelaySeconds：容器启动后等待 n 秒再开始探测。
periodSeconds：探测间隔，默认为 10 秒，最小值为 1 秒。
timeoutSeconds：探测超时时间，默认为 1 秒，最小值也为 1 秒。
failureThreshold：失败重试次数。对于 Readiness，失败后 Pod 会被标记为未就绪；对于 Liveness，失败后 Pod 会被重启。默认为 3 次，最小为 1。
successThreshold：连续成功次数阈值，只有在之前失败的情况下才生效。默认为 1，最小也为 1。

在这个例子中，我们选择了 TCP 探针，但 Kubernetes 还支持其他类型的探针。

2.2. 探针类型详解

根据具体场景选择合适的探针类型非常重要。比如，如果容器是一个 Web 服务器，使用 HTTP 探针通常比 TCP 更可靠。

Kubernetes 支持以下三种探针类型：

✅ exec：在容器内执行 bash 命令。例如检查某个文件是否存在。若返回非零状态码，则探针失败。
✅ tcpSocket：尝试与指定端口建立 TCP 连接。连接失败则探针失败。
✅ httpGet：向容器中的服务发送 HTTP GET 请求。响应码在 200~399 范围内表示成功。

⚠️ 对于 HTTP 探针，除了上述通用字段外，还支持以下额外字段：

host：目标主机名，默认为 Pod IP。
scheme：协议类型，HTTP 或 HTTPS，默认为 HTTP。
path：访问路径。
httpHeaders：自定义请求头。
port：端口号或端口名称。

3. Spring Actuator 与 Kubernetes 自愈能力结合

了解了 Kubernetes 如何检测应用状态后，我们来看看如何借助 Spring Boot Actuator 来监控应用及其依赖组件的健康状况。

以下示例基于 Minikube 环境运行。

3.1. Actuator 与 HealthIndicator

Spring 提供了大量开箱即用的 HealthIndicator 实现，能够反映应用依赖组件的健康状态。我们只需添加 Actuator 依赖即可：

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

3.2. Liveness 探针实战

我们构建一个应用，它会正常启动，但在 30 秒后进入异常状态。

我们通过创建一个 HealthIndicator 来模拟异常状态，它会检查一个布尔变量是否为 true。初始值设为 true，然后安排一个定时任务在 30 秒后将其改为 false：

@Component
public class CustomHealthIndicator implements HealthIndicator {

    private boolean isHealthy = true;

    public CustomHealthIndicator() {
        ScheduledExecutorService scheduled =
          Executors.newSingleThreadScheduledExecutor();
        scheduled.schedule(() -> {
            isHealthy = false;
        }, 30, TimeUnit.SECONDS);
    }

    @Override
    public Health health() {
        return isHealthy ? Health.up().build() : Health.down().build();
    }
}

接着，我们将应用打包为 Docker 镜像：

FROM openjdk:8-jdk-alpine
RUN mkdir -p /usr/opt/service
COPY target/*.jar /usr/opt/service/service.jar
EXPOSE 8080
ENTRYPOINT exec java -jar /usr/opt/service/service.jar

然后编写 Kubernetes 部署模板：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: liveness-example
spec:
  ...
    spec:
      containers:
      - name: liveness-example
        image: dbdock/liveness-example:1.0.0
        ...
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          timeoutSeconds: 2
          periodSeconds: 3
          failureThreshold: 1
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 20
          timeoutSeconds: 2
          periodSeconds: 8
          failureThreshold: 1

这里我们使用了指向 Actuator /health 接口的 HTTP GET 探针。任何应用状态变化都会被探针捕捉并反映到 Pod 的健康检查中。

部署后，大约 30 秒后，Pod 会被标记为未就绪并从流量中移除；稍后，Kubelet 会重启该 Pod。

我们可以通过如下命令查看 Pod 事件：

kubectl describe pod liveness-example

输出如下：

Warning  Unhealthy 3s (x2 over 7s)   kubelet, minikube  Readiness probe failed: HTTP probe failed ...
Warning  Unhealthy 1s                kubelet, minikube  Liveness probe failed: HTTP probe failed ...
Normal   Killing   0s                kubelet, minikube  Killing container with id ...

3.3. Readiness 探针实战

在上一个例子中，我们展示了如何通过 HealthIndicator 反映应用状态。现在我们来看另一个典型场景：

假设应用启动后需要一些时间来加载数据或完成初始化，此时并不适合立即接收流量。

这是一个典型的 Readiness 探针 使用场景。

我们对之前的 HealthIndicator 进行改造：

@Component
public class CustomHealthIndicator implements HealthIndicator {

    private boolean isHealthy = false;

    public CustomHealthIndicator() {
        ScheduledExecutorService scheduled =
          Executors.newSingleThreadScheduledExecutor();
        scheduled.schedule(() -> {
            isHealthy = true;
        }, 40, TimeUnit.SECONDS);
    }

    @Override
    public Health health() {
        return isHealthy ? Health.up().build() : Health.down().build();
    }
}

初始值为 false，40 秒后变为 true。

接着，我们使用如下 Kubernetes 模板部署应用：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: readiness-example
spec:
  ...
    spec:
      containers:
      - name: readiness-example
        image: dbdock/readiness-example:1.0.0
        ...
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 40
          timeoutSeconds: 2
          periodSeconds: 3
          failureThreshold: 2
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 100
          timeoutSeconds: 2
          periodSeconds: 8
          failureThreshold: 1

配置要点如下：

因为应用需要大约 40 秒准备就绪，我们将 Readiness 探针的 initialDelaySeconds 设为 40 秒。
同样地，我们将 Liveness 探针的 initialDelaySeconds 设为 100 秒，防止在应用未就绪时被误杀。

这样，即使初始化未完成，应用也有大约 60 秒的时间完成准备。如果仍未完成，Liveness 探针会触发重启。

4. 总结

本文介绍了 Kubernetes 的探针机制，并展示了如何结合 Spring Boot Actuator 提升应用的健康监控能力。

完整示例代码可在 GitHub 仓库获取。

Persistence

REST

Security