具有大型数据集的 Redis 就绪探针答案

【问题标题】：Readiness Probe for Redis with large dataset具有大型数据集的 Redis 就绪探针
【发布时间】：2020-12-21 16:11:01
【问题描述】：

问题

我有一个 Redis K8s 部署，它链接到一个单独的服务，清单大大减少，如下所示（如果需要更多信息，请告诉我）：

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cache
      environment: dev
  template:
    metadata:
      labels:
        app: cache
        environment: dev
    spec:
      containers:
        - name: cache
          image: marketplace.gcr.io/google/redis5
          imagePullPolicy: IfNotPresent
          livenessProbe:
            exec:
              command:
              - redis-cli
              - ping
            initialDelaySeconds: 30
            timeoutSeconds: 5
          readinessProbe:
            exec:
              command:
              - redis-cli
              - ping
            initialDelaySeconds: 30
            timeoutSeconds: 5
      volumes:
        - name: data
          nfs:
            server: "nfs-server.recs-api.svc.cluster.local"
            path: "/data"

我想定期使用新数据集重新部署 Redis，而不是更新现有缓存。在执行kubectl rollout restart deployment/cache 时，旧的 Redis pod 会在新的 Redis pod 准备好接受流量之前被终止。这些新的 Redis pod 被标记为 READY，并且如预期的那样旧的被终止，但是新的 Redis pod 上的 redis-cli ping 返回(error) LOADING Redis is loading the dataset in memory。目前，Redis 需要 5-10 分钟才能停止加载数据集并准备好接受连接，但此时它们已经准备好相同的时间，因为旧的 Pod 已经终止，所以将活动流量定向到它们。

我的怀疑是因为这个响应的状态码是 0，所以 readinessProbe 触发 READY 1/1 并杀死旧的 pod，但是我找不到合适的 exec: command: 来避免这个问题.

redis-cli info 有一个loading:0|1 行，所以我测试了：

readinessProbe:
  exec:
    command: ["redis-cli", "info", "|", "grep loading:", "|", "grep 0"]

希望对于非 0 加载值，grep 将提供非零状态代码并使 readinessProbe 失败，但这似乎不起作用，并且具有与 redis-cli ping 相同的行为，即过早终止的 pod 和在加载完成之前停止服务。

我想要什么

在部署新的 Redis 缓存 pod 时，我希望有一个 pod 随时可以接受连接，而新的 Redis 缓存 pod 正在将数据集加载到内存中
- 理想的形式是整洁的 readinessProbe 检查，但对任何建议完全开放！
- 也有可能我误解了 readinessProbe 的用途，所以请告诉我
如果可能，请更好地理解为什么 redis-cli ping 或其他 readinessProbe 仍会触发新 pod 的 READY 状态，尽管 exec: command: 上的状态代码非零

谢谢！

【问题讨论】：

redis-cli 的常见问题是在失败时返回零退出代码。您是否尝试过使用 redis-cli ping 打印出响应代码？回声 $?
是的，可以确认失败的 redis-cli ping 返回 0，这就是为什么我希望通过 grep 找到解决方法

标签： kubernetes redis

【解决方案1】：

我研究了 bitnami/redis 图表并了解它们如何实现 liveness/readiness 探测。

从他们的图表中，他们创建了一个 health-configmap，其中包含一个使用 redis-cli ping 对 redis 服务器进行健康检查并处理响应的 shell 脚本。

这是定义的配置图：

data:
  ping_readiness_local.sh: |-
    #!/bin/bash
{{- if .Values.usePasswordFile }}
    password_aux=`cat ${REDIS_PASSWORD_FILE}`
    export REDIS_PASSWORD=$password_aux
{{- end }}
{{- if .Values.usePassword }}
    no_auth_warning=$([[ "$(redis-cli --version)" =~ (redis-cli 5.*) ]] && echo --no-auth-warning)
{{- end }}
    response=$(
      timeout -s 3 $1 \
      redis-cli \
{{- if .Values.usePassword }}
        -a $REDIS_PASSWORD $no_auth_warning \
{{- end }}
        -h localhost \
{{- if .Values.tls.enabled }}
        -p $REDIS_TLS_PORT \
        --tls \
        --cacert {{ template "redis.tlsCACert" . }} \
        {{- if .Values.tls.authClients }}
          --cert {{ template "redis.tlsCert" . }} \
          --key {{ template "redis.tlsCertKey" . }} \
        {{- end }}
{{- else }}
        -p $REDIS_PORT \
{{- end }}
        ping
    )
    if [ "$response" != "PONG" ]; then
      echo "$response"
      exit 1
    fi

而在deployment/statefulset中，只需设置probe来执行这个shell脚本：

readinessProbe:
    initialDelaySeconds: {{ .Values.redis.readinessProbe.initialDelaySeconds }}
    periodSeconds: {{ .Values.redis.readinessProbe.periodSeconds }}
    timeoutSeconds: {{ .Values.redis.readinessProbe.timeoutSeconds }}
    successThreshold: {{ .Values.redis.readinessProbe.successThreshold }}
    failureThreshold: {{ .Values.redis.readinessProbe.failureThreshold }}
    exec:
      command:
        - sh
        - -c
        - /scripts/ping_readiness_local.sh {{ .Values.redis.readinessProbe.timeoutSeconds }}

【讨论】：

这真的很有帮助！谢谢，我会测试一下``` if [ "$response" != "PONG" ];然后 echo "$response" exit 1 ```

【解决方案2】：

以下应该可以正常工作

关键是

tcpSocket:
        port: client # named port

整个sn-p

       - name: redis
         image: ${DOCKER_PATH_AND_IMAGE}
         resources:
           limits:
             memory: "1.5Gi"
           requests:
             memory: "1.5Gi"
         ports:
         - name: client
           containerPort: 6379
         - name: gossip
           containerPort: 16379
         command: ["/conf/update-node.sh", "redis-server", "/conf/redis.conf"]
         livenessProbe:
          tcpSocket:
            port: client # named port
          initialDelaySeconds: 30
          timeoutSeconds: 5
          periodSeconds: 5
          failureThreshold: 5
          successThreshold: 1
         readinessProbe:
          exec:
            command:
            - redis-cli
            - ping
          initialDelaySeconds: 20
          timeoutSeconds: 5
          periodSeconds: 3

【讨论】：