【问题标题】:Kubernetes cluster shuts down after some processingKubernetes 集群在一些处理后关闭
【发布时间】:2020-01-18 10:26:06
【问题描述】:

我在 GCP 上有一个运行 NodeJS 服务器的集群。该服务器在本地运行良好,但是当我向路由发送帖子时停止,没有任何消息。这篇文章应该使用 FCM 向我的一些用户发送推送消息。我的数据库是 Cloud Firestore。

Pod 日志:

Not sending to xxxxxxxxxxxxxxx
Not sending to xxxxxxxxxxxxxyx

app@1.0.0 prestart /opt/app
tsc


app@1.0.0 start /opt/app
node src/index.js

Dockerfile:

FROM node:11.15-alpine

# install deps
ADD package.json /tmp/package.json
RUN apk update && apk add yarn python g++ make && rm -rf /var/cache/apk/*
RUN cd /tmp && npm install

# Copy deps
RUN mkdir -p /opt/app && cp -a /tmp/node_modules /opt/app

# Setup workdir
WORKDIR /opt/app
COPY . /opt/app

# run
EXPOSE 3000
CMD ["npm", "start"] 

Kubernetes.yaml.tpl

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  labels:
    app: app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
        - name: app
          env:
          - name: var1
            value: value1
          - name: var2
            value: value2
          - name: var3
            value: value3
          - name: var4
            value: value4
          - name: var5
            value: value5
          - name: var6
            value: value6
          image: gcr.io/${PROJECT_ID}/app:COMMIT_SHA
          ports:
            - containerPort: 3000
          livenessProbe:
            httpGet:
              path: /alive
              port: 3000
            initialDelaySeconds: 30
          readinessProbe:
            httpGet:
              path: /alive
              port: 3000
            initialDelaySeconds: 30
            timeoutSeconds: 1

---
apiVersion: networking.gke.io/v1beta1
kind: ManagedCertificate
metadata:
  name: app
spec:
  domains:
    - myDomain.com.br
---
apiVersion: v1
kind: Service
metadata:
  name: app
spec:
  type: NodePort
  selector:
    app: app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 3000
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: app
  annotations:
    kubernetes.io/ingress.global-static-ip-name: "00.000.000.000"
    networking.gke.io/managed-certificates: app
spec:
  backend:
    serviceName: app
    servicePort: 80

我正在调用的函数:

var query = tokens;
const getTokens = (
    doc: FirebaseFirestore.QueryDocumentSnapshot
) => {
    // Get user token and send push    
}

const canSend = (user: User): boolean => {
 // Apply business logic to check if the user will receive a push
}


let allUsers: FirebaseFirestore.QuerySnapshot = userdata;
let allGroups: FirebaseFirestore.QuerySnapshot = groups;
await this.asyncForEach(
   query.docs,
   async (doc: FirebaseFirestore.QueryDocumentSnapshot) => {
       let userDoc: User;
       allUsers.docs.filter(
           (userDoc) => userDoc.data()['userId'] === doc.data()['id']
       ).forEach((user: any) => {
           userDoc = new User(user);
       });
       if (userDoc) {
           if (canSend(userDoc)) {
               console.log(`Sending to: ${userDoc.id}`);
               await getTokens(doc);
           } else {
               console.log(`Not sending to: ${doc.data()['id']} `);
           }
        } else {
            console.log(`${doc.data()['id']} Has no document`);
        }
    }
);
console.log('Finished');

EDIT1

我刚刚注意到,当我的服务器发送大量请求或大量小请求时会发生这种情况

编辑 2

kubectl get events返回 No resources found.

【问题讨论】:

  • 您是在通话后使用-f 标志还是使用--previous 标志检查日志?
  • 是的,我使用了kubectl logs -f $POD_NAME,但它在 `app@1.0.0 prestart /opt/app` 之前停止了跟踪。再次运行相同的命令只会返回该命令之后的行。完整的日志(我在问题中包含的日志)保存在 GCP 控制台上
  • 当您说server stops 时,您的意思是运行此服务器的 pod 重新启动了吗?
  • @A_Suh 是的,pod 重启了,没有任何消息
  • @LucasSzavara 你能分享kubectl get events的输出吗

标签: node.js typescript kubernetes google-cloud-platform firebase-cloud-messaging


【解决方案1】:

正如 OP 所确认的,问题在于 livenessProbe 由于超时而失败,导致 pod 终止。

我还建议不要完全删除探针,而是增加超时值(探针超时后的秒数。默认为 1 秒),比如说,在您的部署 yaml 中最多 3-5 秒

timeoutSeconds: 5

更多信息configuring probes

【讨论】:

    猜你喜欢
    • 2019-05-28
    • 2020-05-24
    • 2019-01-31
    • 1970-01-01
    • 2022-07-12
    • 2021-04-12
    • 1970-01-01
    • 2017-06-09
    • 2015-09-11
    相关资源
    最近更新 更多