Skip to content

Direction for finding zookeeper issues #642

@ben-efiz

Description

@ben-efiz

Hello i am running a 1 shard 2 replica ClickHouse instance using clickhouse-operator. Actually everything is fine, incl. using ReplicatedMergeTree and 3 zookeeper instances. Data gets replicated correctly. But both the Grafana clickhouse operator board and ClickHouse show ZooKeeperUserExceptions respectively system.errors. I have issues finding the root cause and seeking for direction where to check.

Prometheus is reporting, e.g.

# HELP chi_clickhouse_event_ZooKeeperUserExceptions 
# TYPE chi_clickhouse_event_ZooKeeperUserExceptions counter
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-0.production.svc.cluster.local",namespace="production"} 12718
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-1.production.svc.cluster.local",namespace="production"} 7971

I assume its related to ClickHouse KEEPER_EXCEPTION which i see confirmed

SELECT *
FROM system.errors
ORDER BY value DESC

┌─name────────────────────────┬─code─┬─value─┐
│ KEEPER_EXCEPTION            │  999 │ 12718 │
│ NETWORK_ERROR               │  210 │   407 │
│ FILE_DOESNT_EXIST           │  107 │   241 │
│ ALL_CONNECTION_TRIES_FAILED │  279 │   116 │
│ CANNOT_READ_ALL_DATA        │   33 │    68 │
│ TABLE_IS_READ_ONLY          │  242 │    49 │
│ SYNTAX_ERROR                │   62 │    23 │
│ NOT_FOUND_NODE              │  142 │    18 │
│ UNKNOWN_TABLE               │   60 │    14 │
│ NO_REPLICA_HAS_PART         │  234 │     7 │
│ TOO_MANY_ROWS_OR_BYTES      │  396 │     6 │
│ UNKNOWN_IDENTIFIER          │   47 │     4 │
│ NOT_AN_AGGREGATE            │  215 │     4 │
│ BAD_ARGUMENTS               │   36 │     3 │
│ UNKNOWN_DATABASE            │   81 │     3 │
│ UNKNOWN_FUNCTION            │   46 │     1 │
│ CANNOT_OPEN_FILE            │   76 │     1 │
│ NO_ELEMENTS_IN_CONFIG       │  139 │     1 │
│ REPLICA_IS_ALREADY_EXIST    │  253 │     1 │
│ FUNCTION_NOT_ALLOWED        │  446 │     1 │
└─────────────────────────────┴──────┴───────┘

zookeeper table also looks good

SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/tables/0'

┌─name────────────────────────────┬─value─┬───────czxid─┬───────mzxid─┬───────────────ctime─┬───────────────mtime─┬─version─┬─cversion─┬─aversion─┬─ephemeralOwner─┬─dataLength─┬─numChildren─┬───────pzxid─┬─path─────────────────┐
│ <redacted db.table name> │       │ 25769804336 │ 25769804336 │ 2021-01-27 15:30:38 │ 2021-01-27 15:30:38 │       0 │       13 │        0 │              0 │          0 │          11 │ 25769804343 │ /clickhouse/tables/0 │
└─────────────────────────────────┴───────┴─────────────┴─────────────┴─────────────────────┴─────────────────────┴─────────┴──────────┴──────────┴────────────────┴────────────┴─────────────┴─────────────┴──────────────────────┘

Clickhouse error logs are not showing anything related to zookeeper, e.g. via
kubectl exec chi-ugcluster-production-0-0-0 -- cat /var/log/clickhouse-server/clickhouse-server.err.log

The ZooKeeper logs also don't show any error, mainly the following liveness probe

$ kubectl logs zk-0
...
2021-01-28 14:35:56,051 [myid:0] - INFO  [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41174
2021-01-28 14:36:01,420 [myid:0] - INFO  [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41186
2021-01-28 14:36:06,051 [myid:0] - INFO  [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41188
2021-01-28 14:36:11,420 [myid:0] - INFO  [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41200
...

This is the zk.yaml (zk-0/1/2 are running in same namespace production as clickhouse)

apiVersion: v1
kind: ConfigMap
metadata:
  name: zk
  namespace: production
data:
  run.sh: |
    #!/bin/bash

    HOSTNAME=`hostname -s`
    echo "My hostname: $HOSTNAME"
    if [[ $HOSTNAME =~ (.*)-([0-9]+)$ ]]; then
      ORD=${BASH_REMATCH[2]}
      export ZOO_MY_ID=$((ORD))
    else
      echo "Failed to get index from hostname $HOST"
      exit 1
    fi

    echo $ZOO_MY_ID > /zk/data/myid

    /docker-entrypoint.sh ./bin/zkServer.sh start-foreground
  zoo.cfg: |
    dataDir=/zk/data
    dataLogDir=/zk/datalog
    clientPort=2181
    clientPortAddress=0.0.0.0
    maxClientCnxns=0
    tickTime=2000
    initLimit=5
    syncLimit=2
    autopurge.snapRetainCount=3
    autopurge.purgeInterval=0
    standaloneEnabled=true
    admin.enableServer=true
    4lw.commands.whitelist=*
    server.0=zk-0.zk.production.svc:2888:3888;2181
    server.1=zk-1.zk.production.svc:2888:3888;2181
    server.2=zk-2.zk.production.svc:2888:3888;2181
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: zk
  name: zk
  namespace: production
spec:
  podManagementPolicy: Parallel
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: zk
  serviceName: zk
  template:
    metadata:
      labels:
        app: zk
      name: zk
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: zk
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: node.kubernetes.io/instance-type
                  operator: In
                  values:
                  - s3.large.4
                  - s3.large.8
      containers:
      - command:
        - bash
        - /run.sh
        image: zookeeper:3.6.1
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
          failureThreshold: 6
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: zk
        ports:
        - containerPort: 2181
          name: client
          protocol: TCP
        - containerPort: 2888
          name: follower
          protocol: TCP
        - containerPort: 3888
          name: election
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
          failureThreshold: 6
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: config
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
        volumeMounts:
        - mountPath: /run.sh
          name: config
          subPath: run.sh
        - mountPath: /conf/zoo.cfg
          name: config
          subPath: zoo.cfg
        - mountPath: /zk/data
          name: data
        - mountPath: /zk/datalog
          name: datalog
      volumes:
      - configMap:
          defaultMode: 420
          name: zk
        name: config
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datalog
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: zk
  name: zk
  namespace: production
spec:
  clusterIP: None
  ports:
  - name: prometheus
    port: 7000
  - name: tcp-client
    port: 2181
    protocol: TCP
    targetPort: client
  - name: follower
    port: 2888
    protocol: TCP
    targetPort: follower
  - name: tcp-election
    port: 3888
    protocol: TCP
    targetPort: election
  publishNotReadyAddresses: true
  selector:
    app: zk
  sessionAffinity: None
  type: ClusterIP

I am mainly asking for directions and hope to find out the issue myself then.

$ kubectl describe deployment clickhouse-operator -n kube-system
Name:                   clickhouse-operator
Namespace:              kube-system
CreationTimestamp:      Wed, 23 Dec 2020 15:38:43 +0100
Labels:                 app=clickhouse-operator
                        clickhouse.altinity.com/app=chop
                        clickhouse.altinity.com/chop=0.13.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions