Direction for finding zookeeper issues

Hello i am running a 1 shard 2 replica ClickHouse instance using clickhouse-operator. Actually everything is fine, incl. using ReplicatedMergeTree and 3 zookeeper instances. Data gets replicated correctly. But both the Grafana clickhouse operator board and ClickHouse show `ZooKeeperUserExceptions` respectively `system.errors`. I have issues finding the root cause and seeking for direction where to check.

Prometheus is reporting, e.g.

```
# HELP chi_clickhouse_event_ZooKeeperUserExceptions 
# TYPE chi_clickhouse_event_ZooKeeperUserExceptions counter
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-0.production.svc.cluster.local",namespace="production"} 12718
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-1.production.svc.cluster.local",namespace="production"} 7971
```

I assume its related to ClickHouse [KEEPER_EXCEPTION](https://github.com/ClickHouse/ClickHouse/blob/a7ca26fec44cc1d4e9d6785de67c755fcb41be54/src/Storages/StorageReplicatedMergeTree.cpp#L4586) which i see confirmed
```
SELECT *
FROM system.errors
ORDER BY value DESC

┌─name────────────────────────┬─code─┬─value─┐
│ KEEPER_EXCEPTION            │  999 │ 12718 │
│ NETWORK_ERROR               │  210 │   407 │
│ FILE_DOESNT_EXIST           │  107 │   241 │
│ ALL_CONNECTION_TRIES_FAILED │  279 │   116 │
│ CANNOT_READ_ALL_DATA        │   33 │    68 │
│ TABLE_IS_READ_ONLY          │  242 │    49 │
│ SYNTAX_ERROR                │   62 │    23 │
│ NOT_FOUND_NODE              │  142 │    18 │
│ UNKNOWN_TABLE               │   60 │    14 │
│ NO_REPLICA_HAS_PART         │  234 │     7 │
│ TOO_MANY_ROWS_OR_BYTES      │  396 │     6 │
│ UNKNOWN_IDENTIFIER          │   47 │     4 │
│ NOT_AN_AGGREGATE            │  215 │     4 │
│ BAD_ARGUMENTS               │   36 │     3 │
│ UNKNOWN_DATABASE            │   81 │     3 │
│ UNKNOWN_FUNCTION            │   46 │     1 │
│ CANNOT_OPEN_FILE            │   76 │     1 │
│ NO_ELEMENTS_IN_CONFIG       │  139 │     1 │
│ REPLICA_IS_ALREADY_EXIST    │  253 │     1 │
│ FUNCTION_NOT_ALLOWED        │  446 │     1 │
└─────────────────────────────┴──────┴───────┘
```

zookeeper table also looks good
```
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/tables/0'

┌─name────────────────────────────┬─value─┬───────czxid─┬───────mzxid─┬───────────────ctime─┬───────────────mtime─┬─version─┬─cversion─┬─aversion─┬─ephemeralOwner─┬─dataLength─┬─numChildren─┬───────pzxid─┬─path─────────────────┐
│ <redacted db.table name> │       │ 25769804336 │ 25769804336 │ 2021-01-27 15:30:38 │ 2021-01-27 15:30:38 │       0 │       13 │        0 │              0 │          0 │          11 │ 25769804343 │ /clickhouse/tables/0 │
└─────────────────────────────────┴───────┴─────────────┴─────────────┴─────────────────────┴─────────────────────┴─────────┴──────────┴──────────┴────────────────┴────────────┴─────────────┴─────────────┴──────────────────────┘
```

Clickhouse error logs are not showing anything related to zookeeper, e.g. via
`kubectl exec chi-ugcluster-production-0-0-0 -- cat /var/log/clickhouse-server/clickhouse-server.err.log`

The ZooKeeper logs also don't show any error, mainly the following liveness probe
```
$ kubectl logs zk-0
...
2021-01-28 14:35:56,051 [myid:0] - INFO  [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41174
2021-01-28 14:36:01,420 [myid:0] - INFO  [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41186
2021-01-28 14:36:06,051 [myid:0] - INFO  [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41188
2021-01-28 14:36:11,420 [myid:0] - INFO  [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41200
...
```

This is the zk.yaml (zk-0/1/2 are running in same namespace `production` as clickhouse)

```
apiVersion: v1
kind: ConfigMap
metadata:
  name: zk
  namespace: production
data:
  run.sh: |
    #!/bin/bash

    HOSTNAME=`hostname -s`
    echo "My hostname: $HOSTNAME"
    if [[ $HOSTNAME =~ (.*)-([0-9]+)$ ]]; then
      ORD=${BASH_REMATCH[2]}
      export ZOO_MY_ID=$((ORD))
    else
      echo "Failed to get index from hostname $HOST"
      exit 1
    fi

    echo $ZOO_MY_ID > /zk/data/myid

    /docker-entrypoint.sh ./bin/zkServer.sh start-foreground
  zoo.cfg: |
    dataDir=/zk/data
    dataLogDir=/zk/datalog
    clientPort=2181
    clientPortAddress=0.0.0.0
    maxClientCnxns=0
    tickTime=2000
    initLimit=5
    syncLimit=2
    autopurge.snapRetainCount=3
    autopurge.purgeInterval=0
    standaloneEnabled=true
    admin.enableServer=true
    4lw.commands.whitelist=*
    server.0=zk-0.zk.production.svc:2888:3888;2181
    server.1=zk-1.zk.production.svc:2888:3888;2181
    server.2=zk-2.zk.production.svc:2888:3888;2181
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: zk
  name: zk
  namespace: production
spec:
  podManagementPolicy: Parallel
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: zk
  serviceName: zk
  template:
    metadata:
      labels:
        app: zk
      name: zk
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: zk
            topologyKey: kubernetes.io/hostname
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: node.kubernetes.io/instance-type
                  operator: In
                  values:
                  - s3.large.4
                  - s3.large.8
      containers:
      - command:
        - bash
        - /run.sh
        image: zookeeper:3.6.1
        imagePullPolicy: Always
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
          failureThreshold: 6
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: zk
        ports:
        - containerPort: 2181
          name: client
          protocol: TCP
        - containerPort: 2888
          name: follower
          protocol: TCP
        - containerPort: 3888
          name: election
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
          failureThreshold: 6
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: config
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
        volumeMounts:
        - mountPath: /run.sh
          name: config
          subPath: run.sh
        - mountPath: /conf/zoo.cfg
          name: config
          subPath: zoo.cfg
        - mountPath: /zk/data
          name: data
        - mountPath: /zk/datalog
          name: datalog
      volumes:
      - configMap:
          defaultMode: 420
          name: zk
        name: config
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datalog
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: zk
  name: zk
  namespace: production
spec:
  clusterIP: None
  ports:
  - name: prometheus
    port: 7000
  - name: tcp-client
    port: 2181
    protocol: TCP
    targetPort: client
  - name: follower
    port: 2888
    protocol: TCP
    targetPort: follower
  - name: tcp-election
    port: 3888
    protocol: TCP
    targetPort: election
  publishNotReadyAddresses: true
  selector:
    app: zk
  sessionAffinity: None
  type: ClusterIP
```

I am mainly asking for directions and hope to find out the issue myself then.

```
$ kubectl describe deployment clickhouse-operator -n kube-system
Name:                   clickhouse-operator
Namespace:              kube-system
CreationTimestamp:      Wed, 23 Dec 2020 15:38:43 +0100
Labels:                 app=clickhouse-operator
                        clickhouse.altinity.com/app=chop
                        clickhouse.altinity.com/chop=0.13.0
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direction for finding zookeeper issues #642

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Direction for finding zookeeper issues #642

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions