-
Notifications
You must be signed in to change notification settings - Fork 539
Direction for finding zookeeper issues #642
Description
Hello i am running a 1 shard 2 replica ClickHouse instance using clickhouse-operator. Actually everything is fine, incl. using ReplicatedMergeTree and 3 zookeeper instances. Data gets replicated correctly. But both the Grafana clickhouse operator board and ClickHouse show ZooKeeperUserExceptions respectively system.errors. I have issues finding the root cause and seeking for direction where to check.
Prometheus is reporting, e.g.
# HELP chi_clickhouse_event_ZooKeeperUserExceptions
# TYPE chi_clickhouse_event_ZooKeeperUserExceptions counter
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-0.production.svc.cluster.local",namespace="production"} 12718
chi_clickhouse_event_ZooKeeperUserExceptions{chi="ugcluster",hostname="chi-ugcluster-production-0-1.production.svc.cluster.local",namespace="production"} 7971
I assume its related to ClickHouse KEEPER_EXCEPTION which i see confirmed
SELECT *
FROM system.errors
ORDER BY value DESC
┌─name────────────────────────┬─code─┬─value─┐
│ KEEPER_EXCEPTION │ 999 │ 12718 │
│ NETWORK_ERROR │ 210 │ 407 │
│ FILE_DOESNT_EXIST │ 107 │ 241 │
│ ALL_CONNECTION_TRIES_FAILED │ 279 │ 116 │
│ CANNOT_READ_ALL_DATA │ 33 │ 68 │
│ TABLE_IS_READ_ONLY │ 242 │ 49 │
│ SYNTAX_ERROR │ 62 │ 23 │
│ NOT_FOUND_NODE │ 142 │ 18 │
│ UNKNOWN_TABLE │ 60 │ 14 │
│ NO_REPLICA_HAS_PART │ 234 │ 7 │
│ TOO_MANY_ROWS_OR_BYTES │ 396 │ 6 │
│ UNKNOWN_IDENTIFIER │ 47 │ 4 │
│ NOT_AN_AGGREGATE │ 215 │ 4 │
│ BAD_ARGUMENTS │ 36 │ 3 │
│ UNKNOWN_DATABASE │ 81 │ 3 │
│ UNKNOWN_FUNCTION │ 46 │ 1 │
│ CANNOT_OPEN_FILE │ 76 │ 1 │
│ NO_ELEMENTS_IN_CONFIG │ 139 │ 1 │
│ REPLICA_IS_ALREADY_EXIST │ 253 │ 1 │
│ FUNCTION_NOT_ALLOWED │ 446 │ 1 │
└─────────────────────────────┴──────┴───────┘
zookeeper table also looks good
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse/tables/0'
┌─name────────────────────────────┬─value─┬───────czxid─┬───────mzxid─┬───────────────ctime─┬───────────────mtime─┬─version─┬─cversion─┬─aversion─┬─ephemeralOwner─┬─dataLength─┬─numChildren─┬───────pzxid─┬─path─────────────────┐
│ <redacted db.table name> │ │ 25769804336 │ 25769804336 │ 2021-01-27 15:30:38 │ 2021-01-27 15:30:38 │ 0 │ 13 │ 0 │ 0 │ 0 │ 11 │ 25769804343 │ /clickhouse/tables/0 │
└─────────────────────────────────┴───────┴─────────────┴─────────────┴─────────────────────┴─────────────────────┴─────────┴──────────┴──────────┴────────────────┴────────────┴─────────────┴─────────────┴──────────────────────┘
Clickhouse error logs are not showing anything related to zookeeper, e.g. via
kubectl exec chi-ugcluster-production-0-0-0 -- cat /var/log/clickhouse-server/clickhouse-server.err.log
The ZooKeeper logs also don't show any error, mainly the following liveness probe
$ kubectl logs zk-0
...
2021-01-28 14:35:56,051 [myid:0] - INFO [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41174
2021-01-28 14:36:01,420 [myid:0] - INFO [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41186
2021-01-28 14:36:06,051 [myid:0] - INFO [NIOWorkerThread-2:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41188
2021-01-28 14:36:11,420 [myid:0] - INFO [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:41200
...
This is the zk.yaml (zk-0/1/2 are running in same namespace production as clickhouse)
apiVersion: v1
kind: ConfigMap
metadata:
name: zk
namespace: production
data:
run.sh: |
#!/bin/bash
HOSTNAME=`hostname -s`
echo "My hostname: $HOSTNAME"
if [[ $HOSTNAME =~ (.*)-([0-9]+)$ ]]; then
ORD=${BASH_REMATCH[2]}
export ZOO_MY_ID=$((ORD))
else
echo "Failed to get index from hostname $HOST"
exit 1
fi
echo $ZOO_MY_ID > /zk/data/myid
/docker-entrypoint.sh ./bin/zkServer.sh start-foreground
zoo.cfg: |
dataDir=/zk/data
dataLogDir=/zk/datalog
clientPort=2181
clientPortAddress=0.0.0.0
maxClientCnxns=0
tickTime=2000
initLimit=5
syncLimit=2
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
standaloneEnabled=true
admin.enableServer=true
4lw.commands.whitelist=*
server.0=zk-0.zk.production.svc:2888:3888;2181
server.1=zk-1.zk.production.svc:2888:3888;2181
server.2=zk-2.zk.production.svc:2888:3888;2181
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: zk
name: zk
namespace: production
spec:
podManagementPolicy: Parallel
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: zk
serviceName: zk
template:
metadata:
labels:
app: zk
name: zk
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: zk
topologyKey: kubernetes.io/hostname
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- s3.large.4
- s3.large.8
containers:
- command:
- bash
- /run.sh
image: zookeeper:3.6.1
imagePullPolicy: Always
livenessProbe:
exec:
command:
- /bin/bash
- -c
- echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: zk
ports:
- containerPort: 2181
name: client
protocol: TCP
- containerPort: 2888
name: follower
protocol: TCP
- containerPort: 3888
name: election
protocol: TCP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- echo "ruok" | timeout 2 nc -w 2 localhost 2181 | grep imok
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: config
resources:
requests:
cpu: 250m
memory: 256Mi
volumeMounts:
- mountPath: /run.sh
name: config
subPath: run.sh
- mountPath: /conf/zoo.cfg
name: config
subPath: zoo.cfg
- mountPath: /zk/data
name: data
- mountPath: /zk/datalog
name: datalog
volumes:
- configMap:
defaultMode: 420
name: zk
name: config
restartPolicy: Always
terminationGracePeriodSeconds: 30
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datalog
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
---
apiVersion: v1
kind: Service
metadata:
labels:
app: zk
name: zk
namespace: production
spec:
clusterIP: None
ports:
- name: prometheus
port: 7000
- name: tcp-client
port: 2181
protocol: TCP
targetPort: client
- name: follower
port: 2888
protocol: TCP
targetPort: follower
- name: tcp-election
port: 3888
protocol: TCP
targetPort: election
publishNotReadyAddresses: true
selector:
app: zk
sessionAffinity: None
type: ClusterIP
I am mainly asking for directions and hope to find out the issue myself then.
$ kubectl describe deployment clickhouse-operator -n kube-system
Name: clickhouse-operator
Namespace: kube-system
CreationTimestamp: Wed, 23 Dec 2020 15:38:43 +0100
Labels: app=clickhouse-operator
clickhouse.altinity.com/app=chop
clickhouse.altinity.com/chop=0.13.0