0% found this document useful (0 votes)
19 views17 pages

Spk-Troubleshooting HTML

Uploaded by

Saurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

Spk-Troubleshooting HTML

Uploaded by

Saurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

BIG-IP Next for Kubernetes on NVIDIA BlueField-3 DPUs can only be deployed
on one DPU per chassis. Additional DPUs in the same chassis are used to
accelerate GPU-to-GPU or GPU-to-storage communications.

! CloudDocs Home (/) > BIG-IP Next for Kubernetes (index.html) > Troubleshooting

Troubleshooting¶

This section provides the steps to troubleshoot and fix some of the common issues
that the user encounters. Following are some of the possible error messages with
steps to troubleshoot and fix the errors.

Troubleshooting common error scenarios¶

ERROR: Ingress Traffic is not working¶

Cause: No static route is created or TMM is rejecting the static route.

TMM logs:

Run the following command to get the TMM logs:

kubectl logs deploy/f5-tmm -c f5-tmm -f

Sample Output:

...

decl_static_route_handler/174: Creating new route_entry: app-ns-static-route-10.

decl_static_route_handler/210: Adding gateway: app-ns-static-route-10.244.0.2

decl_static_route_handler/236: route is unresolved: app-ns-static-route-10.244.0

<134>Oct 1 04:52:57 f5-tmm-57d4488c4-2x9x4 tmm[15]: 01010058:6: audit log:

decl_traffic_matching_criteria_handler/837: Received create tmc message app-ns-g

FIX:¶

1 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

1. Check node annotation:

◦ Add a notation to the Host Node by setting the IP address of the VF


interface that is connected to the DPU/TMM.

◦ To ensure static routes are created for directing traffic to the TMM pod on
the DPU, you need to add an annotation to the Host Node. The CIDR range
of the IP address has to be in the same network CIDR range of the internal
network.

◦ For example, 192.20.28.146/22 is the IP on the Host Node Virtual Function


(VF) interface that is connected to the DPU node through sf_internal
bridge.

◦ Run the following command to annotate the Host Node

kubectl annotate node arm64-sm-2 k8s.ovn.org/node-primary-ifaddr

Note: The IP address is the IP address of the VF Interface and the node
is the name of the node where the application pod is running.

2. Check IP on interface:

◦ Check if the external Node interface has an IP address on the interface in


the same CIDR range as your IPAddress.

◦ Verify whether you can ping the IPAddress.

kubectl get pods -n <application namespace>

◦ Run the following command to verify that the nginx application running:

kubectl get pods -w -n app-ns -owide

Sample Output:

NAME READY STATUS RESTARTS

nginx-deploy-5798c85b9c-qtqnd 1/1 Running 0

ERROR: TMM not starting¶

Verify the TMM logs on f5-tmm container. Run the following command to get the

2 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

logs:

kubectl logs pod/f5-tmm-cf595cb87-wdfrn -n f5-spk

Sample Output:

dpdk: mempool_alloc Successfully created RTE_RING

dpdk: mempool_alloc RTE_RING descriptor count: 262144. MBUF header count: 262143

xnet_dev [mlx5_core.sf.4]: Kernel driver is already unbound or no such device

xnet_dev [mlx5_core.sf.4]: Error: Failed to open /sys/bus/pci/devices/mlx5_core.sf

dpdk[mlx5_core.sf.4]: Error: **** Fatal xnet DPDK Driver Configuration Error

TMM clock is 0 seconds from system time

ticks since last clock update: 161

ticks since start of poll: 111850104

TMM version: no_pgo aarch64 TMM Version 0.1010.1+0.1.5 Build Date: Tue Sep

FIX:¶

Perform the following checks to debug the TMM not starting issue:

1. Verify whether the vfio_pci kernel module is loaded on the DPU node:

Sample Output:

3 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

root@localhost:~# lsmod | grep vfio

root@localhost:~# modprobe vfio_pci

root@localhost:~# lsmod | grep vfio

vfio_pci 16384 0

vfio_pci_core 69632 1 vfio_pci

vfio_virqfd 20480 1 vfio_pci_core

vfio_iommu_type1 49152 0

vfio 45056 2 vfio_pci_core,vfio_iommu_type1

2. Verify whether the SRIOV plugin is creating the scalable function (SF) resources
for K8S.

Run the following command to verify:

kubectl get pods -n kube-system -owide

Sample Output:

4 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

NAME READY STATUS RESTARTS AGE

coredns-76f75df574-4pnbt 1/1 Running 0 28h

coredns-76f75df574-pgjx4 1/1 Running 0 28h

etcd-sm-mgx1 1/1 Running 19 28h

kube-apiserver-sm-mgx1 1/1 Running 15 28h

kube-controller-manager-sm-mgx1 1/1 Running 2 28h

kube-proxy-9hnxf 1/1 Running 0 28h

kube-proxy-zzbjd 1/1 Running 0 28h

kube-scheduler-sm-mgx1 1/1 Running 18 28h

kube-sriov-device-plugin-sgjl2 1/1 Running 0 7h

kube-sriov-device-plugin-vgnkt 1/1 Running 0 7h

3. Check the kube-sriov-device-plugin logs running on the Host Node .

Run the following command to check:

kubectl logs pod/kube-sriov-device-plugin-vgnkt -n kube-system

Look for the configmap in the logs:

5 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

"resourceList": [

"resourceName": "bf3_p0_sf",

"resourcePrefix": "nvidia.com",

"deviceType": "auxNetDevice",

"selectors": [{

"vendors": ["15b3"],

"devices": ["a2dc"],

"pciAddresses": ["0000:03:00.0"],

"pfNames": ["p0#1-2"],

"auxTypes": ["sf"]

}]

},

"resourceName": "bf3_p1_sf",

"resourcePrefix": "nvidia.com",

"deviceType": "auxNetDevice",

"selectors": [{

"vendors": ["15b3"],

"devices": ["a2dc"],

"pciAddresses": ["0000:03:00.1"],

6 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

"pfNames": ["p1#1-2"],

"auxTypes": ["sf"]

}]

Sample Output:

I0306 06:37:22.534796 1 factory.go:203] *types.AuxNetDeviceSelectors for res


I0306 06:37:22.534805 1 manager.go:106] unmarshalled ResourceList: [{Resourc
I0306 06:37:22.534876 1 manager.go:217] validating resource name "nvidia.com
I0306 06:37:22.534894 1 manager.go:217] validating resource name "nvidia.com
I0306 06:37:22.534901 1 main.go:62] Discovering host devices
I0306 06:37:22.619234 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDe
I0306 06:37:22.619318 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDe
I0306 06:37:22.619329 1 netDeviceProvider.go:67] netdevice AddTargetDevices(
I0306 06:37:22.621099 1 netDeviceProvider.go:67] netdevice AddTargetDevices(
I0306 06:37:22.623488 1 main.go:68] Initializing resource servers
I0306 06:37:22.623530 1 manager.go:117] number of config: 2
I0306 06:37:22.623555 1 manager.go:121] Creating new ResourcePool: bf3_p0_sf
I0306 06:37:22.623561 1 manager.go:122] DeviceType: auxNetDevice

If configmap is not found, check whether the scalable functions are created
properly. Also, verify that the Network Attachment Definition Custom Resources are
created.

kubectl get net-attach-def -A

NAMESPACE NAME AGE


f5-spk sf-external 41s
f5-spk sf-internal 41s

ERROR: VLAN¶

7 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

While trying to create VLAN, the following error is observed:

octo@arm64-sm-2:~/orchestrator$ kubectl apply -f vlan.yaml

Error from server (InternalError): error when creating "vlan.yaml": Internal

FIX:¶

1. Check whether the clusterissuer is created. To check, run the following


command:

kubectl get clusterissuer

No resources found

Note: The error from validating webhook indicates that the webhook cannot
authenticate with the api server. This is from validating webhook and not the
conversion webhook.

2. Check for the CRD installation values. If the values are placed in wrong direction,
go back to validating webhook.

3. Set the annotation properly in validatingwebhookconfiguration for


cainjector to inject the ca to the api server.

4. Check for validatingwebhookconfiguration .

kubectl get validatingwebhookconfiguration f5validate-default -o yaml

cert-manager.io/inject-ca-from: default/tls-f5ingress-webhookvalidating-s

5. Check for secrets which should contain the ca . Run the following command in
the terminal:

kubectl get secret | grep tls-f5ingress-webhookvalidating-svr

Sample Output:

tls-f5ingress-webhookvalidating-svr-98clw Opaque

tls-f5ingress-webhookvalidating-svr-secret kubernetes.io/tls

8 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

◦ The first value in the sample output indicates that the issuer /
clusterissuer is not ready. Hence, the secret name is appended with a
random number.

◦ The second value indicates that the secrets are not properly deleted in the
previous installation.

Delete the secrets with helm uninstall f5ingress command or delete


the spkinstance CR.

◦ The secrets stay in the cluster though the pods or deployments are deleted
manually before.

6. Clean-up the environment and ensure that all the tls-* secrets are deleted.

7. Configure the ca secret and create the clusterissuer for Cert Manager. To
create the clusterissuer , follow the instructions provided in Create
Clusterissuer and Certificates (cert-manager.html) guide.

ERROR: Cannot join DPU node to Kubernetes cluster¶

The following error may occur while trying to join DPU node to the Kubernetes
cluster:

ubuntu@localhost:~$ sudo kubeadm join 10.144.47.34:6443 --token xed0f5.9pvw5csqu9w

[preflight] Running pre-flight checks

error execution phase preflight: [preflight] Some fatal errors occurred:

[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sy

[preflight] If you know what you are doing, you can make a check non-fatal

To see the stack trace of this error execute with --v=5 or higher

FIX:¶

Set modprobe br_netfilter

9 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

sudo su

modprobe br_netfilter

echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

echo 1 > /proc/sys/net/ipv4/ip_forward

For more information on this error and Troubleshooting steps, refer Kubernetes
Community Forums (https://discuss.kubernetes.io/t/kubeadmin-join-throws-this-
error-proc-sys-net-bridge-bridge-nf-call-iptables-does-not-exist/24855).

ERROR: DPU does not have management IP for internet access¶

When the DPU does not have a management interface configured, it will not have
access to the internet nor to the Host Node .

FIX:¶

Change K8S to advertise on tmfifo_net0 interface and add iptable rule to send
DPU traffic through the Host.

On the host¶

1. Run the following command on the Host:

iptables -t nat -I POSTROUTING -o eno1 -j MASQUERADE

2. The fix is to have Kubernetes advertise on the RShim tmfifo_net0 interface.

Start Kubernetes with –apiserver-advertise-address=192.168.100.1 to advertise


the kubernetes API at the tmfifo_net0 interface so the DPU can reach the
kubernetes api.

sudo kubeadm init --apiserver-advertise-address=192.168.100.1 --pod

Route all traffic through the en01 interface of the Host Node .

3. Change the DNS server to 1.1.1.1 on the DPU node. To change, edit the /etc/
netplan/50-cloud-init.yaml file and change the nameservers for oob_net0 to
1.1.1.1 .

10 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

vi /etc/netplan/50-cloud-init.yaml

11 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

# This file is generated from information provided by the datasource. Ch

# to it will not persist across an instance reboot. To disable cloud-ini

# network configuration capabilities, write a file

# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following

# network: {config: disabled}

network:

ethernets:

oob_net0:

dhcp4: true

tmfifo_net0:

addresses:

- 192.168.100.2/30

dhcp4: false

nameservers:

addresses:

- 1.1.1.1

routes:

- metric: 1025

to: 0.0.0.0/0

via: 192.168.100.1

renderer: NetworkManager

12 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

version: 2

4. Now, run the following command:

netplan apply

ERROR: multus error ¶

The following error may occur if the multus CNI plugin is not installed properly.

kubectl get pods -A -owide

Sample Output

...

kube-system kube-multus-ds-jt9jk 0/1 Init:CrashL

kube-system kube-multus-ds-vmfws 1/1 Running

FIX:¶

Try the following method to troubleshoot multus pod CrashLoopBackOff error.

kubectl describe pod/kube-multus-ds-jt9jk -n kube-system

13 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

...

Init Containers:

install-multus-binary:

Container ID: containerd://841c0318f6265009e82bc6c82f41f9e27ebb80bde1ae31c6c4

Image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick

Image ID: ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:6879c6efc5dddd56

Port: <none>

Host Port: <none>

Command:

cp

/usr/src/multus-cni/bin/multus-shim

/host/opt/cni/bin/multus-shim

State: Waiting

Reason: CrashLoopBackOff

Last State: Terminated

Reason: Error

Message: cp: cannot create regular file '/host/opt/cni/bin/multus-shim'

Exit Code: 1

Started: Mon, 28 Oct 2024 16:39:22 +0000

Finished: Mon, 28 Oct 2024 16:39:22 +0000

ERROR: BIG-IP Next for Kubernetes not starting up¶

14 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

kubectl describe pod/spkinstance-sample-f5ingress-5d7b978f86-xxw9k

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Scheduled 4m11s default-scheduler Successfully ass

Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU

ERROR: TMM OOMKilled/Crashloopback¶

In the TMM pod, the following OOMKilled or Crashloopback error is observed.

default f5-tmm-dpu-594f5f9d4-lp8mq 3/4 OOMKil

Fix:¶

If you observe OOMKilled or Crashloopback error in TMM pod, perform the


following steps to fix this issue:

1. Check TMM Deployment configuration.

Sample Output

15 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

- name: USE_PHYS_MEM

value: "true"

- name: TMM_GENERIC_SOCKET_DRIVER

value: "false"

- name: TMM_MAPRES_ADDL_VETHS_ON_DP

values: "false"

2. Remove both the TMM_GENERIC_SOCKET_DRIVER and


TMM_MAPRES_ADDL_VETHS_ON_DP from the configuration.

ERROR: Persistence Problems¶

The fluend and CWC pods have unbound immediate PersistentVolumeClaims


which could cause an error.

FIX:¶

If Persistence is enabled in fluentd or CWC, the user must create the persistent
volumes for these objects to depend on.

• For fluentd:

If there is a PVC with \"f5-toda-fluentd\" name in the toda pod, a


persistent volume for this claim has to be created.

• For CWC: If there is a PVC with \"cluster-wide-controller\" name in the


toda pod, a persistent volume for this claim has to be created.

ERROR: Inotify-Related Issues in Pods¶

If your pods experience errors related to inotify limits, you may see logs like the
following:

"ts"="2025-03-24 05:09:43.012"|"l"="error"|"m"="failed to create log watcher"

Cause: This issue occurs when the system’s inotify limits are too low, restricting the
number of files and directories that can be monitored.

16 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...

FIX:¶

1. To increase inotify limits, update the following kernel parameters.

echo "fs.inotify.max_user_watches=2099999999" | sudo tee -a /etc/sysctl.c


echo "fs.inotify.max_user_instances=2099999999" | sudo tee -a /etc/sysctl
echo "fs.inotify.max_queued_events=2099999999" | sudo tee -a /etc/sysctl.

2. After updating the configuration, reload the kernel parameters with the following
command.

sudo sysctl -p

These changes will ensure that your system has higher inotify limits to accommodate
your workload.

17 of 17 5/24/25, 2:17 PM

You might also like