Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
BIG-IP Next for Kubernetes on NVIDIA BlueField-3 DPUs can only be deployed
on one DPU per chassis. Additional DPUs in the same chassis are used to
accelerate GPU-to-GPU or GPU-to-storage communications.
! CloudDocs Home (/) > BIG-IP Next for Kubernetes (index.html) > Troubleshooting
Troubleshooting¶
This section provides the steps to troubleshoot and fix some of the common issues
that the user encounters. Following are some of the possible error messages with
steps to troubleshoot and fix the errors.
Troubleshooting common error scenarios¶
ERROR: Ingress Traffic is not working¶
Cause: No static route is created or TMM is rejecting the static route.
TMM logs:
Run the following command to get the TMM logs:
kubectl logs deploy/f5-tmm -c f5-tmm -f
Sample Output:
...
decl_static_route_handler/174: Creating new route_entry: app-ns-static-route-10.
decl_static_route_handler/210: Adding gateway: app-ns-static-route-10.244.0.2
decl_static_route_handler/236: route is unresolved: app-ns-static-route-10.244.0
<134>Oct 1 04:52:57 f5-tmm-57d4488c4-2x9x4 tmm[15]: 01010058:6: audit log:
decl_traffic_matching_criteria_handler/837: Received create tmc message app-ns-g
FIX:¶
1 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
1. Check node annotation:
◦ Add a notation to the Host Node by setting the IP address of the VF
interface that is connected to the DPU/TMM.
◦ To ensure static routes are created for directing traffic to the TMM pod on
the DPU, you need to add an annotation to the Host Node. The CIDR range
of the IP address has to be in the same network CIDR range of the internal
network.
◦ For example, 192.20.28.146/22 is the IP on the Host Node Virtual Function
(VF) interface that is connected to the DPU node through sf_internal
bridge.
◦ Run the following command to annotate the Host Node
kubectl annotate node arm64-sm-2 k8s.ovn.org/node-primary-ifaddr
Note: The IP address is the IP address of the VF Interface and the node
is the name of the node where the application pod is running.
2. Check IP on interface:
◦ Check if the external Node interface has an IP address on the interface in
the same CIDR range as your IPAddress.
◦ Verify whether you can ping the IPAddress.
kubectl get pods -n <application namespace>
◦ Run the following command to verify that the nginx application running:
kubectl get pods -w -n app-ns -owide
Sample Output:
NAME READY STATUS RESTARTS
nginx-deploy-5798c85b9c-qtqnd 1/1 Running 0
ERROR: TMM not starting¶
Verify the TMM logs on f5-tmm container. Run the following command to get the
2 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
logs:
kubectl logs pod/f5-tmm-cf595cb87-wdfrn -n f5-spk
Sample Output:
dpdk: mempool_alloc Successfully created RTE_RING
dpdk: mempool_alloc RTE_RING descriptor count: 262144. MBUF header count: 262143
xnet_dev [mlx5_core.sf.4]: Kernel driver is already unbound or no such device
xnet_dev [mlx5_core.sf.4]: Error: Failed to open /sys/bus/pci/devices/mlx5_core.sf
dpdk[mlx5_core.sf.4]: Error: **** Fatal xnet DPDK Driver Configuration Error
TMM clock is 0 seconds from system time
ticks since last clock update: 161
ticks since start of poll: 111850104
TMM version: no_pgo aarch64 TMM Version 0.1010.1+0.1.5 Build Date: Tue Sep
FIX:¶
Perform the following checks to debug the TMM not starting issue:
1. Verify whether the vfio_pci kernel module is loaded on the DPU node:
Sample Output:
3 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
root@localhost:~# lsmod | grep vfio
root@localhost:~# modprobe vfio_pci
root@localhost:~# lsmod | grep vfio
vfio_pci 16384 0
vfio_pci_core 69632 1 vfio_pci
vfio_virqfd 20480 1 vfio_pci_core
vfio_iommu_type1 49152 0
vfio 45056 2 vfio_pci_core,vfio_iommu_type1
2. Verify whether the SRIOV plugin is creating the scalable function (SF) resources
for K8S.
Run the following command to verify:
kubectl get pods -n kube-system -owide
Sample Output:
4 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
NAME READY STATUS RESTARTS AGE
coredns-76f75df574-4pnbt 1/1 Running 0 28h
coredns-76f75df574-pgjx4 1/1 Running 0 28h
etcd-sm-mgx1 1/1 Running 19 28h
kube-apiserver-sm-mgx1 1/1 Running 15 28h
kube-controller-manager-sm-mgx1 1/1 Running 2 28h
kube-proxy-9hnxf 1/1 Running 0 28h
kube-proxy-zzbjd 1/1 Running 0 28h
kube-scheduler-sm-mgx1 1/1 Running 18 28h
kube-sriov-device-plugin-sgjl2 1/1 Running 0 7h
kube-sriov-device-plugin-vgnkt 1/1 Running 0 7h
3. Check the kube-sriov-device-plugin logs running on the Host Node .
Run the following command to check:
kubectl logs pod/kube-sriov-device-plugin-vgnkt -n kube-system
Look for the configmap in the logs:
5 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
"resourceList": [
"resourceName": "bf3_p0_sf",
"resourcePrefix": "nvidia.com",
"deviceType": "auxNetDevice",
"selectors": [{
"vendors": ["15b3"],
"devices": ["a2dc"],
"pciAddresses": ["0000:03:00.0"],
"pfNames": ["p0#1-2"],
"auxTypes": ["sf"]
}]
},
"resourceName": "bf3_p1_sf",
"resourcePrefix": "nvidia.com",
"deviceType": "auxNetDevice",
"selectors": [{
"vendors": ["15b3"],
"devices": ["a2dc"],
"pciAddresses": ["0000:03:00.1"],
6 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
"pfNames": ["p1#1-2"],
"auxTypes": ["sf"]
}]
Sample Output:
I0306 06:37:22.534796 1 factory.go:203] *types.AuxNetDeviceSelectors for res
I0306 06:37:22.534805 1 manager.go:106] unmarshalled ResourceList: [{Resourc
I0306 06:37:22.534876 1 manager.go:217] validating resource name "nvidia.com
I0306 06:37:22.534894 1 manager.go:217] validating resource name "nvidia.com
I0306 06:37:22.534901 1 main.go:62] Discovering host devices
I0306 06:37:22.619234 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDe
I0306 06:37:22.619318 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDe
I0306 06:37:22.619329 1 netDeviceProvider.go:67] netdevice AddTargetDevices(
I0306 06:37:22.621099 1 netDeviceProvider.go:67] netdevice AddTargetDevices(
I0306 06:37:22.623488 1 main.go:68] Initializing resource servers
I0306 06:37:22.623530 1 manager.go:117] number of config: 2
I0306 06:37:22.623555 1 manager.go:121] Creating new ResourcePool: bf3_p0_sf
I0306 06:37:22.623561 1 manager.go:122] DeviceType: auxNetDevice
If configmap is not found, check whether the scalable functions are created
properly. Also, verify that the Network Attachment Definition Custom Resources are
created.
kubectl get net-attach-def -A
NAMESPACE NAME AGE
f5-spk sf-external 41s
f5-spk sf-internal 41s
ERROR: VLAN¶
7 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
While trying to create VLAN, the following error is observed:
octo@arm64-sm-2:~/orchestrator$ kubectl apply -f vlan.yaml
Error from server (InternalError): error when creating "vlan.yaml": Internal
FIX:¶
1. Check whether the clusterissuer is created. To check, run the following
command:
kubectl get clusterissuer
No resources found
Note: The error from validating webhook indicates that the webhook cannot
authenticate with the api server. This is from validating webhook and not the
conversion webhook.
2. Check for the CRD installation values. If the values are placed in wrong direction,
go back to validating webhook.
3. Set the annotation properly in validatingwebhookconfiguration for
cainjector to inject the ca to the api server.
4. Check for validatingwebhookconfiguration .
kubectl get validatingwebhookconfiguration f5validate-default -o yaml
cert-manager.io/inject-ca-from: default/tls-f5ingress-webhookvalidating-s
5. Check for secrets which should contain the ca . Run the following command in
the terminal:
kubectl get secret | grep tls-f5ingress-webhookvalidating-svr
Sample Output:
tls-f5ingress-webhookvalidating-svr-98clw Opaque
tls-f5ingress-webhookvalidating-svr-secret kubernetes.io/tls
8 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
◦ The first value in the sample output indicates that the issuer /
clusterissuer is not ready. Hence, the secret name is appended with a
random number.
◦ The second value indicates that the secrets are not properly deleted in the
previous installation.
Delete the secrets with helm uninstall f5ingress command or delete
the spkinstance CR.
◦ The secrets stay in the cluster though the pods or deployments are deleted
manually before.
6. Clean-up the environment and ensure that all the tls-* secrets are deleted.
7. Configure the ca secret and create the clusterissuer for Cert Manager. To
create the clusterissuer , follow the instructions provided in Create
Clusterissuer and Certificates (cert-manager.html) guide.
ERROR: Cannot join DPU node to Kubernetes cluster¶
The following error may occur while trying to join DPU node to the Kubernetes
cluster:
ubuntu@localhost:~$ sudo kubeadm join 10.144.47.34:6443 --token xed0f5.9pvw5csqu9w
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sy
[preflight] If you know what you are doing, you can make a check non-fatal
To see the stack trace of this error execute with --v=5 or higher
FIX:¶
Set modprobe br_netfilter
9 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
sudo su
modprobe br_netfilter
echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/ipv4/ip_forward
For more information on this error and Troubleshooting steps, refer Kubernetes
Community Forums (https://discuss.kubernetes.io/t/kubeadmin-join-throws-this-
error-proc-sys-net-bridge-bridge-nf-call-iptables-does-not-exist/24855).
ERROR: DPU does not have management IP for internet access¶
When the DPU does not have a management interface configured, it will not have
access to the internet nor to the Host Node .
FIX:¶
Change K8S to advertise on tmfifo_net0 interface and add iptable rule to send
DPU traffic through the Host.
On the host¶
1. Run the following command on the Host:
iptables -t nat -I POSTROUTING -o eno1 -j MASQUERADE
2. The fix is to have Kubernetes advertise on the RShim tmfifo_net0 interface.
Start Kubernetes with –apiserver-advertise-address=192.168.100.1 to advertise
the kubernetes API at the tmfifo_net0 interface so the DPU can reach the
kubernetes api.
sudo kubeadm init --apiserver-advertise-address=192.168.100.1 --pod
Route all traffic through the en01 interface of the Host Node .
3. Change the DNS server to 1.1.1.1 on the DPU node. To change, edit the /etc/
netplan/50-cloud-init.yaml file and change the nameservers for oob_net0 to
1.1.1.1 .
10 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
vi /etc/netplan/50-cloud-init.yaml
11 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
# This file is generated from information provided by the datasource. Ch
# to it will not persist across an instance reboot. To disable cloud-ini
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following
# network: {config: disabled}
network:
ethernets:
oob_net0:
dhcp4: true
tmfifo_net0:
addresses:
- 192.168.100.2/30
dhcp4: false
nameservers:
addresses:
- 1.1.1.1
routes:
- metric: 1025
to: 0.0.0.0/0
via: 192.168.100.1
renderer: NetworkManager
12 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
version: 2
4. Now, run the following command:
netplan apply
ERROR: multus error ¶
The following error may occur if the multus CNI plugin is not installed properly.
kubectl get pods -A -owide
Sample Output
...
kube-system kube-multus-ds-jt9jk 0/1 Init:CrashL
kube-system kube-multus-ds-vmfws 1/1 Running
FIX:¶
Try the following method to troubleshoot multus pod CrashLoopBackOff error.
kubectl describe pod/kube-multus-ds-jt9jk -n kube-system
13 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
...
Init Containers:
install-multus-binary:
Container ID: containerd://841c0318f6265009e82bc6c82f41f9e27ebb80bde1ae31c6c4
Image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick
Image ID: ghcr.io/k8snetworkplumbingwg/multus-cni@sha256:6879c6efc5dddd56
Port: <none>
Host Port: <none>
Command:
cp
/usr/src/multus-cni/bin/multus-shim
/host/opt/cni/bin/multus-shim
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: cp: cannot create regular file '/host/opt/cni/bin/multus-shim'
Exit Code: 1
Started: Mon, 28 Oct 2024 16:39:22 +0000
Finished: Mon, 28 Oct 2024 16:39:22 +0000
ERROR: BIG-IP Next for Kubernetes not starting up¶
14 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
kubectl describe pod/spkinstance-sample-f5ingress-5d7b978f86-xxw9k
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m11s default-scheduler Successfully ass
Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m10s (x2 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
Warning FailedMount 4m9s (x3 over 4m10s) kubelet MountVolume.SetU
ERROR: TMM OOMKilled/Crashloopback¶
In the TMM pod, the following OOMKilled or Crashloopback error is observed.
default f5-tmm-dpu-594f5f9d4-lp8mq 3/4 OOMKil
Fix:¶
If you observe OOMKilled or Crashloopback error in TMM pod, perform the
following steps to fix this issue:
1. Check TMM Deployment configuration.
Sample Output
15 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
- name: USE_PHYS_MEM
value: "true"
- name: TMM_GENERIC_SOCKET_DRIVER
value: "false"
- name: TMM_MAPRES_ADDL_VETHS_ON_DP
values: "false"
2. Remove both the TMM_GENERIC_SOCKET_DRIVER and
TMM_MAPRES_ADDL_VETHS_ON_DP from the configuration.
ERROR: Persistence Problems¶
The fluend and CWC pods have unbound immediate PersistentVolumeClaims
which could cause an error.
FIX:¶
If Persistence is enabled in fluentd or CWC, the user must create the persistent
volumes for these objects to depend on.
• For fluentd:
If there is a PVC with \"f5-toda-fluentd\" name in the toda pod, a
persistent volume for this claim has to be created.
• For CWC: If there is a PVC with \"cluster-wide-controller\" name in the
toda pod, a persistent volume for this claim has to be created.
ERROR: Inotify-Related Issues in Pods¶
If your pods experience errors related to inotify limits, you may see logs like the
following:
"ts"="2025-03-24 05:09:43.012"|"l"="error"|"m"="failed to create log watcher"
Cause: This issue occurs when the system’s inotify limits are too low, restricting the
number of files and directories that can be monitored.
16 of 17 5/24/25, 2:17 PM
Firefox https://clouddocs.f5.com/bigip-next-for-kubernetes/2.0.0-GA/spk-troub...
FIX:¶
1. To increase inotify limits, update the following kernel parameters.
echo "fs.inotify.max_user_watches=2099999999" | sudo tee -a /etc/sysctl.c
echo "fs.inotify.max_user_instances=2099999999" | sudo tee -a /etc/sysctl
echo "fs.inotify.max_queued_events=2099999999" | sudo tee -a /etc/sysctl.
2. After updating the configuration, reload the kernel parameters with the following
command.
sudo sysctl -p
These changes will ensure that your system has higher inotify limits to accommodate
your workload.
17 of 17 5/24/25, 2:17 PM