Skip to content

[Bug]: qos/test_qos_sai.py::TestQosSai::testQosSaiPgSharedWatermark fails with multi_asic and multi_dut variants #16167

@arista-nwolfe

Description

@arista-nwolfe

Issue Description

Failure seen:

dst_port_id: 47, src_port_id: 34 src_port_vlan: None
actual dst_port_id: 47
Initial watermark:[112, 0, 0, 0, 256, 0, 0, 0]
Received packets: 0
Init pkts num sent: 0, min: 0, actual watermark value to start: 0
Filled PG min
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
|               | Pfc3TxPkt | InDiscard | InDropPkt | OutDiscard | OutDropPkt | OutUcPkt | InUcPkt | InNonUcPkt | OutNonUcPkt | OutQlen | Ing Pg3 Pkt | Ing Pg3 Share Wm |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
| base src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2442    |     1248    |    0    |    837638   |        0         |
|      src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2443    |     1249    |    0    |    837638   |        0         |
| base dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
|      dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
pkts num to send: 41, total pkts: 41, pg shared: 415271
Compensate 2176538 packets to port 34, and retry 1 times
Received packets: 418930
To fill PG share pool, send 41 pkt
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
|               | Pfc3TxPkt | InDiscard | InDropPkt | OutDiscard | OutDropPkt | OutUcPkt | InUcPkt | InNonUcPkt | OutNonUcPkt | OutQlen | Ing Pg3 Pkt | Ing Pg3 Share Wm |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
| base src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2442    |     1248    |    0    |    837638   |        0         |
|      src port |  5555195  |  4576881  |     0     |     0      |     0      |    4     | 8013426 |    2493    |     1274    |    0    |   1256568   |     46510464     |
| base dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
|      dst port |     0     |     3     |     0     |     0      |     0      |  422408  |   5227  |    1250    |      88     |    0    |      0      |        0         |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
lower bound: 167936, actual value: 46510464, upper bound (+40): 9072
> /root/saitests/py3/sai_qos_tests.py(4624)runTest()
======================================================================
FAIL: sai_qos_tests.PGSharedWatermarkTest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "saitests/py3/sai_qos_tests.py", line 4623, in runTest
    * (packet_length + internal_hdr_size)))
AssertionError

----------------------------------------------------------------------
Ran 1 test in 934.434s

The issue appears to be during the dynamically_compensate_leakout
Compensate 2176538 packets to port 34, and retry 1 times
We can see this is sending far too many packets 2176538.

This function compares the TX_OK value before and after sending the 41 packets.
Here is where it stores the counts before the packets are sent:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L4464

             xmit_counters_history, _ = sai_thrift_read_port_counters(
                 self.dst_client, asic_type, port_list['dst'][dst_port_id])

And within dynamically_compensate_leakout here is where they are read:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L454

    curr, _ = counter_checker(thrift_client, asic_type, check_port)
    leakout_num = curr[check_field] - prev[check_field]

The problem here is the call to dynamically_compensate_leakout is passed self.src_client as the thrift_client argument but is operating on a port in the self.dst_client:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L4551

                    dynamically_compensate_leakout(self.src_client, asic_type, sai_thrift_read_port_counters,
                                                   port_list['dst'][dst_port_id], TRANSMITTED_PKTS,
                                                   xmit_counters_history, self, src_port_id, pkt, 40)

In this failure case I can see that the dst_port_id value is used on both asics:

(Pdb) port_list['src'][32]
4294967297
(Pdb) port_list['dst'][dst_port_id]
4294967297

If I dump the TX_OK of port 32 on the src asic I get:

(Pdb) sai_thrift_read_port_counters(self.src_client, asic_type, port_list['dst'][dst_port_id])[0][TRANSMITTED_PKTS]
2598800

On the dst asic I get:

(Pdb) sai_thrift_read_port_counters(self.dst_client, asic_type, port_list['dst'][dst_port_id])[0][TRANSMITTED_PKTS]
422409

This is where the massive compensate packet number comes from:

(Pdb) xmit_counters_history[TRANSMITTED_PKTS]
422261
2598800 - 422261 = 2176539

Results you see

We poll the incorrect asic/dut client in dynamically_compensate_leakout

Results you expected to see

We should poll the same asic/dut client in dynamically_compensate_leakout as the port we're referencing

Is it platform specific

generic

Relevant log output

No response

Output of show version

No response

Attach files (if any)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions