Skip to content

qa/suites/orch: whitelist warnings that are expected in test environments#55507

Merged
ljflores merged 1 commit intoceph:mainfrom
ljflores:wip-tracker-64343
Feb 10, 2024
Merged

qa/suites/orch: whitelist warnings that are expected in test environments#55507
ljflores merged 1 commit intoceph:mainfrom
ljflores:wip-tracker-64343

Conversation

@ljflores
Copy link
Member

@ljflores ljflores commented Feb 8, 2024

The idea here is to ignore warnings that we know are happening because of deliberate testing conditions.

An alternative to the solution proposed in #55498.

Fixes: https://tracker.ceph.com/issues/64343

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ljflores
Copy link
Member Author

ljflores commented Feb 8, 2024

Here's an example link to a main test run that experienced a lot of failures due to MON_DOWN warnings. In these tests, this warning is expected, so it's not ideal to have them fail. Affected tests are rados/cephadm and rados/thrash-old-clients.
https://pulpito.ceph.com/yuriw-2024-02-07_00:12:44-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/

Here are some test runs I scheduled with the changes applied:
rados/cephadm: https://pulpito.ceph.com/lflores-2024-02-08_22:34:31-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/
rados/thrash-old-clients: https://pulpito.ceph.com/lflores-2024-02-08_22:37:00-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/

@ljflores ljflores requested a review from a team February 8, 2024 22:55
@ljflores
Copy link
Member Author

ljflores commented Feb 8, 2024

Still seeing errors like this in the "after" results:

"2024-02-08T23:08:41.185384+0000 mon.a (mon.0) 197 : cluster 3 [WRN] MON_DOWN: 1/3 mons down, quorum a,c" in cluster log

I think I need to whitelist MON_DOWN rather than (MON_DOWN).

@athanatos
Copy link
Contributor

Most likely.

@ljflores
Copy link
Member Author

ljflores commented Feb 8, 2024

Still seeing errors like this in the "after" results:

"2024-02-08T23:08:41.185384+0000 mon.a (mon.0) 197 : cluster 3 [WRN] MON_DOWN: 1/3 mons down, quorum a,c" in cluster log

I think I need to whitelist MON_DOWN rather than (MON_DOWN).

New "after" results:
rados/cephadm: https://pulpito.ceph.com/lflores-2024-02-08_23:30:13-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/
rados/thrash-old-clients: https://pulpito.ceph.com/lflores-2024-02-08_23:34:25-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/

@markhpc markhpc self-requested a review February 8, 2024 23:40
@markhpc
Copy link
Member

markhpc commented Feb 8, 2024

Looked through the original test results and saw that 35 of the 54 failures were from MON_DOWN. Laura mentioned that she looked through all of them and the other seemed to be for unrelated things so I won't dig through them as well. I'll try to keep an eye on this run as it progresses.

Looks like quite a few of the remaining failures are OSD_DOWN related.

lflores-2024-02-08_23:30:13-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi

$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'MON_DOWN' | sort
./7553028/teuthology.log
./7553032/teuthology.log
./7553038/teuthology.log
./7553041/teuthology.log
./7553051/teuthology.log
./7553053/teuthology.log
./7553055/teuthology.log
$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'OSD_DOWN' | sort
./7553023/teuthology.log
./7553025/teuthology.log
./7553026/teuthology.log
./7553027/teuthology.log
./7553031/teuthology.log
./7553033/teuthology.log
./7553035/teuthology.log
./7553037/teuthology.log
./7553042/teuthology.log
./7553043/teuthology.log
./7553046/teuthology.log
./7553047/teuthology.log
./7553050/teuthology.log
./7553052/teuthology.log
$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'PG_AVAIL' | sort
./7553032/teuthology.log
./7553050/teuthology.log
./7553053/teuthology.log

lflores-2024-02-08_23:34:25-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi

$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'MON_DOWN' | sort
./7553059/teuthology.log
./7553060/teuthology.log
./7553064/teuthology.log
./7553065/teuthology.log
$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'OSD_DOWN' | sort
./7553059/teuthology.log
./7553060/teuthology.log
./7553065/teuthology.log
$ find . -name "teuthology.log" -exec grep -l -H 'FAIL' {} \; | xargs grep -l -H 'PG_AVAIL' | sort
./7553059/teuthology.log
./7553060/teuthology.log
./7553064/teuthology.log
./7553065/teuthology.log

@ljflores
Copy link
Member Author

ljflores commented Feb 9, 2024

New "after" results: rados/cephadm: https://pulpito.ceph.com/lflores-2024-02-08_23:30:13-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/ rados/thrash-old-clients: https://pulpito.ceph.com/lflores-2024-02-08_23:34:25-rados-wip-yuri2-testing-2024-02-06-1154-distro-default-smithi/

All instances of MON_DOWN are silenced in rados/cephadm; I see some instances of OSDMAP_FLAGS: noscrub flag(s) set and OSD_DOWN in rados/thrash-old-clients, so I will silence those as well.

@ljflores
Copy link
Member Author

ljflores commented Feb 9, 2024

@markhpc
Copy link
Member

markhpc commented Feb 9, 2024

@ljflores Nice improvements! I'm out for an hour or two, but I'll try to check in after that. If there's anything you need to help keep things moving, please reach out.

@markhpc
Copy link
Member

markhpc commented Feb 9, 2024

In the last run, we ignorelist both MON_DOWN and OSD_DOWN in 7554107, 7554108, and 7554109. We also are also ignoring \(PG_, but I wonder if we also need to ignore PG_AVAILABILITY similar to the others in the thrashosds-health ignorelist.

@ljflores
Copy link
Member Author

ljflores commented Feb 9, 2024

@markhpc yes, looks like we need PG_DEGRADED, PG_AVAILABILITY, and POOL_APP_NOT_ENABLED based on latest results. Making those changes..

@ljflores
Copy link
Member Author

ljflores commented Feb 9, 2024

@markhpc latest runs passed! If all looks good to you, I think we are good to merge. I suspect a few more warnings might pop up in subsequent runs, as these failures are nondeterministic, but this PR takes care of the majority. We can always raise a part 2 to knock down any more that might arise.

Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Just eyeballing it, I don't think any of the ignorelists here are clearly incorrect, but it's been a while since I've looked over the nuance of the different suites. Either way, I'd take overly broad ignore-lists over the old behavior any day.

Thank you so much @ljflores for putting in the effort to make this happen! 💪

@ljflores ljflores merged commit 228dab8 into ceph:main Feb 10, 2024
@ljflores ljflores deleted the wip-tracker-64343 branch February 12, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants