Skip to content

os/bluestore: log txc details in slow op notification on committed_kv#59481

Merged
SrinivasaBharath merged 3 commits intoceph:mainfrom
ifed01:wip-ifed-more-info-in-slow-op-log
Nov 4, 2024
Merged

os/bluestore: log txc details in slow op notification on committed_kv#59481
SrinivasaBharath merged 3 commits intoceph:mainfrom
ifed01:wip-ifed-more-info-in-slow-op-log

Conversation

@ifed01
Copy link
Contributor

@ifed01 ifed01 commented Aug 28, 2024

This might be helpful to troubleshoot issues with slow ops caused by bulky client transactions.

Related-to: https://tracker.ceph.com/issues/67339

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ifed01 ifed01 requested a review from a team as a code owner August 28, 2024 13:08
@ifed01 ifed01 added the feature label Aug 28, 2024
@ifed01 ifed01 force-pushed the wip-ifed-more-info-in-slow-op-log branch from e1f2fa0 to d6e2f47 Compare August 28, 2024 16:30
@ifed01 ifed01 force-pushed the wip-ifed-more-info-in-slow-op-log branch from d6e2f47 to 1e20b50 Compare August 28, 2024 18:58
@github-actions github-actions bot added the tests label Aug 28, 2024
@ifed01
Copy link
Contributor Author

ifed01 commented Aug 29, 2024

jenkins test api

@batrick
Copy link
Member

batrick commented Aug 29, 2024

Adding this to my future runs. Don't wait for me to merge.

@batrick
Copy link
Member

batrick commented Aug 29, 2024

jenkins test make check arm64

@batrick
Copy link
Member

batrick commented Sep 16, 2024

@ifed01 I think this is fixing a tracker ticket but it's missing in the commit message. Can you please add it?

@batrick
Copy link
Member

batrick commented Sep 16, 2024

@ifed01
Copy link
Contributor Author

ifed01 commented Sep 17, 2024

@ifed01 I think this is fixing a tracker ticket but it's missing in the commit message. Can you please add it?

@batrick - you mean it's fixed https://tracker.ceph.com/issues/67339?
It couldn't do that - this patch is just extended log output to help to troubleshoot the issue. The latter seems to be flickering...

@batrick
Copy link
Member

batrick commented Sep 17, 2024

@ifed01 I think this is fixing a tracker ticket but it's missing in the commit message. Can you please add it?

@batrick - you mean it's fixed https://tracker.ceph.com/issues/67339? It couldn't do that - this patch is just extended log output to help to troubleshoot the issue. The latter seems to be flickering...

Ah, my mistake. Maybe add Test-for: https://tracker.ceph.com/issues/67339?

@ifed01 ifed01 force-pushed the wip-ifed-more-info-in-slow-op-log branch from 1e20b50 to f061054 Compare September 17, 2024 14:30
@ifed01
Copy link
Contributor Author

ifed01 commented Sep 17, 2024

@ifed01 I think this is fixing a tracker ticket but it's missing in the commit message. Can you please add it?

@batrick - you mean it's fixed https://tracker.ceph.com/issues/67339? It couldn't do that - this patch is just extended log output to help to troubleshoot the issue. The latter seems to be flickering...

Ah, my mistake. Maybe add Test-for: https://tracker.ceph.com/issues/67339?

Related-to clause has been added.

@ifed01
Copy link
Contributor Author

ifed01 commented Sep 18, 2024

jenkins test make check

@ifed01
Copy link
Contributor Author

ifed01 commented Sep 18, 2024

jenkins test api

@batrick
Copy link
Member

batrick commented Sep 20, 2024

kv_committed.

This might be helpful to troubleshoot issues with slow ops caused by
bulky client transactions.

Related-to:  https://tracker.ceph.com/issues/67339
Signed-off-by: Igor Fedotov <[email protected]>
@batrick
Copy link
Member

batrick commented Sep 25, 2024

@ifed01 ifed01 force-pushed the wip-ifed-more-info-in-slow-op-log branch from f061054 to 719fd98 Compare September 30, 2024 10:46
@ifed01
Copy link
Contributor Author

ifed01 commented Sep 30, 2024

This PR is under test in https://tracker.ceph.com/issues/68170.

@ifed01

https://pulpito.ceph.com/pdonnell-2024-09-20_23:48:22-fs-wip-pdonnell-testing-20240920.212106-debug-distro-default-smithi/7913346/

and a few others to look at with this PR.

@batrick - nothing bad from a single transaction content perspective for the above run.

I've just added another commit to track maximum amount of pending transactions (and their cost). Having pretty long list of pending transactions could be a cause for slow ops indications as well.
Please try to reproduce the issue with this new patch once again if possible.

@batrick
Copy link
Member

batrick commented Sep 30, 2024

This PR is under test in https://tracker.ceph.com/issues/68170.

@ifed01
https://pulpito.ceph.com/pdonnell-2024-09-20_23:48:22-fs-wip-pdonnell-testing-20240920.212106-debug-distro-default-smithi/7913346/
and a few others to look at with this PR.

@batrick - nothing bad from a single transaction content perspective for the above run.

I've just added another commit to track maximum amount of pending transactions (and their cost). Having pretty long list of pending transactions could be a cause for slow ops indications as well. Please try to reproduce the issue with this new patch once again if possible.

I'll keep it in my next batch but not sure when that will be.

@vshankar @rishabh-d-dave @mchangir please include this debugging PR in your future runs.

@batrick
Copy link
Member

batrick commented Oct 4, 2024

@batrick
Copy link
Member

batrick commented Oct 7, 2024

@ifed01 note we're ignoring BLUESTORE_SLOW_OP_ALERT now via #60011. I will continue including your debugging PR in my batches until you tell me it's no longer necessary.

Anyway, here's some jobs you can look at:

 /a/pdonnell-2024-10-04_18:42:30-fs-wip-pdonnell-testing-20241004.144202-debug-distro-default-smithi$ find -name \*ceph.log\* -exec zgrep BLUESTORE_SLOW {} +
./7933201/remote/smithi192/log/9df3a942-8291-11ef-bafb-efdc52797490/ceph.log.gz:2024-10-04T21:03:48.991775+0000 mon.a (mon.0) 1187 : cluster [WRN] Health check failed: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
./7933201/remote/smithi192/log/9df3a942-8291-11ef-bafb-efdc52797490/ceph.log.gz:2024-10-04T21:10:00.000638+0000 mon.a (mon.0) 1239 : cluster [WRN] [WRN] BLUESTORE_SLOW_OP_ALERT: 2 OSD(s) experiencing slow operations in BlueStore
./7933201/remote/smithi192/log/9df3a942-8291-11ef-bafb-efdc52797490/ceph.log.gz:2024-10-04T21:12:37.372393+0000 mon.a (mon.0) 1259 : cluster [WRN] Health check update: 3 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
./7933201/remote/smithi017/log/9df3a942-8291-11ef-bafb-efdc52797490/ceph.log.gz:2024-10-04T21:03:48.991775+0000 mon.a (mon.0) 1187 : cluster [WRN] Health check failed: 2 OSD(s) experiencing slow operations in BlueStore (BLUESTORE_SLOW_OP_ALERT)
...

@Naveenaidu
Copy link
Contributor

@batrick
Copy link
Member

batrick commented Oct 24, 2024

@ifed01 do you want to merge this into main or keep it only for debugging wip- branches?

@batrick
Copy link
Member

batrick commented Oct 24, 2024

jenkins test api

@batrick
Copy link
Member

batrick commented Oct 31, 2024

@SrinivasaBharath SrinivasaBharath merged commit cfd73d5 into ceph:main Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants