Skip to content

client: fix memory leak in Client::CRF_iofinish::complete#60507

Merged
vshankar merged 1 commit intoceph:mainfrom
synarete:ss-cephfs-asyncio-use-after-free-bugfix
Jun 18, 2025
Merged

client: fix memory leak in Client::CRF_iofinish::complete#60507
vshankar merged 1 commit intoceph:mainfrom
synarete:ss-cephfs-asyncio-use-after-free-bugfix

Conversation

@synarete
Copy link
Contributor

@synarete synarete commented Oct 27, 2024

Commit 1210ddf ("Client: Add non-blocking helper classes") introduced Client::C_Read_Finisher Context object for async READ operations, but it has a read-after-free bug which may cause memory leak when calling libcephf's non-blocking ceph_ll_nonblocking_readv_writev API with async READ:

ceph_ll_nonblocking_readv_writev (READ)
Client::ll_preadv_pwritev
...
Client::_read_async
Context::complete
Client::CRF_iofinish::complete
Client::CRF_iofinish::finish
CRF->finish_io()
Client::C_Read_Finisher::finish_io
...
delete this; // frees CRF_iofinish->CRF
if (CRF->iofinished) // use-after-free of CRF
delete this; // may not get here

A possible memory leak depends on timing and race with other thread allocation which alters the memory address of CRF->iofinished to false, thus skipping the last delete operation.

The check of if (CRF->iofinished) is unnecessary: it is always set to true upon calling CRF->finish_io(). Removed.

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions github-actions bot added the cephfs Ceph File System label Oct 27, 2024
@synarete synarete added wip-ssharon-test2 synarete testing and removed cephfs Ceph File System labels Oct 27, 2024
@synarete synarete marked this pull request as draft October 27, 2024 10:57
@synarete synarete force-pushed the ss-cephfs-asyncio-use-after-free-bugfix branch from d98aa32 to b479023 Compare October 28, 2024 08:19
@github-actions github-actions bot added the cephfs Ceph File System label Oct 28, 2024
@synarete synarete marked this pull request as ready for review October 28, 2024 08:19
@synarete synarete requested review from ffilz and vshankar October 28, 2024 08:24
@vshankar vshankar requested a review from kotreshhr October 28, 2024 09:25
@kotreshhr kotreshhr self-assigned this Oct 28, 2024
@ffilz ffilz requested a review from dparmar18 October 28, 2024 14:28
@synarete synarete force-pushed the ss-cephfs-asyncio-use-after-free-bugfix branch from b479023 to a655d80 Compare October 29, 2024 10:13
@synarete synarete requested a review from batrick October 29, 2024 10:15
@dparmar18 dparmar18 self-assigned this Oct 29, 2024
@batrick batrick self-assigned this Oct 30, 2024
@dparmar18
Copy link
Contributor

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@synarete
Copy link
Contributor Author

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

@dparmar18
Copy link
Contributor

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

Yeah but since @batrick mentioned, im wondering if that part is really required since the base class should take care of it, no?

@synarete
Copy link
Contributor Author

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

Yeah but since @batrick mentioned, im wondering if that part is really required since the base class should take care of it, no?

The code of Client::C_Read_Sync_NonBlocking::finish() has a case in which it may return without setting fini to false, thus the if (fini) in Client::C_Read_Sync_NonBlocking::complete() will not delete this. However, I do not know if it OK or (another) bug.

@dparmar18
Copy link
Contributor

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

Yeah but since @batrick mentioned, im wondering if that part is really required since the base class should take care of it, no?

The code of Client::C_Read_Sync_NonBlocking::finish() has a case in which it may return without setting fini to false, thus the if (fini) in Client::C_Read_Sync_NonBlocking::complete() will not delete this. However, I do not know if it OK or (another) bug.

Okay, I guess it might be needed for Client::C_Read_Sync_NonBlocking::retry() to make sure the retried calls don't segfault.

@ffilz
Copy link
Contributor

ffilz commented Oct 30, 2024

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

Yeah but since @batrick mentioned, im wondering if that part is really required since the base class should take care of it, no?

The code of Client::C_Read_Sync_NonBlocking::finish() has a case in which it may return without setting fini to false, thus the if (fini) in Client::C_Read_Sync_NonBlocking::complete() will not delete this. However, I do not know if it OK or (another) bug.

Okay, I guess it might be needed for Client::C_Read_Sync_NonBlocking::retry() to make sure the retried calls don't segfault.

Yea, I think that one really is needed, or if we can find a different way to resolve the Context lifetime.

@dparmar18
Copy link
Contributor

from what I see is we might also need to audit C_Read_Sync_NonBlocking which has a similar override of complete() in the same commit mentioned in description 1210ddf#diff-7a209b0f3a8f5e5b20525fb85acc702a4b1ec7d568dcf3e88ebc826fa2104e8aR1346-R1352

@dparmar18 As far as I can tell from reading the code of Client::C_Read_Sync_NonBlocking::finish() and Client::C_Read_Sync_NonBlocking::complete() it does not have the same use-after-free case as fixed by this PR (the statement if (fini) delete this; uses private member which is still valid at this point). Please correct me if I am wrong here.

Yeah but since @batrick mentioned, im wondering if that part is really required since the base class should take care of it, no?

The code of Client::C_Read_Sync_NonBlocking::finish() has a case in which it may return without setting fini to false, thus the if (fini) in Client::C_Read_Sync_NonBlocking::complete() will not delete this. However, I do not know if it OK or (another) bug.

Okay, I guess it might be needed for Client::C_Read_Sync_NonBlocking::retry() to make sure the retried calls don't segfault.

Yea, I think that one really is needed, or if we can find a different way to resolve the Context lifetime.

since this bug was caused due to some race where the onfinished's memory addr was altered to the inverse value, im curious if the same can happen for the retries, what if the fini gets altered to true, the context gets deleted and we have a UAF. I think we should refactor the case to some robust solution. Maybe manage the retry context in a thread like finisher thread (ceph/src/common/Finisher.h)?

@synarete synarete force-pushed the ss-cephfs-asyncio-use-after-free-bugfix branch 4 times, most recently from e5d2a37 to d26c19d Compare November 5, 2024 11:55
@synarete synarete force-pushed the ss-cephfs-asyncio-use-after-free-bugfix branch from d26c19d to 0c27fe3 Compare November 10, 2024 09:28
@vshankar
Copy link
Contributor

I'm suspecting this change to be causing failures reported in #60411 (comment)

Will update when I verify that.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jan 28, 2025
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Feb 27, 2025
@dparmar18 dparmar18 reopened this Feb 27, 2025
@dparmar18 dparmar18 removed the stale label Feb 27, 2025
@vshankar
Copy link
Contributor

@dparmar18 This change need to be reviewed and testes. @dparmar18 Mind having a look?

I will get to it later today.

@dparmar18
Copy link
Contributor

@dparmar18 This change need to be reviewed and testes. @dparmar18 Mind having a look?

I will get to it later today.

ACK

@dparmar18
Copy link
Contributor

This patch looks good to me but as @ffilz mentioned in #60507 (comment) that if in case of short reads finish_io might need to return without setting iofinished to true. If that is implemented, the need to have this overridden complete() might exist.

@dparmar18
Copy link
Contributor

I feel we should have a better mechanism of managing context lifetime i.e. independent of fini (in C_Read_Sync_NonBlocking) or iofinished

@ffilz
Copy link
Contributor

ffilz commented Mar 5, 2025

I feel we should have a better mechanism of managing context lifetime i.e. independent of fini (in C_Read_Sync_NonBlocking) or iofinished

Yea, lifetime seems mucky... Not sure best way to fix it.

@github-actions
Copy link

github-actions bot commented May 4, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label May 4, 2025
@dparmar18 dparmar18 removed the stale label May 7, 2025
@dparmar18
Copy link
Contributor

jenkins test make check arm64

Copy link
Contributor

@anoopcs9 anoopcs9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@synarete Sorry to mention at this stage but can you please fix the commit message by removing the unwanted text and duplicate Signed-off-by tag?

@dparmar18
Copy link
Contributor

jenkins test make check arm64

Commit 1210ddf ("Client: Add non-blocking helper classes") introduced
Client::C_Read_Finisher Context object for async READ operations, but
it has a read-after-free bug which may cause memory leak when calling
libcephf's non-blocking ceph_ll_nonblocking_readv_writev API with async
READ:

ceph_ll_nonblocking_readv_writev (READ)
  Client::ll_preadv_pwritev
  ...
    Client::_read_async
      Context::complete
        Client::CRF_iofinish::complete
          Client::CRF_iofinish::finish
          CRF->finish_io()
            Client::C_Read_Finisher::finish_io
            ...
            delete this; // frees CRF_iofinish->CRF
          if (CRF->iofinished) // use-after-free of CRF
            delete this; // may not get here

A possible memory leak depends on timing and race with other thread
allocation which alters the memory address of CRF->iofinished to
false, thus skipping the last delete operation.

The check of `if (CRF->iofinished)` is unnecessary: it is always set to
true upon calling CRF->finish_io(). Thus, there is no need to have the
override function Client::CRF_iofinish::complete() as it now has the
same logic as Context::complete(). Removed.

Signed-off-by: Shachar Sharon <[email protected]>
@synarete synarete force-pushed the ss-cephfs-asyncio-use-after-free-bugfix branch from 5b89503 to 6dc7756 Compare June 13, 2025 11:34
@vshankar
Copy link
Contributor

Testing update: https://tracker.ceph.com/issues/71514#note-12

The jobs have picked up crimson flavour for OSD and is causing lots of failures. Not sure how did that happen. This is under resolution and we are talking to crimson team. Standby!

@vshankar vshankar merged commit a75c560 into ceph:main Jun 18, 2025
13 checks passed
@vshankar vshankar removed the wip-ssharon-test2 synarete testing label Jun 18, 2025
@vshankar
Copy link
Contributor

@synarete I will backport this fix to squid/tentacle branch with other client side fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants