Skip to content

Add detection of buggy DEP cursors#1016

Merged
korylprince merged 1 commit intomicromdm:mainfrom
korylprince:dep-cursor-fix
Dec 5, 2024
Merged

Add detection of buggy DEP cursors#1016
korylprince merged 1 commit intomicromdm:mainfrom
korylprince:dep-cursor-fix

Conversation

@korylprince
Copy link
Contributor

@korylprince korylprince commented Jul 21, 2024

Recently on the MacAdmins Slack @trekkim described an issue where the DEP syncer was syncing several times a second.

On closer inspection, Apple's response to the fetch devices endpoint used the same cursor sent in the request, and set the more_to_follow field to true. This caused MicroMDM to make the same request over and over again in a tight loop.

Ultimately the fix was to delete the "bad" cursor out of the database and start a new sync.

Ideally, Apple would fix this behavior on their side. Until then, this PR adds detection (return same cursor and more_to_follow=true) and correction (fetch with an empty cursor) for this situation.

@jessepeterson
Copy link
Member

I would be careful with this. Resetting the cursor and re-doing a fetch will cause all devices to re-sync. Would it be better to just go into the normal polling loop when this condition is detected?

@korylprince
Copy link
Contributor Author

Yeah, I'm not 100% sure this is the way to go either, but it does follow the existing behavior when the cursor is explicitly invalid (which one could argue is the case here as well).

By "normal polling loop" do you mean use the fetch request instead of the sync request? Presumably, it's going to have the same issue with the cursor (either returns the same one, or we delete it and sync all the devices again).

An alternative might be to simply detect the issue (the cursor being the same), and exit the loop and log a message letting the user know what's going on. Then provide a way for the user to manually delete the cursor if they want - either provide docs for removing it from the db, or, even better, an API endpoint to do it.

@jessepeterson
Copy link
Member

By "normal polling loop" do you mean use the fetch request instead of the sync request?

No I mean try and avoid doing a fetch. Keep the cursor and the sync mode - basically just pretend like the "more" condition isn't true and don't continue out of the loop the immediately — just wait another 30m.

@korylprince
Copy link
Contributor Author

Ah, yeah that would be much less aggressive. I do wonder if Apple would eventually return the right thing, though, or if the sync would get stuck (possibly forever?) at that point. Maybe we should also log a message if the cursor is the same and more_to_follow is true so at least the user would realize that the sync isn't actually working?

When I get a free moment, I'll update this PR to do that instead.

@korylprince korylprince changed the title Add detection and correction of buggy DEP cursors Add detection of buggy DEP cursors Aug 1, 2024
@korylprince
Copy link
Contributor Author

@jessepeterson,

I pushed a new change to just exit the sync early and log an error message.

@jessepeterson
Copy link
Member

jessepeterson commented Dec 5, 2024

Thinking about this more: I wonder if we're getting a cursor exhausted message for this and that's a clue that we should be ignoring this cursor. 🤔

@korylprince korylprince merged commit c6ae057 into micromdm:main Dec 5, 2024
@korylprince
Copy link
Contributor Author

In this particular case, the API responded 200 OK, more_to_follow=true, and with the same cursor used in the request. (If it had responded with any other HTTP status, the existing code would have errored out.)

Based on Apple's docs, that's incorrect behavior from Apple's side for sure.

@jessepeterson
Copy link
Member

jessepeterson commented Dec 6, 2024

In this particular case, the API responded 200 OK, more_to_follow=true, and with the same cursor used in the request. (If it had responded with any other HTTP status, the existing code would have errored out.)

Well, the logs in the Slack thread are inconclusive because I want to see the preceding Fetch runs but I'm curious if it got a cursor exhausted on the fetch run (up in line 289 in this PR) and then every next subsequent sync was using (or not handling) an exhausted cursor? Maybe? Just sorta thinking out loud. I guess the logic works either way; but if that's expected behavior maybe the logging could be omitted? Dunno, I'd have to see it in action I guess.

I also ask because NanoDEP's sync logic is slightly different:

https://github.com/micromdm/nanodep/blob/27ffba3fe19e64a47eb99885d690c4c01db7c07b/sync/syncer.go#L211-L220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants