Skip to content

sys/ztimer doc: List prerequisites for successful use of ztimer_now#17614

Merged
chrysn merged 1 commit intoRIOT-OS:masterfrom
chrysn-pull-requests:ztimer-doc-stricter
Feb 15, 2022
Merged

sys/ztimer doc: List prerequisites for successful use of ztimer_now#17614
chrysn merged 1 commit intoRIOT-OS:masterfrom
chrysn-pull-requests:ztimer-doc-stricter

Conversation

@chrysn
Copy link
Copy Markdown
Member

@chrysn chrysn commented Feb 3, 2022

Contribution description

ztimer_now() is hard to use correctly, and even though it set out to allow implementation flexibility, expectations are already creeping up based on properties ZTimer happens to have now.

This puts a damper on any expectations one might have on ztimer_now() values. It documents design goals (as I understand them), and may easily be stricter than the current implementation is -- but that is necessary to keep implementation flexibility that is already being exercised in PRs like #17607.

(It may also help motivate the creation of a high-level stopwatch mechanism as sketched in #17607, because yes that's a scary wall of text any user would have to go through).

Open questions

If a timer is set with a timeout way less than 2^32 (say, 2^31, as interrupts need to happen at least every 2^31 clock cycles anyway), it's really also OK to use now values with laxer constraints (eg. it'd be OK to compare to a now value obtained in the timer's interrupt then).

I think we can still relax all of this if it turns out that these are worth mentioning; the current text is phrased to give strictly sufficient criteria, whereas insufficient measures are only given by example.

Testing procedure

Compare built documentation to your mental model and/or the implementation of ZTimer.

Issues/PRs references

Closes: #17298

@github-actions github-actions bot added Area: sys Area: System Area: timers Area: timer subsystems labels Feb 3, 2022
@chrysn chrysn added the Area: doc Area: Documentation label Feb 3, 2022
@benpicco benpicco added the CI: skip compile test If set, CI server will run only non-compile jobs, but no compile jobs or their dependent jobs label Feb 3, 2022
@benpicco benpicco requested a review from kfessel February 3, 2022 22:49
Copy link
Copy Markdown
Contributor

@kfessel kfessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this write up i shows how to handle ztimer_now.
Even if another mechanism is introduced that keeps the clock running the documentation needs to clarify how to do that.

@kfessel kfessel added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Feb 4, 2022
@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 4, 2022

https://gist.github.com/kfessel/9beb45ab4efe2be5e4317757306d3308
may solve many ztimer now cases

those are two sets of functions one forever set
and a for (timeout) set

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 4, 2022 via email

@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 4, 2022

The user owns it (and thus does not force the system to have 64Bit) and the user is also able to stop it (it is the owner) e.g. an app needs to track time does not know how long (maybe the timer wraps) but when the app is stopped its ztime_forever can also be stopped

and the second set simply handles that timeout case

maybe aquire and release is a better solution (and maybe its portable to ztimer64 (make it releasable)

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 5, 2022

Yes, when acquire/release semantics are available, ztimer64 might be adjustable to them as well, making it also check the boxes your gist does.

Please hold a bit on merging this, I'd like to make amendments from having applied these rules in practice (nothing seriously changing, just additional sufficient criteria one may want to apply, and it might sense to review them as a whole too).

Comment on lines +502 to +505
* In contrast, the clock is not guaranteed to be active if a timer is
* removed and then a second one is started (even if the thread does not
* block between these events), or when an expiring timer wakes up a thread
* that then sets the second timer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the case only when PM is used?

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 7, 2022 via email

@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 7, 2022

This rises the question will we be able to mark modules that should be PM safe or for which we can guarantee that they do the acquire and release correctly?

@kaspar030
Copy link
Copy Markdown
Contributor

This does not yet take into account clocks that might be continuously running, which for some RTC/msec might be a desirable default behavior.

@jue89
Copy link
Copy Markdown
Contributor

jue89 commented Feb 7, 2022

This does not yet take into account clocks that might be continuously running, which for some RTC/msec might be a desirable default behavior.

Yes, but in that case you have to ensure that you application is only executed on boards with ZTIMER_MSEC running on ztimer_periph_rtt. If ZTIMER_MSEC is running on ztimer_periph_timer, it may stall.

I think it's good habit that users of the ztimer API don't have to think if the ztimer clock could stall oder not. This is the main reason why I came up with ztimer_acquire() and ztimer_release() in #17607. The first one gives you the guarantee that the clock won't stall until you've released the clock again.

I shot myself into the foot several times with stalled ztimer_now(ZTIMER_USEC) - and it hurts if it takes hours to days to trace down that you've done it again ;-)

@jue89
Copy link
Copy Markdown
Contributor

jue89 commented Feb 7, 2022

This rises the question will we be able to mark modules that should be PM safe or for which we can guarantee that they do the acquire and release correctly?

Yes, we have to do some bookkeeping here and should implement necessary ztimer_acquire() (or whatever solution we come up with to fix the issue of stalled clocks).

@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 7, 2022

would be a good habit to setup a timer as long as you need access to now (that was part of the reason why i wrote #17229 (it is much like now but allways needs a timer)) but at the same time it is much heavier than aquire() and release().
We could also go with some thing like external users (things that are not ztimer) of now have to proof they know of a timer that is running (which spares us the O(n) part of #17229).
like ztimer_now( CLOCK, aTimerThatIKeepRunning) (and rename ztimer_now -> _ztimer_now)

@jue89
Copy link
Copy Markdown
Contributor

jue89 commented Feb 7, 2022

would be a good habit to setup a timer as long as you need access to now (that was part of the reason why i wrote #17229 (it is much like now but allways needs a timer)) but at the same time it is much heavier than aquire() and release(). We could also go with some thing like external users (things that are not ztimer) of now have to proof they know of a timer that is running (which spares us the O(n) part of #17229). like ztimer_now( CLOCK, aTimerThatIKeepRunning) (and rename ztimer_now -> _ztimer_now)

Yeah, but in the end users will need a running timer to proof that they have blocked the clock from stalling, don't they?

If yes: In my opinion, that feels a little bit like a workaround and not like a proper solution. Those blocking timer_t will tend to have large offsets - so ztimer must traverse to the end of the linked list when the timer ist set, reset or removed. Furthermore, code using this pattern must check if the blocking ztimer_t will run long enough (however that may be defined ...) and if the timer hasn't expired in the meantime.

If we need to touch code that makes use of ztimer_now() (and I don't see how this can be avoided, TBH) we should go for a proper solution.

(Just to make things clear: I don't want to be harsh. I just want to find a good solution that fixes obvious problems in terms of good power management :-))

@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 7, 2022

  • with an explicit timer management (acquire() and release()) there might be timer_leak and timer_use_after_release (or use without acquire (many current ones) )
  • with with a private now() and access functions that need the programmer to proof that he has told the clock to be active we would remove the timer_use_after_release (or without acquire)
  • with an timeout on each acquire (like setting up a timer) we would also remove timer leak removing now and providing sys/ztimer: add ztimer_until, tell the time until a timer triggers #17229 would force that

Furthermore, code using this pattern must check if the blocking ztimer_t will run long enough (however that may be defined ...) and if the timer hasn't expired in the meantime.

you would not check if your timer runs long enough but set it how long you need it:
- waiting for input or timeout -> set it to timeout (you might not even need to remove the timer),
- want to count events in 2 minutes set it to two minutes (as long as timer is_set() keep counting),
- want to meassure time forever -> set it to as long as possible and if it triggers set it again and count the overflow. <- this is a proper time measurement ( not aquire() and hope there is no overflow ), But do not do this in ztimer (ztimer_now64 was not good) just do it in your application/driver/module/package what ever (create another module that just does this if you have many of these).
if you dont need the trigger set it to some donothing_fn().

  • but setting up a timer is also very costly (uint32 offset, Tnext, T cb T* args) 128byte and O(n) to add or remove (checking is O(1))
  • setting up something to track maybe cheaper (one would at least need an id)
  • (aquire and release without tracking and timeout are much cheaper)

and there is another blocker for that, you have to setup a timer before you can ask for now it is very uncommon -> much adaptation work in external packages

even though: i think doing it by setting up a timer is a more proper solution (avoids timer-leak and therefor may safe power) than that user-counter workaround.
if something in your build acquires a timer without releasing it and thus your device doesn't sleep you are down to user tracking like valgrind but for timers. i hope this will not happen to often.

I think your solution is simpler to use and more portable (especially it might also be usable for ztimer64 with all its absolute interfaces )

@jue89
Copy link
Copy Markdown
Contributor

jue89 commented Feb 7, 2022

Thank you for your detailed reply! (And sry @chrysn for abusing your PR for bike shedding about a related problem!)

  • with an explicit timer management (acquire() and release()) there might be timer_leak and timer_use_after_release (or use without acquire (many current ones) )

Good point - that may happen. I don't know if a higher level API (like this one - for example) could make those problems at least more obvious. But it will be very hard to avoid problems like this with C. C doesn't bring all these nice guarantees Rust is offering ...

This mentioned high level API could of course make use of your approach with timer_t used as blocker or the acquire/release pattern.

  • waiting for input or timeout -> set it to timeout (you might not even need to remove the timer),
  • want to count events in 2 minutes set it to two minutes (as long as timer is_set() keep counting),

In those cases, I would stick to ztimer_set() - no manual ztimer_acquire() / ztimer_release() required.

  • want to meassure time forever -> set it to as long as possible and if it triggers set it again and count the overflow. <- this is a proper time measurement ( not acquire() and hope there is no overflow ), But do not do this in ztimer (ztimer_now64 was not good) just do it in your application/driver/module/package what ever (create another module that just does this if you have many of these).

Time measurement is the only use-case I considered for this acquire/release pattern. That may could become a separate helper module which can handle acquire/release - or setting long timers - in the background. We shouldn't expose this implementation detail.

even though: i think doing it by setting up a timer is a more proper solution (avoids timer-leak and therefor may safe power) than that user-counter workaround.

What bugs me at this solution: When I just want to measure a time difference between two events, the API asks me what the longest expected time period will be. That could be a tricky question. Furthermore, a ztimer_t on the stack that isn't removed from the clock's list before returning from the function scope will lead to very unexpected behavior.

if something in your build acquires a timer without releasing it and thus your device doesn't sleep you are down to user tracking like valgrind but for timers. i hope this will not happen to often.

Yes right. That are the costs :/
(I'd ask gdb who's calling ztimer_acquire() and ztimer_release() - but yes debugging will take some time.)

@kfessel
Copy link
Copy Markdown
Contributor

kfessel commented Feb 8, 2022

Time measurement is the only use-case I considered for this acquire/release pattern. That may could become a separate helper module which can handle acquire/release - or setting long timers - in the background. We shouldn't expose this implementation detail.

If you do not know how long but you know it will be within 32 bits (~1h10min for usec) -> if not you need to mange overflows -> you need a timer anyway.

https://gist.github.com/kfessel/9beb45ab4efe2be5e4317757306d3308

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 9, 2022

Ad "This does not yet take into account clocks that might be continuously running": I don't think that should be on the list, or if only with caveats of its own. Most users of ztimer are not the top-level application and thus don't influence whether that is the case or not. And even code residing in the top-level application may be part of a structured application -- there, a part implicitly asserting that the clocks are running would depend on a completely unrelated part.

It would certainly be cool if code that makes that assertion can make that in a minimal-overhead fashion (like, a ztimer_acquire/_leak(ZTIMER_MSEC) being a no-op if the MSEC timer is based on a clock that won't shut off anyway for some reasons), but that's more for the API that such modules would already need anyway to not silently conflict with power management.

Ad timers leaking: There's already nothing stopping anyone from creating a ztimer_periodic they forget about. Acquire/release semantics are IMO familiar enough to users that they'd manage (with the error likelihood inherent to these things in C). And finally, leaking an acquisition might be exactly what a module wants: If a module needs a never-stopping possibly-wrapping 32bit seconds timer, then a ztimer_acquire(ZTIMER_SEC); in its init is just the right thing to do. (With ztimer_periodic on maximum time being the alternative if no acquisition counters happen).

Ad ztimer private: I don't know whether that's a good idea, but being an embedded OS I think we can well afford some not-trivial-to-use functions, if they are well documented.

What bugs me at this solution: When I just want to measure a time difference between two events, the API asks me what the longest expected time period will be. That could be a tricky question. Furthermore, a ztimer_t on the stack that isn't removed from the clock's list before returning from the function scope will lead to very unexpected behavior

It is a tricky question, but ultimately one that users better consider rather than not. Once considered, the user will know whether overflows are relevant or not. (Also, in PM'd systems, it is good to have it answered: If we're counting when an event expected within milliseconds arrives, but it never does, for how long are we prepared to keep spending energy?).

On the overall "acquire vs. set a long timer", I have no strong preferences. Acquire/release has some nice properties in terms of optimization (being O(1) and the possibility to make it a no-op on systems where the clock never sleeps), but the users that can legitimately use it without odd errors will be very few (for they need external assertions on their differences never wrapping, and if they do work with wrapping they could just as well go for ztimer64 or @kfessel's variation of it that is user-acquired). Anyhow, I doubt that the precise method of acquisition is relevant to this PR (even though it does give valuable input).


On the overall status, I've just added two more points in a follow-up commit, asking for another round of review:

  • Emphasize that all criteria are sufficient, but not necessary. Depending on board, user or application there may be additional ways to get the desired properties (of a timer being active or a difference not having wrapped), which are not described here.
  • Highlight one particularly useful criterion on wrapping: If the set timer was shorter than the wrap-around time by the maximum time the system can spend with pending interrupts (for which there is no agreed-on number, but as a highly conservative estimate, modules can assume interrupts to be serviced at least once per minute), the measurements being between start and triggering of a timer is sufficient.

@github-actions github-actions bot removed the Area: doc Area: Documentation label Feb 9, 2022
@chrysn chrysn added the Area: doc Area: Documentation label Feb 9, 2022
*
* If the clock was active, then the difference between the second value and
* the first is then the elapsed time in the clock's unit, **modulo the
* clock's wrap-around time**.
Copy link
Copy Markdown
Contributor

@kaspar030 kaspar030 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always 2**32. edit

Copy link
Copy Markdown
Contributor

@kfessel kfessel Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do not jump the 0 or ff..ff then it should be mod 2^32 I think

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not using the precise value due to the presence of the ZTIMER_NOW64 module; if you think that can be disregarded here I'm happy to add that in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought (and re-reading the text), I'm putting 2^32 in there with a note on ZTIMER_NOW64 (so that in particular if the latter gets deprecated / removed, the text automatically becomes better usable).

@kaspar030
Copy link
Copy Markdown
Contributor

LGTM.

I really think we should go with "ZTIMER_MSEC is always running". But I guess that's another topic.

Highlight one particularly useful criterion on wrapping: If the set timer was shorter than the wrap-around time by the maximum
time the system can spend with pending interrupts (for which there is no agreed-on number, but as a highly conservative
estimate, modules can assume interrupts to be serviced at least once per minute), the measurements being between start
and triggering of a timer is sufficient.

I think this refers to the dance with checking a mutex within interrupt? I don't get that yet.
Using a mutex is another way to check if a timer has expired, but one could also use ztimer_remove() in ISR, no? Where does the "maximum time the system can spend with pending interrupts" come into play?
everything can be late the maximum time an interrupt might take, even in a highest priority thread? Both when using ztimer_remove() or the mutex_trylock(), in ISR, the ztimer ISR and the testing ISR might be late. Or is this for asserting that even if they'd be late, the timer clock can be assumed to have been active during that time and thus the now value is valid (in the case the timer didn't expire -> mutex was unlocked)? Ah now that I write it, that makes sense. :)

@kaspar030
Copy link
Copy Markdown
Contributor

BTW, the extreme cases for "bad" timer configurations are 16bit counters being extended to 32bit in microseconds, or in 250khz (on avr). ztimer is fine until it misses a whole period (one half period can be missed). That would be a maximum of ~65ms resp. ~260ms that ztimer's tick ISR can be late before ztimer loses a tick. In theory.

@kaspar030 kaspar030 removed the CI: skip compile test If set, CI server will run only non-compile jobs, but no compile jobs or their dependent jobs label Feb 10, 2022
@kaspar030
Copy link
Copy Markdown
Contributor

kaspar030 removed the CI: skip compile test label 29 seconds ago

comment-only changes should do this automatically.

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 10, 2022 via email

chrysn added a commit to chrysn-pull-requests/RIOT that referenced this pull request Feb 10, 2022
@github-actions github-actions bot removed the Area: doc Area: Documentation label Feb 10, 2022
@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 10, 2022

I think this refers to the dance with checking a mutex within interrupt? I don't get that yet.

It is purely there because unlike ztimer_is_set / ztimer_remove it is constant-time (and thus more suitable for an interrupt handler); granted, it is contrived, so I can switch that over to the more regular ztimer_remove.

Using a mutex is another way to check if a timer has expired, but one could also use ztimer_remove() in ISR, no? Where does the "maximum time the system can spend with pending interrupts" come into play?

(answering anyway because while it makes sense to you now the question is good)

It's really doing both things at the same time: The mutex check (which is indeed equivalent to the ztimer_remove check) ensures that the timer is active. To also satisfy the "difference is guaranteed to not have wrapped" one would usually need the extra intermediate step of having interrupts on (which can't be enabled in arbitrary code just like that) -- otherwise the timer might still be in the queue, but the now-value wrapped anyway due to critical sections which ztimer generally tolerates. When the timer was much shorter than the 2**32 ticks, it is safe to assume that the timer fires early enough that even a long critical section won't cause a difference wraparound.


Would you prefer if the second example were also expressed in terms of timer removal or is_set?

May I squash this?

@chrysn
Copy link
Copy Markdown
Member Author

chrysn commented Feb 14, 2022

Pinging @kfessel (ACK holding after the extensions) or @kaspar030 (would you ACK?)

Copy link
Copy Markdown
Contributor

@kfessel kfessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nitpicks but my ACK hold for the content

* that then sets the second timer.
*
* If the clock was active, then the difference between the second value and
* the first is then the elapsed time in the clock's unit, **modulo 2³²
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all 2³² and 2⁶⁴ i would prefer another notation like 2^32 (2**32 may be missrendered ) ** indicating bold or italic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have ² and ³ around already, both in documentation and in strings, including use for exponentiation. The ** easily misrenders, and the ^ triggers LaTeX traumata, so I've left this one out. (doxygen does have math support to do this properly, but I don't think it's worth the time it'd need to set it up for the cases where exponentiation is around).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2⁶⁴³² << why does github do this

Copy link
Copy Markdown
Contributor

@kfessel kfessel Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2**32 is used in l169 same file

i think marking the numbers as code would help against misrendering

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am ok with any of them
(doc.riot-os.org has meta data utf-8 and superscripts of numbers are certainly not the last chars somone would implement )

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2⁶⁴³² << why does github do this

I don't see where github mixes the 64 and 32, what do you mean?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
i mean ²³¹⁰⁴⁵⁶⁷⁸⁹ somehow 123 are a different super than 0456789 (in code)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny -- I see very subtle differences as it renders here, but they might easily be explained by a font designer trying to badly do oldstyle figures.

What I see in your screenshot more looks like what'd happen when the font was originally cut for latin-1 (which had superscript for digits 1, 2, and 3) and when later extended to he superscript block of Unicode.

Do they render fine in the documentation for you? (deep link into the build artifact)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes doc renders fine

these are the fonts github uses for code ui-monospace,SFMono-Regular,SF Mono,Menlo,Consolas,Liberation Mono,monospace;

Liberation Mono is the choosen one on this system

ui-monospace -> sytems monospace font (working in safari) see https://caniuse.com/extended-system-fonts
SFMono -> apples default monospace
Menlo -> apples old default monospace
Consolas -> windows default since 8

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the gucharmap tool on Linux offers convenient inspection of fall-back font selection -- on my system, it shows superscript 1, 2 and 3 with Bitstream Vera Sans (on right-clicking the glyph's cell), but the others in DejaVu Sans. (Sadly, I know of no tool that allows one to input a list of font names like in CSS and then tells you where there is a Zwiebelfisch (where I contest the articles assessment that the term does not apply here)).

Copy link
Copy Markdown
Contributor

@kaspar030 kaspar030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK from my side, once Karl's latest comments are addressed.

@chrysn chrysn force-pushed the ztimer-doc-stricter branch from 85c87a7 to 632ca35 Compare February 14, 2022 14:58
@chrysn chrysn enabled auto-merge February 15, 2022 08:09
@chrysn chrysn merged commit 69db27f into RIOT-OS:master Feb 15, 2022
@chrysn chrysn deleted the ztimer-doc-stricter branch February 15, 2022 08:09
@OlegHahm OlegHahm added this to the Release 2022.04 milestone Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: sys Area: System Area: timers Area: timer subsystems CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ztimer_now() documentation is misleading

7 participants