Skip to content

Comments

CRYPTO_atomic_read#4414

Closed
paulidale wants to merge 2 commits intoopenssl:masterfrom
paulidale:atomic_read
Closed

CRYPTO_atomic_read#4414
paulidale wants to merge 2 commits intoopenssl:masterfrom
paulidale:atomic_read

Conversation

@paulidale
Copy link
Contributor

Add a CRYPTO_atomic_read call which allows an int variable to be read in an atomic fashion.

  • documentation is added or updated
  • tests are added or updated

@paulidale paulidale mentioned this pull request Sep 25, 2017
2 tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that second read mention be atomic add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in this matter, just googled it, but wouldn't this implementation here be more natural?

int CRYPTO_atomic_read(int *val, int *ret, CRYPTO_RWLOCK *lock)
{
    *ret = InterlockedCompareExchange(val, 0, 0);
    return 1;
}

See https://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx. Unless OpenSSL still supports Windows XP, evenInterlockedCompareExchangeAcquire() could be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know nothing about Windows programming :)
I'll switch this over.

@mspncp
Copy link
Contributor

mspncp commented Sep 25, 2017

Looks like there are already some existing use cases for CRYPTO_atomic_read(), which should be replaced:

~/src/openssl$ grep -r -nH 'CRYPTO_atomic_add.*, 0'  *
crypto/lhash/lh_stats.c:75:    CRYPTO_atomic_add(&lh_mut->num_hash_calls, 0, &ret,
crypto/lhash/lh_stats.c:78:    CRYPTO_atomic_add(&lh_mut->num_comp_calls, 0, &ret,
crypto/lhash/lh_stats.c:85:    CRYPTO_atomic_add(&lh_mut->num_retrieve, 0, &ret, lh->retrieve_stats_lock);
crypto/lhash/lh_stats.c:87:    CRYPTO_atomic_add(&lh_mut->num_retrieve_miss, 0, &ret,
crypto/lhash/lh_stats.c:90:    CRYPTO_atomic_add(&lh_mut->num_hash_comps, 0, &ret,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is acquire the order you want to use here? Maybe acquire release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire/release isn't legal for an atomic load. The valid memory orders are: relaxed, seq_cst, acquire, consume. Because this is reading but not writing, I think acquire is the correct ordering.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 25, 2017

As was discussed already [internally], you can't make any sensible decision without modifying an atomic variable. Or in other words how atomic read is different from just read? Indeed, consider

    if (!CRYPTO_THREAD_write_lock(lock))
        return 0;

    *ret  = *val;

    if (!CRYPTO_THREAD_unlock(lock))
        return 0;

If there is contention for *val, then it can be different by the time caller gains execution control after CRYPTO_THREAD_unlock. So what does locking give you? Well, one can still argue that suggested atomic read would prevent compiler from moving it as result of optimization, but so would cast to volatile. As for use of atomic add with 0 increment in lh_stats.c it's terrible and is not right. If coherent view in stats is absolute requirement, then one shouldn't use atomic_add on structure fields elsewhere, but lock the whole structure. Currently view is incoherent and not no atomic add with 0 increment would fix that.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 25, 2017

Well, there is a chance that I've let "practitioner" in myself take over "theoretician". I mean is it possible that notion of atomic load is meaningful on platforms where load has to be performed with multiple instructions? And "practitioner" says there are no such platforms? Is this it? Yet, we do make assumption about platform, for example that it's [[I]L]P(32|64), two's complement, so we don't actually aim for pure abstract platform. So is it reasonable to assume that conventional load is actually atomic?

@kaduk
Copy link
Contributor

kaduk commented Sep 25, 2017

I, for one, am not confident enough to assert that we can assume that conventional load is atomic on all platforms we are used on. (Note that I say "platforms we are used on", not "supported platforms", because there are no doubt people pushing the envelope.)
And if I understand correctly, here "platform" includes not just processor but also memory model and potential for unaligned accesses.

@paulidale
Copy link
Contributor Author

This came out of #4402 I don't think that an atomic read is required there. The most likely outcome of a race is an extra reseeding (or possibly a delayed reseed).

I have seen processors where int loads took several instructions. Some small micros do this, MIPS does for constant loads (which aren't directly relevant). However, on systems where loads are atomic, I'd expect them to be replaced by a plain load/store sequence. This happens on x86/64. There is still a cost -- the memory write, the function call and return.

If people don't think this PR is worthwhile, speak up. I'm happy for it to not go in. The current use case doesn't need it and the other places in lhash where it could be used should be done differently.

Locking is the solution for the lhash cases. In a new PR.

@paulidale
Copy link
Contributor Author

I've dealt with the lhash concerns in #4418

@dot-asm
Copy link
Contributor

dot-asm commented Sep 26, 2017

"platform" includes not just processor but also memory model and potential for unaligned accesses.

Trick question. There is ambiguity in "memory model" when it's used in same sentence with "platform" and "processor". In sense that processor itself upholds certain discipline referred to as "memory model", but on the other hand there is language "memory model" or how compiler schedules loads and stores :-) Anyway, I'd say that we can take unaligned accesses out of equation. For following reason. We are not talking about reading variables in general, but specifically about base types that match processor register, int, long, void *. And we can expect them to be aligned, compiler is obliged to do it for us. In the context of "atomic" load we can take processor's memory model out of equation too, because there is ambiguity in either case, i.e. did you get value before or after modification. I mean processor's memory model affect ambiguity, but doesn't eliminate it, and it doesn't matter which ambiguity is it. As for language "memory model" it's tricky, isn't it? Our requirement for language standard compliance predates notion of "memory model" in C. It should be noted however that two disjoint subroutines accessing same shared variable, and called from disjoint translation unit do constitute desirable "memory model", don't they? [I suppose this is also argument against LTO.]

@dot-asm
Copy link
Contributor

dot-asm commented Sep 26, 2017

I have seen processors where int loads took several instructions. Some small micros do this, MIPS does for constant loads

This can't be viewed as load, because there is no separately addressable memory cell. Pieces of constant are encoded into instructions themselves. As already mentioned, we are not talking about loads in most general sense, but specifically about aligned loads of shared variables that fit into processor register.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 26, 2017

two disjoint subroutines accessing same shared variable, and called from disjoint translation unit do constitute desirable "memory model", don't they?

And so would variable declared volatile...

@dot-asm
Copy link
Contributor

dot-asm commented Sep 26, 2017

OK, would it be appropriate to say like this? It's desirable to use atomic_load, because it conveys information to compiler about what it can do in terms of optimization. It will be compiled as just load [on supported platforms], but compiler won't be as free to move it around as just reference. It should be noted that essentially same effect can be achieved by declaring variable volatile. And it would actually be acceptable approach for compilers that don't specify "memory model". One can also point out that CRYPTO_atomic_* reside in separate translation unit, which actually limits such compiler's ability to screw things up. [On the other hand one can note that formally speaking adhering to atomic_load should make LTO safe, at least in respect to involved variables.]

If it's adequate summary, then it might be appropriate to implement CRYPTO_atomic_read. But suggestion is to adhere to volatile where atomic_load is not an option. Concern here is that synchronization costs would be off-charts disproportionate to [zero] benefit. And "might" in first sentence means that one can still wonder if there is really legitimate reason for it. I mean since you can't really take sensible decision based on value of reference counter without actually modifying it... I suppose I have to double-check #4402.

@mspncp
Copy link
Contributor

mspncp commented Sep 26, 2017

The counter drbg->restart_count in #4402 is not a reference counter, it's more like a modification timestamp. A chained drbg checks wether drbg->restart_count != drbg->parent->restart_count. If this is the case, it knows that the parent reseeded more recently than itself, and will auto-reseed. As @paulidale already noted, a race condition will in the worst case have the effect, that the drbg is reseeded more than once , or the reseed occures one or more generate() requests later than intended.

@paulidale
Copy link
Contributor Author

@mspncp summed it up well. The instance in question doesn't strictly require this functionality.
Hence to offer to close this without merging.

@paulidale
Copy link
Contributor Author

It seems that volatile isn't a good answer: https://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming

It guarantees going to memory, not that the memory access is atomic.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 27, 2017

Once again, volatile is suggested as compromise for specific case, load of aligned integer. There is no contradiction with referred article, because it too recognizes that any load of aligned integer, be it volatile or not, is atomic. And what is the compromise? Pre-C11[?] doesn't make any promises about how and when variables are loaded, written or cached in registers. Except for volatile ones. So by adhering to volatile load we simply prevent compiler from caching the value in register. And referred article effectively recognizes that. And once again, the reason for seeking compromise is that synchronization doesn't give you any advantage, in sense that you can have get outdated value even with mutexes around it.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 27, 2017

Pre-C11[?] doesn't make any promises about how and when variables are loaded, written or cached in registers.

Well, it's not like C11 makes such promises in general. It's just in addition to volatiles it adds notion of atomic variables and specifies their behaviour in respect to memory references.

@mspncp
Copy link
Contributor

mspncp commented Sep 27, 2017

synchronization doesn't give you any advantage, in sense that you can have outdated value even with mutexes around it.

As far as I understand it, atomic read/write is meant to prevent me from reading inconsistent values (partiallly updated by another thread). This has nothing to do with the question whether the value is current or might become outdated immediately afterwards. So isn't this discussion comparing apples and oranges? Or did I miss something?

@dot-asm
Copy link
Contributor

dot-asm commented Sep 27, 2017

by adhering to volatile load we simply prevent compiler from caching the value in register.

Just in case to clarify. I You have to imagine some super aggressive LTO that inlines all functions and for whatever reason chooses to cache variable value in register for like forever.

@dot-asm
Copy link
Contributor

dot-asm commented Sep 27, 2017

atomic read/write is meant to prevent me from reading inconsistent values (partiallly updated by another thread).

This is misleading formulation. Atomic variable can't be partially updated, per its definition. For this reason compiler won't even support types that hardware doesn't guarantee atomic access for. I.e. you are not free to declare arbitrary type [base or not] as atomic.

Or did I miss something?

Well, the spirit of the quote is rather "[provided that we recognize that load of aligned integer is actually atomic (on all known ILP32 and better platforms)] synchronization doesn't give you any advantage."

@dot-asm
Copy link
Contributor

dot-asm commented Sep 27, 2017

[provided that we recognize that load of aligned integer is actually atomic (on all known ILP32 and better platforms)]

And of course so is store in thread that is updating the variable in question of course. Be it one declared atomic or not.

@mspncp
Copy link
Contributor

mspncp commented Sep 27, 2017

Well, the spirit of the quote is rather "[provided that we recognize that load of aligned integer is actually atomic (on all known ILP32 and better platforms)] synchronization doesn't give you any advantage."

Ok, I think you convinced me now to keep things simple in PR #4402 and just do a naive compare without worrying too much ;-).

@paulidale
Copy link
Contributor Author

I can't see a simple compare hurting for #4402 A delayed or extra reseed is harmless. I don't think a reseed can reasonably be missed which is the problem case (2^32 reseeds between generations is a problem but essentially impossible).

What synchronisation does give you is a guarantee that the two halves weren't read separately. E.g. the structure is misaligned or you're running on an 8 bit CPU that requires four loads or, perhaps, crosses a page boundary.

Anyway, I'll close this unmerged tomorrow if I don't hear any objections.

@paulidale
Copy link
Contributor Author

No concerns about keeping this have been raised so I'm closing without merging.

@paulidale paulidale closed this Sep 27, 2017
@paulidale paulidale deleted the atomic_read branch September 27, 2017 20:57
@kaduk
Copy link
Contributor

kaduk commented Sep 27, 2017

It would probably be document the conclusions in the tree somewhere, in terms of what behavior we have decided to rely upon.

@paulidale paulidale restored the atomic_read branch September 27, 2017 21:12
@paulidale
Copy link
Contributor Author

Reopening in light of #4467

@richsalz
Copy link
Contributor

richsalz commented Oct 8, 2017

Now that it's re-opened because it seems useful (see the atomic-add calls with 0), are @dot-asm and others satisfied with it? Another option is to define atomic-read as atomic-add with zero and "fix" it later.

@paulidale
Copy link
Contributor Author

Fixed a #defined() to use the correct macro (__ATOMIC_ACQUIRE instead of __ATOMIC_ACQ_REL).

@paulidale
Copy link
Contributor Author

Should an atomic write also be supported?

int CRYPTO_atomic_write(int *val, int n, CRYPTO_RWLOCK *lock)

Again, atomic_add can do this by adding zero but a write would be more efficient on many platforms since it avoids half of the read/write cycle.

@paulidale
Copy link
Contributor Author

CRYPTO_atomic_write added.

@paulidale
Copy link
Contributor Author

ping?

levitte pushed a commit that referenced this pull request Oct 10, 2017
in an atomic fashion.

Reviewed-by: Rich Salz <[email protected]>
(Merged from #4414)
levitte pushed a commit that referenced this pull request Oct 10, 2017
Reviewed-by: Rich Salz <[email protected]>
(Merged from #4414)
@paulidale
Copy link
Contributor Author

Merged, thanks.

@paulidale paulidale closed this Oct 10, 2017
@paulidale
Copy link
Contributor Author

Merged, thanks.

@paulidale paulidale deleted the atomic_read branch October 10, 2017 02:58
@dot-asm
Copy link
Contributor

dot-asm commented Oct 10, 2017

Generally speaking it's rather a compiler-specific thing than threading-model-specific. For example there is gcc for Windows, you know. To extend the argument it might use C11 coverage (see refcount.h for inspiration). In other words it might be appropriate to have it as separate module as apposite to threading-module-specific...


int CRYPTO_atomic_read(int *val, int *ret, CRYPTO_RWLOCK *lock)
{
InterlockedCompareExchange(val, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And who updates *ret?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, if one argues that atomic_load is better than atomic_add with 0, because it translates to instruction with no lock prefix, then how come this one? This is as expensive instruction...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to other suggestions as to how to go about this one Windows.
I don't know enough about the operating system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulidale, you forget the assignment to *ret, see my original suggestion.

I have no idea about the cost of these functions, @dot-asm seems to be better informed. If he says that the compare exchange is much more expensive than a simple attomic_add with 0, then you could use InterlockedAdd() instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry, I overlooked #4517.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed the assignment part of your suggestion @mspncp
Final versions are best done by someone with substantial Windows experience.

return 1;
}
# endif
if (!CRYPTO_THREAD_write_lock(lock))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why write lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste :(

@dot-asm
Copy link
Contributor

dot-asm commented Oct 10, 2017

are @dot-asm and others satisfied with it?

I'm not, because I still think that it gives you wrong impression. But I have to collect my thoughts....

@paulidale
Copy link
Contributor Author

I've made the changes to the above problems in #4517

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants