In versions v0.0.12 and earlier, the documentation in ConflictSet.h is ambiguous about the behavior of a call to check or setOldestVersion with a version greater than the version of the last call to addWrites. One might reasonably expect the behavior to be the same as if there's an implicit call to addWrites with zero writes and the later version prior to the check or setOldestVersion call. However, the 32-bit version implementation was written with the implicit assumption that this does not happen, and so effectively there's a missing precondition in the documentation for check and setOldestVersion. Subsequent versions will explicitly list this precondition.

Versions v0.0.7 - v0.0.12 can report conflicts incorrectly if this precondition is violated. The precondition is now explicit, and so later versions have undefined behavior (the usual consequence of violating preconditions) if the precondition is violated.

Timeline

2024-07-08

v0.0.7 is released containing the bug/missing precondition.

2024/08/23

A downstream project reports an unexpected conflict in v0.0.12

2024/08/24

Root cause is discovered, the missing precondition is published

Details

Recall that 32 bit versions are safe as long as you never compare versions whose full-precision values are different by more than 2^31 - 1.

If you check a version large enough to be outside the window, it can give a wrong result.

cs.addWrites(0, write(b""))  
cs.check(read(2**31 + 1, b""))  
# incorrectly gives conflict, should be commit

The implementation of setOldestVersion doesn't garbage collect until the oldest extant version is back in the valid window (addWrites does), so calling setOldestVersion can result in versions within the tree being outside the window as well.

Extensive fuzzing has not found a scenario where this issue causes an incorrect commit.

You may see use-of-uninitialized-memory if running under Valgrind, which technically means undefined behavior in the c++ abstract machine, but the only undesired behavior observed in the generated code is incorrect conflicts.

Root Cause Analysis

Why not caught sooner?

Since the code was written implicitly assuming that versions for check and setOldestVersion are always <= the version of the last addWrites, there is no code that tests this scenario. The fuzz tests were written in a way where this never happens (and they still are, otherwise the test case would violate a precondition).

How was it discovered?

A downstream project reports an unexpected conflict in v0.0.12

How did we go from symptom to root cause?

After some exploratory experiments trying to rule out various possibilities, we learned that this doesn't reproduce in 64-bit version mode. Then we added some instrumentation to check for comparisons of versions whose full precision is different by more than 2^31 - 1, and found those. Asking "why do we expect this to not happen" led to noticing the implicit precondition.

Prevention

We plan to add a new fuzz test or change the existing one such that the structure of the calls is less constrained. Currently it calls check, then addWrites, then setOldestVersion in a cycle. See https://git.weaselab.dev/weaselab/conflict-set/issues/33

What went well

Deterministic simulation to reproduce consistently
Structuring the code to allow switching between 32-bit versions and 64-bit versions was convenient for root-causing
Having USE_SIMD_FALLBACK handy was also convenient as a way to disable all the simd code that assumes a particular version representation