Here at weaselab we take correctness seriously, so when a release contains a correctness issue it's worth taking the time to reflect on how it happened. This serves to improve our understanding of our test effectiveness, incentivize not releasing bugs, and communicate to users transparently.

Summary

In affected versions, interleaving calls to multiple conflict sets in the same thread can cause wrong results. Upgrade to v0.0.9.

Affected versions:

v0.0.7
v0.0.8

Timeline

2024-06-25

We realize (by inspection/thought experiment) that performance for range reads is unsatisfactory in certain scenarios (https://git.weaselab.dev/weaselab/conflict-set/issues/27). This results in a spree of optimizations intended to improve read range performance, the most ambitious of which is storing versions in the tree with 32-bit precision in order to use wider SIMD instructions and save memory. Doing this correctly is quite tricky, but basically it suffices to compare versions (using two's complement as guaranteed in c++20) with subtract followed by signed compare with 0, and ensure that you never compare versions whose full-precision values are different by more than 2^31 - 1. As part of this effort, test coverage is added for advancing the version by large amounts and testing with large versions.

2024-07-02

The bug is introduced in https://git.weaselab.dev/weaselab/conflict-set/commit/3e2c8310bb91681b30ad0b1bf0924d1df308ee93, as part of implementing 32-bit precision versions

2024-07-08

v0.0.7 is released containing the bug

2024-07-18

A downstream project discovers the bug. The symptom is a use-of-uninitialized-memory in an internal ConflictSet call stack, reported by Valgrind.

2024-07-22

Root cause discovered, fixed, and released as v0.0.9

Details

A suitable value for "zero" is tracked in a thread local variable. The semantics of "zero" are simply "<= all valid versions in the current window". This value was incorrectly not updated at the beginning of a call to addWrites. Here's how to reproduce (in a python-like pseudocode):

# write the empty key at version 2 to conflict set one
cs1.addWrites(2, write(b""))  
cs1.setOldestVersion(2)  
# The internal thread local variable "zero" is now 2

# Make a Node48. Correctly checking range reads on a Node48 relies on "zero"
# being set correctly when the Node48 is created.
for i in range(256 - 17, 256):  
    cs2.addWrites(int(1), write(bytes([i])))

# Scan until first point write. This incorrectly conflicts. It should commit.
cs2.check(read(0, b"\x00", bytes([256 - 17])))

It's also possible to commit incorrectly:

# Add an empty list of writes at version 2**32
cs1.addWrites(2**32)  
cs1.setOldestVersion(2**32)  
cs2.addWrites(2**31 + 100)  
cs2.setOldestVersion(2**31 + 100)  
# "zero" is now 2**31 + 100
cs1.addWrites(2**32 + 101, write(b"", b"\x02"), write(b"\x01"))  
# rangeVersion of \x01 is now 2**31 + 100 ("max" of (2**31 + 100, 2**32 + 101))
cs1.check(read(2**32 + 1, b"\x00"))  
# but 2**32 + 1 ">" 2**31 + 100, and it incorrectly commits

If there's only one conflict set at a time per thread, this bug cannot manifest since "zero" remains correct throughout.

Root Cause Analysis

The question to ask is almost always "why wasn't this caught sooner?" The answer in this case is pretty simple. We have sophisticated testing for many scenarios, including concurrent calls on different threads, but we do not test interleaving calls to different conflict sets on the same thread. In hindsight this would have been prudent given the error-prone thread local variable usage.

But let's keep asking "why?". Why were we using a thread local variable in an error-prone way? This was to avoid plumbing "zero" throughout the call tree internally. Why have "zero" at all? Something like this is a necessary part of the book-keeping required to enable storing 32-bit versions internally. Why 32-bit versions? Performance. Why performance? Outperforming the skip list is the raison d'être of this project. The skip list still beats us in "pathological" workloads, but now it's mostly just slow in the usually way a radix tree can be made slow: a, aa, aaa, aaaa, ...

So in the end this is a case of introducing complexity in the name of performance without managing that complexity well enough.

Another interesting question is "how was it discovered?" In this case Valgrind caught it, which is a real blessing since debugging memory errors is often much easier than logical errors. The usual knee-jerk reaction to use-of-uninitialized-memory is to ... initialize the memory. However, while this would have prevented Valgrind from issuing a diagnostic, this would not have been correct and the net effect is that debugging this would be much harder, and delayed since in the short-term we assume we fixed it. In this case, a correct "zero" value prevents the uninitialized memory from getting used, and the memory was uninitialized in the first place because there's nothing meaningful to initialize it with. If you're not comfortable with actual uninitialized memory, you can zero it but still have valgrind track it as uninitialized using client requests.

Last question: "How did we go from symptom to root cause?" Once we could reproduce, we tried recording a trace of the calls to each conflict set and replaying them, to isolate the conflict set. This of course didn't work since interleaving calls from multiple conflict sets is necessary to trigger the bug. After initial frustration with the "record and replay" idea, we realized that "record and replay isn't working" was actually a big clue! Then we started looking at differences between replay and in situ, and found the bug by inspection.

Prevention

Introduce regression tests for this specific bug, and include coverage for interleaving calls in the same thread in our fuzz testing.

What went well

Using Valgrind, and tracking "meaningfulness" of memory as "initializedness"
Deterministic simulation to reproduce consistently
Using libfuzzer to find minimal counter examples to assist with understanding the bug