Skip to content

fix(batcher): replace static global with dynamic allocation and refcounting#1556

Merged
jpnurmi merged 20 commits intomasterfrom
jpnurmi/fix/batcher-global
Mar 9, 2026
Merged

fix(batcher): replace static global with dynamic allocation and refcounting#1556
jpnurmi merged 20 commits intomasterfrom
jpnurmi/fix/batcher-global

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Mar 6, 2026

This PR adds a new stress test that spawns multiple producer threads continuously logging while the main thread repeatedly re-inits the SDK, and fixes the resulting issue flagged by TSan:

$ RUN_ANALYZER=tsan pytest -v --capture=no "tests/test_unit.py::test_unit[logs_reinit_stress]"
...
WARNING: ThreadSanitizer: data race (pid=1194861)
  Write of size 8 at 0x5555556341b8 by main thread (mutexes: write M0):
    #0 sentry__batcher_startup sentry-native/src/sentry_batcher.c:302 (sentry_test_unit+0x37c38) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #1 sentry__logs_startup sentry-native/src/sentry_logs.c:463 (sentry_test_unit+0x48297) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #2 sentry_init sentry-native/src/sentry_core.c:302 (sentry_test_unit+0x3903f) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #3 test_sentry_logs_reinit_stress sentry-native/tests/unit/test_logs.c:461 (sentry_test_unit+0x7da63) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #4 test_do_run__ sentry-native/include/../vendor/acutest.h:943 (sentry_test_unit+0x70ad1) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #5 test_run__ sentry-native/include/../vendor/acutest.h:1114 (sentry_test_unit+0x1d154) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #6 main sentry-native/include/../vendor/acutest.h:1623 (sentry_test_unit+0x1d154)

  Previous read of size 8 at 0x5555556341b8 by thread T2:
    #0 pthread_cond_signal ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1319 (libtsan.so.2+0x8b77e) (BuildId: edd4cc50d12638c09456940b2f9b3fb7d4f8eec4)
    #1 sentry__batcher_enqueue sentry-native/src/sentry_batcher.c:203 (sentry_test_unit+0x37ad1) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #2 sentry__logs_log sentry-native/src/sentry_logs.c:385 (sentry_test_unit+0x47892) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #3 sentry__logs_log sentry-native/src/sentry_logs.c:362 (sentry_test_unit+0x47e9c) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)
    #4 sentry_log_info sentry-native/src/sentry_logs.c:422 (sentry_test_unit+0x47e9c)
    #5 log_producer_thread sentry-native/tests/unit/test_logs.c:442 (sentry_test_unit+0x7badd) (BuildId: 19335422fa6a19b9e786eea02ccd28b1770e2ea3)

  Location is global 'g_batcher' of size 248 at 0x555555634120 (sentry_test_unit+0xe01b8)

@jpnurmi jpnurmi changed the title test(logs): add stress test for batcher reinit hang test(logs): add stress test for batcher reinit Mar 6, 2026
@jpnurmi jpnurmi changed the title test(logs): add stress test for batcher reinit fix(batcher): replace static global with dynamic allocation and refcounting Mar 6, 2026
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

jpnurmi and others added 12 commits March 6, 2026 16:45
Spawns 8 producer threads continuously logging while the main thread
does 10 cycles of sentry_init/sentry_close. Reproduces a condvar
corruption hang in the batcher under TSan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…unting

The static g_batcher persisted across init/close cycles, causing condvar
corruption when sentry__batcher_startup re-initialized the condvar while
old threads still used it. Dynamic allocation with refcounting solves
both the condvar corruption and the lifetime problem.

Introduces sentry_batcher_ref_t with acquire/release/swap API that
encapsulates a spinlock to make concurrent access from producer threads
and shutdown safe.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the previous startup failed to spawn its thread, shutdown_begin
returns false and shutdown_wait is never called. The old batcher stays
in g_batcher and leaks when startup swaps it out without releasing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A producer that acquired a ref before shutdown could enqueue items after
the final flush. These items were leaked when the batcher was freed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sentry__cond_destroy for all platforms (pthread_cond_destroy on
POSIX, CloseHandle on pre-Vista Windows, no-op on Vista+) and call it
in sentry__batcher_release. Without this, each SDK re-initialization
leaks kernel handles on pre-Vista Windows and violates POSIX which
requires pthread_cond_destroy before freeing memory containing a
pthread_cond_t.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 467bfb5ecb2c4528e1048dab1c467203fdc29cdf.
The two-phase shutdown (begin/wait) was designed to signal both the logs
and metrics threads in parallel before joining either. With #1558
replacing the condvar with a level-triggered waitable flag, thread wake
latency drops to sub-millisecond (futex/WaitOnAddress), making the
parallel signaling unnecessary.

Merging into a single function also fixes a race where a concurrent
sentry_init() could replace g_batcher between the signal and join steps,
causing the new batcher to be shut down instead of the old one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi force-pushed the jpnurmi/fix/batcher-global branch from a3738d5 to ba0b566 Compare March 6, 2026 15:45
@jpnurmi jpnurmi marked this pull request as ready for review March 6, 2026 15:45
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

The crash-safe flush functions called sentry__batcher_acquire, which
spins on a spinlock. If the crash occurs on the thread holding that
lock, the signal handler deadlocks. Replace with a new lock-free
sentry__batcher_peek that reads ref->ptr via atomic load without
taking the spinlock or bumping the refcount (safe because the process
is dying).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi and others added 2 commits March 6, 2026 17:21
On Windows x64, `long` is 32-bit but pointers are 64-bit. Use
InterlockedCompareExchangePointer on Windows and __atomic_load
elsewhere to avoid truncating the pointer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
force_flush_begin and force_flush_wait each acquired the batcher
independently. A concurrent sentry_init could swap the batcher between
the two calls, causing wait to flush the new empty batcher while the
original data is lost.

Fix by returning an opaque token (the acquired batcher ref) from begin
and passing it to wait, ensuring both operate on the same instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Mar 6, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very clean! Only minor comments.

jpnurmi and others added 2 commits March 9, 2026 12:47
Co-authored-by: Mischan Toosarani-Hausberger <mischan@abovevacant.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Metrics startup swaps NULL batcher on allocation failure
    • Updated metrics startup to return early on batcher allocation failure and to shut down any swapped-out batcher before releasing it.

Create PR

Or push these changes by commenting:

@cursor push e977efb2bc
Preview (e977efb2bc)
diff --git a/src/sentry_metrics.c b/src/sentry_metrics.c
--- a/src/sentry_metrics.c
+++ b/src/sentry_metrics.c
@@ -133,10 +133,17 @@
 {
     sentry_batcher_t *batcher
         = sentry__batcher_new(sentry__envelope_add_metrics);
-    if (batcher) {
-        sentry__batcher_startup(batcher, options);
+    if (!batcher) {
+        SENTRY_WARN("failed to allocate metrics batcher");
+        return;
     }
-    sentry__batcher_release(sentry__batcher_swap(&g_batcher, batcher));
+
+    sentry__batcher_startup(batcher, options);
+    sentry_batcher_t *old = sentry__batcher_swap(&g_batcher, batcher);
+    if (old) {
+        sentry__batcher_shutdown(old, 0);
+    }
+    sentry__batcher_release(old);
 }
 
 void
This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@jpnurmi jpnurmi merged commit 5f1b18c into master Mar 9, 2026
50 checks passed
@jpnurmi jpnurmi deleted the jpnurmi/fix/batcher-global branch March 9, 2026 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants