Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy,
which would be the only scenario where the graphLaunchCache in TracyCUDA
could serve stale entries for a non-matching exec handle.
Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing
counter that is never recycled. 22 create/instantiate/launch/destroy
cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per
cycle — one unit per node created during graph construction).
This confirms that the stale-cache concern raised in code review is not
a real risk in practice: two distinct exec handles always have distinct
graphIds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests two questions:
1. Does relaunching the same cudaGraphExec produce a new correlationId
each time, or is it reused?
2. Do two different cudaGraphExec handles from the same cudaGraph share
a graphId?
Results on H100, CUDA 13.1:
- Each launch of the same exec handle gets a strictly unique, monotonically
increasing correlationId. CPU callback corrId == GPU activity corrId.
This is formally documented in cupti_activity.h:
"Each graph launch is assigned a unique correlation ID that is
identical to the correlation ID in the driver API activity record
that launched the graph."
- graphId identifies the exec handle (instantiation), not the graph
definition. Two cudaGraphInstantiate calls on the same graph produce
different graphIds.
These findings confirm that the cudaGraphCurrentLaunch cache in
matchGraphActivityToAPICall is always refreshed by the first activity
of each new launch before the graphId fallback path is ever used.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVCC 13.1 defaults to a PTX version incompatible with the installed
driver (580.105.08), causing kernels to silently fail with "provided
PTX was compiled with an unsupported toolchain". Use -arch=native so
NVCC auto-detects the target GPU (H100, sm_90) at build time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the synthetic APICallInfo hack with proper correlation via
CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API
callback, its correlationId is stored in cudaCallSiteInfo. The
GRAPH_TRACE activity record carries the same correlationId plus the
graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/
memset activities then look up this map via their graphId field.
Key changes:
- Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the
API callback machinery captures the CPU call site
- Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in
DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId]
- Add cudaGraphCurrentLaunch map to PersistentState
- Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE
records (which complete last on GPU) are processed before the
kernel/memcpy/memset records that depend on them
- Replace graphId=0 fallback in kernel/memcpy/memset with proper
cudaGraphCurrentLaunch lookup; fall through to matchError if
the graphId is not found
- Update repro to include TracyCUDA headers and properly test
GPU zone correlation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimal reproducer showing that CUDA Graph-launched kernels produce
0 GPU zones in Tracy. The repro creates a simple graph (2 kernels +
1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without
the fallback patch, all activity records are dropped by matchError().
Tested on NVIDIA H100, CUDA 13.1.
Introduce Windows ARM64(native) support across ToyPathTracer,
profiler, and server code paths when building with MSVC(_M_ARM64).
Key changes:
- MathSimd.h/Maths.h:
- Fix NEON movemask constants for MSVC/ARM64 by loading from a uint32_t[]
via vld1q_u32() and using vdupq_n_u32() for highbit.
- enkiTS/TaskScheduler.cpp:
- Provide Pause() implementation on _M_ARM64 using __yield().
- profiler/winmain.cpp:
- AVX feature checks to x86/x64 only and skip on ARM64.
- server/TracyPopcnt.hpp:
- Implement TracyCountBits using ARM NEON intrinsics.
- Implement TracyLzcnt using _BitScanReverse64().
* Add a <cmath> as to fix unknown symbol sqrt whist building on linux (gcc). Remove <algorithm> as a consequence of clangd suggestion.
* adjust according to the review comment
- Adds the file TracyOpenCL.hpp which contains the API to annotate OpenCL applications
- It works in a similar fashion to the Vulkan annotations
- Adds an example OpenCL application in examples/OpenCLVectorAdd
- Adds "OpenCL Context" to the UI
- Manual entry for annotating OpenCL zones