Commit Graph

33 Commits

Author SHA1 Message Date
Clément Grégoire
f93d17a96f Add TRACY_PLATFORM_HEADER hook for unsupported platforms.
Extension point so private/unsupported platforms can plug in their own implementations of the kernel/libc primitives Tracy depends on, without patching the `#if`/`#elif` chains.

Projects supply a platform header via `-DTRACY_PLATFORM_HEADER="\"my_platform.h\""` at build time. Tracy includes it in any TU that needs the hooks. The header toggles per-category `TRACY_HAS_CUSTOM_*` macros and declares matching `tracy::Platform*` functions.

Available hooks:

- `TRACY_HAS_CUSTOM_THREAD_ID` → `PlatformGetThreadId`
- `TRACY_HAS_CUSTOM_USER_INFO` → `PlatformGetHostname`, `PlatformGetUserLogin`, `PlatformGetUserFullName`
- `TRACY_HAS_CUSTOM_SAFE_COPY` → `PlatformSafeMemcpy`
- `TRACY_HAS_CUSTOM_ALLOCATOR` → `PlatformMalloc`, `PlatformFree`, `PlatformRealloc`, `PlatformAllocatorInit`, `PlatformAllocatorThreadInit`, `PlatformAllocatorFinalize`, `PlatformAllocatorThreadFinalize`

Each hook is wired as the first arm of its respective `#if`/`#elif` chain, so existing supported platforms are unaffected.

Template files in `examples/CustomPlatform/` and a new subsection in `manual/tracy.tex` document the mechanism.
2026-05-24 15:42:42 +02:00
Basil Milanich
12dc23f67e Add graphId recycle investigation test
Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy,
which would be the only scenario where the graphLaunchCache in TracyCUDA
could serve stale entries for a non-matching exec handle.

Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing
counter that is never recycled. 22 create/instantiate/launch/destroy
cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per
cycle — one unit per node created during graph construction).

This confirms that the stale-cache concern raised in code review is not
a real risk in practice: two distinct exec handles always have distinct
graphIds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:34:51 -05:00
Basil Milanich
c807367099 Add correlationId investigation test for CUDA Graph launches
Tests two questions:
1. Does relaunching the same cudaGraphExec produce a new correlationId
   each time, or is it reused?
2. Do two different cudaGraphExec handles from the same cudaGraph share
   a graphId?

Results on H100, CUDA 13.1:
- Each launch of the same exec handle gets a strictly unique, monotonically
  increasing correlationId. CPU callback corrId == GPU activity corrId.
  This is formally documented in cupti_activity.h:
    "Each graph launch is assigned a unique correlation ID that is
     identical to the correlation ID in the driver API activity record
     that launched the graph."
- graphId identifies the exec handle (instantiation), not the graph
  definition. Two cudaGraphInstantiate calls on the same graph produce
  different graphIds.

These findings confirm that the cudaGraphCurrentLaunch cache in
matchGraphActivityToAPICall is always refreshed by the first activity
of each new launch before the graphId fallback path is ever used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:28:34 -05:00
Basil Milanich
d9a1cc06c1 Fix repro build: add -arch=native to use correct GPU architecture
NVCC 13.1 defaults to a PTX version incompatible with the installed
driver (580.105.08), causing kernels to silently fail with "provided
PTX was compiled with an unsupported toolchain". Use -arch=native so
NVCC auto-detects the target GPU (H100, sm_90) at build time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:33:05 -05:00
Basil Milanich
4ccaea9f08 Expand repro: multiple graphs, multiple kernels, interleaved launches
Tests:
- Two distinct graphs (different graphIds) on the same stream
- Graph A: kernel + memcpy + kernel (3 nodes)
- Graph B: scale + add + scale (3 nodes)
- 5 interleaved launches of each, stressing the graphId cache
- Expected 30 graph GPU zones total

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:54:08 -05:00
Basil Milanich
7bca9dcd90 Fix CUDA Graph GPU zones with proper cuGraphLaunch correlation
Replace the synthetic APICallInfo hack with proper correlation via
CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API
callback, its correlationId is stored in cudaCallSiteInfo. The
GRAPH_TRACE activity record carries the same correlationId plus the
graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/
memset activities then look up this map via their graphId field.

Key changes:
- Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the
  API callback machinery captures the CPU call site
- Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in
  DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId]
- Add cudaGraphCurrentLaunch map to PersistentState
- Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE
  records (which complete last on GPU) are processed before the
  kernel/memcpy/memset records that depend on them
- Replace graphId=0 fallback in kernel/memcpy/memset with proper
  cudaGraphCurrentLaunch lookup; fall through to matchError if
  the graphId is not found
- Update repro to include TracyCUDA headers and properly test
  GPU zone correlation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:03:58 -05:00
Basil Milanich
6c6999bf01 Add CUDA Graph GPU zone repro
Minimal reproducer showing that CUDA Graph-launched kernels produce
0 GPU zones in Tracy. The repro creates a simple graph (2 kernels +
1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without
the fallback patch, all activity records are dropped by matchError().

Tested on NVIDIA H100, CUDA 13.1.
2026-03-24 09:37:50 -05:00
Naveen Regulla
9879a31fc5 Add Windows on ARM64 with MSVC support for Tracy Profiler
Introduce Windows ARM64(native) support across ToyPathTracer,
profiler, and server code paths when building with MSVC(_M_ARM64).

Key changes:
- MathSimd.h/Maths.h:
   - Fix NEON movemask constants for MSVC/ARM64 by loading from a uint32_t[]
    via vld1q_u32() and using vdupq_n_u32() for highbit.
- enkiTS/TaskScheduler.cpp:
   - Provide Pause() implementation on _M_ARM64 using __yield().
- profiler/winmain.cpp:
   -  AVX feature checks to x86/x64 only and skip on ARM64.
- server/TracyPopcnt.hpp:
   - Implement TracyCountBits using ARM NEON intrinsics.
   - Implement TracyLzcnt using _BitScanReverse64().
2026-02-17 16:42:50 +05:30
FrK5E
b8c2e25c3d Fix a build of OpelCLVectorAdd example on linux (#989)
* Add a <cmath> as to fix unknown symbol sqrt whist building on linux (gcc). Remove <algorithm> as a consequence of clangd suggestion.

* adjust according to the review comment
2025-02-26 15:37:49 +01:00
Bartosz Taudul
06c7984a16 Move all client headers and sources to public/ directory. 2022-07-17 15:47:38 +02:00
Cloud Han
c9fb07ffe5 remove blocking collect 2022-01-28 11:55:42 +08:00
Cloud Han
9b60c4970e better example 2022-01-25 12:59:35 +08:00
Cloud Han
d660425e00 mute warnings 2022-01-25 12:17:11 +08:00
Cloud Han
e28c562a20 add log and simple statistics for async kernel 2022-01-25 12:17:04 +08:00
Cloud Han
f6894c8d6c Make OpenCL example async 2022-01-25 12:17:03 +08:00
Bartosz Taudul
94bcc6ffcc Update OpenCLVectorAdd.cpp
Cosmetics.
2022-01-20 13:02:38 +01:00
Dantali0n
e73d21f5da Bump OpenCLVectorAdd cmake version to 3.10 2022-01-20 12:44:00 +01:00
Dantali0n
1120f0b07e Cross platform solution to linking OpenCLVectorAdd libraries 2022-01-20 12:42:51 +01:00
Dantali0n
bff83aab4a Fix OpenCLVectorAdd example 2022-01-20 11:09:33 +01:00
Bartosz Taudul
bcdbd2f7d7 Add simple fiber example. 2021-11-20 17:09:20 +01:00
Bartosz Taudul
68d2812e82 Retarget ToyPathTracer to MSVC 2022. 2021-11-14 12:22:52 +01:00
David Farrell
3f2d0bbf6e The ToyPathTracer example uses the D3D11 GPU zones 2021-05-01 17:31:49 -07:00
Bartosz Taudul
3f495f9cd7 Disable edit-and-continue in ToyPathTracer example. 2021-01-26 20:46:55 +01:00
Bartosz Taudul
046638a2b7 Update all sln files to MSVC 2019. 2021-01-26 20:24:47 +01:00
Thales Sabino
a46f83364e Add OpenCL trace support
- Adds the file TracyOpenCL.hpp which contains the API to annotate OpenCL applications
- It works in a similar fashion to the Vulkan annotations
- Adds an example OpenCL application in examples/OpenCLVectorAdd
- Adds "OpenCL Context" to the UI
- Manual entry for annotating OpenCL zones
2020-06-05 10:15:47 +01:00
Bartosz Taudul
ff56f1d2f8 Update example application memory requirements. 2020-01-26 17:30:05 +01:00
Bartosz Taudul
390558b627 Update memory requirements. 2019-11-03 16:29:45 +01:00
Bartosz Taudul
079e21ea43 Leave two threads for smooth operation of profiler. 2019-10-29 22:53:03 +01:00
Bartosz Taudul
3e19fbc2fb Instrument functions. 2019-10-29 22:45:30 +01:00
Bartosz Taudul
516ec6883d Limit number of rendered frames. 2019-10-29 22:45:01 +01:00
Bartosz Taudul
5bcf288333 Integrate Tracy. 2019-10-29 22:27:04 +01:00
Bartosz Taudul
546eeda1cd Ignore compiled shaders. 2019-10-29 22:25:10 +01:00
Bartosz Taudul
0b1eff8b0d Add aras-p's ToyPathTracer.
https://github.com/aras-p/ToyPathTracer
b076563906169aa2f9e6d7218ef85decf81f8f72
2019-10-29 22:21:34 +01:00