Compare commits

...

84 Commits

Author SHA1 Message Date
Marcos Slomp
1de94aa856 add routine to check for GL features/extensions at run-time 2026-06-15 21:19:12 -07:00
Bartosz Taudul
ec1d5bd3d7 Merge pull request #1402 from wolfpld/slomp/webgpu-example-platform
Switch webgpu example to SDL3, plus patch edge-case for wgpu-native
2026-06-15 23:48:14 +02:00
Marcos Slomp
69af195c98 edge-case bug-fix (could cause wgpu-native to panic) 2026-06-15 13:14:28 -07:00
Marcos Slomp
60699c4a92 fixing win32 builds with SDL3 + WebGPU 2026-06-15 13:14:28 -07:00
Marcos Slomp
cc45cf6046 switch to SDL3 (no cmake fetch, just find_package) 2026-06-15 13:14:28 -07:00
Bartosz Taudul
62560a6429 Add 8-bit length string transfers to the protocol. 2026-06-15 20:43:55 +02:00
Bartosz Taudul
f7b4e177ff Change misleading etc1buf variable to texbuf. 2026-06-15 19:31:06 +02:00
Bartosz Taudul
084daf0516 Force inline send string strlen helpers. 2026-06-15 19:19:13 +02:00
Bartosz Taudul
a98956f2d9 Another typo. 2026-06-15 17:17:32 +02:00
Bartosz Taudul
ac6f0f88fa Actually describe the message severity levels. 2026-06-15 17:09:42 +02:00
Bartosz Taudul
33fccb3530 Typos. 2026-06-15 17:08:39 +02:00
Bartosz Taudul
45576f6972 Merge pull request #1400 from wolfpld/slomp/gl-example
adding OpenGL example (spinning triangle)
2026-06-14 21:39:54 +02:00
Marcos Slomp
17e13bc2e0 SDL2 -> SDL3 2026-06-14 12:18:06 -07:00
Marcos Slomp
ee0c73bf25 switch to SDL2 (no cmake fetch, just find_package) 2026-06-14 11:24:14 -07:00
Bartosz Taudul
343567a3f2 Regenerate markdown manual. 2026-06-14 17:31:49 +02:00
Bartosz Taudul
20b3535623 Use fancy quotes in the manual. 2026-06-14 17:31:32 +02:00
Bartosz Taudul
5298316480 Revert emscripten back to 5.0.7. There are threading problems with 6.0.0.
Specifically, click on the red power off button to go back to the welcome
screen, and the cleanup popup never goes away.
2026-06-14 16:24:13 +02:00
Bartosz Taudul
83719fb29b WASM_BIGINT is enabled by default since emscripten 4.0.0. 2026-06-14 15:17:32 +02:00
Bartosz Taudul
f7d789eddb Split emscripten link options to multiple lines. 2026-06-14 15:09:40 +02:00
Bartosz Taudul
3816b2485e Bump used emscripten version to 6.0.0. 2026-06-14 15:06:53 +02:00
Bartosz Taudul
f8aa88d522 Explicitly disable shared libs for md4c.
Fixes emscripten build.
2026-06-14 15:06:36 +02:00
Bartosz Taudul
b5ae187f76 Disable separate fast model by default. 2026-06-12 22:20:47 +02:00
Marcos Slomp
3f203806e2 X11 workaround check 2026-06-12 13:00:33 -07:00
Bartosz Taudul
15c6b49de2 Mark text embeds as TEXT. 2026-06-12 21:43:51 +02:00
Bartosz Taudul
a153f3a562 Extend Embed macro to support TEXT parameter enabling CRLF to LF conversion. 2026-06-12 21:43:12 +02:00
Bartosz Taudul
c2998310cf Add CRLF to LF conversion support to embed. 2026-06-12 21:42:45 +02:00
Bartosz Taudul
a43b74ed8f Update NEWS. 2026-06-12 21:08:33 +02:00
Bartosz Taudul
d3047f8069 Fix memory discard + callstack.
Bug (High Severity): Wrong queue type in MemDiscardCallstack

In the callstack path of MemDiscardCallstack, the wrong queue type is
sent:

  SendMemDiscard( QueueType::MemDiscard, thread, name );

Every other callstack variant correctly uses its callstack queue type
(MemAllocCallstack, MemFreeCallstack, etc.), but this one uses the
non-callstack type. The SendMemDiscard assertion at line 1026 confirms
MemDiscardCallstack is a valid value.

Impact: The callstack captured by SendCallstackSerial() will be orphaned.
The server processes the event via the non-callstack handler, leaving the
callstack serial data unconsumed, which desynchronizes the serial queue
and corrupts all subsequent events.
2026-06-12 20:30:59 +02:00
Bartosz Taudul
3804b2580a Regenerate markdown manual. 2026-06-12 19:58:06 +02:00
Bartosz Taudul
329ac6c9f1 Document memory discard macro. 2026-06-12 19:57:45 +02:00
Bartosz Taudul
a091bb4ad2 Remove "secure" variant of alloc/free.
Random crashes are not fun. Always use the "secure" code path.
2026-06-12 19:41:17 +02:00
Bartosz Taudul
86b5f43959 Provide proper test directory. 2026-06-12 19:23:28 +02:00
Marcos Slomp
39dc688340 adding Xrandr dependency 2026-06-12 08:44:46 -07:00
Marcos Slomp
832234838b better comments and messages 2026-06-12 07:33:06 -07:00
Marcos Slomp
daba5acfbc more explicit compiler warning message 2026-06-12 07:31:03 -07:00
Bartosz Taudul
07bfe3465e Merge pull request #1356 from wolfpld/slomp/tracy-webgpu
GPU: WebGPU back-end
2026-06-12 12:11:32 +02:00
Bartosz Taudul
0544440a34 Remove unused, extremely broken code. 2026-06-11 22:35:25 +02:00
Marcos Slomp
f287508772 addressing type conversion warning 2026-06-11 13:28:32 -07:00
Bartosz Taudul
f622b97436 Backdate init time when a producer token predates it.
A zone emitted from a shared object initializer runs before the
executable's constructors, so its timestamp precedes s_initTime, which
the server uses as the trace epoch (baseTime). Such a zone converts to
negative trace time and its end no longer satisfies IsEndValid(), which
excludes it from statistics reconstruction and makes it render as
never-ending.

Record the current time when a producer token is created before
s_initTime is constructed and use it as the init time, ensuring no event
timestamp precedes the trace epoch.
2026-06-11 20:05:59 +02:00
Bartosz Taudul
dfded9d55d Recover main thread producer orphaned by cross-module init order.
ELF init_priority only orders constructors within a single module. All of
a shared object's initializers run before any of the executable's, so an
instrumented dependency .so emitting a zone from its static initializer
creates the main thread producer token against the zero-initialized
s_queue. The queue constructor then resets the producer list, orphaning
that producer: every zone emitted on the main thread from that point on
is enqueued into blocks no consumer ever iterates and silently lost,
while sampling (worker thread producer) keeps working.

Re-link such a producer right after the queue is constructed. In the
common case, where nothing was emitted during shared object init, this
merely constructs the main thread token eagerly.
2026-06-11 20:05:57 +02:00
Marcos Slomp
a2555fbb33 fixing Windows/Linux build 2026-06-11 07:37:58 -07:00
Bartosz Taudul
7180ea381f Merge pull request #1401 from Lectem/fix/win32-non-desktop
`TRACY_WIN32_NO_DESKTOP` should use `GetVersionExW` explicitly.
2026-06-11 13:01:24 +02:00
Clément Grégoire
0c74658dd3 TRACY_WIN32_NO_DESKTOP should use GetVersionExW explicitly.
Since we use `RTL_OSVERSIONINFOW` we need to use W version explicitly
2026-06-11 12:06:34 +02:00
Marcos Slomp
debda1df55 scoping the GpuCtx constructor 2026-06-10 18:57:22 -07:00
Marcos Slomp
d98608b022 issue a Tracy warning message when timestamp queries are supported but not properly implemented 2026-06-10 18:52:57 -07:00
Marcos Slomp
eb88c6eba0 adding warning about TracyOpenGL usage on Apple devices 2026-06-10 18:52:09 -07:00
Marcos Slomp
e83429c926 replacing the various platform layers by RGFW 2026-06-10 18:38:48 -07:00
Bartosz Taudul
cc091a99a2 Support key modifiers on emscripten. 2026-06-10 23:39:08 +02:00
Marcos Slomp
1b207d3e2a adding OpenGL example (spinning triangle) 2026-06-10 14:14:54 -07:00
Bartosz Taudul
f89709e99e Prevent click-through when activating annotation. 2026-06-10 22:50:45 +02:00
Bartosz Taudul
a4c5f15312 Rewrite annotations drawing. 2026-06-10 22:42:05 +02:00
Bartosz Taudul
3455fd9f82 Fix regression making existing annotations non-editable after trace load. 2026-06-10 21:48:12 +02:00
Marcos Slomp
cfc046abcd refactoring requirements 2026-06-10 11:09:51 -07:00
Marcos Slomp
0d848c3042 proper device descriptor chaining 2026-06-10 08:03:18 -07:00
Marcos Slomp
54270d3fd5 move window to top when launching from console 2026-06-10 06:24:53 -07:00
Marcos Slomp
1341f98c61 cleanup 2026-06-10 06:24:32 -07:00
Marcos Slomp
6fc279eef4 more descriptive API name 2026-06-10 06:23:57 -07:00
Marcos Slomp
28d3a91980 more changes to allow for null context 2026-06-09 16:26:51 -07:00
Marcos Slomp
0fbb2eaaa4 typo 2026-06-09 16:00:43 -07:00
Marcos Slomp
b27dab4584 remove "spontaneous" callback (better determinism) 2026-06-09 15:59:38 -07:00
Marcos Slomp
75bee5370f cosmetics 2026-06-09 15:58:24 -07:00
Marcos Slomp
e7499458e9 allow scoped instrumentation to no-op with null context 2026-06-09 15:58:06 -07:00
Marcos Slomp
958cb8d7f8 WGPU_PATH fix 2026-06-09 12:56:34 -07:00
Marcos Slomp
59f17794a5 fixing MemWrite casts 2026-06-09 09:06:48 -07:00
Marcos Slomp
3b2c7dbacb fixing webgpu lib linkage based on WGPU_PATH 2026-06-09 09:06:48 -07:00
Marcos Slomp
56ed480ed2 relocating webgpu example 2026-06-09 09:06:48 -07:00
Marcos Slomp
0572c86551 Wayland woes... 2026-06-09 09:06:48 -07:00
Marcos Slomp
6499e3383b fix Linux build 2026-06-09 09:06:48 -07:00
Marcos Slomp
8278ace0c1 build fix 2026-06-09 09:06:48 -07:00
Marcos Slomp
5981eca141 adding webgpu example/demo 2026-06-09 09:06:48 -07:00
Marcos Slomp
1b2856b885 GPU context name 2026-06-09 09:06:48 -07:00
Marcos Slomp
118f18cf4b updating docs 2026-06-09 09:06:48 -07:00
Marcos Slomp
bfbc1d3bee missing interface, and more debugging 2026-06-09 09:06:48 -07:00
Marcos Slomp
831779508f minor fixes/comments 2026-06-09 09:06:48 -07:00
Marcos Slomp
286309af3f refactoring calibration estimations 2026-06-09 09:06:47 -07:00
Marcos Slomp
3db70a2237 refactoring 2026-06-09 09:06:47 -07:00
Marcos Slomp
da952f3f38 more refactoring 2026-06-09 09:06:47 -07:00
Marcos Slomp
efba4685ef more cleanup and refactoring 2026-06-09 09:06:47 -07:00
Marcos Slomp
598984c45d refactoring initial calibration 2026-06-09 09:06:47 -07:00
Marcos Slomp
860011c604 calibration stability 2026-06-09 09:06:47 -07:00
Marcos Slomp
0cdcbfc75d refactoring query resolve 2026-06-09 09:06:47 -07:00
Marcos Slomp
e5d4be95df getting rid of spontaneous callbacks 2026-06-09 09:06:47 -07:00
Marcos Slomp
7b3863d93d redesign... 2026-06-09 09:06:47 -07:00
Marcos Slomp
de2a18d964 initial prototype for WebGPU back-end 2026-06-09 09:06:47 -07:00
35 changed files with 2546 additions and 347 deletions

View File

@@ -20,7 +20,7 @@ jobs:
- name: Setup emscripten
uses: emscripten-core/setup-emsdk@v16
with:
version: 4.0.10
version: 5.0.7
- name: Trust git repo
run: git config --global --add safe.directory '*'
- uses: actions/checkout@v4

View File

@@ -6,7 +6,7 @@
"${workspaceFolder}/import",
"${workspaceFolder}/merge",
"${workspaceFolder}/update",
"${workspaceFolder}/test",
"${workspaceFolder}/tests/tracy",
"${workspaceFolder}",
],
"cmake.buildDirectory": "${sourceDirectory}/build",

29
NEWS
View File

@@ -5,9 +5,16 @@ here.
vx.xx.x (2026-xx-xx)
--------------------
- API break: removed "secure" variants of memory alloc and free macros. The
secure code path is now always enabled. Migrate by removing "Secure" from
the macros you use, e.g. TracySecureAlloc(...) -> TracyAlloc(...).
- Added tracy-capture-daemon for automated multi-client trace capture.
- Added tracy-merge utility for combining multiple trace files into one.
- Added support for Windows on ARM64 with MSVC.
- Added support for WebGPU.
- Trace-specific settings storage has been completely overhauled. It is now
possible to make the settings sidecar file public, saved next to the trace
file.
- External frames are now omitted in the single-line call stack list visible
in messages list, or in memory allocation info window.
- External frames are now hidden by default in various contexts where they
@@ -147,8 +154,13 @@ vx.xx.x (2026-xx-xx)
- There is now chapter tree and the manual contents are displayed section
by section.
- Links to chapters are now properly working.
- The "bclogo" blocks are now correctly processed.
- The "bclogo" blocks are now correctly processed and displayed as proper
admonitions.
- The font awesome icons now show as in the rest of the UI.
- Footnotes are now rendered as proper footnotes.
- Tables are now rendered as intended.
- LaTeX math is now converted to readable form.
- Added a button to download the full PDF manual to the user manual window.
- Call stack window will now show the thread viewed call stack originates
from (if possible).
- "Visible threads" checkboxes in messages, flame graph and wait stacks
@@ -172,6 +184,21 @@ vx.xx.x (2026-xx-xx)
- Fixed NVCC builds.
- Fixed possible lockups in Vulkan timer calibration loop.
- The flame graph view now supports zooming in and panning with the mouse.
- General application crash information polish in the profiler UI.
- The achievements system has been converted to use markdown renderer.
- Offline symbol resolution with the update utility now supports custom
addr2line-compatible tools via -a and -A command line parameters.
Additionally, it is now possible to reset all call stack frame symbols to
unresolved with the -R parameter.
- Periodic recalibration of the clock drift in OpenGL contexts can be enabled
with the TRACY_OPENGL_AUTO_CALIBRATION compilation define. Note that this
requires a full CPU/GPU sync on each calibration event. These events will
not fire more often than once every second.
- Added missing C API for shared locks.
- Implemented semi-unique, nonsense random name generator.
- Can be used to set a trace description.
- Will be used to provide default description for newly added annotations.
- Polished look and feel of annotation regions on the timeline.
v0.13.1 (2025-12-11)

View File

@@ -4,7 +4,7 @@
### A real time, nanosecond resolution, remote telemetry, hybrid frame and sampling profiler for games and other applications.
Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
- [Documentation](https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf) for usage and build process instructions
- [Releases](https://github.com/wolfpld/tracy/releases) containing the documentation (`tracy.pdf`) and compiled Windows x64 binaries (`Tracy-<version>.7z`) as assets

View File

@@ -218,6 +218,8 @@ CPMAddPackage(
NAME md4c
GITHUB_REPOSITORY mity/md4c
GIT_TAG 755ce49acdc7cd682d4502b4796db5ed6a1230fb
OPTIONS
"BUILD_SHARED_LIBS OFF"
EXCLUDE_FROM_ALL TRUE
)

View File

@@ -0,0 +1,83 @@
# CMakeLists.txt — OpenGL spinning triangle demo
#
# macOS:
# cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
# cmake --build build/ninja
#
# Linux (requires libsdl3-dev libgl1-mesa-dev):
# cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
# cmake --build build/ninja
#
# Windows:
# cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
# cmake --build build/ninja
cmake_minimum_required(VERSION 3.16)
project(gl_spinning_triangle LANGUAGES C CXX)
# ---------------------------------------------------------------------------
# Tracy root — defaults to three directories above this CMakeLists.txt.
# ---------------------------------------------------------------------------
set(TRACY_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../..")
option(TRACY_ENABLE "Enable Tracy profiling" ON)
# ---------------------------------------------------------------------------
# Platform — SDL3 (cross-platform windowing, must be installed on the system)
# ---------------------------------------------------------------------------
find_package(SDL3 REQUIRED)
# ---------------------------------------------------------------------------
# GL extension loader — GLEW (Windows + Linux, fetched automatically)
# ---------------------------------------------------------------------------
if(NOT APPLE)
include(FetchContent)
set(glew-cmake_BUILD_SHARED OFF CACHE BOOL "" FORCE)
set(ONLY_LIBS ON CACHE BOOL "" FORCE)
FetchContent_Declare(glew
GIT_REPOSITORY https://github.com/Perlmint/glew-cmake.git
GIT_TAG master # pin to a specific commit for reproducible builds
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(glew)
endif()
set(PLATFORM_SOURCES platform/platform_sdl3.cpp)
if(APPLE)
set(PLATFORM_LIBS SDL3::SDL3 "-framework OpenGL")
elseif(WIN32)
set(PLATFORM_LIBS SDL3::SDL3 opengl32 libglew_static)
else()
set(PLATFORM_LIBS SDL3::SDL3 GL libglew_static)
endif()
# ---------------------------------------------------------------------------
# Target
# ---------------------------------------------------------------------------
add_executable(gl_spinning_triangle
spinning_triangle.cpp
"${TRACY_DIR}/public/TracyClient.cpp"
${PLATFORM_SOURCES}
)
# Suppress upstream warnings from TracyClient.cpp
if(MSVC)
set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
PROPERTIES COMPILE_FLAGS "/w"
)
else()
set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
PROPERTIES COMPILE_FLAGS "-w"
)
endif()
target_compile_features(gl_spinning_triangle PRIVATE cxx_std_17)
if(TRACY_ENABLE)
target_compile_definitions(gl_spinning_triangle PRIVATE TRACY_ENABLE)
endif()
target_include_directories(gl_spinning_triangle PRIVATE
"${TRACY_DIR}/public"
)
target_link_libraries(gl_spinning_triangle PRIVATE ${PLATFORM_LIBS})

View File

@@ -0,0 +1,37 @@
// platform.h — interface between platform-agnostic code and platform backends
//
// Each platform_*.mm / platform_*.cpp file implements these four functions.
// Exactly one backend must be linked into the final binary.
#pragma once
#ifdef __APPLE__
# include <OpenGL/gl3.h>
#else
# include <GL/glew.h>
#endif
// Initialize the windowing system, create a window, and make an OpenGL 3.3
// Core Profile context current on the calling thread.
// Returns true on success.
bool platformInit(int width, int height, const char* title);
// Load OpenGL function pointers (no-op on macOS where the framework exports them directly).
// Must be called after platformInit() while the GL context is current.
// Returns true on success.
bool platformInitGL();
// Elapsed wall-clock time in seconds since platformInit().
double platformGetTime();
// Swap front and back buffers (present the rendered frame).
void platformSwapBuffers();
// Pixel scaling factor relative to the logical window size (1.0 on non-HiDPI displays).
// Must be called after platformInit().
void platformGetPixelDensityScale(float* x, float* y);
// Enter the platform event/render loop.
// Calls render() each frame at ~60 fps.
// Calls shutdown() exactly once before returning.
void platformRunLoop(void (*render)(), void (*shutdown)());

View File

@@ -0,0 +1,85 @@
// platform_sdl3.cpp — SDL3 windowing backend (cross-platform)
#include "platform.h" // GL headers first (gl3.h / glew.h) so SDL sees guards set
#define SDL_MAIN_HANDLED // we don't want SDL_main
#include <SDL3/SDL.h>
#include <chrono>
#include <cstdio>
static SDL_Window* sWin = nullptr;
static SDL_GLContext sCtx = nullptr;
static std::chrono::steady_clock::time_point sStartTime;
bool platformInit(int width, int height, const char* title) {
if (!SDL_Init(SDL_INIT_VIDEO)) {
fprintf(stderr, "ERROR: SDL_Init failed: %s\n", SDL_GetError());
return false;
}
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, 3);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, 3);
SDL_GL_SetAttribute(SDL_GL_CONTEXT_PROFILE_MASK, SDL_GL_CONTEXT_PROFILE_CORE);
sWin = SDL_CreateWindow(title, width, height, SDL_WINDOW_OPENGL);
if (!sWin) {
fprintf(stderr, "ERROR: SDL_CreateWindow failed: %s\n", SDL_GetError());
SDL_Quit();
return false;
}
SDL_SetWindowPosition(sWin, SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED);
sCtx = SDL_GL_CreateContext(sWin);
if (!sCtx) {
fprintf(stderr, "ERROR: SDL_GL_CreateContext failed: %s\n", SDL_GetError());
SDL_DestroyWindow(sWin);
SDL_Quit();
return false;
}
SDL_GL_SetSwapInterval(1);
sStartTime = std::chrono::steady_clock::now();
return true;
}
bool platformInitGL() {
#ifndef __APPLE__
glewExperimental = GL_TRUE;
if (glewInit() != GLEW_OK) {
fprintf(stderr, "Failed to initialize GLEW\n");
return false;
}
#endif
return true;
}
double platformGetTime() {
return std::chrono::duration<double>(
std::chrono::steady_clock::now() - sStartTime).count();
}
void platformSwapBuffers() { SDL_GL_SwapWindow(sWin); }
void platformGetPixelDensityScale(float* x, float* y) {
int pw, ph, ww, wh;
SDL_GetWindowSizeInPixels(sWin, &pw, &ph);
SDL_GetWindowSize(sWin, &ww, &wh);
*x = (ww > 0) ? (float)pw / (float)ww : 1.0f;
*y = (wh > 0) ? (float)ph / (float)wh : 1.0f;
}
void platformRunLoop(void (*render)(), void (*shutdown)()) {
bool running = true;
while (running) {
SDL_Event e;
while (SDL_PollEvent(&e)) {
if (e.type == SDL_EVENT_QUIT) running = false;
if (e.type == SDL_EVENT_KEY_DOWN && e.key.key == SDLK_ESCAPE) running = false;
}
if (running) render();
}
shutdown();
SDL_GL_DestroyContext(sCtx);
SDL_DestroyWindow(sWin);
SDL_Quit();
}

View File

@@ -0,0 +1,145 @@
// spinning_triangle.cpp — OpenGL spinning triangle demo with Tracy GPU profiling.
#ifdef __APPLE__
// NOTE: OpenGL is only available on MacOS (no iOS support)
// Including and using anything related to OpenGL on Apple (like <OpenGL/gl3.h>)
// will emit deprecation warnings, unless GL_SILENCE_DEPRECATION is defined
#define GL_SILENCE_DEPRECATION
// NOTE: TracyOpenGL.hpp will not work as expected even on Apple devices that
// support OpenGL, because the OpenGL drivers do not implement ARB_timer_query
// properly (querying GL_TIMESTAMP always resolves to 0). TracyOpenGL.hpp will
// emit a compiler warning, and a Tracy message to the trace/profiler, but the
// program will still run.
#endif
#include "platform/platform.h" // also includes OpenGL headers
#include <tracy/Tracy.hpp>
// NOTE: opt-in toggle for periodic recalibrations during Collect()
#define TRACY_OPENGL_AUTO_CALIBRATION
#include <tracy/TracyOpenGL.hpp>
static const int kWidth = 800;
static const int kHeight = 600;
static GLuint gProgram = 0;
static GLuint gVao = 0;
static GLint gAngleLoc = -1;
// Vertex colors and positions are baked in; rotation is driven by a uniform.
static const char* kVertSrc = R"(
#version 150 core
uniform float uAngle;
const vec2 kPos[3] = vec2[3](
vec2( 0.0, 0.5 ),
vec2(-0.433, -0.25 ),
vec2( 0.433, -0.25 )
);
const vec3 kCol[3] = vec3[3](
vec3(1.0, 0.0, 0.0),
vec3(0.0, 1.0, 0.0),
vec3(0.0, 0.0, 1.0)
);
out vec3 vColor;
void main() {
float c = cos(uAngle);
float s = sin(uAngle);
vec2 p = kPos[gl_VertexID];
gl_Position = vec4(p.x*c - p.y*s, p.x*s + p.y*c, 0.0, 1.0);
vColor = kCol[gl_VertexID];
}
)";
static const char* kFragSrc = R"(
#version 150 core
in vec3 vColor;
out vec4 fragColor;
void main() { fragColor = vec4(vColor, 1.0); }
)";
static GLuint compileShader(GLenum type, const char* src) {
GLuint s = glCreateShader(type);
glShaderSource(s, 1, &src, nullptr);
glCompileShader(s);
GLint ok = 0;
glGetShaderiv(s, GL_COMPILE_STATUS, &ok);
if (!ok) {
char log[512];
glGetShaderInfoLog(s, sizeof(log), nullptr, log);
fprintf(stderr, "Shader compile error: %s\n", log);
glDeleteShader(s);
return 0;
}
return s;
}
static int initGL() {
if (!platformInitGL()) return 1;
TracyGpuContext;
TracyGpuContextName("OpenGL", 6);
GLuint vert = compileShader(GL_VERTEX_SHADER, kVertSrc);
GLuint frag = compileShader(GL_FRAGMENT_SHADER, kFragSrc);
if (!vert || !frag) return 1;
gProgram = glCreateProgram();
glAttachShader(gProgram, vert);
glAttachShader(gProgram, frag);
glLinkProgram(gProgram);
glDeleteShader(vert);
glDeleteShader(frag);
GLint ok = 0;
glGetProgramiv(gProgram, GL_LINK_STATUS, &ok);
if (!ok) {
char log[512];
glGetProgramInfoLog(gProgram, sizeof(log), nullptr, log);
fprintf(stderr, "Program link error: %s\n", log);
return 1;
}
gAngleLoc = glGetUniformLocation(gProgram, "uAngle");
// Core profile requires a bound VAO even with no vertex attributes.
glGenVertexArrays(1, &gVao);
glBindVertexArray(gVao);
glClearColor(0.05f, 0.05f, 0.08f, 1.0f);
float scaleX, scaleY;
platformGetPixelDensityScale(&scaleX, &scaleY);
glViewport(0, 0, (int)(kWidth * scaleX), (int)(kHeight * scaleY));
return 0;
}
static void renderFrame() {
ZoneScoped;
glClear(GL_COLOR_BUFFER_BIT);
glUseProgram(gProgram);
{
TracyGpuZone("triangle draw");
glUniform1f(gAngleLoc, (float)platformGetTime());
glDrawArrays(GL_TRIANGLES, 0, 3);
}
platformSwapBuffers();
TracyGpuCollect;
}
static void shutdown() {
fprintf(stderr, "application is shutting down...\n");
glDeleteVertexArrays(1, &gVao);
glDeleteProgram(gProgram);
}
int main() {
if (!platformInit(kWidth, kHeight, "OpenGL Spinning Triangle"))
return 1;
if (initGL() != 0)
return 2;
platformRunLoop(renderFrame, shutdown);
return 0;
}

View File

@@ -0,0 +1,157 @@
# CMakeLists.txt — WebGPU spinning triangle demo
#
# macOS:
# clang++ -std=c++17 -ObjC++ spinning_triangle.cpp platform/platform_macos.mm \
# -I/path/to/wgpu/include -L/path/to/wgpu/lib -lwgpu_native \
# -Wl,-rpath,@executable_path \
# -framework Cocoa -framework Metal -framework QuartzCore \
# -framework Foundation -framework IOKit -framework IOSurface \
# -o spinning_triangle
#
# Windows (MSVC):
# cl /std:c++17 spinning_triangle.cpp platform/platform_windows.cpp \
# /I\path\to\wgpu\include \path\to\wgpu\lib\wgpu_native.lib \
# user32.lib gdi32.lib /Fe:spinning_triangle.exe
#
# Linux (requires libsdl3-dev):
# g++ -std=c++17 spinning_triangle.cpp platform/platform_wayland.cpp \
# xdg-shell-protocol.c \
# -I/path/to/wgpu/include -L/path/to/wgpu/lib -lwgpu_native \
# -lwayland-client -o spinning_triangle
cmake_minimum_required(VERSION 3.16)
project(spinning_triangle LANGUAGES C CXX)
# ---------------------------------------------------------------------------
# WebGPU backend — set WGPU_PATH to your wgpu-native or Dawn installation.
# The library name differs between backends:
# wgpu-native → wgpu_native
# Dawn → webgpu_dawn
# ---------------------------------------------------------------------------
set(WGPU_PATH "" CACHE PATH "Root of the WebGPU native installation (contains include/ and lib/)")
set(WGPU_LIB "" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty")
if(NOT WGPU_PATH)
message(FATAL_ERROR "Set WGPU_PATH to the root of your WebGPU native installation.")
endif()
# When WGPU_PATH changes, discard any previously auto-detected WGPU_LIB so
# detection re-runs against the new path.
if(NOT "${WGPU_PATH}" STREQUAL "${_WGPU_PATH_LAST}")
unset(WGPU_LIB CACHE)
set(WGPU_LIB "" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty")
endif()
set(_WGPU_PATH_LAST "${WGPU_PATH}" CACHE INTERNAL "")
if(NOT WGPU_LIB)
unset(_WGPU_NATIVE_LIB CACHE)
unset(_WEBGPU_DAWN_LIB CACHE)
find_library(_WGPU_NATIVE_LIB NAMES wgpu_native wgpu_native.dll PATHS "${WGPU_PATH}/lib" NO_DEFAULT_PATH)
find_library(_WEBGPU_DAWN_LIB NAMES webgpu_dawn PATHS "${WGPU_PATH}/lib" NO_DEFAULT_PATH)
if(_WGPU_NATIVE_LIB)
set(WGPU_LIB "wgpu_native" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty" FORCE)
elseif(_WEBGPU_DAWN_LIB)
set(WGPU_LIB "webgpu_dawn" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty" FORCE)
else()
message(FATAL_ERROR "Could not detect a WebGPU library in ${WGPU_PATH}/lib. Set WGPU_LIB explicitly (wgpu_native or webgpu_dawn).")
endif()
message(STATUS "WebGPU library auto-detected: ${WGPU_LIB}")
endif()
# ---------------------------------------------------------------------------
# Tracy root — defaults to two directories above this CMakeLists.txt.
# ---------------------------------------------------------------------------
set(TRACY_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../..")
option(TRACY_ENABLE "Enable Tracy profiling" ON)
# ---------------------------------------------------------------------------
# macOS quarantine — pre-built WebGPU binaries downloaded from the internet
# carry a com.apple.quarantine extended attribute that prevents dyld from
# loading them ("damaged or incomplete" / Gatekeeper block). Strip it once
# at configure time so the linker and the runtime loader can both access the
# library directory without further user intervention.
# ---------------------------------------------------------------------------
if(APPLE)
execute_process(
COMMAND xattr -dr com.apple.quarantine "${WGPU_PATH}/lib"
)
endif()
# ---------------------------------------------------------------------------
# Platform — SDL3 (cross-platform windowing, must be installed on the system)
# ---------------------------------------------------------------------------
find_package(SDL3 REQUIRED)
set(PLATFORM_SOURCES platform/platform_sdl3.cpp)
if(APPLE)
set(PLATFORM_LIBS
SDL3::SDL3
"-framework Cocoa"
"-framework Metal"
"-framework QuartzCore"
"-framework Foundation"
"-framework IOKit"
"-framework IOSurface"
)
elseif(WIN32)
# wgpu-native (Rust stdlib) pull-ins: NtReadFile, GetUserProfileDirectoryW, ...
set(WGPU_NATIVE_WIN32_LIBS ntdll userenv)
# Dawn pull-ins: WKPDID_D3DDebugObjectName GUID, CompareObjectHandles, ...
set(WEBGPU_DAWN_WIN32_LIBS dxguid onecore)
set(PLATFORM_LIBS SDL3::SDL3 ${WGPU_NATIVE_WIN32_LIBS} ${WEBGPU_DAWN_WIN32_LIBS})
else()
set(PLATFORM_LIBS SDL3::SDL3)
endif()
# ---------------------------------------------------------------------------
# Target
# ---------------------------------------------------------------------------
add_executable(spinning_triangle
spinning_triangle.cpp
"${TRACY_DIR}/public/TracyClient.cpp"
${PLATFORM_SOURCES}
)
# Treat TracyClient.cpp as third-party code — suppress all warnings so that
# upstream changes don't pollute our build output.
if(MSVC)
set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
PROPERTIES COMPILE_FLAGS "/w"
)
else()
set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
PROPERTIES COMPILE_FLAGS "-w"
)
endif()
target_compile_features(spinning_triangle PRIVATE cxx_std_17)
if(TRACY_ENABLE)
target_compile_definitions(spinning_triangle PRIVATE TRACY_ENABLE)
endif()
target_include_directories(spinning_triangle PRIVATE
"${WGPU_PATH}/include"
"${TRACY_DIR}/public"
)
target_link_directories(spinning_triangle PRIVATE "${WGPU_PATH}/lib")
target_link_libraries(spinning_triangle PRIVATE
${WGPU_LIB}
${PLATFORM_LIBS}
)
# Embed the rpath so the binary finds the WebGPU dylib/so next to itself.
if(APPLE)
set_target_properties(spinning_triangle PROPERTIES
BUILD_RPATH "${WGPU_PATH}/lib"
INSTALL_RPATH "@executable_path"
)
elseif(UNIX)
set_target_properties(spinning_triangle PROPERTIES
BUILD_RPATH "${WGPU_PATH}/lib"
INSTALL_RPATH "$ORIGIN"
)
endif()

View File

@@ -0,0 +1,23 @@
// platform.h — interface between platform-agnostic code and platform backends
//
// Each platform_*.mm / platform_*.cpp file implements these five functions.
// Exactly one backend must be linked into the final binary.
#pragma once
#include <webgpu/webgpu.h>
// Initialize the windowing system and create a window of the given dimensions.
// Returns true on success.
bool platformInit(int width, int height, const char* title);
// Create a WebGPU surface backed by the platform window.
// Must be called after wgpuCreateInstance() and platformInit().
WGPUSurface platformCreateSurface(WGPUInstance instance);
// Elapsed wall-clock time in seconds since platformInit().
double platformGetTime();
// Enter the platform event/render loop.
// Calls render() each frame at ~60 fps.
// Calls shutdown() exactly once before returning.
void platformRunLoop(void (*render)(), void (*shutdown)());

View File

@@ -0,0 +1,95 @@
// platform_sdl3.cpp — SDL3 windowing backend for the WebGPU example
#include "platform.h" // webgpu/webgpu.h first
#define SDL_MAIN_HANDLED // we don't want SDL_main
#include <SDL3/SDL.h>
#ifdef __APPLE__
# include <SDL3/SDL_metal.h>
#endif
#include <chrono>
#include <cstdio>
static SDL_Window* sWin = nullptr;
static std::chrono::steady_clock::time_point sStartTime;
#ifdef __APPLE__
static SDL_MetalView sMetalView = nullptr;
#endif
bool platformInit(int width, int height, const char* title) {
if (!SDL_Init(SDL_INIT_VIDEO)) {
fprintf(stderr, "ERROR: SDL_Init failed: %s\n", SDL_GetError());
return false;
}
SDL_WindowFlags flags = 0;
#ifdef __APPLE__
flags |= SDL_WINDOW_METAL;
#endif
sWin = SDL_CreateWindow(title, width, height, flags);
if (!sWin) {
fprintf(stderr, "ERROR: SDL_CreateWindow failed: %s\n", SDL_GetError());
SDL_Quit();
return false;
}
SDL_SetWindowPosition(sWin, SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED);
sStartTime = std::chrono::steady_clock::now();
return true;
}
WGPUSurface platformCreateSurface(WGPUInstance instance) {
WGPUSurfaceDescriptor desc = {};
SDL_PropertiesID props = SDL_GetWindowProperties(sWin);
#if defined(__APPLE__)
sMetalView = SDL_Metal_CreateView(sWin);
if (!sMetalView) {
fprintf(stderr, "ERROR: SDL_Metal_CreateView failed\n");
return nullptr;
}
WGPUSurfaceSourceMetalLayer metalDesc = {};
metalDesc.chain.sType = WGPUSType_SurfaceSourceMetalLayer;
metalDesc.layer = SDL_Metal_GetLayer(sMetalView);
desc.nextInChain = &metalDesc.chain;
#elif defined(_WIN32)
WGPUSurfaceSourceWindowsHWND hwndDesc = {};
hwndDesc.chain.sType = WGPUSType_SurfaceSourceWindowsHWND;
hwndDesc.hinstance = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_WIN32_INSTANCE_POINTER, nullptr);
hwndDesc.hwnd = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_WIN32_HWND_POINTER, nullptr);
desc.nextInChain = &hwndDesc.chain;
#else // Linux / X11
WGPUSurfaceSourceXlibWindow x11Desc = {};
x11Desc.chain.sType = WGPUSType_SurfaceSourceXlibWindow;
x11Desc.display = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_X11_DISPLAY_POINTER, nullptr);
x11Desc.window = (uint32_t)SDL_GetNumberProperty(props, SDL_PROP_WINDOW_X11_WINDOW_NUMBER, 0);
desc.nextInChain = &x11Desc.chain;
#endif
return wgpuInstanceCreateSurface(instance, &desc);
}
double platformGetTime() {
return std::chrono::duration<double>(
std::chrono::steady_clock::now() - sStartTime).count();
}
void platformRunLoop(void (*render)(), void (*shutdown)()) {
bool running = true;
while (running) {
SDL_Event e;
while (SDL_PollEvent(&e)) {
if (e.type == SDL_EVENT_QUIT) running = false;
if (e.type == SDL_EVENT_KEY_DOWN && e.key.key == SDLK_ESCAPE) running = false;
}
if (running) render();
}
shutdown();
#ifdef __APPLE__
SDL_Metal_DestroyView(sMetalView);
#endif
SDL_DestroyWindow(sWin);
SDL_Quit();
}

View File

@@ -0,0 +1,352 @@
// spinning_triangle.cpp — platform-agnostic WebGPU spinning triangle demo.
#include "platform/platform.h"
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <webgpu/webgpu.h>
#include <tracy/Tracy.hpp>
#include <tracy/TracyWebGPU.hpp>
// ---------------------------------------------------------------------------
// Globals
// ---------------------------------------------------------------------------
static const int kWidth = 800;
static const int kHeight = 600;
static WGPUInstance gInstance = nullptr;
static WGPUSurface gSurface = nullptr;
static WGPUAdapter gAdapter = nullptr;
static WGPUDevice gDevice = nullptr;
static WGPUQueue gQueue = nullptr;
static WGPURenderPipeline gPipeline = nullptr;
static WGPUBuffer gUniformBuf = nullptr;
static WGPUBindGroup gBindGroup = nullptr;
static TracyWebGPUCtx gTracyCtx = nullptr;
static WGPUTextureFormat gSurfaceFormat = WGPUTextureFormat_BGRA8Unorm;
// TODO: this can become platformError() instead
int error(int code, const char* message) {
fprintf(stderr, "ERROR: %s (code: %d)\n", message, code);
return code;
}
// ---------------------------------------------------------------------------
// WGSL shader — vertex colours baked in, rotation via a uniform float.
// ---------------------------------------------------------------------------
static const char* kShaderSource = R"(
struct Uniforms {
angle: f32,
};
@group(0) @binding(0) var<uniform> u: Uniforms;
struct VSOut {
@builtin(position) pos: vec4f,
@location(0) color: vec3f,
};
@vertex
fn vs_main(@builtin(vertex_index) vi: u32) -> VSOut {
var positions = array<vec2f, 3>(
vec2f( 0.0, 0.5),
vec2f(-0.433, -0.25),
vec2f( 0.433, -0.25),
);
var colors = array<vec3f, 3>(
vec3f(1.0, 0.0, 0.0),
vec3f(0.0, 1.0, 0.0),
vec3f(0.0, 0.0, 1.0),
);
let c = cos(u.angle);
let s = sin(u.angle);
let p = positions[vi];
let rotated = vec2f(p.x * c - p.y * s, p.x * s + p.y * c);
var out: VSOut;
out.pos = vec4f(rotated, 0.0, 1.0);
out.color = colors[vi];
return out;
}
@fragment
fn fs_main(@location(0) color: vec3f) -> @location(0) vec4f {
return vec4f(color, 1.0);
}
)";
// ---------------------------------------------------------------------------
// Adapter / Device request callbacks (current wgpu-native API)
// ---------------------------------------------------------------------------
static void onAdapterReady(WGPURequestAdapterStatus status,
WGPUAdapter adapter,
WGPUStringView message,
void* userdata1, void* /*userdata2*/) {
if (status == WGPURequestAdapterStatus_Success) {
*(WGPUAdapter*)userdata1 = adapter;
} else {
fprintf(stderr, "Adapter request failed: %.*s\n",
(int)message.length, message.data);
}
}
static void onDeviceReady(WGPURequestDeviceStatus status,
WGPUDevice device,
WGPUStringView message,
void* userdata1, void* /*userdata2*/) {
if (status == WGPURequestDeviceStatus_Success) {
*(WGPUDevice*)userdata1 = device;
} else {
fprintf(stderr, "Device request failed: %.*s\n",
(int)message.length, message.data);
}
}
// ---------------------------------------------------------------------------
// WebGPU init
// ---------------------------------------------------------------------------
static int initWebGPU() {
// Adapter
WGPURequestAdapterOptions adapterOpts = {};
adapterOpts.compatibleSurface = gSurface;
WGPURequestAdapterCallbackInfo adapterCB = {};
adapterCB.mode = WGPUCallbackMode_AllowProcessEvents;
adapterCB.callback = onAdapterReady;
adapterCB.userdata1 = &gAdapter;
wgpuInstanceRequestAdapter(gInstance, &adapterOpts, adapterCB);
while (!gAdapter) { wgpuInstanceProcessEvents(gInstance); }
if (!gAdapter) return error(11, "No adapter");
WGPUUncapturedErrorCallbackInfo errorCB = {};
errorCB.callback = [](WGPUDevice const*, WGPUErrorType type,
WGPUStringView message, void*, void*) {
fprintf(stderr, "[WGPU ERROR] type=%d %.*s\n",
(int)type, (int)message.length, message.data);
};
WGPUDeviceDescriptor deviceDesc = {};
deviceDesc.uncapturedErrorCallbackInfo = errorCB;
TracyWebGPUSetupDeviceDescriptor(deviceDesc);
WGPURequestDeviceCallbackInfo deviceCB = {};
deviceCB.mode = WGPUCallbackMode_AllowProcessEvents;
deviceCB.callback = onDeviceReady;
deviceCB.userdata1 = &gDevice;
wgpuAdapterRequestDevice(gAdapter, &deviceDesc, deviceCB);
while (!gDevice) { wgpuInstanceProcessEvents(gInstance); }
if (!gDevice) return error(12, "No device");
gQueue = wgpuDeviceGetQueue(gDevice);
gTracyCtx = TracyWebGPUContext(gInstance, gDevice, gQueue);
TracyWebGPUContextName(gTracyCtx, "WebGPU", 6);
// Configure surface
WGPUSurfaceConfiguration config = {};
config.device = gDevice;
config.format = gSurfaceFormat;
config.usage = WGPUTextureUsage_RenderAttachment;
config.alphaMode = WGPUCompositeAlphaMode_Opaque;
config.width = kWidth;
config.height = kHeight;
config.presentMode = WGPUPresentMode_Fifo;
wgpuSurfaceConfigure(gSurface, &config);
// Shader module
WGPUShaderSourceWGSL wgslSrc = {};
wgslSrc.chain.sType = WGPUSType_ShaderSourceWGSL;
wgslSrc.code = { kShaderSource, WGPU_STRLEN };
WGPUShaderModuleDescriptor smDesc = {};
smDesc.nextInChain = (WGPUChainedStruct*)&wgslSrc;
WGPUShaderModule shaderMod = wgpuDeviceCreateShaderModule(gDevice, &smDesc);
// Uniform buffer (one f32 for rotation angle)
WGPUBufferDescriptor bufDesc = {};
bufDesc.usage = WGPUBufferUsage_Uniform | WGPUBufferUsage_CopyDst;
bufDesc.size = sizeof(float);
gUniformBuf = wgpuDeviceCreateBuffer(gDevice, &bufDesc);
// Bind group layout + bind group
WGPUBindGroupLayoutEntry bglEntry = {};
bglEntry.binding = 0;
bglEntry.visibility = WGPUShaderStage_Vertex;
bglEntry.buffer.type = WGPUBufferBindingType_Uniform;
bglEntry.buffer.minBindingSize = sizeof(float);
WGPUBindGroupLayoutDescriptor bglDesc = {};
bglDesc.entryCount = 1;
bglDesc.entries = &bglEntry;
WGPUBindGroupLayout bgl = wgpuDeviceCreateBindGroupLayout(gDevice, &bglDesc);
WGPUBindGroupEntry bgEntry = {};
bgEntry.binding = 0;
bgEntry.buffer = gUniformBuf;
bgEntry.size = sizeof(float);
WGPUBindGroupDescriptor bgDesc = {};
bgDesc.layout = bgl;
bgDesc.entryCount = 1;
bgDesc.entries = &bgEntry;
gBindGroup = wgpuDeviceCreateBindGroup(gDevice, &bgDesc);
// Pipeline layout
WGPUPipelineLayoutDescriptor plDesc = {};
plDesc.bindGroupLayoutCount = 1;
plDesc.bindGroupLayouts = &bgl;
WGPUPipelineLayout pipelineLayout = wgpuDeviceCreatePipelineLayout(gDevice, &plDesc);
// Render pipeline
WGPUColorTargetState colorTarget = {};
colorTarget.format = gSurfaceFormat;
colorTarget.writeMask = WGPUColorWriteMask_All;
WGPUFragmentState fragState = {};
fragState.module = shaderMod;
fragState.entryPoint = { "fs_main", WGPU_STRLEN };
fragState.targetCount = 1;
fragState.targets = &colorTarget;
WGPURenderPipelineDescriptor rpDesc = {};
rpDesc.layout = pipelineLayout;
rpDesc.vertex.module = shaderMod;
rpDesc.vertex.entryPoint = { "vs_main", WGPU_STRLEN };
rpDesc.primitive.topology = WGPUPrimitiveTopology_TriangleList;
rpDesc.multisample.count = 1;
rpDesc.multisample.mask = 0xFFFFFFFF;
rpDesc.fragment = &fragState;
gPipeline = wgpuDeviceCreateRenderPipeline(gDevice, &rpDesc);
// Cleanup intermediates
wgpuShaderModuleRelease(shaderMod);
wgpuPipelineLayoutRelease(pipelineLayout);
wgpuBindGroupLayoutRelease(bgl);
return 0;
}
// ---------------------------------------------------------------------------
// Frame rendering
// ---------------------------------------------------------------------------
// Returns the surface texture for the current frame, or {.texture=nullptr} on
// a skippable condition (timeout, occlusion) or an error.
static WGPUSurfaceTexture getWindowSurface() {
WGPUSurfaceTexture surfTex = {};
wgpuSurfaceGetCurrentTexture(gSurface, &surfTex);
if (surfTex.status == WGPUSurfaceGetCurrentTextureStatus_SuccessOptimal ||
surfTex.status == WGPUSurfaceGetCurrentTextureStatus_SuccessSuboptimal)
return surfTex;
// Timeout and Occluded are normal OS events (window covered / on a different Space).
bool silent = surfTex.status == WGPUSurfaceGetCurrentTextureStatus_Timeout;
#ifdef WGPU_H_
silent = silent || surfTex.status == (WGPUSurfaceGetCurrentTextureStatus)WGPUSurfaceGetCurrentTextureStatus_Occluded;
#endif
if (!silent)
fprintf(stderr, "Failed to get surface texture (status %d)\n", surfTex.status);
if (surfTex.texture) wgpuTextureRelease(surfTex.texture);
surfTex.texture = nullptr;
return surfTex;
}
static void renderFrame() {
ZoneScoped;
// Update rotation angle
float angle = (float)platformGetTime();
wgpuQueueWriteBuffer(gQueue, gUniformBuf, 0, &angle, sizeof(float));
WGPUSurfaceTexture surfTex = getWindowSurface();
if (!surfTex.texture) return;
WGPUTextureView view = wgpuTextureCreateView(surfTex.texture, nullptr);
// Command encoder
WGPUCommandEncoder encoder = wgpuDeviceCreateCommandEncoder(gDevice, nullptr);
// Render pass
WGPURenderPassColorAttachment colorAtt = {};
colorAtt.view = view;
colorAtt.loadOp = WGPULoadOp_Clear;
colorAtt.storeOp = WGPUStoreOp_Store;
colorAtt.clearValue = { 0.05, 0.05, 0.08, 1.0 };
colorAtt.depthSlice = WGPU_DEPTH_SLICE_UNDEFINED;
WGPURenderPassDescriptor passDesc = {};
passDesc.colorAttachmentCount = 1;
passDesc.colorAttachments = &colorAtt;
{
ZoneScopedN("render-pass");
TracyWebGPUNamedZone(gTracyCtx, tracyZone, encoder, passDesc, "triangle draw", true);
WGPURenderPassEncoder pass = wgpuCommandEncoderBeginRenderPass(encoder, &passDesc);
wgpuRenderPassEncoderSetPipeline(pass, gPipeline);
wgpuRenderPassEncoderSetBindGroup(pass, 0, gBindGroup, 0, nullptr);
wgpuRenderPassEncoderDraw(pass, 3, 1, 0, 0);
wgpuRenderPassEncoderEnd(pass);
wgpuRenderPassEncoderRelease(pass);
}
// Submit
WGPUCommandBuffer cmdBuf = wgpuCommandEncoderFinish(encoder, nullptr);
wgpuQueueSubmit(gQueue, 1, &cmdBuf);
// Present
wgpuSurfacePresent(gSurface);
// Process Events
wgpuInstanceProcessEvents(gInstance);
TracyWebGPUCollect(gTracyCtx);
// Cleanup
wgpuCommandBufferRelease(cmdBuf);
wgpuCommandEncoderRelease(encoder);
wgpuTextureViewRelease(view);
wgpuTextureRelease(surfTex.texture);
}
// ---------------------------------------------------------------------------
// Shutdown
// ---------------------------------------------------------------------------
static void shutdown() {
fprintf(stderr, "application is shutting down...\n");
TracyWebGPUDestroy(gTracyCtx);
if (gBindGroup) wgpuBindGroupRelease(gBindGroup);
if (gUniformBuf) wgpuBufferRelease(gUniformBuf);
if (gPipeline) wgpuRenderPipelineRelease(gPipeline);
if (gQueue) wgpuQueueRelease(gQueue);
if (gDevice) wgpuDeviceRelease(gDevice);
if (gAdapter) wgpuAdapterRelease(gAdapter);
if (gSurface) wgpuSurfaceRelease(gSurface);
if (gInstance) wgpuInstanceRelease(gInstance);
}
// ---------------------------------------------------------------------------
// main
// ---------------------------------------------------------------------------
int main(int argc, char* argv[]) {
if (!platformInit(kWidth, kHeight, "WebGPU Spinning Triangle"))
return 1;
gInstance = wgpuCreateInstance(nullptr);
if (!gInstance) return error(2, "Failed to create WebGPU instance.");
gSurface = platformCreateSurface(gInstance);
if (!gSurface) return error(3, "Failed to create surface.");
if (initWebGPU() != 0) return 4;
platformRunLoop(renderFrame, shutdown);
return 0;
}

View File

@@ -11,7 +11,7 @@ The user manual
**Bartosz Taudul** [\<wolf@nereid.pl\>](mailto:wolf@nereid.pl)
2026-06-09 <https://github.com/wolfpld/tracy>
2026-06-15 <https://github.com/wolfpld/tracy>
# Quick overview {#quick-overview .unnumbered}
@@ -69,7 +69,7 @@ Tracy is a real-time, nanosecond resolution *hybrid frame and sampling profiler*
[^1]: Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C#, OCaml, Odin, etc.
[^2]: All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL.
[^2]: All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.
While Tracy can perform statistical analysis of sampled call stack data, just like other *statistical profilers* (such as VTune, perf, or Very Sleepy), it mainly focuses on manual markup of the source code. Such markup allows frame-by-frame inspection of the program execution. For example, you will be able to see exactly which functions are called, how much time they require, and how they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it cannot accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds.
@@ -145,7 +145,7 @@ Tracy aims to give you an understanding of the inner workings of a tight loop of
## Sampling profiler
Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others.
Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even "steal" an optimization performed by one compiler and make it available for the others.
On some platforms, it is possible to sample the hardware performance counters, which will give you information not only *where* your program is running slowly, but also *why*.
@@ -279,7 +279,7 @@ Tracy Profiler supports MSVC, GCC, and clang. You will need to use a reasonably
- QNX (x64)
[^11]: Requires **\"OpenCL, OpenGL, and Vulkan Compatibility Pack\"** from Microsoft Store.
[^11]: Requires **"OpenCL, OpenGL, and Vulkan Compatibility Pack"** from Microsoft Store.
Moreover, the following platforms are not supported due to how secretive their owners are but were reported to be working after extending the system integration layer:
@@ -463,7 +463,7 @@ In the case of some programming environments, you may need to take extra steps t
If you are using MSVC, you will need to disable the *Edit And Continue* feature, as it makes the compiler non-conformant to some aspects of the C++ standard. In order to do so, open the project properties and go to C/C++,General,Debug Information Format and make sure *Program Database for Edit And Continue (/ZI)* is *not* selected.
For context, if you experience errors like \"error C2131: expression did not evaluate to a constant\", \"failure was caused by non-constant arguments or reference to a non-constant symbol\", and \"see usage of '`__LINE__Var`'\", chances are that your project has the *Edit And Continue* feature enabled.
For context, if you experience errors like "error C2131: expression did not evaluate to a constant", "failure was caused by non-constant arguments or reference to a non-constant symbol", and "see usage of '`__LINE__Var`'", chances are that your project has the *Edit And Continue* feature enabled.
#### Universal Windows Platform
@@ -641,11 +641,11 @@ Nevertheless, let's look at how we can try to stabilize the profiling data.
Also known as: the *spectre* thing we have to deal with now.
You must be aware that most processors available on the market[^19] *do not* execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more 'reliable' readings[^20] would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is *really* doing.
You must be aware that most processors available on the market[^19] *do not* execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more "reliable" readings[^20] would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is *really* doing.
[^19]: Except low-cost ARM CPUs.
[^20]: And by saying 'reliable,' you do in reality mean: behaving in a way you expect it.
[^20]: And by saying "reliable," you do in reality mean: behaving in a way you expect it.
This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: <https://travisdowns.github.io/blog/2019/06/11/speed-limits.html>.
@@ -675,7 +675,7 @@ While the CPU is more-or-less designed always to be able to work at the advertis
- Do you have complete control over the power profile? Spoiler alert: no. The operating system may run anything at any time on any of the other cores, which will impact the turbo frequency you're able to achieve.
As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at *four* different speeds.
As you can see, this feature basically screams "unreliable results!" Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at *four* different speeds.
Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down throttled.
@@ -797,7 +797,7 @@ If you want to use X11 instead, you can enable the `LEGACY` option in CMake buil
Special considerations must be taken to run the Tracy server/profiler GUI on Windows on ARM.
Ensure that the **\"OpenCL, OpenGL, and Vulkan Compatibility Pack\"** is installed (from the Microsoft Store), otherwise the GUI will fail to open.
Ensure that the **"OpenCL, OpenGL, and Vulkan Compatibility Pack"** is installed (from the Microsoft Store), otherwise the GUI will fail to open.
### Using an IDE
@@ -813,7 +813,7 @@ The CMake build configuration will begin immediately. It is likely that you will
After the build configuration phase is over, you may want to make some further adjustments to what is being built. The primary place to do this is in the *Project Status* section of the CMake side panel. The two key settings there are also available in the status bar at the bottom of the window:
- The *Folder* setting allows you to choose which Tracy utility you want to work with. Select \"profiler\" for the profiler's GUI.
- The *Folder* setting allows you to choose which Tracy utility you want to work with. Select "profiler" for the profiler's GUI.
- The *Build variant* setting is used to toggle between the debug and release build configurations.
@@ -877,7 +877,7 @@ Some source location data such as function name, file path or line number can be
On selected platforms (see section [2.6](#featurematrix)) Tracy will intercept application crashes[^28]. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.
[^28]: For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.
[^28]: For example, invalid memory accesses ("segmentation faults", "null pointer exceptions"), divisions by zero, etc.
This is an automatic process, and it doesn't require user interaction. If you are experiencing issues with crash handling you may want to try defining the `TRACY_NO_CRASH_HANDLER` macro to disable the built in crash handling.
@@ -905,6 +905,8 @@ Some features of the profiler are only available on selected platforms. Please r
| GPU zones (OpenGL) |  |  |  |  |  | |  |
| GPU zones (Vulkan) |  |  |  |  |  | |  |
| GPU zones (Metal) |  |  |  | ^*b*^ | ^*b*^ |  |  |
| GPU zones (CUDA) |  |  |  |  |  | ? |  |
| GPU zones (WebGPU) |  |  |  |  |  | ? | ? |
| Call stacks |  |  |  |  |  |  |  |
| Symbol resolution |  |  |  |  |  |  |  |
| Crash handling |  |  |  |  |  |  |  |
@@ -966,7 +968,7 @@ In some cases marked in the manual, Tracy expects you to provide a unique pointe
Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image[^33]. For example, on MSVC, this is controlled by Configuration Properties,C/C++,Code Generation,Enable String Pooling option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.
[^33]: [@ISO:2012:III] §2.14.5.12: \"Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined.\"
[^33]: [@ISO:2012:III] §2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."
As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work around this problem, you may employ the following technique. In *one* source file create the unique pointer for a string literal, for example:
@@ -1237,7 +1239,7 @@ Zone objects can't be moved or copied.
### Filtering zones {#filteringzones}
Zone logging can be disabled on a per-zone basis by making use of the `ZoneNamed` macros. Each of the macros takes an `active` argument ('`true`' in the example in section [3.4.2](#multizone)), which will determine whether the zone should be logged.
Zone logging can be disabled on a per-zone basis by making use of the `ZoneNamed` macros. Each of the macros takes an `active` argument ("`true`" in the example in section [3.4.2](#multizone)), which will determine whether the zone should be logged.
Note that this parameter may be a run-time variable, such as a user-controlled switch to enable profiling of a specific part of code only when required.
@@ -1371,13 +1373,27 @@ Fast navigation in large data sets and correlating zones with what was happening
If you want to include color coding of the messages (for example to make critical messages easily visible), you can use `TracyMessageC(text, size, color)` or `TracyMessageLC(text, color)` macros.
Messages can also have different severity levels: `Trace`, `Debug`, `Info`, `Warning`, `Error` or `Fatal`. The `TracyMessage` macros will log messages with the severity `Info`. To log a message with a different severity, you may use the `TracyLogString` macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.
Messages can also have different severity levels:
- *Trace* -- Broadly track variable states and events in the software program.
- *Debug* -- Describes variable states and details about specific internal events in the software, that are useful for investigations.
- *Info* -- Describes normal events, which inform on the expected progress and state of your software.
- *Warning* -- Describes potentially dangerous situations caused by unexpected events and states.
- *Error* -- Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
- *Fatal* -- Describes a critical event that will lead to a software failure/crash.
The `TracyMessage` macros will log messages with the severity `Info`. To log a message with a different severity, you may use the `TracyLogString` macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.
Examples:
std::string dynStr = "Trace using a dynamic string, blue color, no callstack";
TracyLogString( tracy::MessageSeverity::Trace, 0xFF, 0, dynStr.size(), dynStr.c_str() );
TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string litteral, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string literal, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
### Application information {#appinfo}
@@ -1416,8 +1432,6 @@ To mark memory events, use the `TracyAlloc(ptr, size)` and `TracyFree(ptr)` macr
free(ptr);
}
In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To work around this issue, you may use `TracySecureAlloc` and `TracySecureFree` variants of the macros.
> [!IMPORTANT]
> **Important**
>
@@ -1446,9 +1460,11 @@ Sometimes an application will use more than one memory pool. For example, in add
To mark that a separate memory pool is to be tracked you should use the named version of memory macros, for example `TracyAllocN(ptr, size, name)` and `TracyFreeN(ptr, name)`, where `name` is an unique pointer to a string literal (section [3.1.2](#uniquepointers)) identifying the memory pool.
Certain memory allocator designs ("arena allocators") use an always-incrementing pointer to track the next region to allocate and do not support deallocation of individual objects. The only way to free memory with such an allocator is to simultaneously release all the objects that were allocated (reset the allocator state). You can mark such a mass-deallocation event in a memory pool with the `TracyMemoryDiscard(name)` macro.
## GPU profiling {#gpuprofiling}
Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL and CUDA execution time on GPU.
Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL, CUDA and WebGPU execution time on GPU.
Note that the CPU and GPU timers may be unsynchronized unless you create a calibrated context, but the availability of calibrated contexts is limited. You can try to correct the desynchronization of uncalibrated contexts in the profiler's options (section [5.4](#options)).
@@ -1589,6 +1605,16 @@ Unlike other GPU backends in Tracy, there is no need to call `TracyCUDACollect(c
To stop profiling, call the `TracyCUDAStopProfiling(ctx)` macro.
### WebGPU
WebGPU support is enabled by including the `public/tracy/TracyWebGPU.hpp` header file. Both major implementations of WebGPU (Dawn and wgpu-native) are supported.
Before creating the WebGPU device, make sure to call `TracyWebGPUSetupDeviceDescriptor()` to let Tracy request the necessary device features and extensions necessary for profiling. After the device is created, use the `TracyWebGPUContext()` macro to instantiate the necessary `WebGPUQueueCtx` object required for GPU instrumentation. The object should later be cleaned up with the `TracyWebGPUDestroy()` macro. To set a custom name for the context, use the `TracyWebGPUContextName()` macro.
To instrument a GPU zone, use the various `TracyWebGPU*Zone*()` macros. Note that WebGPU only offers command instrumentation at the "pass"-level. While command-level granularity is possible through implementation-specific WebGPU extensions, Tracy does not support it at the moment. Supply the corresponding WebGPU pass descriptor to the instrumentation macro *before* creating the WebGPU pass encoder.
You are required to periodically collect the GPU events using the `TracyWebGPUCollect()` macro. Good places for collection are: after synchronous waits, after event processing `wgpuInstanceProcessEvents`, after present drawable calls (`wgpuSurfacePresent`), and inside the completion callback of command queues (`wgpuQueueOnSubmittedWorkDone`).
### ROCm
On Linux, if rocprofiler-sdk is installed, tracy can automatically trace GPU dispatches and collect performance counter values. If CMake can't find rocprofiler-sdk, you can set the CMake variable `rocprofiler-sdk_DIR` to point it at the correct module directory. Use the `TRACY_ROCPROF_COUNTERS` environment variable with the desired counters separated by commas to control what values are collected. The results will appear for each dispatch in the tool tip and zone detail window. Results are summed across dimensions. You can get a list of the counters available for your hardware with this command:
@@ -1613,13 +1639,13 @@ rocprofv3 -L
Putting more than one GPU zone macro in a single scope features the same issue as with the `ZoneScoped` macros, described in section [3.4.2](#multizone) (but this time the variable name is `___tracy_gpu_zone`).
To solve this problem, in case of OpenGL use the `TracyGpuNamedZone` macro in place of `TracyGpuZone` (or the color variant). The same applies to Vulkan, Direct3D 11/12 and Metal -- replace `TracyVkZone` with `TracyVkNamedZone`, `TracyD3D11Zone`/`TracyD3D12Zone` with `TracyD3D11NamedZone`/`TracyD3D12NamedZone`, and `TracyMetalZone` with `TracyMetalNamedZone`.
To solve this problem, in case of OpenGL use the `TracyGpuNamedZone` macro in place of `TracyGpuZone` (or the color variant). The same applies to Vulkan, Direct3D 11/12, Metal and WebGPU -- replace `TracyVkZone` with `TracyVkNamedZone`, `TracyD3D11Zone`/`TracyD3D12Zone` with `TracyD3D11NamedZone`/`TracyD3D12NamedZone`, `TracyMetalZone` with `TracyMetalNamedZone`, and `TracyWebGPUZone` with `TracyWebGPUNamedZone`.
Remember to provide your name for the created stack variable as the first parameter to the macros.
### Transient GPU zones
Transient zones (see section [3.4.4](#transientzones) for details) are available in OpenGL, Vulkan, and Direct3D 11/12 macros. Transient zones are not available for Metal at this moment.
Transient zones (see section [3.4.4](#transientzones) for details) are available in OpenGL, Vulkan, Direct3D 11/12 and WebGPU macros. Transient zones are not available for Metal at this moment.
## Fibers
@@ -1664,7 +1690,7 @@ As you can see, there are two threads, `t1` and `t2`, which are simulating worke
## Collecting call stacks {#collectingcallstacks}
Capture of true calls stacks can be performed by using macros with the `S` postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: `ZoneScopedS`, `ZoneScopedNS`, `ZoneScopedCS`, `ZoneScopedNCS`, `TracyAllocS`, `TracyFreeS`, `TracySecureAllocS`, `TracySecureFreeS`, `TracyMessageS`, `TracyMessageLS`, `TracyMessageCS`, `TracyMessageLCS`, `TracyGpuZoneS`, `TracyGpuZoneCS`, `TracyVkZoneS`, `TracyVkZoneCS`, and the named and transient variants.
Capture of true calls stacks can be performed by using macros with the `S` postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: `ZoneScopedS`, `ZoneScopedNS`, `ZoneScopedCS`, `ZoneScopedNCS`, `TracyAllocS`, `TracyFreeS`, `TracyMessageS`, `TracyMessageLS`, `TracyMessageCS`, `TracyMessageLCS`, `TracyGpuZoneS`, `TracyGpuZoneCS`, `TracyVkZoneS`, `TracyVkZoneCS`, and the named and transient variants.
Be aware that call stack collection is a relatively slow operation. Table [6](#CallstackTimes) and figure [6](#CallstackPlot) show how long it took to perform a single capture of varying depth on multiple CPU architectures.
@@ -1788,7 +1814,7 @@ An example implementation of such a lock interface is provided below, as a refer
void DbgHelpUnlock() { ReleaseMutex(dbgHelpLock); }
}
At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the `TRACY_NO_DBGHELP_INIT_LOAD` environment variable to \"1\" to disable this behavior and rely on-demand symbol loading.
At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the `TRACY_NO_DBGHELP_INIT_LOAD` environment variable to "1" to disable this behavior and rely on-demand symbol loading.
#### Disabling resolution of inline frames
@@ -2039,10 +2065,6 @@ Use the following macros in your implementations of `malloc` and `free`:
- `TracyCFree(ptr)`
- `TracyCSecureAlloc(ptr, size)`
- `TracyCSecureFree(ptr)`
Correctly using this functionality can be pretty tricky. You also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions) and the allocations made by system functions. If you can't track such an allocation, you will need to make sure freeing is not reported[^56].
[^56]: It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to release.
@@ -2108,7 +2130,7 @@ To see how you should use this API, you should look at the reference implementat
> [!IMPORTANT]
> **Important**
>
> A common mistake is to skip the zone \"`isActive`\" check. When using `TRACY_ON_DEMAND`, you need to read the value of `TracyCIsConnected` once, and check the same value for both\
> A common mistake is to skip the zone "`isActive`" check. When using `TRACY_ON_DEMAND`, you need to read the value of `TracyCIsConnected` once, and check the same value for both\
> `___tracy_emit_gpu_zone_begin_alloc` and `___tracy_emit_gpu_zone_end`. Tracy may otherwise receive a zone end without a zone begin.
### Fibers
@@ -2546,9 +2568,9 @@ To collect frame images, use `tracy_image(image, w, h, offset, flip)` call.
Use the following calls in your implementations of allocator/deallocator:
- `tracy_memory_alloc(ptr, size, name, depth, secure)`
- `tracy_memory_alloc(ptr, size, name, depth)`
- `tracy_memory_free(ptr, name, depth, secure)`
- `tracy_memory_free(ptr, name, depth)`
Correctly using this functionality can be pretty tricky especially in Fortran. In Fortran, you can not redefine `allocate` statement (as well as `deallocate` statement) to profile memory usage by `allocatable` variables. However, many applications[^58] uses stack allocator on memory tape where these calls can be useful.
@@ -2600,7 +2622,7 @@ Some profiling data can only be retrieved using the kernel facilities, which are
[^59]: To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.
As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the `TRACY_NO_SYSTEM_TRACING` define. If you want to disable this functionality dynamically at runtime instead, you can set the `TRACY_NO_SYSTEM_TRACING` environment variable to \"1\".
As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the `TRACY_NO_SYSTEM_TRACING` define. If you want to disable this functionality dynamically at runtime instead, you can set the `TRACY_NO_SYSTEM_TRACING` environment variable to "1".
> [!TIP]
> **What should be granted privileges?**
@@ -2737,7 +2759,7 @@ It would be best to be extra careful when working with non-public code, as parts
### Vertical synchronization
On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section [3.17.1](#privilegeelevation)). These events will be reported as '`[x] Vsync`' frame sets, where `x` is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section [3.17.1](#privilegeelevation)). These events will be reported as "`[x] Vsync`" frame sets, where `x` is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
Use the `TRACY_NO_VSYNC_CAPTURE` macro to disable capture of Vsync events.
@@ -2887,7 +2909,7 @@ The * Wrench* button opens the about dialog, which also contains a number of
The client *address entry* field and the  *Connect* button are used to connect to a running client[^66]. You can use the connection history button  to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the  mouse cursor over an entry and pressing the Delete button on the keyboard.
[^66]: Note that a custom port may be provided here, for example by entering '127.0.0.1:1234'.
[^66]: Note that a custom port may be provided here, for example by entering "127.0.0.1:1234".
If you want to open a trace that you have stored on the disk, you can do so by pressing the  *Open saved trace* button.
@@ -3403,13 +3425,13 @@ You will find the zones with locks and their associated threads on this combined
The left-hand side *index area* of the timeline view displays various labels (threads, locks), which can be categorized in the following way:
- *Light blue label* -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12 and Metal contexts are additionally split into separate threads.
- *Light blue label* -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12, Metal and WebGPU contexts are additionally split into separate threads.
- *Pink label* -- CPU data graph.
- *White label* -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section [2.5](#crashhandling)). If automated sampling was performed, clicking the left mouse button on the * ghost zones* button will switch zone display mode between 'instrumented' and 'ghost.'
- *White label* -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section [2.5](#crashhandling)). If automated sampling was performed, clicking the left mouse button on the * ghost zones* button will switch zone display mode between "instrumented" and "ghost."
- *Green label* -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread.'
- *Green label* -- Fiber, coroutine, or any other sort of cooperative multitasking "green thread."
- *Light red label* -- Indicates a lock.
@@ -3437,7 +3459,7 @@ In an example in figure [18](#zoneslocks) you can see that there are two thread
Meanwhile, the *Streaming thread* is performing some *Streaming jobs*. The first *Streaming job* sent a message (section [3.7](#messagelog)). In addition to being listed in the message log, it is indicated by a triangle over the thread separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle.
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL context in place of a thread name.
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL/CUDA/WebGPU context in place of a thread name.
Hovering the  mouse pointer over a zone will highlight all other zones that have the exact source location with a white outline. Clicking the left mouse button on a zone will open the zone information window (section [5.14](#zoneinfo)). Holding the Ctrl key and clicking the left mouse button on a zone will open the zone statistics window (section [5.7](#findzone)). Clicking the middle mouse button on a zone will zoom the view to the extent of the zone.
@@ -3659,7 +3681,7 @@ In this window, you can set various trace-related options. For example, the time
- * Draw CPU usage graph* -- You can disable drawing of the CPU usage graph here.
- * Draw GPU zones* -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL zones. The *GPU zones* drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section [3.9](#gpuprofiling) for more information). The * Auto* button automatically measures the GPU drift value[^78].
- * Draw GPU zones* -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL/CUDA/WebGPU zones. The *GPU zones* drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section [3.9](#gpuprofiling) for more information). The * Auto* button automatically measures the GPU drift value[^78].
- * Draw CPU zones* -- Determines whether CPU zones are displayed.
@@ -3738,7 +3760,7 @@ You can filter the message list in the following ways:
- By the originating thread in the * Visible threads* drop-down.
- By matching the message text to the expression in the * Filter messages* entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' *or* 'info'). You can exclude matches by preceding the term with a minus character (e.g., '-debug' will hide all messages containing the string 'debug').
- By matching the message text to the expression in the * Filter messages* entry field. Multiple filter expressions can be comma-separated (e.g. "warn, info" will match messages containing strings "warn" *or* "info"). You can exclude matches by preceding the term with a minus character (e.g., "-debug" will hide all messages containing the string "debug").
- By message source, distinguishing between * User* messages and internal * Tracy* diagnostics.
@@ -4215,7 +4237,7 @@ The zone information window displays detailed information about a single zone. T
- Timing information.
- If the profiler performed context switch capture (section [3.17.3](#contextswitches)) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section [3.17.4](#cputopology)), the profiler will mark zone migrations across cores with 'C' and migrations across packages -- with 'P.' In some cases, context switch data might be incomplete[^92], in which case a warning message will be displayed.
- If the profiler performed context switch capture (section [3.17.3](#contextswitches)) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section [3.17.4](#cputopology)), the profiler will mark zone migrations across cores with "C" and migrations across packages -- with "P." In some cases, context switch data might be incomplete[^92], in which case a warning message will be displayed.
- Memory events list, both summarized and a list of individual allocation/free events (see section [5.10](#memorywindow) for more information on the memory events list).
@@ -4275,7 +4297,7 @@ This window shows the frames contained in the selected call stack. Information a
A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with *inline* in place of frame number[^94].
[^94]: Or '' icon in case of call stack tooltips.
[^94]: Or "" icon in case of call stack tooltips.
If the call stack shows a crash (see section [2.5](#crashhandling)), a red * Crash* label will be displayed. Clicking it will center the timeline on the crash. Note that the crash stack may contain OS or Tracy frames where the crash was intercepted and processed.
@@ -4289,7 +4311,7 @@ Stack frame location may be displayed in the following number of ways, depending
- *Symbol address* -- displays begin address of the function containing the frame address.
In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed '`[ntdll.dll]`' name of the image containing the frame address, or simply '`[unknown]`' if the profiler cannot retrieve even this information. Additionally, '`[kernel]`' is used to indicate unknown stack frames within the operating system's internal routines.
In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed "`[ntdll.dll]`" name of the image containing the frame address, or simply "`[unknown]`" if the profiler cannot retrieve even this information. Additionally, "`[kernel]`" is used to indicate unknown stack frames within the operating system's internal routines.
External frames from system libraries are hidden by default. Enabling the * External* option will show these frames, which can be useful for debugging issues in external code. When external frames are displayed, they are dimmed out.
@@ -4388,7 +4410,7 @@ Some modes may be unavailable in some circumstances (missing or outdated source
#### Source mode
This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an '`@`' prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an "`@`" prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
The *Propagate inlines* option, available when sample data is present, will enable propagation of the instruction costs down the local call stack. For example, suppose a base function in the symbol issues a call to an inlined function (which may not be readily visible due to being contained in another source file). In that case, any cost attributed to the inlined function will be visible in the base function. Because the cost information is added to all the entries in the local call stacks, it is possible to see seemingly nonsense total cost values when this feature is enabled. To quickly toggle this on or off, you may also press the X key.
@@ -4403,7 +4425,7 @@ If the * Source locations* option is selected, each line of the assembly code
>
> In some cases, it may be challenging to understand what is being displayed in the disassembly. For example, calling the `std::lower_bound` function may generate multiple levels of inlined functions: first, we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such an event, you will most likely see that some external code is taking a long time to execute, and you will be none the wiser on improving things.
>
> The local call stack for an assembly instruction represents all the inline function calls *within the symbol* (hence the 'local' part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the right mouse button on the source location.
> The local call stack for an assembly instruction represents all the inline function calls *within the symbol* (hence the "local" part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the right mouse button on the source location.
Selecting the * Raw code* option will enable the display of raw machine code bytes for each line. Individual bytes are displayed with interwoven colors to make reading easier.
@@ -4487,9 +4509,9 @@ In this mode, the source and assembly panes will be displayed together, providin
#### Instruction pointer cost statistics
If automated call stack sampling (see chapter [3.17.5](#sampling)) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify 'hot' places in the code at a glance.
If automated call stack sampling (see chapter [3.17.5](#sampling)) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify "hot" places in the code at a glance.
By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the * Child calls* option, which you may also temporarily toggle by holding the Z key. You can also click the  drop down control to display a child call distribution list[^101], which shows each known function[^102] that the symbol called. Make sure to familiarize yourself with section [5.15.1](#readingcallstacks) to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls (\"% Calls\") and the percentage of the total symbol time (\"% Total\").
By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the * Child calls* option, which you may also temporarily toggle by holding the Z key. You can also click the  drop down control to display a child call distribution list[^101], which shows each known function[^102] that the symbol called. Make sure to familiarize yourself with section [5.15.1](#readingcallstacks) to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls ("% Calls") and the percentage of the total symbol time ("% Total").
[^101]: The height of the list can be changed by dragging the separator bar.
@@ -4710,7 +4732,7 @@ There are no ideal LLM providers, but here are some options:
- *llama-swap* (<https://github.com/mostlygeek/llama-swap>) -- Wrapper for llama.cpp that allows model selection. Recommended to augment the above.
- *LM Studio* (<https://lmstudio.ai/>) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable \"When applicable, separate `reasoning_content` and `content` in API responses\".
- *LM Studio* (<https://lmstudio.ai/>) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable "When applicable, separate `reasoning_content` and `content` in API responses".
## Model selection
@@ -4727,19 +4749,19 @@ A good *starting* point that will work fairly well on almost any hardware is the
> [!TIP]
> **Model quantization**
>
> Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more \"dumbed down\" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
> Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more "dumbed down" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
> [!TIP]
> **Model size**
>
> Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the \"smarter\" its responses will be.
> Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the "smarter" its responses will be.
>
> Most modern models will be \"Mixture of Experts\", or MoE, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
> Most modern models will be "Mixture of Experts", or "MoE", and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
> [!TIP]
> **Context size**
>
> The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can \"remember\". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
> The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can "remember". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
>
> Each token present in the context window may require a fairly large amount of memory, and that can quickly add up to gigabytes. Some modern models use solutions that greatly reduce context memory requirements, but that varies from model to model. If needed, the KV cache used for context can be quantized, just like model parameters. In this case, the recommended size per weight is 8 bits.
>
@@ -4749,7 +4771,7 @@ A good *starting* point that will work fairly well on almost any hardware is the
Sometimes Tracy needs to do some language processing where speed is more important than the smarts. The default setting is to use the chat model with the reasoning disabled, which is fine for most applications.
It may be more convenient to use a small, quick model instead, in which case enable the *Fast model* checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set `-ngl 0` for llama.cpp or set \"GPU offload\" to 0 in LM Studio) and disable the KV cache offload to GPU (set `-nkvo` for llama.cpp or disable \"Offload KV Cache to GPU Memory\" in LM Studio).
It may be more convenient to use a small, quick model instead, in which case enable the *Fast model* checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set `-ngl 0` for llama.cpp or set "GPU offload" to 0 in LM Studio) and disable the KV cache offload to GPU (set `-nkvo` for llama.cpp or disable "Offload KV Cache to GPU Memory" in LM Studio).
### Embedding model
@@ -4817,7 +4839,7 @@ The horizontal meter directly below shows how much of the context size has been
The chat section contains the conversation with the automated assistant with alternating user and assistant turns. Clicking on the * User* role icon removes the chat content up to the selected question. Similarly, clicking on the * Assistant* role icon removes the conversation content up to this point and generates another response from the assistant.
The assistant may give preliminary replies to the user, for example, *\"I will now check the source of function foobar\"*, followed by performing the actual check, then a continuation of the reply, such as *\"Now I can see that\...\"*. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
The assistant may give preliminary replies to the user, for example, "I will now check the source of function foobar", followed by performing the actual check, then a continuation of the reply, such as "Now I can see that\...". To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
Each assistant reply contains a note about the language model that was used and the time it took to generate the text.

View File

@@ -141,7 +141,7 @@ There's much more Tracy can do, which can be explored by carefully reading this
\section{A quick look at Tracy Profiler}
\label{quicklook}
Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that you can use for remote or embedded telemetry of games and other applications. It can profile CPU\footnote{Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C\#, OCaml, Odin, etc.}, GPU\footnote{All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL.}, memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that you can use for remote or embedded telemetry of games and other applications. It can profile CPU\footnote{Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C\#, OCaml, Odin, etc.}, GPU\footnote{All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.}, memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
While Tracy can perform statistical analysis of sampled call stack data, just like other \emph{statistical profilers} (such as VTune, perf, or Very Sleepy), it mainly focuses on manual markup of the source code. Such markup allows frame-by-frame inspection of the program execution. For example, you will be able to see exactly which functions are called, how much time they require, and how they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it cannot accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds.
@@ -228,7 +228,7 @@ Tracy aims to give you an understanding of the inner workings of a tight loop of
\subsection{Sampling profiler}
Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others.
Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even \enquote{steal} an optimization performed by one compiler and make it available for the others.
On some platforms, it is possible to sample the hardware performance counters, which will give you information not only \emph{where} your program is running slowly, but also \emph{why}.
@@ -369,7 +369,7 @@ Note that these binary releases require AVX2 instruction set support on the proc
Tracy Profiler supports MSVC, GCC, and clang. You will need to use a reasonably recent version of the compiler due to the C++11 requirement. The following platforms are confirmed to be working (this is not a complete list):
\begin{itemize}
\item Windows (x86, x64, ARM64\footnote{Requires \textbf{"OpenCL, OpenGL, and Vulkan Compatibility Pack"} from Microsoft Store.})
\item Windows (x86, x64, ARM64\footnote{Requires \textbf{\enquote{OpenCL, OpenGL, and Vulkan Compatibility Pack}} from Microsoft Store.})
\item Linux (x86, x64, ARM, ARM64)
\item Android (ARM, ARM64, x86)
\item FreeBSD (x64)
@@ -594,7 +594,7 @@ In the case of some programming environments, you may need to take extra steps t
If you are using MSVC, you will need to disable the \emph{Edit And Continue} feature, as it makes the compiler non-conformant to some aspects of the C++ standard. In order to do so, open the project properties and go to \menu[,]{C/C++,General,Debug Information Format} and make sure \emph{Program Database for Edit And Continue (/ZI)} is \emph{not} selected.
For context, if you experience errors like "error C2131: expression did not evaluate to a constant", "failure was caused by non-constant arguments or reference to a non-constant symbol", and "see usage of '\texttt{\_\_LINE\_\_Var}'", chances are that your project has the \emph{Edit And Continue} feature enabled.
For context, if you experience errors like \enquote{error C2131: expression did not evaluate to a constant}, \enquote{failure was caused by non-constant arguments or reference to a non-constant symbol}, and \enquote{see usage of \enquote{\texttt{\_\_LINE\_\_Var}}}, chances are that your project has the \emph{Edit And Continue} feature enabled.
\paragraph{Universal Windows Platform}
@@ -778,7 +778,7 @@ Nevertheless, let's look at how we can try to stabilize the profiling data.
Also known as: the \emph{spectre} thing we have to deal with now.
You must be aware that most processors available on the market\footnote{Except low-cost ARM CPUs.} \emph{do not} execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more 'reliable' readings\footnote{And by saying 'reliable,' you do in reality mean: behaving in a way you expect it.} would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is \emph{really} doing.
You must be aware that most processors available on the market\footnote{Except low-cost ARM CPUs.} \emph{do not} execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more \enquote{reliable} readings\footnote{And by saying \enquote{reliable,} you do in reality mean: behaving in a way you expect it.} would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is \emph{really} doing.
This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: \url{https://travisdowns.github.io/blog/2019/06/11/speed-limits.html}.
@@ -805,7 +805,7 @@ While the CPU is more-or-less designed always to be able to work at the advertis
\item Do you have complete control over the power profile? Spoiler alert: no. The operating system may run anything at any time on any of the other cores, which will impact the turbo frequency you're able to achieve.
\end{itemize}
As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds.
As you can see, this feature basically screams \enquote{unreliable results!} Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds.
Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down throttled.
@@ -940,7 +940,7 @@ Please don't ask about window decorations in Gnome. The current behavior is the
Special considerations must be taken to run the Tracy server/profiler GUI on Windows on ARM.
Ensure that the \textbf{"OpenCL, OpenGL, and Vulkan Compatibility Pack"} is installed (from the Microsoft Store), otherwise the GUI will fail to open.
Ensure that the \textbf{\enquote{OpenCL, OpenGL, and Vulkan Compatibility Pack}} is installed (from the Microsoft Store), otherwise the GUI will fail to open.
\subsubsection{Using an IDE}
@@ -955,7 +955,7 @@ The CMake build configuration will begin immediately. It is likely that you will
After the build configuration phase is over, you may want to make some further adjustments to what is being built. The primary place to do this is in the \emph{Project Status} section of the CMake side panel. The two key settings there are also available in the status bar at the bottom of the window:
\begin{itemize}
\item The \emph{Folder} setting allows you to choose which Tracy utility you want to work with. Select "profiler" for the profiler's GUI.
\item The \emph{Folder} setting allows you to choose which Tracy utility you want to work with. Select \enquote{profiler} for the profiler's GUI.
\item The \emph{Build variant} setting is used to toggle between the debug and release build configurations.
\end{itemize}
@@ -1016,7 +1016,7 @@ void Graphics::Render()
\subsection{Crash handling}
\label{crashhandling}
On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.
On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses (\enquote{segmentation faults}, \enquote{null pointer exceptions}), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.
This is an automatic process, and it doesn't require user interaction. If you are experiencing issues with crash handling you may want to try defining the \texttt{TRACY\_NO\_CRASH\_HANDLER} macro to disable the built in crash handling.
@@ -1050,6 +1050,8 @@ Memory & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faXm
GPU zones (OpenGL) & \faCheck & \faCheck & \faCheck & \faPoo & \faPoo & & \faXmark \\
GPU zones (Vulkan) & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & & \faXmark \\
GPU zones (Metal) & \faXmark & \faXmark & \faXmark & \faCheck\textsuperscript{\emph{b}} & \faCheck\textsuperscript{\emph{b}} & \faXmark & \faXmark \\
GPU zones (CUDA) & \faCheck & \faCheck & \faXmark & \faXmark & \faXmark & \faQuestion & \faXmark \\
GPU zones (WebGPU) & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faQuestion & \faQuestion \\
Call stacks & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faXmark \\
Symbol resolution & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck \\
Crash handling & \faCheck & \faCheck & \faCheck & \faXmark & \faXmark & \faXmark & \faXmark \\
@@ -1108,7 +1110,7 @@ FrameMarkStart("Audio processing");
FrameMarkEnd("Audio processing");
\end{lstlisting}
Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."}. For example, on MSVC, this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.
Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: \enquote{Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined.}}. For example, on MSVC, this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.
As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work around this problem, you may employ the following technique. In \emph{one} source file create the unique pointer for a string literal, for example:
@@ -1406,7 +1408,7 @@ It is valid to set the \texttt{Zone1} text or name \emph{only} in places \circle
\subsubsection{Filtering zones}
\label{filteringzones}
Zone logging can be disabled on a per-zone basis by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument ('\texttt{true}' in the example in section~\ref{multizone}), which will determine whether the zone should be logged.
Zone logging can be disabled on a per-zone basis by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument (\enquote{\texttt{true}} in the example in section~\ref{multizone}), which will determine whether the zone should be logged.
Note that this parameter may be a run-time variable, such as a user-controlled switch to enable profiling of a specific part of code only when required.
@@ -1558,14 +1560,24 @@ Fast navigation in large data sets and correlating zones with what was happening
If you want to include color coding of the messages (for example to make critical messages easily visible), you can use \texttt{TracyMessageC(text, size, color)} or \texttt{TracyMessageLC(text, color)} macros.
Messages can also have different severity levels: \texttt{Trace}, \texttt{Debug}, \texttt{Info}, \texttt{Warning}, \texttt{Error} or \texttt{Fatal}.
Messages can also have different severity levels:
\begin{itemize}
\item \emph{Trace} -- Broadly track variable states and events in the software program.
\item \emph{Debug} -- Describes variable states and details about specific internal events in the software, that are useful for investigations.
\item \emph{Info} -- Describes normal events, which inform on the expected progress and state of your software.
\item \emph{Warning} -- Describes potentially dangerous situations caused by unexpected events and states.
\item \emph{Error} -- Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
\item \emph{Fatal} -- Describes a critical event that will lead to a software failure/crash.
\end{itemize}
The \texttt{TracyMessage} macros will log messages with the severity \texttt{Info}. To log a message with a different severity, you may use the \texttt{TracyLogString} macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.
Examples:
\begin{lstlisting}
std::string dynStr = "Trace using a dynamic string, blue color, no callstack";
TracyLogString( tracy::MessageSeverity::Trace, 0xFF, 0, dynStr.size(), dynStr.c_str() );
TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string litteral, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string literal, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
\end{lstlisting}
@@ -1607,8 +1619,6 @@ void operator delete(void* ptr) noexcept
}
\end{lstlisting}
In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To work around this issue, you may use \texttt{TracySecureAlloc} and \texttt{TracySecureFree} variants of the macros.
\begin{bclogo}[
noborder=true,
couleur=black!5,
@@ -1642,10 +1652,12 @@ Sometimes an application will use more than one memory pool. For example, in add
To mark that a separate memory pool is to be tracked you should use the named version of memory macros, for example \texttt{TracyAllocN(ptr, size, name)} and \texttt{TracyFreeN(ptr, name)}, where \texttt{name} is an unique pointer to a string literal (section~\ref{uniquepointers}) identifying the memory pool.
Certain memory allocator designs (\enquote{arena allocators}) use an always-incrementing pointer to track the next region to allocate and do not support deallocation of individual objects. The only way to free memory with such an allocator is to simultaneously release all the objects that were allocated (reset the allocator state). You can mark such a mass-deallocation event in a memory pool with the \texttt{TracyMemoryDiscard(name)} macro.
\subsection{GPU profiling}
\label{gpuprofiling}
Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL and CUDA execution time on GPU.
Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL, CUDA and WebGPU execution time on GPU.
Note that the CPU and GPU timers may be unsynchronized unless you create a calibrated context, but the availability of calibrated contexts is limited. You can try to correct the desynchronization of uncalibrated contexts in the profiler's options (section~\ref{options}).
@@ -1791,6 +1803,16 @@ Unlike other GPU backends in Tracy, there is no need to call \texttt{TracyCUDACo
To stop profiling, call the \texttt{TracyCUDAStopProfiling(ctx)} macro.
\subsubsection{WebGPU}
WebGPU support is enabled by including the \texttt{public/tracy/TracyWebGPU.hpp} header file. Both major implementations of WebGPU (Dawn and wgpu-native) are supported.
Before creating the WebGPU device, make sure to call \texttt{TracyWebGPUSetupDeviceDescriptor()} to let Tracy request the necessary device features and extensions necessary for profiling. After the device is created, use the \texttt{TracyWebGPUContext()} macro to instantiate the necessary \texttt{WebGPUQueueCtx} object required for GPU instrumentation. The object should later be cleaned up with the \texttt{TracyWebGPUDestroy()} macro. To set a custom name for the context, use the \texttt{TracyWebGPUContextName()} macro.
To instrument a GPU zone, use the various \texttt{TracyWebGPU*Zone*()} macros. Note that WebGPU only offers command instrumentation at the \enquote{pass}-level. While command-level granularity is possible through implementation-specific WebGPU extensions, Tracy does not support it at the moment. Supply the corresponding WebGPU pass descriptor to the instrumentation macro \textit{before} creating the WebGPU pass encoder.
You are required to periodically collect the GPU events using the \texttt{TracyWebGPUCollect()} macro. Good places for collection are: after synchronous waits, after event processing \texttt{wgpuInstanceProcessEvents}, after present drawable calls (\texttt{wgpuSurfacePresent}), and inside the completion callback of command queues (\texttt{wgpuQueueOnSubmittedWorkDone}).
\subsubsection{ROCm}
On Linux, if rocprofiler-sdk is installed, tracy can automatically trace GPU dispatches and collect
@@ -1824,13 +1846,13 @@ sudo amd-smi set -g 0 -l stable_std
Putting more than one GPU zone macro in a single scope features the same issue as with the \texttt{ZoneScoped} macros, described in section~\ref{multizone} (but this time the variable name is \texttt{\_\_\_tracy\_gpu\_zone}).
To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan, Direct3D 11/12 and Metal -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone}, \texttt{TracyD3D11Zone}/\texttt{TracyD3D12Zone} with \texttt{TracyD3D11NamedZone}/\texttt{TracyD3D12NamedZone}, and \texttt{TracyMetalZone} with \texttt{TracyMetalNamedZone}.
To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan, Direct3D 11/12, Metal and WebGPU -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone}, \texttt{TracyD3D11Zone}/\texttt{TracyD3D12Zone} with \texttt{TracyD3D11NamedZone}/\texttt{TracyD3D12NamedZone}, \texttt{TracyMetalZone} with \texttt{TracyMetalNamedZone}, and \texttt{TracyWebGPUZone} with \texttt{TracyWebGPUNamedZone}.
Remember to provide your name for the created stack variable as the first parameter to the macros.
\subsubsection{Transient GPU zones}
Transient zones (see section~\ref{transientzones} for details) are available in OpenGL, Vulkan, and Direct3D 11/12 macros. Transient zones are not available for Metal at this moment.
Transient zones (see section~\ref{transientzones} for details) are available in OpenGL, Vulkan, Direct3D 11/12 and WebGPU macros. Transient zones are not available for Metal at this moment.
\subsection{Fibers}
\label{fibers}
@@ -1877,7 +1899,7 @@ As you can see, there are two threads, \texttt{t1} and \texttt{t2}, which are si
\subsection{Collecting call stacks}
\label{collectingcallstacks}
Capture of true calls stacks can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracySecureAllocS}, \texttt{TracySecureFreeS}, \texttt{TracyMessageS}, \texttt{TracyMessageLS}, \texttt{TracyMessageCS}, \texttt{TracyMessageLCS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}, and the named and transient variants.
Capture of true calls stacks can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracyMessageS}, \texttt{TracyMessageLS}, \texttt{TracyMessageCS}, \texttt{TracyMessageLCS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}, and the named and transient variants.
Be aware that call stack collection is a relatively slow operation. Table~\ref{CallstackTimes} and figure~\ref{CallstackPlot} show how long it took to perform a single capture of varying depth on multiple CPU architectures.
@@ -2023,7 +2045,7 @@ void DbgHelpUnlock() { ReleaseMutex(dbgHelpLock); }
}
\end{lstlisting}
At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the \texttt{TRACY\_NO\_DBGHELP\_INIT\_LOAD} environment variable to "1" to disable this behavior and rely on-demand symbol loading.
At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the \texttt{TRACY\_NO\_DBGHELP\_INIT\_LOAD} environment variable to \enquote{1} to disable this behavior and rely on-demand symbol loading.
\paragraph{Disabling resolution of inline frames}
@@ -2306,8 +2328,6 @@ Use the following macros in your implementations of \texttt{malloc} and \texttt{
\begin{itemize}
\item \texttt{TracyCAlloc(ptr, size)}
\item \texttt{TracyCFree(ptr)}
\item \texttt{TracyCSecureAlloc(ptr, size)}
\item \texttt{TracyCSecureFree(ptr)}
\end{itemize}
Correctly using this functionality can be pretty tricky. You also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions) and the allocations made by system functions. If you can't track such an allocation, you will need to make sure freeing is not reported\footnote{It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to release.}.
@@ -2369,7 +2389,7 @@ To see how you should use this API, you should look at the reference implementat
couleur=black!5,
logo=\bcbombe
]{Important}
A common mistake is to skip the zone "\texttt{isActive}" check. When using \texttt{TRACY\_ON\_DEMAND}, you need to read the value of \texttt{TracyCIsConnected} once, and check the same value for both \newline \texttt{\_\_\_tracy\_emit\_gpu\_zone\_begin\_alloc} and \texttt{\_\_\_tracy\_emit\_gpu\_zone\_end}. Tracy may otherwise receive a zone end without a zone begin.
A common mistake is to skip the zone \enquote{\texttt{isActive}} check. When using \texttt{TRACY\_ON\_DEMAND}, you need to read the value of \texttt{TracyCIsConnected} once, and check the same value for both \newline \texttt{\_\_\_tracy\_emit\_gpu\_zone\_begin\_alloc} and \texttt{\_\_\_tracy\_emit\_gpu\_zone\_end}. Tracy may otherwise receive a zone end without a zone begin.
\end{bclogo}
\subsubsection{Fibers}
@@ -2867,8 +2887,8 @@ logo=\bclampe
Use the following calls in your implementations of allocator/deallocator:
\begin{itemize}
\item \texttt{tracy\_memory\_alloc(ptr, size, name, depth, secure)}
\item \texttt{tracy\_memory\_free(ptr, name, depth, secure)}
\item \texttt{tracy\_memory\_alloc(ptr, size, name, depth)}
\item \texttt{tracy\_memory\_free(ptr, name, depth)}
\end{itemize}
Correctly using this functionality can be pretty tricky especially in Fortran.
@@ -2924,7 +2944,7 @@ Tracy will perform an automatic collection of system data without user intervent
Some profiling data can only be retrieved using the kernel facilities, which are not available to users with normal privilege level. To collect such data, you will need to elevate your rights to the administrator level. You can do so either by running the profiled program from the \texttt{root} account on Unix or through the \emph{Run as administrator} option on Windows\footnote{To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.}. On Android, you will need to have a rooted device (see section~\ref{androidlunacy} for additional information).
As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. If you want to disable this functionality dynamically at runtime instead, you can set the \texttt{TRACY\_NO\_SYSTEM\_TRACING} environment variable to "1".
As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. If you want to disable this functionality dynamically at runtime instead, you can set the \texttt{TRACY\_NO\_SYSTEM\_TRACING} environment variable to \enquote{1}.
\begin{bclogo}[
noborder=true,
@@ -3076,7 +3096,7 @@ On Linux, Tracy will override the \texttt{dlclose} function call to prevent shar
\subsubsection{Vertical synchronization}
On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section~\ref{privilegeelevation}). These events will be reported as '\texttt{[x] Vsync}' frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section~\ref{privilegeelevation}). These events will be reported as \enquote{\texttt{[x] Vsync}} frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
Use the \texttt{TRACY\_NO\_VSYNC\_CAPTURE} macro to disable capture of Vsync events.
@@ -3230,7 +3250,7 @@ If you want to look at the profile data in real-time (or load a saved trace file
The \emph{\faWrench{}~Wrench} button opens the about dialog, which also contains a number of global settings you may want to tweak (section~\ref{aboutwindow}).
The client \emph{address entry} field and the \faWifi{}~\emph{Connect} button are used to connect to a running client\footnote{Note that a custom port may be provided here, for example by entering '127.0.0.1:1234'.}. You can use the connection history button~\faCaretDown{} to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the \faArrowPointer{}~mouse cursor over an entry and pressing the \keys{\del} button on the keyboard.
The client \emph{address entry} field and the \faWifi{}~\emph{Connect} button are used to connect to a running client\footnote{Note that a custom port may be provided here, for example by entering \enquote{127.0.0.1:1234}.}. You can use the connection history button~\faCaretDown{} to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the \faArrowPointer{}~mouse cursor over an entry and pressing the \keys{\del} button on the keyboard.
If you want to open a trace that you have stored on the disk, you can do so by pressing the \faFolderOpen{}~\emph{Open saved trace} button.
@@ -3877,10 +3897,10 @@ You will find the zones with locks and their associated threads on this combined
The left-hand side \emph{index area} of the timeline view displays various labels (threads, locks), which can be categorized in the following way:
\begin{itemize}
\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12 and Metal contexts are additionally split into separate threads.
\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12, Metal and WebGPU contexts are additionally split into separate threads.
\item \emph{Pink label} -- CPU data graph.
\item \emph{White label} -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between 'instrumented' and 'ghost.'
\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread.'
\item \emph{White label} -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between \enquote{instrumented} and \enquote{ghost.}
\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking \enquote{green thread.}
\item \emph{Light red label} -- Indicates a lock.
\item \emph{Yellow label} -- Plot.
\end{itemize}
@@ -3899,7 +3919,7 @@ In an example in figure~\ref{zoneslocks} you can see that there are two threads:
Meanwhile, the \emph{Streaming thread} is performing some \emph{Streaming jobs}. The first \emph{Streaming job} sent a message (section~\ref{messagelog}). In addition to being listed in the message log, it is indicated by a triangle over the thread separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle.
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL context in place of a thread name.
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL/CUDA/WebGPU context in place of a thread name.
Hovering the \faArrowPointer{} mouse pointer over a zone will highlight all other zones that have the exact source location with a white outline. Clicking the \LMB{}~left mouse button on a zone will open the zone information window (section~\ref{zoneinfo}). Holding the \keys{\ctrl} key and clicking the \LMB{}~left mouse button on a zone will open the zone statistics window (section~\ref{findzone}). Clicking the \MMB{}~middle mouse button on a zone will zoom the view to the extent of the zone.
@@ -4108,7 +4128,7 @@ In this window, you can set various trace-related options. For example, the time
\begin{itemize}
\item \emph{\faSignature{} Draw CPU usage graph} -- You can disable drawing of the CPU usage graph here.
\end{itemize}
\item \emph{\faEye{} Draw GPU zones} -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL zones. The \emph{GPU zones} drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section~\ref{gpuprofiling} for more information). The \emph{\faRobot~Auto} button automatically measures the GPU drift value\footnote{There is an assumption that drift is linear. Automated measurement calculates and removes change over time in delay-to-execution of GPU zones. Resulting value may still be incorrect.}.
\item \emph{\faEye{} Draw GPU zones} -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL/CUDA/WebGPU zones. The \emph{GPU zones} drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section~\ref{gpuprofiling} for more information). The \emph{\faRobot~Auto} button automatically measures the GPU drift value\footnote{There is an assumption that drift is linear. Automated measurement calculates and removes change over time in delay-to-execution of GPU zones. Resulting value may still be incorrect.}.
\item \emph{\faMicrochip{} Draw CPU zones} -- Determines whether CPU zones are displayed.
\begin{itemize}
\item \emph{\faGhost{} Draw ghost zones} -- Controls if ghost zones should be displayed in threads which don't have any instrumented zones available.
@@ -4158,7 +4178,7 @@ You can filter the message list in the following ways:
\begin{itemize}
\item By the originating thread in the \emph{\faShuffle{} Visible threads} drop-down.
\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' \emph{or} 'info'). You can exclude matches by preceding the term with a minus character (e.g., '-debug' will hide all messages containing the string 'debug').
\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. \enquote{warn, info} will match messages containing strings \enquote{warn} \emph{or} \enquote{info}). You can exclude matches by preceding the term with a minus character (e.g., \enquote{-debug} will hide all messages containing the string \enquote{debug}).
\item By message source, distinguishing between \emph{\faUser{}~User} messages and internal \emph{\faMicroscope{}~Tracy} diagnostics.
\item By severity level: \emph{\faShoePrints{}~Trace}, \emph{\faBug{}~Debug}, \emph{\faInfo{}~Info}, \emph{\faTriangleExclamation{}~Warning}, \emph{\faCircleXmark{}~Error}, or \emph{\faSkullCrossbones{}~Fatal}.
\end{itemize}
@@ -4623,7 +4643,7 @@ The zone information window displays detailed information about a single zone. T
\begin{itemize}
\item Basic source location information: function name, source file location, and the thread name.
\item Timing information.
\item If the profiler performed context switch capture (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section~\ref{cputopology}), the profiler will mark zone migrations across cores with 'C' and migrations across packages -- with 'P.' In some cases, context switch data might be incomplete\footnote{For example, when capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed.
\item If the profiler performed context switch capture (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section~\ref{cputopology}), the profiler will mark zone migrations across cores with \enquote{C} and migrations across packages -- with \enquote{P.} In some cases, context switch data might be incomplete\footnote{For example, when capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed.
\item Memory events list, both summarized and a list of individual allocation/free events (see section~\ref{memorywindow} for more information on the memory events list).
\item List of messages that the profiler logged in the zone's scope. If the \emph{exclude children} option is disabled, messages emitted in child zones will also be included.
\item Parent zones list, showing the hierarchy of parent zones that contain the current zone. Hovering the \faArrowPointer{}~mouse pointer over a parent zone will highlight it on the timeline view with a red outline. Clicking the \LMB{}~left mouse button on a zone will switch the zone info window to that zone. Clicking the \MMB{}~middle mouse button on a zone will zoom the timeline view to the zone's extent. Clicking the \RMB{}~right mouse button on a source file location will open the source file view window (if applicable, see section~\ref{sourceview}).
@@ -4660,7 +4680,7 @@ Clicking on the \emph{\faClipboard{}~Copy to clipboard} buttons will copy the ap
This window shows the frames contained in the selected call stack. Information about the originating thread is included. Each frame is described by a function name, source file location, and originating image\footnote{Executable images are called \emph{modules} by Microsoft.} name. Function frames originating from the kernel are marked with a red color. Clicking the \LMB{}~left mouse button on either the function name of source file location will copy the name to the clipboard. Clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}).
A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or '\faCaretRight{}'~icon in case of call stack tooltips.}.
A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or \enquote{\faCaretRight{}}~icon in case of call stack tooltips.}.
If the call stack shows a crash (see section~\ref{crashhandling}), a red \emph{\faSkull{}~Crash} label will be displayed. Clicking it will center the timeline on the crash. Note that the crash stack may contain OS or Tracy frames where the crash was intercepted and processed.
@@ -4673,7 +4693,7 @@ Stack frame location may be displayed in the following number of ways, depending
\item \emph{Symbol address} -- displays begin address of the function containing the frame address.
\end{itemize}
In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed '\texttt{[ntdll.dll]}' name of the image containing the frame address, or simply '\texttt{[unknown]}' if the profiler cannot retrieve even this information. Additionally, '\texttt{[kernel]}' is used to indicate unknown stack frames within the operating system's internal routines.
In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed \enquote{\texttt{[ntdll.dll]}} name of the image containing the frame address, or simply \enquote{\texttt{[unknown]}} if the profiler cannot retrieve even this information. Additionally, \enquote{\texttt{[kernel]}} is used to indicate unknown stack frames within the operating system's internal routines.
External frames from system libraries are hidden by default. Enabling the \emph{\faShieldHalved{}~External} option will show these frames, which can be useful for debugging issues in external code. When external frames are displayed, they are dimmed out.
@@ -4761,7 +4781,7 @@ Some modes may be unavailable in some circumstances (missing or outdated source
\paragraph{Source mode}
This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an '\texttt{@}' prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an \enquote{\texttt{@}} prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
The \emph{Propagate inlines} option, available when sample data is present, will enable propagation of the instruction costs down the local call stack. For example, suppose a base function in the symbol issues a call to an inlined function (which may not be readily visible due to being contained in another source file). In that case, any cost attributed to the inlined function will be visible in the base function. Because the cost information is added to all the entries in the local call stacks, it is possible to see seemingly nonsense total cost values when this feature is enabled. To quickly toggle this on or off, you may also press the \keys{X} key.
@@ -4779,7 +4799,7 @@ logo=\bclampe
]{Local call stack}
In some cases, it may be challenging to understand what is being displayed in the disassembly. For example, calling the \texttt{std::lower\_bound} function may generate multiple levels of inlined functions: first, we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such an event, you will most likely see that some external code is taking a long time to execute, and you will be none the wiser on improving things.
The local call stack for an assembly instruction represents all the inline function calls \emph{within the symbol} (hence the 'local' part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the \RMB{}~right mouse button on the source location.
The local call stack for an assembly instruction represents all the inline function calls \emph{within the symbol} (hence the \enquote{local} part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the \RMB{}~right mouse button on the source location.
\end{bclogo}
Selecting the \emph{\faGears{}~Raw code} option will enable the display of raw machine code bytes for each line. Individual bytes are displayed with interwoven colors to make reading easier.
@@ -4830,9 +4850,9 @@ In this mode, the source and assembly panes will be displayed together, providin
\paragraph{Instruction pointer cost statistics}
If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify 'hot' places in the code at a glance.
If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify \enquote{hot} places in the code at a glance.
By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the \emph{\faRightFromBracket{}~Child calls} option, which you may also temporarily toggle by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list\footnote{The height of the list can be changed by dragging the separator bar.}, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls ("\%~Calls") and the percentage of the total symbol time ("\%~Total").
By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the \emph{\faRightFromBracket{}~Child calls} option, which you may also temporarily toggle by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list\footnote{The height of the list can be changed by dragging the separator bar.}, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls (\enquote{\%~Calls}) and the percentage of the total symbol time (\enquote{\%~Total}).
The total number of collected samples is displayed in the UI under the~\emph{\faEyeDropper~Samples} label and converted to a time approximation at the~\emph{\faStopwatch~Time} label. The displayed values show the local count if child calls are disabled and the total count if the option is enabled. In either case, the number of samples attributed only to the child calls is displayed in parentheses with the + or - symbol and as a percentage of the total symbol time.
@@ -5009,7 +5029,7 @@ There are no ideal LLM providers, but here are some options:
\begin{itemize}
\item \emph{llama.cpp} (\url{https://github.com/ggml-org/llama.cpp}) -- Recommended as the easiest to use. Clone from git and build it yourself. By default it fits the model automatically to available memory. It is rapidly advancing with new features and model support. Most other providers use it to do the actual work, and they typically use an outdated release. The \url{https://llama.app/} site might provide easy way to install llama.
\item \emph{llama-swap} (\url{https://github.com/mostlygeek/llama-swap}) -- Wrapper for llama.cpp that allows model selection. Recommended to augment the above.
\item \emph{LM Studio} (\url{https://lmstudio.ai/}) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable "When applicable, separate \texttt{reasoning\_content} and \texttt{content} in API responses".
\item \emph{LM Studio} (\url{https://lmstudio.ai/}) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable \enquote{When applicable, separate \texttt{reasoning\_content} and \texttt{content} in API responses}.
\end{itemize}
\subsection{Model selection}
@@ -5029,7 +5049,7 @@ noborder=true,
couleur=black!5,
logo=\bclampe
]{Model quantization}
Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more "dumbed down" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more \enquote{dumbed down} the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
\end{bclogo}
\begin{bclogo}[
@@ -5037,9 +5057,9 @@ noborder=true,
couleur=black!5,
logo=\bclampe
]{Model size}
Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the "smarter" its responses will be.
Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the \enquote{smarter} its responses will be.
Most modern models will be "Mixture of Experts", or MoE, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
Most modern models will be \enquote{Mixture of Experts}, or \enquote{MoE}, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
\end{bclogo}
\begin{bclogo}[
@@ -5047,7 +5067,7 @@ noborder=true,
couleur=black!5,
logo=\bclampe
]{Context size}
The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can "remember". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can \enquote{remember}. This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
Each token present in the context window may require a fairly large amount of memory, and that can quickly add up to gigabytes. Some modern models use solutions that greatly reduce context memory requirements, but that varies from model to model. If needed, the KV cache used for context can be quantized, just like model parameters. In this case, the recommended size per weight is 8 bits.
@@ -5058,7 +5078,7 @@ The realistic minimum required context size for Tracy to run the assistant is 10
Sometimes Tracy needs to do some language processing where speed is more important than the smarts. The default setting is to use the chat model with the reasoning disabled, which is fine for most applications.
It may be more convenient to use a small, quick model instead, in which case enable the \emph{Fast model} checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set \texttt{-ngl 0} for llama.cpp or set "GPU offload" to 0 in LM Studio) and disable the KV cache offload to GPU (set \texttt{-nkvo} for llama.cpp or disable "Offload KV Cache to GPU Memory" in LM Studio).
It may be more convenient to use a small, quick model instead, in which case enable the \emph{Fast model} checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set \texttt{-ngl 0} for llama.cpp or set \enquote{GPU offload} to 0 in LM Studio) and disable the KV cache offload to GPU (set \texttt{-nkvo} for llama.cpp or disable \enquote{Offload KV Cache to GPU Memory} in LM Studio).
\subsubsection{Embedding model}
@@ -5119,7 +5139,7 @@ The horizontal meter directly below shows how much of the context size has been
The chat section contains the conversation with the automated assistant with alternating user and assistant turns. Clicking on the~\emph{\faUser{}~User} role icon removes the chat content up to the selected question. Similarly, clicking on the~\emph{\faRobot{}~Assistant} role icon removes the conversation content up to this point and generates another response from the assistant.
The assistant may give preliminary replies to the user, for example, \emph{"I will now check the source of function foobar"}, followed by performing the actual check, then a continuation of the reply, such as \emph{"Now I can see that..."}. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
The assistant may give preliminary replies to the user, for example, \enquote{I will now check the source of function foobar}, followed by performing the actual check, then a continuation of the reply, such as \enquote{Now I can see that...}. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
Each assistant reply contains a note about the language model that was used and the time it took to generate the text.
@@ -5187,8 +5207,8 @@ You can customize the output with the following command line options:
\item \texttt{-h, -\hspace{-1.25ex} -help} -- Display a help message
\item \texttt{-f, -\hspace{-1.25ex} -filter <name>} -- Filter the zone names
\item \texttt{-c, -\hspace{-1.25ex} -case} -- Make the name filtering case sensitive
\item \texttt{-s, -\hspace{-1.25ex} -sep <separator>} -- Customize the CSV separator (default is ``\texttt{,}'')
\item \texttt{-e, -\hspace{-1.25ex} -self} -- Use self time (equivalent to the ``Self time'' toggle in the profiler GUI)
\item \texttt{-s, -\hspace{-1.25ex} -sep <separator>} -- Customize the CSV separator (default is \enquote{\texttt{,}})
\item \texttt{-e, -\hspace{-1.25ex} -self} -- Use self time (equivalent to the \enquote{Self time} toggle in the profiler GUI)
\item \texttt{-u, -\hspace{-1.25ex} -unwrap} -- Report each zone individually; this will discard the statistics columns and instead report the timestamp and duration for each zone entry
\item \texttt{-g, -\hspace{-1.25ex} -gpu} -- Report each GPU zone event
\item \texttt{-m, -\hspace{-1.25ex} -messages} -- Report only messages

View File

@@ -44,10 +44,16 @@ ExternalProject_Add(embed
)
function(Embed LIST NAME FILE)
cmake_parse_arguments(EMBED "TEXT" "" "" ${ARGN})
if(EMBED_TEXT)
set(EMBED_FLAGS -t)
else()
set(EMBED_FLAGS)
endif()
add_custom_command(
OUTPUT data/${NAME}.cpp data/${NAME}.hpp
COMMAND ${CMAKE_COMMAND} -E make_directory data
COMMAND ${CMAKE_CURRENT_BINARY_DIR}/embed ${NAME} ${CMAKE_CURRENT_LIST_DIR}/${FILE} data/${NAME}
COMMAND ${CMAKE_CURRENT_BINARY_DIR}/embed ${EMBED_FLAGS} ${NAME} ${CMAKE_CURRENT_LIST_DIR}/${FILE} data/${NAME}
DEPENDS embed ${CMAKE_CURRENT_LIST_DIR}/${FILE}
)
list(APPEND ${LIST} data/${NAME}.cpp)
@@ -146,10 +152,10 @@ set(PROFILER_FILES
src/winmainArchDiscovery.cpp
)
Embed(PROFILER_FILES SystemPrompt src/llm/system.prompt.md)
Embed(PROFILER_FILES SkillCallstack src/llm/skill.callstack.md)
Embed(PROFILER_FILES SkillOptimization src/llm/skill.optimization.md)
Embed(PROFILER_FILES ToolsJson src/llm/tools.json)
Embed(PROFILER_FILES SystemPrompt src/llm/system.prompt.md TEXT)
Embed(PROFILER_FILES SkillCallstack src/llm/skill.callstack.md TEXT)
Embed(PROFILER_FILES SkillOptimization src/llm/skill.optimization.md TEXT)
Embed(PROFILER_FILES ToolsJson src/llm/tools.json TEXT)
Embed(PROFILER_FILES FontFixed src/font/FiraCode-Retina.ttf)
Embed(PROFILER_FILES FontIcons src/font/Font\ Awesome\ 7\ Free-Solid-900.otf)
@@ -159,20 +165,20 @@ Embed(PROFILER_FILES FontItalic src/font/Roboto-Italic.ttf)
Embed(PROFILER_FILES FontBoldItalic src/font/Roboto-BoldItalic.ttf)
Embed(PROFILER_FILES FontEmoji src/font/NotoEmoji-Regular.ttf)
Embed(PROFILER_FILES Manual ../manual/tracy.md)
Embed(PROFILER_FILES Manual ../manual/tracy.md TEXT)
Embed(PROFILER_FILES Text100Million src/achievements/100Million.md)
Embed(PROFILER_FILES TextConnectToClient src/achievements/ConnectToClient.md)
Embed(PROFILER_FILES TextFindZone src/achievements/FindZone.md)
Embed(PROFILER_FILES TextFrameImages src/achievements/FrameImages.md)
Embed(PROFILER_FILES TextGlobalSettings src/achievements/GlobalSettings.md)
Embed(PROFILER_FILES TextInstrumentationIntro src/achievements/InstrumentationIntro.md)
Embed(PROFILER_FILES TextInstrumentationStatistics src/achievements/InstrumentationStatistics.md)
Embed(PROFILER_FILES TextInstrumentFrames src/achievements/InstrumentFrames.md)
Embed(PROFILER_FILES TextIntro src/achievements/Intro.md)
Embed(PROFILER_FILES TextLoadTrace src/achievements/LoadTrace.md)
Embed(PROFILER_FILES TextSamplingIntro src/achievements/SamplingIntro.md)
Embed(PROFILER_FILES TextSaveTrace src/achievements/SaveTrace.md)
Embed(PROFILER_FILES Text100Million src/achievements/100Million.md TEXT)
Embed(PROFILER_FILES TextConnectToClient src/achievements/ConnectToClient.md TEXT)
Embed(PROFILER_FILES TextFindZone src/achievements/FindZone.md TEXT)
Embed(PROFILER_FILES TextFrameImages src/achievements/FrameImages.md TEXT)
Embed(PROFILER_FILES TextGlobalSettings src/achievements/GlobalSettings.md TEXT)
Embed(PROFILER_FILES TextInstrumentationIntro src/achievements/InstrumentationIntro.md TEXT)
Embed(PROFILER_FILES TextInstrumentationStatistics src/achievements/InstrumentationStatistics.md TEXT)
Embed(PROFILER_FILES TextInstrumentFrames src/achievements/InstrumentFrames.md TEXT)
Embed(PROFILER_FILES TextIntro src/achievements/Intro.md TEXT)
Embed(PROFILER_FILES TextLoadTrace src/achievements/LoadTrace.md TEXT)
Embed(PROFILER_FILES TextSamplingIntro src/achievements/SamplingIntro.md TEXT)
Embed(PROFILER_FILES TextSaveTrace src/achievements/SaveTrace.md TEXT)
set(INCLUDES "${CMAKE_CURRENT_BINARY_DIR}")
set(LIBS "")
@@ -294,7 +300,19 @@ if(NOT EMSCRIPTEN)
endif()
if(EMSCRIPTEN)
target_link_options(${PROJECT_NAME} PRIVATE -pthread -sASSERTIONS=0 -sINITIAL_MEMORY=384mb -sALLOW_MEMORY_GROWTH=1 -sMAXIMUM_MEMORY=4gb -sSTACK_SIZE=1048576 -sWASM_BIGINT=1 -sPTHREAD_POOL_SIZE=8 -sEXPORTED_FUNCTIONS=_main,_nativeOpenFile,_tracy_paste_clipboard -sEXPORTED_RUNTIME_METHODS=ccall -sENVIRONMENT=web,worker --preload-file embed.tracy)
target_link_options(${PROJECT_NAME} PRIVATE
-pthread
-sASSERTIONS=0
-sINITIAL_MEMORY=384mb
-sALLOW_MEMORY_GROWTH=1
-sMAXIMUM_MEMORY=4gb
-sSTACK_SIZE=1048576
-sPTHREAD_POOL_SIZE=8
-sEXPORTED_FUNCTIONS=_main,_nativeOpenFile,_tracy_paste_clipboard
-sEXPORTED_RUNTIME_METHODS=ccall
-sENVIRONMENT=web,worker
--preload-file embed.tracy
)
file(DOWNLOAD https://share.nereid.pl/i/embed.tracy ${CMAKE_CURRENT_BINARY_DIR}/embed.tracy EXPECTED_MD5 ca0fa4f01e7b8ca5581daa16b16c768d)
file(COPY ${CMAKE_CURRENT_LIST_DIR}/wasm/index.html DESTINATION ${CMAKE_CURRENT_BINARY_DIR})

View File

@@ -1,17 +1,27 @@
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <string>
#include "../../public/common/tracy_lz4hc.hpp"
static void Usage()
{
fprintf( stderr, "Usage: embed <objectName> <source> <destination>\n" );
fprintf( stderr, "Usage: embed [-t] <objectName> <source> <destination>\n" );
fprintf( stderr, " destination should be without extension, will create cpp, hpp pair\n" );
fprintf( stderr, " -t: treat source as text, convert line endings to unix\n" );
}
int main( int argc, char** argv )
{
bool text = false;
if( argc >= 2 && strcmp( argv[1], "-t" ) == 0 )
{
text = true;
argc--;
argv++;
}
if( argc < 4 )
{
Usage();
@@ -38,6 +48,16 @@ int main( int argc, char** argv )
fread( data, 1, sz, src );
fclose( src );
if( text )
{
size_t pos = 0;
for( size_t i=0; i<sz; i++ )
{
if( data[i] != '\r' ) data[pos++] = data[i];
}
sz = pos;
}
const auto lz4szMax = tracy::LZ4_compressBound( sz );
auto lz4data = new uint8_t[lz4szMax];
const auto lz4sz = tracy::LZ4_compress_HC( (const char*)data, (char*)lz4data, sz, lz4szMax, 6 );

View File

@@ -162,6 +162,15 @@ static ImGuiKey TranslateKeyCode( const char* code )
return ImGuiKey_None;
}
static void UpdateKeyModifiers( const EmscriptenKeyboardEvent* e )
{
ImGuiIO& io = ImGui::GetIO();
io.AddKeyEvent( ImGuiMod_Ctrl, e->ctrlKey );
io.AddKeyEvent( ImGuiMod_Shift, e->shiftKey );
io.AddKeyEvent( ImGuiMod_Alt, e->altKey );
io.AddKeyEvent( ImGuiMod_Super, e->metaKey );
}
Backend::Backend( const char* title, const std::function<void()>& redraw, const std::function<void(float)>& scaleChanged, const std::function<int(void)>& isBusy, RunQueue* mainThreadTasks )
{
constexpr EGLint eglConfigAttrib[] = {
@@ -243,6 +252,7 @@ Backend::Backend( const char* title, const std::function<void()>& redraw, const
return EM_TRUE;
} );
emscripten_set_keydown_callback( EMSCRIPTEN_EVENT_TARGET_WINDOW, nullptr, EM_TRUE, [] ( int, const EmscriptenKeyboardEvent* e, void* ) -> EM_BOOL {
UpdateKeyModifiers( e );
const auto code = TranslateKeyCode( e->code );
if( code == ImGuiKey_None ) return EM_FALSE;
ImGui::GetIO().AddKeyEvent( code, true );
@@ -250,6 +260,7 @@ Backend::Backend( const char* title, const std::function<void()>& redraw, const
return EM_TRUE;
} );
emscripten_set_keyup_callback( EMSCRIPTEN_EVENT_TARGET_WINDOW, nullptr, EM_TRUE, [] ( int, const EmscriptenKeyboardEvent* e, void* ) -> EM_BOOL {
UpdateKeyModifiers( e );
const auto code = TranslateKeyCode( e->code );
if( code == ImGuiKey_None ) return EM_FALSE;
ImGui::GetIO().AddKeyEvent( code, false );

View File

@@ -44,7 +44,7 @@ struct Config
std::string llmSearchIdentifier;
std::string llmSearchApiKey;
std::string llmSearchBraveApiKey;
bool llmSeparateFastModel = true;
bool llmSeparateFastModel = false;
bool llmAnnotateCallstacks = false;
bool llmLimitToolReplySize = false;
int llmMaxToolReplySizeValue = 48*1024;

View File

@@ -295,6 +295,7 @@ bool UserData::Load()
LoadValue( v, "min", a->range.min );
LoadValue( v, "max", a->range.max );
LoadValue( v, "color", a->color );
a->range.active = true;
m_annotations.emplace_back( std::move( a ) );
}
}

View File

@@ -49,7 +49,8 @@ constexpr const char* GpuContextNames[] = {
"Metal",
"Custom",
"CUDA",
"Rocprof"
"Rocprof",
"WebGPU"
};
struct MemoryPage;

View File

@@ -299,6 +299,22 @@ void View::DrawTimeline()
v->range.StartFrame();
HandleRange( v->range, timespan, ImGui::GetCursorScreenPos(), w );
}
if( IsMouseClicked( 0 ) )
{
const auto ty = ImGui::GetTextLineHeight();
for( auto& ann : m_annotations )
{
if( ann->range.min >= m_vd.zvEnd || ann->range.max <= m_vd.zvStart ) continue;
const auto aMin = ( ann->range.min - m_vd.zvStart ) * pxns;
const auto aMax = ( ann->range.max - m_vd.zvStart ) * pxns;
if( ImGui::IsMouseHoveringRect( linepos + ImVec2( aMin, lineh - ty * 1.5f ), linepos + ImVec2( aMax, lineh ) ) )
{
m_selectedAnnotation = ann.get();
ConsumeMouseEvents( 0 );
break;
}
}
}
HandleTimelineMouse( timespan, ImGui::GetCursorScreenPos(), w );
}
if( ImGui::IsWindowFocused( ImGuiHoveredFlags_ChildWindows | ImGuiHoveredFlags_AllowWhenBlockedByActiveItem ) )
@@ -360,9 +376,8 @@ void View::DrawTimeline()
bool hover = ImGui::IsWindowHovered() && ImGui::IsMouseHoveringRect( wpos, wpos + ImVec2( w, h ) );
draw = ImGui::GetWindowDrawList();
const auto scale = GetScale();
const auto ty = ImGui::GetTextLineHeight();
const auto to = 9.f;
const auto th = ( ty - to ) * sqrt( 3 ) * 0.5;
if( m_vd.drawGpuZones )
{
@@ -415,17 +430,24 @@ void View::DrawTimeline()
m_lockHighlight = m_nextLockHighlight;
const auto iconSize = ImGui::CalcTextSize( ICON_FA_NOTE_STICKY );
for( auto& ann : m_annotations )
{
if( ann->range.min < m_vd.zvEnd && ann->range.max > m_vd.zvStart )
{
uint32_t c0 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x44000000 : 0x22000000 );
uint32_t c1 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x66000000 : 0x44000000 );
uint32_t c2 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0xCC000000 : 0xAA000000 );
draw->AddRectFilled( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ), c0 );
DrawLine( draw, linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + 0.5f, 0.5f ), linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + 0.5f, lineh + 0.5f ), ann->range.hiMin ? c2 : c1, ann->range.hiMin ? 2 : 1 );
DrawLine( draw, linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns + 0.5f, 0.5f ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns + 0.5f, lineh + 0.5f ), ann->range.hiMax ? c2 : c1, ann->range.hiMax ? 2 : 1 );
if( drawMouseLine && ImGui::IsMouseHoveringRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ) ) )
uint32_t c0 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x22000000 : 0x11000000 );
uint32_t c1 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x88000000 : 0x66000000 );
uint32_t c2 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0xDD000000 : 0xBB000000 );
const auto aMin = ( ann->range.min - m_vd.zvStart ) * pxns;
const auto aMax = ( ann->range.max - m_vd.zvStart ) * pxns;
draw->AddRectFilled( linepos + ImVec2( aMin, 0 ), linepos + ImVec2( aMax, lineh ), c0 );
draw->AddRectFilled( linepos + ImVec2( aMin + 1, lineh - ty * 1.5f ), linepos + ImVec2( aMax - 1, lineh ), 0x88000000 );
DrawLine( draw, linepos + ImVec2( aMin + 0.5f, 0.5f ), linepos + ImVec2( aMin + 0.5f, lineh + 0.5f ), ann->range.hiMin ? c2 : c1, ann->range.hiMin ? 2 : 1 );
DrawLine( draw, linepos + ImVec2( aMax - 0.5f, 0.5f ), linepos + ImVec2( aMax - 0.5f, lineh + 0.5f ), ann->range.hiMax ? c2 : c1, ann->range.hiMax ? 2 : 1 );
if( drawMouseLine && ImGui::IsMouseHoveringRect( linepos + ImVec2( aMin, 0 ), linepos + ImVec2( aMax, lineh ) ) )
{
ImGui::BeginTooltip();
if( ann->text.empty() )
@@ -442,27 +464,22 @@ void View::DrawTimeline()
TextFocused( "Annotation length:", TimeToString( ann->range.max - ann->range.min ) );
ImGui::EndTooltip();
}
const auto aw = ( ann->range.max - ann->range.min ) * pxns;
if( aw > th * 4 )
{
draw->AddCircleFilled( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 2, th * 2 ), th, 0x88AABB22 );
draw->AddCircle( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 2, th * 2 ), th, 0xAAAABB22 );
if( drawMouseLine && IsMouseClicked( 0 ) && ImGui::IsMouseHoveringRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th, th ), linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 3, th * 3 ) ) )
{
m_selectedAnnotation = ann.get();
}
const auto aw = ( ann->range.max - ann->range.min ) * pxns;
if( aw > ty + iconSize.x )
{
draw->AddText( linepos + ImVec2( aMin + ty * 0.5f, lineh - ty * 1.25f ), ann->color | 0xFF000000, ICON_FA_NOTE_STICKY );
if( !ann->text.empty() )
{
const auto tw = ImGui::CalcTextSize( ann->text.c_str() ).x;
if( aw - th*4 > tw )
if( aw > ty + iconSize.x + tw )
{
draw->AddText( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 4, th * 0.5 ), 0xFFFFFFFF, ann->text.c_str() );
draw->AddText( linepos + ImVec2( aMin + ty + iconSize.x, lineh - ty * 1.25f ), 0xFFFFFFFF, ann->text.c_str() );
}
else
{
draw->PushClipRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ), true );
draw->AddText( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 4, th * 0.5 ), 0xFFFFFFFF, ann->text.c_str() );
draw->PushClipRect( linepos + ImVec2( aMin + 1, lineh - ty * 1.5f ), linepos + ImVec2( aMax - 1, lineh ) );
draw->AddText( linepos + ImVec2( aMin + ty + iconSize.x, lineh - ty * 1.25f ), 0xFFFFFFFF, ann->text.c_str() );
draw->PopClipRect();
}
}
@@ -485,7 +502,6 @@ void View::DrawTimeline()
draw->AddRect( ImVec2( wpos.x + px0, linepos.y ), ImVec2( wpos.x + px1, linepos.y + lineh ), 0x4488DD88 );
}
const auto scale = GetScale();
if( m_findZone.range.active && ( m_findZone.show || m_showRanges ) )
{
const auto px0 = ( m_findZone.range.min - m_vd.zvStart ) * pxns;

View File

@@ -861,43 +861,38 @@ module tracy
end interface
interface
subroutine impl_tracy_emit_memory_alloc_callstack(ptr, size, depth, secure) &
subroutine impl_tracy_emit_memory_alloc_callstack(ptr, size, depth) &
bind(C, name="___tracy_emit_memory_alloc_callstack")
import
type(c_ptr), intent(in), value :: ptr
integer(c_size_t), intent(in), value :: size
integer(c_int32_t), intent(in), value :: depth
integer(c_int32_t), intent(in), value :: secure
end subroutine impl_tracy_emit_memory_alloc_callstack
subroutine impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth, secure, name) &
subroutine impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth, name) &
bind(C, name="___tracy_emit_memory_alloc_callstack_named")
import
type(c_ptr), intent(in), value :: ptr
integer(c_size_t), intent(in), value :: size
integer(c_int32_t), intent(in), value :: depth
integer(c_int32_t), intent(in), value :: secure
type(c_ptr), intent(in), value :: name
end subroutine impl_tracy_emit_memory_alloc_callstack_named
subroutine impl_tracy_emit_memory_free_callstack(ptr, depth, secure) &
subroutine impl_tracy_emit_memory_free_callstack(ptr, depth) &
bind(C, name="___tracy_emit_memory_free_callstack")
import
type(c_ptr), intent(in), value :: ptr
integer(c_int32_t), intent(in), value :: depth
integer(c_int32_t), intent(in), value :: secure
end subroutine impl_tracy_emit_memory_free_callstack
subroutine impl_tracy_emit_memory_free_callstack_named(ptr, depth, secure, name) &
subroutine impl_tracy_emit_memory_free_callstack_named(ptr, depth, name) &
bind(C, name="___tracy_emit_memory_free_callstack_named")
import
type(c_ptr), intent(in), value :: ptr
integer(c_int32_t), intent(in), value :: depth
integer(c_int32_t), intent(in), value :: secure
type(c_ptr), intent(in), value :: name
end subroutine impl_tracy_emit_memory_free_callstack_named
subroutine impl_tracy_emit_memory_discard_callstack(name, secure, depth) &
subroutine impl_tracy_emit_memory_discard_callstack(name, depth) &
bind(C, name="___tracy_emit_memory_discard_callstack")
import
type(c_ptr), intent(in), value :: name
integer(c_int32_t), intent(in), value :: secure
integer(c_int32_t), intent(in), value :: depth
end subroutine impl_tracy_emit_memory_discard_callstack
end interface
@@ -1128,58 +1123,43 @@ contains
tracy_connected = impl_tracy_connected() /= 0_c_int32_t
end function tracy_connected
subroutine tracy_memory_alloc(ptr, size, name, depth, secure)
subroutine tracy_memory_alloc(ptr, size, name, depth)
type(c_ptr), intent(in) :: ptr
integer(c_size_t), intent(in) :: size
character(kind=c_char, len=*), target, intent(in), optional :: name
integer(c_int32_t), intent(in), optional :: depth
logical(1), intent(in), optional :: secure
!
integer(c_int32_t) :: depth_, secure_
secure_ = 0_c_int32_t
integer(c_int32_t) :: depth_
depth_ = 0_c_int32_t
if (present(secure)) then
if (secure) secure_ = 1_c_int32_t
end if
if (present(depth)) depth_ = depth
if (present(name)) then
call impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth_, secure_, c_loc(name))
call impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth_, c_loc(name))
else
call impl_tracy_emit_memory_alloc_callstack(ptr, size, depth_, secure_)
call impl_tracy_emit_memory_alloc_callstack(ptr, size, depth_)
end if
end subroutine tracy_memory_alloc
subroutine tracy_memory_free(ptr, name, depth, secure)
subroutine tracy_memory_free(ptr, name, depth)
type(c_ptr), intent(in) :: ptr
character(kind=c_char, len=*), target, intent(in), optional :: name
integer(c_int32_t), intent(in), optional :: depth
logical(1), intent(in), optional :: secure
!
integer(c_int32_t) :: depth_, secure_
secure_ = 0_c_int32_t
integer(c_int32_t) :: depth_
depth_ = 0_c_int32_t
if (present(secure)) then
if (secure) secure_ = 1_c_int32_t
end if
if (present(depth)) depth_ = depth
if (present(name)) then
call impl_tracy_emit_memory_free_callstack_named(ptr, depth_, secure_, c_loc(name))
call impl_tracy_emit_memory_free_callstack_named(ptr, depth_, c_loc(name))
else
call impl_tracy_emit_memory_free_callstack(ptr, depth_, secure_)
call impl_tracy_emit_memory_free_callstack(ptr, depth_)
end if
end subroutine tracy_memory_free
subroutine tracy_memory_discard(name, depth, secure)
subroutine tracy_memory_discard(name, depth)
character(kind=c_char, len=*), target, intent(in) :: name
integer(c_int32_t), intent(in), optional :: depth
logical(1), intent(in), optional :: secure
!
integer(c_int32_t) :: depth_, secure_
secure_ = 0_c_int32_t
integer(c_int32_t) :: depth_
depth_ = 0_c_int32_t
if (present(secure)) then
if (secure) secure_ = 1_c_int32_t
end if
if (present(depth)) depth_ = depth
call impl_tracy_emit_memory_discard_callstack(c_loc(name), depth_, secure_)
call impl_tracy_emit_memory_discard_callstack(c_loc(name), depth_)
end subroutine tracy_memory_discard
subroutine tracy_message(msg, color, depth)

View File

@@ -524,7 +524,7 @@ static const char* GetHostInfo()
auto ptr = buf;
#if defined _WIN32
# if defined TRACY_WIN32_NO_DESKTOP
auto GetVersion = &::GetVersionEx;
auto GetVersion = &::GetVersionExW;
# else
auto GetVersion = (t_RtlGetVersion)GetProcAddress( GetModuleHandleA( "ntdll.dll" ), "RtlGetVersion" );
# endif
@@ -1408,9 +1408,30 @@ namespace
// 1a. But s_queue is needed for initialization of variables in point 2.
extern moodycamel::ConcurrentQueue<QueueItem> s_queue;
// A producer token may be created before s_initTime is constructed (the dynamic loader
// runs shared object initializers before any of the executable's constructors, and such
// an initializer may emit a zone). Remember the time of such an early token creation, so
// that the init time can be backdated accordingly and no event timestamp precedes the
// trace epoch.
static std::atomic<int64_t> s_earlyTokenTime { 0 };
static bool s_initTimeConstructed = false;
// 2. If these variables would be in the .CRT$XCB section, they would be initialized only in main thread.
thread_local moodycamel::ProducerToken init_order(107) s_token_detail( s_queue );
thread_local ProducerWrapper init_order(108) s_token { s_queue.get_explicit_producer( s_token_detail ) };
static moodycamel::ConcurrentQueue<QueueItem>::ExplicitProducer* CreateProducerToken()
{
auto ptr = s_queue.get_explicit_producer( s_token_detail );
if( !s_initTimeConstructed )
{
const auto t = Profiler::GetTime();
auto e = s_earlyTokenTime.load( std::memory_order_relaxed );
while( ( e == 0 || t < e ) && !s_earlyTokenTime.compare_exchange_weak( e, t, std::memory_order_relaxed ) ) {}
}
return ptr;
}
thread_local ProducerWrapper init_order(108) s_token { CreateProducerToken() };
thread_local ThreadHandleWrapper init_order(104) s_threadHandle { detail::GetThreadHandleImpl() };
# ifdef _MSC_VER
@@ -1419,12 +1440,36 @@ thread_local ThreadHandleWrapper init_order(104) s_threadHandle { detail::GetThr
# pragma init_seg( ".CRT$XCB" )
# endif
static InitTimeWrapper init_order(101) s_initTime { SetupHwTimer() };
static int64_t GetInitTimeImpl()
{
auto t = SetupHwTimer();
const auto e = s_earlyTokenTime.load( std::memory_order_relaxed );
if( e != 0 && e < t ) t = e;
s_initTimeConstructed = true;
return t;
}
static InitTimeWrapper init_order(101) s_initTime { GetInitTimeImpl() };
std::atomic<int> init_order(102) RpInitDone( 0 );
std::atomic<int> init_order(102) RpInitLock( 0 );
thread_local bool RpThreadInitDone = false;
thread_local bool RpThreadShutdown = false;
moodycamel::ConcurrentQueue<QueueItem> init_order(103) s_queue( QueuePrealloc );
# ifndef _MSC_VER
// An instrumented shared object may emit zones from its static initializers, which the
// dynamic loader runs before any of the executable's constructors, including the
// priority-ordered constructor of s_queue above. The main thread producer token (s_token)
// is then lazily created against the zero-initialized queue memory, and the queue
// constructor subsequently orphans it, making all zones emitted on the main thread
// invisible to the consumer. Re-adopt such a producer here. If no zones were emitted up
// to this point, this only triggers construction of s_token, which is a no-op repair.
struct EarlyMainThreadTokenRepair
{
EarlyMainThreadTokenRepair() { if( s_token.ptr ) s_queue.readopt_orphaned_producer( s_token.ptr ); }
};
static EarlyMainThreadTokenRepair init_order(104) s_earlyMainThreadTokenRepair;
# endif
std::atomic<uint32_t> init_order(104) s_lockCounter( 0 );
std::atomic<uint8_t> init_order(104) s_gpuCtxCounter( 0 );
@@ -2290,12 +2335,12 @@ void Profiler::CompressWorker()
const auto w = fi->w;
const auto h = fi->h;
const auto csz = size_t( w * h / 2 );
auto etc1buf = (char*)tracy_malloc( csz );
CompressImageDxt1( (const char*)fi->image, etc1buf, w, h );
auto texbuf = (char*)tracy_malloc( csz );
CompressImageDxt1( (const char*)fi->image, texbuf, w, h );
tracy_free( fi->image );
TracyLfqPrepare( QueueType::FrameImage );
MemWrite( &item->frameImageFat.image, (uint64_t)etc1buf );
MemWrite( &item->frameImageFat.image, (uint64_t)texbuf );
MemWrite( &item->frameImageFat.frame, fi->frame );
MemWrite( &item->frameImageFat.w, w );
MemWrite( &item->frameImageFat.h, h );
@@ -3409,34 +3454,68 @@ void Profiler::SendString( uint64_t str, const char* ptr, size_t len, QueueType
AppendDataUnsafe( ptr, l16 );
}
void Profiler::SendSingleString( const char* ptr, size_t len )
void Profiler::SendSingleString8( const char* ptr, size_t len )
{
QueueItem item;
MemWrite( &item.hdr.type, QueueType::SingleStringData8 );
assert( len <= std::numeric_limits<uint8_t>::max() );
auto l8 = uint8_t( len );
NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData8] + sizeof( l8 ) + len );
AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SingleStringData8] );
AppendDataUnsafe( &l8, sizeof( l8 ) );
AppendDataUnsafe( ptr, len );
}
void Profiler::SendSingleString16( const char* ptr, size_t len )
{
QueueItem item;
MemWrite( &item.hdr.type, QueueType::SingleStringData );
// Ignoring u16+ range by design
assert( len > std::numeric_limits<uint8_t>::max() );
assert( len <= std::numeric_limits<uint16_t>::max() );
auto l16 = uint16_t( len );
auto l16 = uint16_t( len - ProtocolOffset8Bit );
NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData] + sizeof( l16 ) + l16 );
NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData] + sizeof( l16 ) + len );
AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SingleStringData] );
AppendDataUnsafe( &l16, sizeof( l16 ) );
AppendDataUnsafe( ptr, l16 );
AppendDataUnsafe( ptr, len );
}
void Profiler::SendSecondString( const char* ptr, size_t len )
void Profiler::SendSecondString8( const char* ptr, size_t len )
{
QueueItem item;
MemWrite( &item.hdr.type, QueueType::SecondStringData8 );
assert( len <= std::numeric_limits<uint8_t>::max() );
auto l8 = uint8_t( len );
NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData8] + sizeof( l8 ) + len );
AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SecondStringData8] );
AppendDataUnsafe( &l8, sizeof( l8 ) );
AppendDataUnsafe( ptr, len );
}
void Profiler::SendSecondString16( const char* ptr, size_t len )
{
QueueItem item;
MemWrite( &item.hdr.type, QueueType::SecondStringData );
// Ignoring u16+ range by design
assert( len > std::numeric_limits<uint8_t>::max() );
assert( len <= std::numeric_limits<uint16_t>::max() );
auto l16 = uint16_t( len );
auto l16 = uint16_t( len - ProtocolOffset8Bit );
NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData] + sizeof( l16 ) + l16 );
NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData] + sizeof( l16 ) + len );
AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SecondStringData] );
AppendDataUnsafe( &l16, sizeof( l16 ) );
AppendDataUnsafe( ptr, l16 );
AppendDataUnsafe( ptr, len );
}
void Profiler::SendLongString( uint64_t str, const char* ptr, size_t len, QueueType type )
@@ -4664,64 +4743,64 @@ TRACY_API void ___tracy_emit_zone_value( TracyCZoneCtx ctx, uint64_t value )
}
}
TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size, int32_t secure ) { tracy::Profiler::MemAlloc( ptr, size, secure != 0 ); }
TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth, int32_t secure )
TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size ) { tracy::Profiler::MemAlloc( ptr, size ); }
TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth )
{
if( depth > 0 && tracy::has_callstack() )
{
tracy::Profiler::MemAllocCallstack( ptr, size, depth, secure != 0 );
tracy::Profiler::MemAllocCallstack( ptr, size, depth );
}
else
{
tracy::Profiler::MemAlloc( ptr, size, secure != 0 );
tracy::Profiler::MemAlloc( ptr, size );
}
}
TRACY_API void ___tracy_emit_memory_free( const void* ptr, int32_t secure ) { tracy::Profiler::MemFree( ptr, secure != 0 ); }
TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth, int32_t secure )
TRACY_API void ___tracy_emit_memory_free( const void* ptr ) { tracy::Profiler::MemFree( ptr ); }
TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth )
{
if( depth > 0 && tracy::has_callstack() )
{
tracy::Profiler::MemFreeCallstack( ptr, depth, secure != 0 );
tracy::Profiler::MemFreeCallstack( ptr, depth );
}
else
{
tracy::Profiler::MemFree( ptr, secure != 0 );
tracy::Profiler::MemFree( ptr );
}
}
TRACY_API void ___tracy_emit_memory_discard( const char* name, int32_t secure ) { tracy::Profiler::MemDiscard( name, secure != 0 ); }
TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t secure, int32_t depth )
TRACY_API void ___tracy_emit_memory_discard( const char* name ) { tracy::Profiler::MemDiscard( name ); }
TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t depth )
{
if( depth > 0 && tracy::has_callstack() )
{
tracy::Profiler::MemDiscardCallstack( name, secure != 0, depth );
tracy::Profiler::MemDiscardCallstack( name, depth );
}
else
{
tracy::Profiler::MemDiscard( name, secure != 0 );
tracy::Profiler::MemDiscard( name );
}
}
TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, int32_t secure, const char* name ) { tracy::Profiler::MemAllocNamed( ptr, size, secure != 0, name ); }
TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, int32_t secure, const char* name )
TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, const char* name ) { tracy::Profiler::MemAllocNamed( ptr, size, name ); }
TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, const char* name )
{
if( depth > 0 && tracy::has_callstack() )
{
tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, secure != 0, name );
tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, name );
}
else
{
tracy::Profiler::MemAllocNamed( ptr, size, secure != 0, name );
tracy::Profiler::MemAllocNamed( ptr, size, name );
}
}
TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, int32_t secure, const char* name ) { tracy::Profiler::MemFreeNamed( ptr, secure != 0, name ); }
TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, int32_t secure, const char* name )
TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, const char* name ) { tracy::Profiler::MemFreeNamed( ptr, name ); }
TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, const char* name )
{
if( depth > 0 && tracy::has_callstack() )
{
tracy::Profiler::MemFreeCallstackNamed( ptr, depth, secure != 0, name );
tracy::Profiler::MemFreeCallstackNamed( ptr, depth, name );
}
else
{
tracy::Profiler::MemFreeNamed( ptr, secure != 0, name );
tracy::Profiler::MemFreeNamed( ptr, name );
}
}
TRACY_API void ___tracy_emit_frame_mark( const char* name ) { tracy::Profiler::SendFrameMark( name ); }

View File

@@ -535,9 +535,9 @@ public:
TracyLfqCommit;
}
static tracy_force_inline void MemAlloc( const void* ptr, size_t size, bool secure )
static tracy_force_inline void MemAlloc( const void* ptr, size_t size )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
#ifdef TRACY_ON_DEMAND
if( !GetProfiler().IsConnected() ) return;
#endif
@@ -548,9 +548,9 @@ public:
GetProfiler().m_serialLock.unlock();
}
static tracy_force_inline void MemFree( const void* ptr, bool secure )
static tracy_force_inline void MemFree( const void* ptr )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
#ifdef TRACY_ON_DEMAND
if( !GetProfiler().IsConnected() ) return;
#endif
@@ -561,9 +561,9 @@ public:
GetProfiler().m_serialLock.unlock();
}
static tracy_force_inline void MemAllocCallstack( const void* ptr, size_t size, int32_t depth, bool secure )
static tracy_force_inline void MemAllocCallstack( const void* ptr, size_t size, int32_t depth )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
if( depth > 0 && has_callstack() )
{
auto& profiler = GetProfiler();
@@ -581,16 +581,16 @@ public:
}
else
{
MemAlloc( ptr, size, secure );
MemAlloc( ptr, size );
}
}
static tracy_force_inline void MemFreeCallstack( const void* ptr, int32_t depth, bool secure )
static tracy_force_inline void MemFreeCallstack( const void* ptr, int32_t depth )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
if( !ProfilerAllocatorAvailable() )
{
MemFree( ptr, secure );
MemFree( ptr );
return;
}
if( depth > 0 && has_callstack() )
@@ -610,13 +610,13 @@ public:
}
else
{
MemFree( ptr, secure );
MemFree( ptr );
}
}
static tracy_force_inline void MemAllocNamed( const void* ptr, size_t size, bool secure, const char* name )
static tracy_force_inline void MemAllocNamed( const void* ptr, size_t size, const char* name )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
#ifdef TRACY_ON_DEMAND
if( !GetProfiler().IsConnected() ) return;
#endif
@@ -628,9 +628,9 @@ public:
GetProfiler().m_serialLock.unlock();
}
static tracy_force_inline void MemFreeNamed( const void* ptr, bool secure, const char* name )
static tracy_force_inline void MemFreeNamed( const void* ptr, const char* name )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
#ifdef TRACY_ON_DEMAND
if( !GetProfiler().IsConnected() ) return;
#endif
@@ -642,9 +642,9 @@ public:
GetProfiler().m_serialLock.unlock();
}
static tracy_force_inline void MemAllocCallstackNamed( const void* ptr, size_t size, int32_t depth, bool secure, const char* name )
static tracy_force_inline void MemAllocCallstackNamed( const void* ptr, size_t size, int32_t depth, const char* name )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
if( depth > 0 && has_callstack() )
{
auto& profiler = GetProfiler();
@@ -663,13 +663,13 @@ public:
}
else
{
MemAllocNamed( ptr, size, secure, name );
MemAllocNamed( ptr, size, name );
}
}
static tracy_force_inline void MemFreeCallstackNamed( const void* ptr, int32_t depth, bool secure, const char* name )
static tracy_force_inline void MemFreeCallstackNamed( const void* ptr, int32_t depth, const char* name )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
if( depth > 0 && has_callstack() )
{
auto& profiler = GetProfiler();
@@ -688,13 +688,13 @@ public:
}
else
{
MemFreeNamed( ptr, secure, name );
MemFreeNamed( ptr, name );
}
}
static tracy_force_inline void MemDiscard( const char* name, bool secure )
static tracy_force_inline void MemDiscard( const char* name )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
#ifdef TRACY_ON_DEMAND
if( !GetProfiler().IsConnected() ) return;
#endif
@@ -705,9 +705,9 @@ public:
GetProfiler().m_serialLock.unlock();
}
static tracy_force_inline void MemDiscardCallstack( const char* name, bool secure, int32_t depth )
static tracy_force_inline void MemDiscardCallstack( const char* name, int32_t depth )
{
if( secure && !ProfilerAvailable() ) return;
if( !ProfilerAvailable() ) return;
if( depth > 0 && has_callstack() )
{
# ifdef TRACY_ON_DEMAND
@@ -719,12 +719,12 @@ public:
GetProfiler().m_serialLock.lock();
SendCallstackSerial( callstack );
SendMemDiscard( QueueType::MemDiscard, thread, name );
SendMemDiscard( QueueType::MemDiscardCallstack, thread, name );
GetProfiler().m_serialLock.unlock();
}
else
{
MemDiscard( name, secure );
MemDiscard( name );
}
}
@@ -827,12 +827,12 @@ public:
void RequestShutdown() { m_shutdown.store( true, std::memory_order_relaxed ); m_shutdownManual.store( true, std::memory_order_relaxed ); }
bool HasShutdownFinished() const { return m_shutdownFinished.load( std::memory_order_relaxed ); }
void SendString( uint64_t str, const char* ptr, QueueType type ) { SendString( str, ptr, strlen( ptr ), type ); }
tracy_force_inline void SendString( uint64_t str, const char* ptr, QueueType type ) { SendString( str, ptr, strlen( ptr ), type ); }
void SendString( uint64_t str, const char* ptr, size_t len, QueueType type );
void SendSingleString( const char* ptr ) { SendSingleString( ptr, strlen( ptr ) ); }
void SendSingleString( const char* ptr, size_t len );
void SendSecondString( const char* ptr ) { SendSecondString( ptr, strlen( ptr ) ); }
void SendSecondString( const char* ptr, size_t len );
tracy_force_inline void SendSingleString( const char* ptr ) { SendSingleString( ptr, strlen( ptr ) ); }
tracy_force_inline void SendSingleString( const char* ptr, size_t len ) { len <= 255 ? SendSingleString8( ptr, len ) : SendSingleString16( ptr, len ); }
tracy_force_inline void SendSecondString( const char* ptr ) { SendSecondString( ptr, strlen( ptr ) ); }
tracy_force_inline void SendSecondString( const char* ptr, size_t len ) { len <= 255 ? SendSecondString8( ptr, len ) : SendSecondString16( ptr, len ); }
// Allocated source location data layout:
@@ -975,6 +975,11 @@ private:
void CalibrateDelay();
void ReportTopology();
void SendSingleString8( const char* ptr, size_t len );
void SendSingleString16( const char* ptr, size_t len );
void SendSecondString8( const char* ptr, size_t len );
void SendSecondString16( const char* ptr, size_t len );
static tracy_force_inline void SendCallstackSerial( void* ptr )
{
if( has_callstack() )

View File

@@ -52,20 +52,8 @@ public:
RingBuffer( const RingBuffer& ) = delete;
RingBuffer& operator=( const RingBuffer& ) = delete;
RingBuffer( RingBuffer&& other )
{
memcpy( (char*)&other, (char*)this, sizeof( RingBuffer ) );
m_metadata = nullptr;
m_fd = 0;
}
RingBuffer& operator=( RingBuffer&& other )
{
memcpy( (char*)&other, (char*)this, sizeof( RingBuffer ) );
m_metadata = nullptr;
m_fd = 0;
return *this;
}
RingBuffer( RingBuffer&& other ) = delete;
RingBuffer& operator=( RingBuffer&& other ) = delete;
bool IsValid() const { return m_metadata != nullptr; }
int GetId() const { return m_id; }

View File

@@ -1210,6 +1210,21 @@ private:
return static_cast<ExplicitProducer*>(token.producer);
}
// If a producer token is created before the constructor of a statically allocated
// queue runs (which may happen due to the undefined order of static initialization
// across module boundaries), the constructor will orphan it by resetting the
// producer list. Such a producer is functional, as producer creation works on the
// zero-initialized queue memory, but the consumer is not able to see the data it
// enqueues. This method links the producer back into the list.
bool readopt_orphaned_producer(ExplicitProducer* producer)
{
for (auto ptr = producerListTail.load(std::memory_order_relaxed); ptr != nullptr; ptr = ptr->next_prod()) {
if (ptr == static_cast<ProducerBase*>(producer)) return false;
}
add_producer(static_cast<ProducerBase*>(producer));
return true;
}
private:
//////////////////////////////////

View File

@@ -10,7 +10,7 @@ namespace tracy
constexpr unsigned Lz4CompressBound( unsigned isize ) { return isize + ( isize / 255 ) + 16; }
constexpr uint32_t ProtocolVersion = 79;
constexpr uint32_t ProtocolVersion = 80;
constexpr uint16_t BroadcastVersion = 3;
using lz4sz_t = uint32_t;
@@ -155,6 +155,7 @@ struct BroadcastMessage_v0
#pragma pack( pop )
constexpr uint64_t ProtocolOffset8Bit = (1ull << 8);
constexpr uint64_t ProtocolOffset16Bit = (1ull << 16);
constexpr uint64_t ProtocolOffset32Bit = (1ull << 16) + (1ull << 32);

View File

@@ -122,6 +122,8 @@ enum class QueueType : uint8_t
CpuTopology,
SingleStringData,
SecondStringData,
SingleStringData8,
SecondStringData8,
MemNamePayload,
ThreadGroupHint,
GpuZoneAnnotation,
@@ -390,7 +392,7 @@ enum class MessageSeverity : uint8_t
Debug, // Describes variable states and details about specific internal events in the software, that are useful for investigations.
Info, // Describes normal events, which inform on the expected progress and state of your software.
Warning, // Describes potentially dangerous situations caused by unexpected events and states.
Error, // Describes the occurance of unexpected behavior. Does not interrupt the execution of the software.
Error, // Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
Fatal, // Describes a critical event that will lead to a software failure/crash.
COUNT
};
@@ -492,7 +494,8 @@ enum class GpuContextType : uint8_t
Metal,
Custom,
CUDA,
Rocprof
Rocprof,
WebGPU
};
enum GpuContextFlags : uint8_t
@@ -1039,6 +1042,8 @@ static constexpr size_t QueueDataSize[] = {
sizeof( QueueHeader ) + sizeof( QueueCpuTopology ),
sizeof( QueueHeader ), // single string data
sizeof( QueueHeader ), // second string data
sizeof( QueueHeader ), // single string data, 8 bit length
sizeof( QueueHeader ), // second string data, 8 bit length
sizeof( QueueHeader ) + sizeof( QueueMemNamePayload ),
sizeof( QueueHeader ) + sizeof( QueueThreadGroupHint ),
sizeof( QueueHeader ) + sizeof( QueueGpuZoneAnnotation ), // GPU zone annotation

View File

@@ -76,14 +76,9 @@
#define TracyAlloc(x,y)
#define TracyFree(x)
#define TracyMemoryDiscard(x)
#define TracySecureAlloc(x,y)
#define TracySecureFree(x)
#define TracySecureMemoryDiscard(x)
#define TracyAllocN(x,y,z)
#define TracyFreeN(x,y)
#define TracySecureAllocN(x,y,z)
#define TracySecureFreeN(x,y)
#define ZoneNamedS(x,y,z)
#define ZoneNamedNS(x,y,z,w)
@@ -101,14 +96,9 @@
#define TracyAllocS(x,y,z)
#define TracyFreeS(x,y)
#define TracyMemoryDiscardS(x,y)
#define TracySecureAllocS(x,y,z)
#define TracySecureFreeS(x,y)
#define TracySecureMemoryDiscardS(x,y)
#define TracyAllocNS(x,y,z,w)
#define TracyFreeNS(x,y,z)
#define TracySecureAllocNS(x,y,z,w)
#define TracySecureFreeNS(x,y,z)
#define TracyMessageS(x,y,z)
#define TracyMessageLS(x,y)
@@ -221,17 +211,12 @@
#define TracyMessageC( txt, size, color ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, color, TRACY_CALLSTACK, size, txt )
#define TracyMessageLC( txt, color ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, color, TRACY_CALLSTACK, txt )
#define TracyAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK, false )
#define TracyFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK, false )
#define TracySecureAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK, true )
#define TracySecureFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK, true )
#define TracyAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK )
#define TracyFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK )
#define TracyAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, false, name )
#define TracyFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, false, name )
#define TracyMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, false, TRACY_CALLSTACK )
#define TracySecureAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, true, name )
#define TracySecureFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, true, name )
#define TracySecureMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, true, TRACY_CALLSTACK )
#define TracyAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, name )
#define TracyFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, name )
#define TracyMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, TRACY_CALLSTACK )
#define ZoneNamedS( varname, depth, active ) static constexpr tracy::SourceLocationData TracyConcat(__tracy_source_location,TracyLine) { nullptr, TracyFunction, TracyFile, (uint32_t)TracyLine, 0 }; tracy::ScopedZone varname( &TracyConcat(__tracy_source_location,TracyLine), depth, active )
#define ZoneNamedNS( varname, name, depth, active ) static constexpr tracy::SourceLocationData TracyConcat(__tracy_source_location,TracyLine) { name, TracyFunction, TracyFile, (uint32_t)TracyLine, 0 }; tracy::ScopedZone varname( &TracyConcat(__tracy_source_location,TracyLine), depth, active )
@@ -246,17 +231,12 @@
#define ZoneScopedCS( color, depth ) ZoneNamedCS( ___tracy_scoped_zone, color, depth, true )
#define ZoneScopedNCS( name, color, depth ) ZoneNamedNCS( ___tracy_scoped_zone, name, color, depth, true )
#define TracyAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth, false )
#define TracyFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth, false )
#define TracySecureAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth, true )
#define TracySecureFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth, true )
#define TracyAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth )
#define TracyFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth )
#define TracyAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, false, name )
#define TracyFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, false, name )
#define TracyMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, false, depth )
#define TracySecureAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, true, name )
#define TracySecureFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, true, name )
#define TracySecureMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, true, depth )
#define TracyAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, name )
#define TracyFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, name )
#define TracyMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, depth )
#define TracyMessageS( txt, size, depth ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, 0, depth, size, txt )
#define TracyMessageLS( txt, depth ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, 0, depth, txt )

View File

@@ -64,14 +64,9 @@ typedef const void* TracyCSharedLockCtx;
#define TracyCAlloc(x,y)
#define TracyCFree(x)
#define TracyCMemoryDiscard(x)
#define TracyCSecureAlloc(x,y)
#define TracyCSecureFree(x)
#define TracyCSecureMemoryDiscard(x)
#define TracyCAllocN(x,y,z)
#define TracyCFreeN(x,y)
#define TracyCSecureAllocN(x,y,z)
#define TracyCSecureFreeN(x,y)
#define TracyCFrameMark
#define TracyCFrameMarkNamed(x)
@@ -98,14 +93,9 @@ typedef const void* TracyCSharedLockCtx;
#define TracyCAllocS(x,y,z)
#define TracyCFreeS(x,y)
#define TracyCMemoryDiscardS(x,y)
#define TracyCSecureAllocS(x,y,z)
#define TracyCSecureFreeS(x,y)
#define TracyCSecureMemoryDiscardS(x,y)
#define TracyCAllocNS(x,y,z,w)
#define TracyCFreeNS(x,y,z)
#define TracyCSecureAllocNS(x,y,z,w)
#define TracyCSecureFreeNS(x,y,z)
#define TracyCMessageS(x,y,z)
#define TracyCMessageLS(x,y)
@@ -295,31 +285,26 @@ TRACY_API int32_t ___tracy_connected(void);
#define TracyCZoneValue( ctx, value ) ___tracy_emit_zone_value( ctx, value );
TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size, int32_t secure );
TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth, int32_t secure );
TRACY_API void ___tracy_emit_memory_free( const void* ptr, int32_t secure );
TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth, int32_t secure );
TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, int32_t secure, const char* name );
TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, int32_t secure, const char* name );
TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, int32_t secure, const char* name );
TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, int32_t secure, const char* name );
TRACY_API void ___tracy_emit_memory_discard( const char* name, int32_t secure );
TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t secure, int32_t depth );
TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size );
TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth );
TRACY_API void ___tracy_emit_memory_free( const void* ptr );
TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth );
TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, const char* name );
TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, const char* name );
TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, const char* name );
TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, const char* name );
TRACY_API void ___tracy_emit_memory_discard( const char* name );
TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t depth );
TRACY_API void ___tracy_emit_logString( int8_t severity, int32_t color, int32_t callstack_depth, size_t size, const char* txt );
TRACY_API void ___tracy_emit_logStringL( int8_t severity, int32_t color, int32_t callstack_depth, const char* txt );
#define TracyCAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK, 0 )
#define TracyCFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK, 0 )
#define TracyCMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, 0, TRACY_CALLSTACK );
#define TracyCSecureAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK, 1 )
#define TracyCSecureFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK, 1 )
#define TracyCSecureMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, 1, TRACY_CALLSTACK );
#define TracyCAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK )
#define TracyCFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK )
#define TracyCMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, TRACY_CALLSTACK );
#define TracyCAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, 0, name )
#define TracyCFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, 0, name )
#define TracyCSecureAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, 1, name )
#define TracyCSecureFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, 1, name )
#define TracyCAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, name )
#define TracyCFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, name )
#define TracyCMessage( txt, size ) ___tracy_emit_logString( TracyMessageSeverityInfo, 0, TRACY_CALLSTACK, size, txt )
#define TracyCMessageL( txt ) ___tracy_emit_logStringL( TracyMessageSeverityInfo, 0, TRACY_CALLSTACK, txt )
@@ -357,17 +342,12 @@ TRACY_API void ___tracy_emit_message_appinfo( const char* txt, size_t size );
#define TracyCZoneCS( ctx, color, depth, active ) static const struct ___tracy_source_location_data TracyConcat(__tracy_source_location,TracyLine) = { NULL, __func__, TracyFile, (uint32_t)TracyLine, color }; TracyCZoneCtx ctx = ___tracy_emit_zone_begin_callstack( &TracyConcat(__tracy_source_location,TracyLine), depth, active );
#define TracyCZoneNCS( ctx, name, color, depth, active ) static const struct ___tracy_source_location_data TracyConcat(__tracy_source_location,TracyLine) = { name, __func__, TracyFile, (uint32_t)TracyLine, color }; TracyCZoneCtx ctx = ___tracy_emit_zone_begin_callstack( &TracyConcat(__tracy_source_location,TracyLine), depth, active );
#define TracyCAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth, 0 )
#define TracyCFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth, 0 )
#define TracyCMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, 0, depth )
#define TracyCSecureAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth, 1 )
#define TracyCSecureFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth, 1 )
#define TracyCSecureMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, 1, depth )
#define TracyCAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth )
#define TracyCFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth )
#define TracyCMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, depth )
#define TracyCAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, 0, name )
#define TracyCFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, 0, name )
#define TracyCSecureAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, 1, name )
#define TracyCSecureFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, 1, name )
#define TracyCAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, name )
#define TracyCFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, name )
#define TracyCMessageS( txt, size, depth ) ___tracy_emit_logString( TracyMessageSeverityInfo, 0, depth, size, txt )
#define TracyCMessageLS( txt, depth ) ___tracy_emit_logStringL( TracyMessageSeverityInfo, 0, depth, txt )

View File

@@ -1,7 +1,12 @@
#ifndef __TRACYOPENGL_HPP__
#define __TRACYOPENGL_HPP__
#if !defined TRACY_ENABLE || defined __APPLE__
#ifdef __APPLE__
#define TRACY_OPENGL_DISABLE
#warning "OpenGL timestamps are unreliable on Apple devices that still run OpenGL."
#endif
#if !defined TRACY_ENABLE || defined TRACY_OPENGL_DISABLE
#define TracyGpuContext
#define TracyGpuContextName(x,y)
@@ -98,17 +103,31 @@ public:
, m_head( 0 )
, m_tail( 0 )
{
ZoneScopedC( Color::Red4 );
assert( m_context != 255 );
glGenQueries( QueryCount, m_query );
if( !CheckFeature( "GL_ARB_timer_query" ) )
{
Profiler::LogString( MessageSourceType::Tracy, MessageSeverity::Warning, Color::Tomato, 0,
"OpenGL context does not support GL_ARB_timer_query." );
}
GLint bits;
glGetQueryiv( GL_TIMESTAMP, GL_QUERY_COUNTER_BITS, &bits );
if( bits == 0 )
{
// all timestamp queries would resolve to 0 (and produce 0ns GPU zones).
// (this is the case for many TBDR GPUs, including Apple Silicon)
Profiler::LogString( MessageSourceType::Tracy, MessageSeverity::Warning, Color::Tomato, 0,
"OpenGL driver does not implement GL_TIMESTAMP precision." );
}
assert( bits > 0 );
int64_t tgpu;
glGetInteger64v( GL_TIMESTAMP, &tgpu );
int64_t tcpu = Profiler::GetTime();
GLint bits;
glGetQueryiv( GL_TIMESTAMP, GL_QUERY_COUNTER_BITS, &bits );
#ifdef TRACY_OPENGL_AUTO_CALIBRATION
// The anchor above is never refreshed; advertise calibration and emit periodic
// GpuCalibration events to correct CPU/GPU drift (see Recalibrate). Opt-in,
@@ -117,6 +136,8 @@ public:
m_prevCalibration = GetHostTimeNs();
#endif
glGenQueries( QueryCount, m_query );
const float period = 1.f;
const auto thread = GetThreadHandle();
TracyLfqPrepare( QueueType::GpuNewContext );
@@ -194,6 +215,30 @@ public:
}
private:
// Returns whether the driver advertises a single extension (full GL_-prefixed token).
static bool CheckFeature( const char* feature )
{
GLint major = 0;
glGetIntegerv( GL_MAJOR_VERSION, &major );
if( glGetError() != GL_NO_ERROR ) major = 0; // pre-3.0: enum not supported
if( major >= 3 )
{
GLint numExt = 0;
glGetIntegerv( GL_NUM_EXTENSIONS, &numExt );
for( GLint i = 0; i < numExt; i++ )
{
auto ext = (const char*)glGetStringi( GL_EXTENSIONS, i );
if( ext && strcmp( ext, feature ) == 0 ) return true;
}
return false;
}
// pre GL3 fallback:
auto exts = (const char*)glGetString( GL_EXTENSIONS );
return exts && strstr( exts, feature ) != nullptr;
}
#ifdef TRACY_OPENGL_AUTO_CALIBRATION
// Monotonic host ns for the inter-calibration interval (cpuDelta), kept
// separate from Profiler::GetTime() as in the D3D12/Vulkan backends.

View File

@@ -0,0 +1,971 @@
#ifndef __TRACYWEBGPU_HPP__
#define __TRACYWEBGPU_HPP__
// WebGPU, unlike other graphics APIs, has many annoying restrictions that complicate
// the design of the Tracy WebGPU back-end:
// - there's no CPU/GPU clock calibration API
// - submitting GPU commands that touch a buffer that the host is mapping is not permitted
// - resolving timestamps require destination offsets aligned to 256 bytes
// - timestamps are only available at pass granularity (implementations may need to emulate this)
// - spec mandates timestamps to be in nanoseconds (implementationw may need to emulate this)
#ifndef TRACY_ENABLE
#define TracyWebGPUSetupDeviceDescriptor(deviceDescriptor)
#define TracyWebGPUContext(instance, device, queue) nullptr
#define TracyWebGPUDestroy(ctx)
#define TracyWebGPUContextName(ctx, name, size)
#define TracyWebGPUZone(ctx, encoder, passDesc, name)
#define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color)
#define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active)
#define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active)
#define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active)
#define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth)
#define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth)
#define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active)
#define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active)
#define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active)
#define TracyWebGPUCollect(ctx)
namespace tracy
{
class WebGPUZoneScope {};
}
using TracyWebGPUCtx = void*;
#else
#include "Tracy.hpp"
#include "../client/TracyProfiler.hpp"
#include "../client/TracyCallstack.hpp"
#include "../common/TracyAlign.hpp"
#include "../common/TracyAlloc.hpp"
#include <atomic>
#include <mutex>
#include <vector>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cassert>
#include <chrono>
#include <thread>
#include <webgpu/webgpu.h>
// piggy-back on WGPU_DAWN_TOGGLES_DESCRIPTOR_INIT to detect Dawn header
#ifdef WGPU_DAWN_TOGGLES_DESCRIPTOR_INIT
#define TRACY_WEBGPU_DAWN_NATIVE (1)
#include <dawn/native/DawnNative.h>
#else
#define TRACY_WEBGPU_WGPU_NATIVE (1)
#include <webgpu/wgpu.h>
#endif
#ifndef TRACY_WEBGPU_DEBUG_LEVEL
#define TRACY_WEBGPU_DEBUG_LEVEL (0)
#endif//TRACY_WEBGPU_DEBUG_LEVEL
#if TRACY_WEBGPU_DEBUG_LEVEL
#define TracyWebGPUDebug(...) __VA_ARGS__;
#if defined(_MSC_VER)
extern "C" int32_t IsDebuggerPresent(void);
#define TracyWebGPUBreak() if (IsDebuggerPresent()) __debugbreak()
#else
#define TracyWebGPUBreak() ((void)0)
#endif
#define TracyWebGPUAssert(predicate, ...) if (predicate) {} else { __VA_ARGS__; TracyWebGPUBreak(); }
#else
#define TracyWebGPUDebug(...)
#define TracyWebGPUBreak()
#define TracyWebGPUAssert(predicate, ...) assert(predicate);
#endif
#define TracyWebGPULog(severity, msg) fprintf(stdout, "%s", msg), tracy::Profiler::LogString( tracy::MessageSourceType::Tracy, tracy::MessageSeverity::severity, tracy::Color::Red4, 0, msg );
#define TracyWebGPUPanic(msg, ...) do { TracyWebGPULog(Error, msg); TracyWebGPUAssert(false && "TracyWebGPU: " msg); __VA_ARGS__; } while(false);
namespace tracy
{
class WebGPUQueueCtx
{
friend class WebGPUZoneScope;
uint8_t m_contextId = 255; // 255 represents "invalid id"
std::mutex m_collectionMutex;
WGPUInstance m_instance = nullptr;
WGPUDevice m_device = nullptr;
WGPUQueue m_queue = nullptr;
struct ReadbackStage
{
WGPUBuffer buffer = nullptr;
std::atomic<uint64_t> copiedUpto {0};
std::atomic<WGPUMapAsyncStatus> mapStatus = {};
WGPUFuture pendingFuture = {};
};
static_assert(std::atomic<WGPUMapAsyncStatus>::is_always_lock_free, "WGPUMapAsyncStatus must be lock-free atomic");
WGPUQuerySet m_querySet = nullptr;
WGPUBuffer m_resolveBuffer = nullptr;
ReadbackStage m_readbackReel [3];
std::atomic<int> m_writeIdx {0};
using atomic_counter = std::atomic<uint64_t>;
atomic_counter m_queryCounter = 0;
atomic_counter m_previousCheckpoint = 0;
uint32_t m_queryLimit = 0;
std::vector<uint64_t> m_shadowBuffer;
using WallTime = std::chrono::steady_clock::time_point;
static tracy_force_inline auto GetWallTime() { return WallTime::clock::now(); }
static tracy_force_inline auto Milliseconds(int value) { return std::chrono::milliseconds(value); }
static bool WaitQueueIdle(WGPUQueue queue, WGPUInstance instance)
{
bool gpuDone = false;
WGPUQueueWorkDoneCallbackInfo doneCB = {};
doneCB.mode = WGPUCallbackMode_AllowProcessEvents;
doneCB.callback = [](WGPUQueueWorkDoneStatus, WGPUStringView, void* userData, void*) {
*static_cast<bool*>(userData) = true;
};
doneCB.userdata1 = &gpuDone;
wgpuQueueOnSubmittedWorkDone(queue, doneCB);
const auto deadline = GetWallTime() + Milliseconds(2000);
while (!gpuDone && GetWallTime() < deadline)
wgpuInstanceProcessEvents(instance);
return gpuDone;
}
static const uint64_t* MapBufferSync(WGPUBuffer buffer, WGPUInstance instance)
{
struct MapCtx { WGPUMapAsyncStatus status = {}; } ctx;
WGPUBufferMapCallbackInfo cbInfo = {};
cbInfo.mode = WGPUCallbackMode_AllowProcessEvents;
cbInfo.callback = [](WGPUMapAsyncStatus status, WGPUStringView, void* userData, void*) {
auto* ctx = static_cast<MapCtx*>(userData);
ctx->status = status;
};
cbInfo.userdata1 = &ctx;
size_t offset = 0;
size_t size = 2 * sizeof(uint64_t);
wgpuBufferMapAsync(buffer, WGPUMapMode_Read, offset, size, cbInfo);
const auto deadline = GetWallTime() + Milliseconds(2000);
while (ctx.status == 0 && GetWallTime() < deadline)
wgpuInstanceProcessEvents(instance);
if (ctx.status != WGPUMapAsyncStatus_Success) return nullptr;
auto data = wgpuBufferGetConstMappedRange(buffer, offset, size);
return static_cast<const uint64_t*>(data);
}
struct Calibration {
int64_t minCpuRange = ~uint64_t(0) >> 1;
struct Regression
{
int64_t n = 0;
int64_t mean_x = 0;
int64_t mean_y = 0;
int64_t S_xx = 0;
int64_t S_xy = 0;
void Update(int64_t x, int64_t y)
{
n += 1;
int64_t dx = x - mean_x;
int64_t dy = y - mean_y;
mean_x += dx / n;
mean_y += dy / n;
S_xx += dx * (x - mean_x);
S_xy += dx * (y - mean_y);
}
double Slope() const { return double(S_xy) / S_xx; }
double Intercept() const { return mean_y - Slope() * mean_x; }
};
Regression cpuToGpuModel; // cpu-ticks to gpu-ticks
Regression cpuRangeModel; // cpu-tick interval uncertainty
Regression wallToGpuModel; // nanoseconds to gpu-ticks
void GetReferenceTime(uint64_t& cpuTime, uint64_t& gpuTime) const
{
// the mean belongs to the regression line
cpuTime = cpuToGpuModel.mean_x;
gpuTime = cpuToGpuModel.mean_y;
}
double Period() const { return 1.0 / wallToGpuModel.Slope(); } // ns/tick
bool AcceptX(const Regression& r, int64_t x, double threshold = 3.0) const {
if (r.n < 2) return true;
auto dx = x - r.mean_x;
if (dx <= 0) return true; // always accept "tighter" outliers
double variance = double(r.S_xx) / (r.n - 1);
if (variance == 0.0) return true;
// WARN: dx*dx "could" overflow, but very unlikely in practice
double zz = (double)(dx*dx) / variance;
return zz <= (threshold*threshold);
}
bool Update(WallTime twall0, WallTime twall1, uint64_t tcpu0, uint64_t tcpu1, uint64_t tgpu)
{
using namespace std::chrono;
int64_t cpuRange = tcpu1 - tcpu0;
cpuRangeModel.Update(cpuRange, 0);
if (!AcceptX(cpuRangeModel, cpuRange, 1.0)) return false;
// Process sample:
int64_t tcpu = tcpu0 + (tcpu1 - tcpu0) / 2; // mid-point
int64_t twall = duration_cast<nanoseconds>(
(twall0 + (twall1 - twall0) / 2) // mid-point
.time_since_epoch()
).count();
// incremental regression:
cpuToGpuModel.Update(tcpu, tgpu);
wallToGpuModel.Update(twall, tgpu);
TracyWebGPUDebug( fprintf(stderr, "----- (sample accepted! wall = %lld | cpu = %lld | gpu = %lld | period = %f)\n", twall, tcpu, tgpu, Period()) );
return true;
}
} m_calibration;
tracy_force_inline void SubmitQueueItem(tracy::QueueItem* item)
{
#ifdef TRACY_ON_DEMAND
GetProfiler().DeferItem(*item);
#endif
Profiler::QueueSerialFinish();
}
bool CalibrateClocks(uint64_t& outCpuTime, uint64_t& outGpuTime, double& period)
{
// WebGPU does not have any clock calibration API.
// This routine attempts to estimates a reasonable (cpuTime, gpuTime) correlation
// by sampling CPU and GPU timestamps around a "synchronous" draw call.
// Several samples are taken to tighten the estimation.
ZoneScoped;
WGPUShaderSourceWGSL wgslSrc = {};
wgslSrc.chain.sType = WGPUSType_ShaderSourceWGSL;
wgslSrc.code =
{
R"(
@vertex fn vs(@builtin(vertex_index) i: u32) -> @builtin(position) vec4f {
var p = array(vec4f(-1,-1,.5,1), vec4f(3,-1,.5,1), vec4f(-1,3,.5,1));
return p[i];
}
@fragment fn fs() -> @location(0) vec4f { return vec4f(0.0); }
)",
WGPU_STRLEN
};
WGPUShaderModuleDescriptor smDesc = {};
smDesc.nextInChain = reinterpret_cast<WGPUChainedStruct*>(&wgslSrc);
WGPUShaderModule calibShader = wgpuDeviceCreateShaderModule(m_device, &smDesc);
if (!calibShader) { TracyWebGPUPanic("Failed to create calibration shader.", return false); }
WGPUTextureDescriptor texDesc = {};
texDesc.usage = WGPUTextureUsage_RenderAttachment;
texDesc.dimension = WGPUTextureDimension_2D;
texDesc.size = { 1, 1, 1 };
texDesc.format = WGPUTextureFormat_BGRA8Unorm;
texDesc.mipLevelCount = 1;
texDesc.sampleCount = 1;
WGPUTexture tex = wgpuDeviceCreateTexture(m_device, &texDesc);
if (!tex) { wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration scratch texture.", return false); }
WGPUTextureView texView = wgpuTextureCreateView(tex, nullptr);
if (!texView) { wgpuTextureRelease(tex); wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration scratch texture view.", return false); }
WGPUColorTargetState colorTarget = {};
colorTarget.format = WGPUTextureFormat_BGRA8Unorm;
colorTarget.writeMask = WGPUColorWriteMask_All;
WGPUFragmentState fragState = {};
fragState.module = calibShader;
fragState.entryPoint = { "fs", WGPU_STRLEN };
fragState.targetCount = 1;
fragState.targets = &colorTarget;
WGPURenderPipelineDescriptor pipeDesc = {};
pipeDesc.vertex.module = calibShader;
pipeDesc.vertex.entryPoint = { "vs", WGPU_STRLEN };
pipeDesc.primitive.topology = WGPUPrimitiveTopology_TriangleList;
pipeDesc.multisample.count = 1;
pipeDesc.fragment = &fragState;
WGPURenderPipeline calibPipeline = wgpuDeviceCreateRenderPipeline(m_device, &pipeDesc);
if (!calibPipeline) { wgpuTextureViewRelease(texView); wgpuTextureRelease(tex); wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration pipeline.", return false); }
uint32_t queryId = 0;
WGPUPassTimestampWrites anchorTs = {};
anchorTs.querySet = m_querySet;
anchorTs.beginningOfPassWriteIndex = queryId;
anchorTs.endOfPassWriteIndex = queryId+1;
WGPURenderPassColorAttachment att = {};
att.view = texView;
att.loadOp = WGPULoadOp_Clear;
att.storeOp = WGPUStoreOp_Store;
att.depthSlice = WGPU_DEPTH_SLICE_UNDEFINED;
WGPURenderPassDescriptor passDesc = {};
passDesc.colorAttachmentCount = 1;
passDesc.colorAttachments = &att;
passDesc.timestampWrites = &anchorTs;
// calibration loop
const auto deadline = GetWallTime() + Milliseconds(100);
for (int i = 0; i < 1000; ++i)
{
// loop until time budget (100ms) allows, but ensure at least 5 iterations
if ((GetWallTime() >= deadline) && (i > 5))
break;
WGPUCommandEncoder enc = wgpuDeviceCreateCommandEncoder(m_device, nullptr);
if (!enc) { TracyWebGPUPanic("Failed to create command encoder for time calibration.", return false); }
WGPURenderPassEncoder pass = wgpuCommandEncoderBeginRenderPass(enc, &passDesc);
wgpuRenderPassEncoderSetPipeline(pass, calibPipeline);
wgpuRenderPassEncoderDraw(pass, 3, 1, 0, 0);
wgpuRenderPassEncoderEnd(pass);
wgpuRenderPassEncoderRelease(pass);
WGPUBuffer readBackBuffer = m_readbackReel[0].buffer;
uint32_t byteOffset = queryId * sizeof(uint64_t);
uint32_t sizeInBytes = 2 * sizeof(uint64_t);
wgpuCommandEncoderResolveQuerySet(enc, m_querySet, queryId, 2, m_resolveBuffer, byteOffset);
wgpuCommandEncoderCopyBufferToBuffer(enc, m_resolveBuffer, byteOffset, readBackBuffer, byteOffset, sizeInBytes);
WGPUCommandBuffer cmd = wgpuCommandEncoderFinish(enc, nullptr);
wgpuCommandEncoderRelease(enc);
if (!cmd) { TracyWebGPUPanic("Failed to finish calibration command encoder.", return false); }
WaitQueueIdle(m_queue, m_instance);
int64_t cpu [2] = {};
int64_t gpu [2] = {};
WallTime wall [2] = {};
cpu[0] = Profiler::GetTime();
wall[0] = GetWallTime();
wgpuQueueSubmit(m_queue, 1, &cmd);
wgpuCommandBufferRelease(cmd);
WaitQueueIdle(m_queue, m_instance);
wall[1] = GetWallTime();
cpu[1] = Profiler::GetTime();
auto gpuTimestamps = MapBufferSync(readBackBuffer, m_instance);
TracyWebGPUAssert(gpuTimestamps != nullptr);
gpu[0] = gpuTimestamps[0];
gpu[1] = gpuTimestamps[1];
wgpuBufferUnmap(readBackBuffer);
TracyWebGPUDebug(
fprintf(stdout, "[%03d] CalibrateClocks() [CPU] %16lld | %16lld | /// %lld\n", i, cpu[0], cpu[1], cpu[1]-cpu[0]);
fprintf(stdout, "----------------------- [GPU] %16llu | %16llu | /// %lld\n", gpu[0], gpu[1], gpu[1]-gpu[0]);
uint64_t cpuTimeRef, gpuTimeRef;
m_calibration.GetReferenceTime(cpuTimeRef, gpuTimeRef);
if (gpu[0] < gpuTimeRef)
fprintf(stdout, "!!!!! CalibrateClocks() -> WARNING!!! going backwards!\n%llu\n%llu\n%lld\n", gpuTimeRef, gpu[0], gpu[0] - gpuTimeRef);
);
// skip first sample since it is quite jittery (lazy intialization of WebGPU objects)
if (i == 0)
continue;
m_calibration.Update(wall[0], wall[1], cpu[0], cpu[1], gpu[0]);
};
TracyWebGPUDebug(
fprintf(stdout, "##### CalibrateClocks() WALL = %lld | CPU = %lld | GPU = %lld | period = %f\n",
m_calibration.wallToGpuModel.mean_x,
m_calibration.cpuToGpuModel.mean_x,
m_calibration.cpuToGpuModel.mean_y,
m_calibration.Period());
);
wgpuRenderPipelineRelease(calibPipeline);
wgpuShaderModuleRelease(calibShader);
wgpuTextureViewRelease(texView);
wgpuTextureRelease(tex);
m_calibration.GetReferenceTime(outCpuTime, outGpuTime);
period = m_calibration.Period();
// assume 1 ns/tick if the period estimation is close enough to 1
if (std::abs(period - 1.0) < 0.001)
period = 1.0;
return true;
}
public:
class Requirements
{
private:
# if (TRACY_WEBGPU_DAWN_NATIVE)
WGPUDawnTogglesDescriptor dawnTogglesDesc = {};
static constexpr int NumExtras = 0;
# elif (TRACY_WEBGPU_WGPU_NATIVE)
static constexpr int NumExtras = 1;
# endif
public:
static constexpr int NumFeatures = 1 + NumExtras;
WGPUFeatureName features [NumFeatures] = {};
WGPUChainedStruct* togglesDesc = nullptr;
Requirements()
{
this->features[0] = WGPUFeatureName_TimestampQuery;
# if (TRACY_WEBGPU_WGPU_NATIVE)
this->features[1] = (WGPUFeatureName)WGPUNativeFeature_TimestampQueryInsideEncoders;
# endif
# if (TRACY_WEBGPU_DAWN_NATIVE)
static const char* dawnDisabledToggles[] = { "timestamp_quantization" };
static const char* dawnEnabledToggles[] = { "disable_timestamp_query_conversion" };
this->dawnTogglesDesc.chain.sType = WGPUSType_DawnTogglesDescriptor;
this->dawnTogglesDesc.disabledToggles = dawnDisabledToggles;
this->dawnTogglesDesc.disabledToggleCount = 1;
this->dawnTogglesDesc.enabledToggles = dawnEnabledToggles;
this->dawnTogglesDesc.enabledToggleCount = 1;
this->togglesDesc = reinterpret_cast<WGPUChainedStruct*>(&this->dawnTogglesDesc);
# endif
}
static bool VerifyDevice(WGPUDevice device)
{
if (device == nullptr)
return false;
if (wgpuDeviceHasFeature(device, WGPUFeatureName_TimestampQuery) == WGPU_FALSE)
return false;
# if (TRACY_WEBGPU_DAWN_NATIVE)
bool hasDisableConversion = false, hasQuantization = false;
for (const char* t : ::dawn::native::GetTogglesUsed(device))
{
if (strcmp(t, "disable_timestamp_query_conversion") == 0)
hasDisableConversion = true;
if (strcmp(t, "timestamp_quantization") == 0)
hasQuantization = true;
}
return hasDisableConversion && !hasQuantization;
# elif (TRACY_WEBGPU_WGPU_NATIVE)
if (wgpuDeviceHasFeature(device, (WGPUFeatureName)WGPUNativeFeature_TimestampQueryInsideEncoders) == WGPU_FALSE)
return false;
return true;
# endif
return false;
}
void ApplyToDeviceDescriptor(WGPUDeviceDescriptor& deviceDescriptor)
{
size_t userCount = deviceDescriptor.requiredFeatureCount;
size_t totalCount = userCount + NumFeatures;
// NOTE: this allocation will leak...
auto* mergedFeatures = static_cast<WGPUFeatureName*>(tracy_malloc(totalCount * sizeof(WGPUFeatureName)));
if (userCount > 0 && deviceDescriptor.requiredFeatures)
memcpy(mergedFeatures, deviceDescriptor.requiredFeatures, userCount * sizeof(WGPUFeatureName));
memcpy(mergedFeatures + userCount, features, NumFeatures * sizeof(WGPUFeatureName));
deviceDescriptor.requiredFeatures = mergedFeatures;
deviceDescriptor.requiredFeatureCount = totalCount;
if (togglesDesc)
{
togglesDesc->next = deviceDescriptor.nextInChain;
deviceDescriptor.nextInChain = togglesDesc;
}
}
};
WebGPUQueueCtx(WGPUInstance instance, WGPUDevice device, WGPUQueue queue)
{
ZoneScopedC(Color::Red4);
if (!Requirements::VerifyDevice(device))
TracyWebGPUPanic("GPU profiling disabled because the device did not enable the necessary features.", return)
TracyWebGPUAssert(instance); wgpuInstanceAddRef(instance); m_instance = instance;
TracyWebGPUAssert(device); wgpuDeviceAddRef(device); m_device = device;
TracyWebGPUAssert(queue); wgpuQueueAddRef(queue); m_queue = queue;
// Setup Query Set: must have even size since queries are issued in pairs.
// (The WebGPU spec mandates 4096, with no way to query the device limit.)
WGPUQuerySetDescriptor qsDesc = {};
qsDesc.type = WGPUQueryType_Timestamp;
qsDesc.count = 4096;
for (;;)
{
m_querySet = wgpuDeviceCreateQuerySet(m_device, &qsDesc);
if (m_querySet) break;
qsDesc.count /= 2;
if (qsDesc.count < 128) break;
}
if (m_querySet == nullptr)
TracyWebGPUPanic("Failed to create timestamp query set.", return);
m_queryLimit = qsDesc.count;
WGPUBufferDescriptor resolveDesc = {};
resolveDesc.usage = WGPUBufferUsage_QueryResolve | WGPUBufferUsage_CopySrc;
resolveDesc.size = static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t);
m_resolveBuffer = wgpuDeviceCreateBuffer(m_device, &resolveDesc);
if (!m_resolveBuffer)
TracyWebGPUPanic("Failed to create timestamp resolve buffer.", return);
WGPUBufferDescriptor readbackDesc = {};
readbackDesc.usage = WGPUBufferUsage_CopyDst | WGPUBufferUsage_MapRead;
readbackDesc.size = static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t);
for (auto& stage : m_readbackReel)
{
stage.buffer = wgpuDeviceCreateBuffer(m_device, &readbackDesc);
stage.copiedUpto = 0;
if (!stage.buffer) { TracyWebGPUPanic("Failed to create timestamp readback buffer.", return); }
}
uint64_t cpuTimestamp = 0;
uint64_t gpuTimestamp = 0;
double period = 0.0; // in nanoseconds per gpu-tick
if (!CalibrateClocks(cpuTimestamp, gpuTimestamp, period))
TracyWebGPUPanic("Failed to calibrate CPU/GPU clocks.", return);
TracyWebGPUDebug( fprintf(stdout, "[WebGPUQueueCtx] cpuTimestamp: %llu | gpuTimestamp: %llu | period: %f\n", cpuTimestamp, gpuTimestamp, period) );
m_shadowBuffer.resize(m_queryLimit, gpuTimestamp);
// All setup completed: register the context.
m_contextId = GetGpuCtxCounter().fetch_add(1);
ZoneValue(m_contextId);
auto* item = Profiler::QueueSerial();
MemWrite(&item->hdr.type, QueueType::GpuNewContext);
MemWrite(&item->gpuNewContext.cpuTime, static_cast<int64_t>(cpuTimestamp));
MemWrite(&item->gpuNewContext.gpuTime, static_cast<int64_t>(gpuTimestamp));
MemWrite(&item->gpuNewContext.thread, static_cast<uint32_t>(0));
MemWrite(&item->gpuNewContext.period, static_cast<float>(period));
MemWrite(&item->gpuNewContext.context, static_cast<uint8_t>(GetId()));
MemWrite(&item->gpuNewContext.flags, GpuContextFlags(0)); // no calibration available
MemWrite(&item->gpuNewContext.type, GpuContextType::WebGPU);
SubmitQueueItem(item);
}
~WebGPUQueueCtx()
{
// TODO: a few problems to address later during this final Collect():
// 1. ensure "partial" query batches are collected
// 2. ensure all readback stages are collected and empty
// 3. ensure readback buffers are not mapped before deleting them
Collect();
for (auto& stage : m_readbackReel)
if (stage.buffer) { wgpuBufferRelease(stage.buffer); stage.buffer = nullptr; }
if (m_resolveBuffer) { wgpuBufferRelease(m_resolveBuffer); m_resolveBuffer = nullptr; }
if (m_querySet) { wgpuQuerySetRelease(m_querySet); m_querySet = nullptr; }
if (m_queue) { wgpuQueueRelease(m_queue); m_queue = nullptr; }
if (m_device) { wgpuDeviceRelease(m_device); m_device = nullptr; }
if (m_instance) { wgpuInstanceRelease(m_instance); m_instance = nullptr; }
}
tracy_force_inline uint8_t GetId() const
{
return m_contextId;
}
void Name(const char* name, uint16_t len)
{
auto ptr = (char*)tracy_malloc(len);
memcpy(ptr, name, len);
auto item = Profiler::QueueSerial();
MemWrite(&item->hdr.type, QueueType::GpuContextName);
MemWrite(&item->gpuContextNameFat.context, GetId());
MemWrite(&item->gpuContextNameFat.ptr, (uint64_t)ptr);
MemWrite(&item->gpuContextNameFat.size, len);
SubmitQueueItem(item);
}
void Collect(bool webgpuProcessEvents=false)
{
#ifdef TRACY_ON_DEMAND
if (!GetProfiler().IsConnected()) return;
#endif
if (!m_collectionMutex.try_lock()) return;
std::unique_lock<std::mutex> lock(m_collectionMutex, std::adopt_lock);
ZoneScopedC(Color::Red4);
if (Distance(m_previousCheckpoint, m_queryCounter) <= 0)
return;
// Current Readback "Reel" Stages:
const int state = m_writeIdx;
const int fillingIdx = (state + 0) % 3; // this is where instrumentation is pushing new queries
const int pendingIdx = (state + 1) % 3; // instrumentation is done here; ready to be collected
const int collectIdx = (state + 2) % 3; // this is where queries are being collected right now
// Ensure readback buffer has been mapped to the host
auto& collectStage = m_readbackReel[collectIdx];
if (collectStage.pendingFuture.id != 0)
{
if (webgpuProcessEvents)
wgpuInstanceProcessEvents(m_instance);
if (collectStage.mapStatus == WGPUMapAsyncStatus{})
return; // callback hasn't fired yet
collectStage.pendingFuture = {};
if (collectStage.mapStatus != WGPUMapAsyncStatus_Success)
TracyWebGPUPanic("Colect(): unable to map readback buffer.", return);
}
if (collectStage.mapStatus == WGPUMapAsyncStatus_Success)
{
const uint64_t* ts = static_cast<const uint64_t*>(
wgpuBufferGetConstMappedRange(collectStage.buffer, 0,
static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t)));
if (ts)
{
uint64_t ticket = m_previousCheckpoint;
const uint64_t end = collectStage.copiedUpto;
TracyWebGPUDebug( fprintf(stdout, "[TWG] Collect [%d] (%llu, %llu)\n", collectIdx, ticket, end) );
for (; Distance(ticket, end) > 0; ticket += 2)
{
const uint32_t slotB = RingIndex(ticket);
const uint32_t slotE = slotB + 1;
TracyWebGPUDebug(
fprintf(stderr,
"[TWG] slot B=%4u E=%4u ts[B]=%llu ts[E]=%llu shadow[E]=%llu ts-diff=%lld shadow-diff=%lld\n",
slotB, slotE,
ts[slotB], ts[slotE], m_shadowBuffer[slotE],
Distance(ts[slotB], ts[slotE]),
Distance(m_shadowBuffer[slotE], ts[slotE]));
);
if (Distance(m_shadowBuffer[slotE], ts[slotE]) <= 0)
break; // GPU hasn't written this timestamp yet; retry next Collect()
EmitGpuTime(ts[slotB], slotB);
EmitGpuTime(ts[slotE], slotE);
}
m_previousCheckpoint = ticket;
if (Distance(ticket, end) > 0)
return; // still unresolved queries in this buffer; come back next Collect()
}
// All queries resolved (or getMappedRange failed): unmap and fall through to rotate.
wgpuBufferUnmap(collectStage.buffer);
collectStage.mapStatus = {};
}
// At this point, all queries in the collect buffer have been processed.
// (it's now tie to "rotate" the buffers around...)
// Has any ResolveQueryBatch call landed in this reel stage since it was last recycled?
// (Are there any queries to resolve and collect at all?)
if (m_readbackReel[fillingIdx].copiedUpto <= m_previousCheckpoint)
return;
// Rotate/Cycle the Readback Pipeline State:
// the buffer that was just collected shall now be used for instrumentation
collectStage.copiedUpto = m_previousCheckpoint.load();
m_writeIdx = collectIdx; // atomically commit the pipeline rotation
auto& nextToCollect = m_readbackReel[pendingIdx];
WGPUBufferMapCallbackInfo cbInfo = {};
cbInfo.mode = WGPUCallbackMode_AllowProcessEvents;
cbInfo.callback = [](WGPUMapAsyncStatus status, WGPUStringView, void* userData, void*)
{
auto* stage = static_cast<ReadbackStage*>(userData);
stage->mapStatus = status;
};
cbInfo.userdata1 = &nextToCollect;
nextToCollect.pendingFuture = wgpuBufferMapAsync(
nextToCollect.buffer, WGPUMapMode_Read, 0,
static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t), cbInfo);
}
private:
void EmitGpuTime(uint64_t gpuTimestamp, uint32_t queryId)
{
auto* item = Profiler::QueueSerial();
MemWrite(&item->hdr.type, QueueType::GpuTime);
MemWrite(&item->gpuTime.gpuTime, static_cast<int64_t>(gpuTimestamp));
MemWrite(&item->gpuTime.queryId, static_cast<uint16_t>(queryId));
MemWrite(&item->gpuTime.context, GetId());
Profiler::QueueSerialFinish();
m_shadowBuffer[queryId] = gpuTimestamp;
}
tracy_force_inline uint32_t RingCapacity() const { return m_queryLimit; }
tracy_force_inline uint32_t RingIndex(uint64_t t) const
{
return static_cast<uint32_t>(t % RingCapacity());
}
tracy_force_inline static int64_t Distance(uint64_t begin, uint64_t end)
{
return static_cast<int64_t>(end - begin);
}
tracy_force_inline uint64_t NextQueryId()
{
const uint64_t ticket = m_queryCounter.fetch_add(2, std::memory_order_relaxed);
if (Distance(m_previousCheckpoint, ticket)
>= static_cast<int64_t>(RingCapacity()))
{
TracyWebGPULog(Warning, "Too many pending GPU queries: stalling!");
Collect();
}
return ticket;
}
};
class WebGPUZoneScope
{
const bool m_active;
WebGPUQueueCtx* m_ctx = nullptr;
WGPUCommandEncoder m_encoder = nullptr;
uint64_t m_rawTicket = 0;
uint32_t m_queryId = 0;
WGPUPassTimestampWrites m_timestampWrites = {};
void ResolveQueryBatch(uint32_t queryBatchStartId)
{
// Ensure there are pending queries to resolve in the batch
auto& stage = m_ctx->m_readbackReel[m_ctx->m_writeIdx];
if (WebGPUQueueCtx::Distance(stage.copiedUpto, m_rawTicket) <= 0) return;
// 32 queries = 32 * 8 bytes = 256 bytes
TracyWebGPUAssert(queryBatchStartId % 32 == 0, return);
queryBatchStartId = m_ctx->RingIndex(queryBatchStartId);
const uint64_t blockOffset = static_cast<uint64_t>(queryBatchStartId) * sizeof(uint64_t);
wgpuCommandEncoderResolveQuerySet(
m_encoder,
m_ctx->m_querySet,
queryBatchStartId, 32,
m_ctx->m_resolveBuffer,
blockOffset // MUST be a multiple of (aligned to) 256...
);
auto readbackBuffer = stage.buffer;
wgpuCommandEncoderCopyBufferToBuffer(
m_encoder,
m_ctx->m_resolveBuffer,
blockOffset,
readbackBuffer,
blockOffset,
32 * sizeof(uint64_t)
);
// Advance this stage's high-water mark to cover the block just encoded.
// TODO: maybe we can use fetch_add to increment the atomic and not need
// to keep track of the raw ticket; Collect would need to derive the raw
// end ticket number.
const uint64_t blockEnd = m_rawTicket;
uint64_t prev = stage.copiedUpto;
while ((WebGPUQueueCtx::Distance(prev, blockEnd) > 0) &&
!stage.copiedUpto.compare_exchange_weak(prev, blockEnd)) {}
TracyWebGPUDebug( fprintf(stdout, "[TWG] WebGPUZoneScope [%d] (%d,%d)\n", (int)m_ctx->m_writeIdx, queryBatchStartId, queryBatchStartId+32) );
}
tracy_force_inline void WriteQueueItem(const SourceLocationData* srcLocation, int32_t callstackDepth, uint32_t sourceLine, const char* sourceFile, size_t sourceFileLen, const char* functionName, size_t functionNameLen, const char* zoneName, size_t zoneNameLen)
{
if (!m_active) return;
const bool captureCallstack = callstackDepth > 0 && has_callstack();
const bool transientZone = srcLocation == nullptr;
uint64_t srcLocationAddr = reinterpret_cast<uint64_t>(srcLocation);
QueueItem* item = nullptr;
QueueType itemType;
if (transientZone)
{
srcLocationAddr = Profiler::AllocSourceLocation(sourceLine, sourceFile, sourceFileLen, functionName, functionNameLen, zoneName, zoneNameLen);
if (captureCallstack)
{
item = Profiler::QueueSerialCallstack(Callstack(callstackDepth));
itemType = QueueType::GpuZoneBeginAllocSrcLocCallstackSerial;
}
else
{
item = Profiler::QueueSerial();
itemType = QueueType::GpuZoneBeginAllocSrcLocSerial;
}
}
else
{
if (captureCallstack)
{
item = Profiler::QueueSerialCallstack(Callstack(callstackDepth));
itemType = QueueType::GpuZoneBeginCallstackSerial;
}
else
{
item = Profiler::QueueSerial();
itemType = QueueType::GpuZoneBeginSerial;
}
}
MemWrite(&item->hdr.type, itemType);
MemWrite(&item->gpuZoneBegin.cpuTime, Profiler::GetTime());
MemWrite(&item->gpuZoneBegin.srcloc, srcLocationAddr);
MemWrite(&item->gpuZoneBegin.thread, GetThreadHandle());
MemWrite(&item->gpuZoneBegin.queryId, static_cast<uint16_t>(m_queryId));
MemWrite(&item->gpuZoneBegin.context, m_ctx->GetId());
Profiler::QueueSerialFinish();
}
// Fills in m_timestampWrites and assigns its address to passDesc.timestampWrites.
// Works with both WGPURenderPassDescriptor and WGPUComputePassDescriptor.
template<typename PassDescriptor>
tracy_force_inline void InitBase(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc)
{
m_ctx = ctx;
m_encoder = encoder;
m_rawTicket = m_ctx->NextQueryId();
m_queryId = m_ctx->RingIndex(m_rawTicket);
m_timestampWrites.querySet = m_ctx->m_querySet;
m_timestampWrites.beginningOfPassWriteIndex = m_queryId;
m_timestampWrites.endOfPassWriteIndex = m_queryId + 1;
passDesc.timestampWrites = &m_timestampWrites;
}
public:
template<typename PassDescriptor>
tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc, const SourceLocationData* srcLocation, bool active)
#ifdef TRACY_ON_DEMAND
: m_active(active && GetProfiler().IsConnected())
#else
: m_active(active)
#endif
{
if (!m_active || !ctx) return;
InitBase(ctx, encoder, passDesc);
WriteQueueItem(srcLocation, 0, 0, nullptr, 0, nullptr, 0, nullptr, 0);
}
template<typename PassDescriptor>
tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc, const SourceLocationData* srcLocation, int32_t depth, bool active)
#ifdef TRACY_ON_DEMAND
: m_active(active && GetProfiler().IsConnected())
#else
: m_active(active)
#endif
{
if (!m_active || !ctx) return;
InitBase(ctx, encoder, passDesc);
WriteQueueItem(srcLocation, depth, 0, nullptr, 0, nullptr, 0, nullptr, 0);
}
template<typename PassDescriptor>
tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, uint32_t line, const char* source, size_t sourceSz, const char* function, size_t functionSz, const char* name, size_t nameSz, WGPUCommandEncoder encoder, PassDescriptor& passDesc, bool active)
#ifdef TRACY_ON_DEMAND
: m_active(active && GetProfiler().IsConnected())
#else
: m_active(active)
#endif
{
if (!m_active || !ctx) return;
InitBase(ctx, encoder, passDesc);
WriteQueueItem(nullptr, 0, line, source, sourceSz, function, functionSz, name, nameSz);
}
template<typename PassDescriptor>
tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, uint32_t line, const char* source, size_t sourceSz, const char* function, size_t functionSz, const char* name, size_t nameSz, WGPUCommandEncoder encoder, PassDescriptor& passDesc, int32_t depth, bool active)
#ifdef TRACY_ON_DEMAND
: m_active(active && GetProfiler().IsConnected())
#else
: m_active(active)
#endif
{
if (!m_active || !ctx) return;
InitBase(ctx, encoder, passDesc);
WriteQueueItem(nullptr, depth, line, source, sourceSz, function, functionSz, name, nameSz);
}
tracy_force_inline ~WebGPUZoneScope()
{
if (!m_active || !m_ctx) return;
const auto queryId = m_queryId + 1;
auto* item = Profiler::QueueSerial();
MemWrite(&item->hdr.type, QueueType::GpuZoneEndSerial);
MemWrite(&item->gpuZoneEnd.cpuTime, Profiler::GetTime());
MemWrite(&item->gpuZoneEnd.thread, GetThreadHandle());
MemWrite(&item->gpuZoneEnd.queryId, static_cast<uint16_t>(queryId));
MemWrite(&item->gpuZoneEnd.context, m_ctx->GetId());
Profiler::QueueSerialFinish();
if (m_queryId % 32 == 0)
ResolveQueryBatch(m_queryId-32);
}
};
static inline void DestroyWebGPUContext(WebGPUQueueCtx* ctx)
{
if (!ctx) return;
ctx->~WebGPUQueueCtx();
tracy_free(ctx);
}
static inline WebGPUQueueCtx* CreateWebGPUContext(WGPUInstance instance, WGPUDevice device, WGPUQueue queue)
{
auto* ctx = static_cast<WebGPUQueueCtx*>(tracy_malloc(sizeof(WebGPUQueueCtx)));
new (ctx) WebGPUQueueCtx{ instance, device, queue };
if (ctx->GetId() == 255)
{
DestroyWebGPUContext(ctx);
return nullptr;
}
return ctx;
}
}
#undef TracyWebGPUPanic
#undef TracyWebGPULog
#undef TracyWebGPUAssert
#undef TracyWebGPUBreak
#undef TracyWebGPUDebug
#undef TRACY_WEBGPU_DEBUG_LEVEL
using TracyWebGPUCtx = tracy::WebGPUQueueCtx*;
#define TracyWebGPUSetupDeviceDescriptor(deviceDescriptor) tracy::WebGPUQueueCtx::Requirements TracyConcat(__tracy_wgpu_setup_, TracyLine); TracyConcat(__tracy_wgpu_setup_, TracyLine).ApplyToDeviceDescriptor(deviceDescriptor)
#define TracyWebGPUContext(instance, device, queue) tracy::CreateWebGPUContext(instance, device, queue);
#define TracyWebGPUDestroy(ctx) tracy::DestroyWebGPUContext(ctx);
#define TracyWebGPUContextName(ctx, name, size) if (ctx) ctx->Name(name, size);
#define TracyWebGPUUnnamedZone ___tracy_gpu_webgpu_zone
#define TracyWebGPUSrcLocSymbol TracyConcat(__tracy_webgpu_source_location,TracyLine)
#define TracyWebGPUSrcLocObject(name, color) static constexpr tracy::SourceLocationData TracyWebGPUSrcLocSymbol { name, TracyFunction, TracyFile, (uint32_t)TracyLine, color };
#if defined TRACY_HAS_CALLSTACK && defined TRACY_CALLSTACK
# define TracyWebGPUZone(ctx, encoder, passDesc, name) TracyWebGPUNamedZoneS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, TRACY_CALLSTACK, true)
# define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color) TracyWebGPUNamedZoneCS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, TRACY_CALLSTACK, true)
# define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, TRACY_CALLSTACK, active };
# define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, TRACY_CALLSTACK, active };
# define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active) TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, TRACY_CALLSTACK, active)
#else
# define TracyWebGPUZone(ctx, encoder, passDesc, name) TracyWebGPUNamedZone(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, true)
# define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color) TracyWebGPUNamedZoneC(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, true)
# define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, active };
# define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, active };
# define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active) tracy::WebGPUZoneScope varname{ ctx, TracyLine, TracyFile, strlen(TracyFile), TracyFunction, strlen(TracyFunction), name, strlen(name), encoder, passDesc, active };
#endif
#ifdef TRACY_HAS_CALLSTACK
# define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth) TracyWebGPUNamedZoneS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, depth, true)
# define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth) TracyWebGPUNamedZoneCS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, depth, true)
# define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, depth, active };
# define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, depth, active };
# define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active) tracy::WebGPUZoneScope varname{ ctx, TracyLine, TracyFile, strlen(TracyFile), TracyFunction, strlen(TracyFunction), name, strlen(name), encoder, passDesc, depth, active };
#else
# define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth) TracyWebGPUZone(ctx, encoder, passDesc, name)
# define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth) TracyWebGPUZoneC(ctx, encoder, passDesc, name, color)
# define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active)
# define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active) TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active)
# define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active)
#endif
#define TracyWebGPUCollect(ctx) if (ctx) ctx->Collect();
#endif
#endif

View File

@@ -1033,14 +1033,15 @@ PYBIND11_MODULE( TracyServerBindings, m )
// --- GPU contexts ---
.def( "get_gpu_contexts", []( const Worker& w ) {
static const char* gpuTypeStr[] = {
"Invalid", "OpenGL", "Vulkan", "OpenCL", "Direct3D12", "Direct3D11", "Metal", "Custom", "CUDA", "Rocprof" };
"Invalid", "OpenGL", "Vulkan", "OpenCL", "Direct3D12", "Direct3D11", "Metal", "Custom", "CUDA", "Rocprof", "WebGPU" };
static size_t numTypes = sizeof(gpuTypeStr) / sizeof(gpuTypeStr[0]);
std::vector<GpuContextSummary> result;
for( const auto* ctx : w.GetGpuData() )
{
if( !ctx ) continue;
const std::string name = ctx->name.Active() ? w.GetString( ctx->name ) : "";
const uint8_t typeIdx = (uint8_t)ctx->type;
const char* typeStr = typeIdx < 10 ? gpuTypeStr[typeIdx] : "Unknown";
const char* typeStr = typeIdx < numTypes ? gpuTypeStr[typeIdx] : "Unknown";
result.push_back( GpuContextSummary{
name, ctx->count, std::string( typeStr ), ctx->thread } );
}

View File

@@ -3137,6 +3137,7 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
}
else
{
uint8_t sz8;
uint16_t sz;
switch( ev.hdr.type )
{
@@ -3144,6 +3145,7 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
ptr += sizeof( QueueHeader );
memcpy( &sz, ptr, sizeof( sz ) );
ptr += sizeof( sz );
sz += ProtocolOffset8Bit;
AddSingleStringFailure( ptr, sz );
ptr += sz;
break;
@@ -3151,9 +3153,24 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
ptr += sizeof( QueueHeader );
memcpy( &sz, ptr, sizeof( sz ) );
ptr += sizeof( sz );
sz += ProtocolOffset8Bit;
AddSecondString( ptr, sz );
ptr += sz;
break;
case QueueType::SingleStringData8:
ptr += sizeof( QueueHeader );
memcpy( &sz8, ptr, sizeof( sz8 ) );
ptr += sizeof( sz8 );
AddSingleStringFailure( ptr, sz8 );
ptr += sz8;
break;
case QueueType::SecondStringData8:
ptr += sizeof( QueueHeader );
memcpy( &sz8, ptr, sizeof( sz8 ) );
ptr += sizeof( sz8 );
AddSecondString( ptr, sz8 );
ptr += sz8;
break;
default:
ptr += QueueDataSize[ev.hdr.idx];
switch( ev.hdr.type )
@@ -3337,6 +3354,7 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
}
else
{
uint8_t sz8;
uint16_t sz;
switch( ev.hdr.type )
{
@@ -3344,6 +3362,7 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
ptr += sizeof( QueueHeader );
memcpy( &sz, ptr, sizeof( sz ) );
ptr += sizeof( sz );
sz += ProtocolOffset8Bit;
AddSingleString( ptr, sz );
ptr += sz;
return true;
@@ -3351,9 +3370,24 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
ptr += sizeof( QueueHeader );
memcpy( &sz, ptr, sizeof( sz ) );
ptr += sizeof( sz );
sz += ProtocolOffset8Bit;
AddSecondString( ptr, sz );
ptr += sz;
return true;
case QueueType::SingleStringData8:
ptr += sizeof( QueueHeader );
memcpy( &sz8, ptr, sizeof( sz8 ) );
ptr += sizeof( sz8 );
AddSingleString( ptr, sz8 );
ptr += sz8;
return true;
case QueueType::SecondStringData8:
ptr += sizeof( QueueHeader );
memcpy( &sz8, ptr, sizeof( sz8 ) );
ptr += sizeof( sz8 );
AddSecondString( ptr, sz8 );
ptr += sz8;
return true;
default:
ptr += QueueDataSize[ev.hdr.idx];
return Process( ev );