add routine to check for GL features/extensions at run-time

Merge pull request #1402 from wolfpld/slomp/webgpu-example-platform
Switch webgpu example to SDL3, plus patch edge-case for wgpu-native
2026-06-16 12:18:59 +00:00 · 2026-06-15 21:19:12 -07:00 · 2026-06-15 23:48:14 +02:00 · 2026-06-15 13:14:28 -07:00 · 2026-06-15 13:14:28 -07:00 · 2026-06-15 13:14:28 -07:00
35 changed files with 2546 additions and 347 deletions
--- a/.github/workflows/emscripten.yml
+++ b/.github/workflows/emscripten.yml
@@ -20,7 +20,7 @@ jobs:
    - name: Setup emscripten
      uses: emscripten-core/setup-emsdk@v16
      with:
-        version: 4.0.10
+        version: 5.0.7
    - name: Trust git repo
      run: git config --global --add safe.directory '*'
    - uses: actions/checkout@v4
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -6,7 +6,7 @@
        "${workspaceFolder}/import",
        "${workspaceFolder}/merge",
        "${workspaceFolder}/update",
-        "${workspaceFolder}/test",
+        "${workspaceFolder}/tests/tracy",
        "${workspaceFolder}",
    ],
    "cmake.buildDirectory": "${sourceDirectory}/build",
--- a/29
+++ b/29
@@ -5,9 +5,16 @@ here.
 vx.xx.x (2026-xx-xx)
 --------------------

+- API break: removed "secure" variants of memory alloc and free macros. The
+  secure code path is now always enabled. Migrate by removing "Secure" from
+  the macros you use, e.g. TracySecureAlloc(...) -> TracyAlloc(...).
 - Added tracy-capture-daemon for automated multi-client trace capture.
 - Added tracy-merge utility for combining multiple trace files into one.
 - Added support for Windows on ARM64 with MSVC.
+- Added support for WebGPU.
+- Trace-specific settings storage has been completely overhauled. It is now
+  possible to make the settings sidecar file public, saved next to the trace
+  file.
 - External frames are now omitted in the single-line call stack list visible
  in messages list, or in memory allocation info window.
 - External frames are now hidden by default in various contexts where they
@@ -147,8 +154,13 @@ vx.xx.x (2026-xx-xx)
  - There is now chapter tree and the manual contents are displayed section
    by section.
  - Links to chapters are now properly working.
-  - The "bclogo" blocks are now correctly processed.
+  - The "bclogo" blocks are now correctly processed and displayed as proper
+    admonitions.
  - The font awesome icons now show as in the rest of the UI.
+  - Footnotes are now rendered as proper footnotes.
+  - Tables are now rendered as intended.
+  - LaTeX math is now converted to readable form.
+  - Added a button to download the full PDF manual to the user manual window.
 - Call stack window will now show the thread viewed call stack originates
  from (if possible).
 - "Visible threads" checkboxes in messages, flame graph and wait stacks
@@ -172,6 +184,21 @@ vx.xx.x (2026-xx-xx)
 - Fixed NVCC builds.
 - Fixed possible lockups in Vulkan timer calibration loop.
 - The flame graph view now supports zooming in and panning with the mouse.
+- General application crash information polish in the profiler UI.
+- The achievements system has been converted to use markdown renderer.
+- Offline symbol resolution with the update utility now supports custom
+  addr2line-compatible tools via -a and -A command line parameters.
+  Additionally, it is now possible to reset all call stack frame symbols to
+  unresolved with the -R parameter.
+- Periodic recalibration of the clock drift in OpenGL contexts can be enabled
+  with the TRACY_OPENGL_AUTO_CALIBRATION compilation define. Note that this
+  requires a full CPU/GPU sync on each calibration event. These events will
+  not fire more often than once every second.
+- Added missing C API for shared locks.
+- Implemented semi-unique, nonsense random name generator.
+  - Can be used to set a trace description.
+  - Will be used to provide default description for newly added annotations.
+- Polished look and feel of annotation regions on the timeline.


 v0.13.1 (2025-12-11)
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@

 ### A real time, nanosecond resolution, remote telemetry, hybrid frame and sampling profiler for games and other applications.

-Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
+Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.

 - [Documentation](https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf) for usage and build process instructions
 - [Releases](https://github.com/wolfpld/tracy/releases) containing the documentation (`tracy.pdf`) and compiled Windows x64 binaries (`Tracy-<version>.7z`) as assets
--- a/cmake/vendor.cmake
+++ b/cmake/vendor.cmake
@@ -218,6 +218,8 @@ CPMAddPackage(
    NAME md4c
    GITHUB_REPOSITORY mity/md4c
    GIT_TAG 755ce49acdc7cd682d4502b4796db5ed6a1230fb
+    OPTIONS
+        "BUILD_SHARED_LIBS OFF"
    EXCLUDE_FROM_ALL TRUE
 )

--- a/examples/opengl/triangle/CMakeLists.txt
+++ b/examples/opengl/triangle/CMakeLists.txt
@@ -0,0 +1,83 @@
+# CMakeLists.txt — OpenGL spinning triangle demo
+#
+#   macOS:
+#     cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
+#     cmake --build build/ninja
+#
+#   Linux (requires libsdl3-dev libgl1-mesa-dev):
+#     cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
+#     cmake --build build/ninja
+#
+#   Windows:
+#     cmake -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -B build/ninja .
+#     cmake --build build/ninja
+
+cmake_minimum_required(VERSION 3.16)
+project(gl_spinning_triangle LANGUAGES C CXX)
+
+# ---------------------------------------------------------------------------
+# Tracy root — defaults to three directories above this CMakeLists.txt.
+# ---------------------------------------------------------------------------
+set(TRACY_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../..")
+option(TRACY_ENABLE                  "Enable Tracy profiling"                    ON)
+
+# ---------------------------------------------------------------------------
+# Platform — SDL3 (cross-platform windowing, must be installed on the system)
+# ---------------------------------------------------------------------------
+find_package(SDL3 REQUIRED)
+
+# ---------------------------------------------------------------------------
+# GL extension loader — GLEW (Windows + Linux, fetched automatically)
+# ---------------------------------------------------------------------------
+if(NOT APPLE)
+    include(FetchContent)
+    set(glew-cmake_BUILD_SHARED OFF CACHE BOOL "" FORCE)
+    set(ONLY_LIBS               ON  CACHE BOOL "" FORCE)
+    FetchContent_Declare(glew
+        GIT_REPOSITORY https://github.com/Perlmint/glew-cmake.git
+        GIT_TAG        master   # pin to a specific commit for reproducible builds
+        GIT_SHALLOW    TRUE
+    )
+    FetchContent_MakeAvailable(glew)
+endif()
+
+set(PLATFORM_SOURCES  platform/platform_sdl3.cpp)
+
+if(APPLE)
+    set(PLATFORM_LIBS SDL3::SDL3 "-framework OpenGL")
+elseif(WIN32)
+    set(PLATFORM_LIBS SDL3::SDL3 opengl32 libglew_static)
+else()
+    set(PLATFORM_LIBS SDL3::SDL3 GL libglew_static)
+endif()
+
+# ---------------------------------------------------------------------------
+# Target
+# ---------------------------------------------------------------------------
+add_executable(gl_spinning_triangle
+    spinning_triangle.cpp
+    "${TRACY_DIR}/public/TracyClient.cpp"
+    ${PLATFORM_SOURCES}
+)
+
+# Suppress upstream warnings from TracyClient.cpp
+if(MSVC)
+    set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
+        PROPERTIES COMPILE_FLAGS "/w"
+    )
+else()
+    set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
+        PROPERTIES COMPILE_FLAGS "-w"
+    )
+endif()
+
+target_compile_features(gl_spinning_triangle PRIVATE cxx_std_17)
+
+if(TRACY_ENABLE)
+    target_compile_definitions(gl_spinning_triangle PRIVATE TRACY_ENABLE)
+endif()
+
+target_include_directories(gl_spinning_triangle PRIVATE
+    "${TRACY_DIR}/public"
+)
+target_link_libraries(gl_spinning_triangle PRIVATE ${PLATFORM_LIBS})
--- a/examples/opengl/triangle/platform/platform.h
+++ b/examples/opengl/triangle/platform/platform.h
@@ -0,0 +1,37 @@
+// platform.h — interface between platform-agnostic code and platform backends
+//
+// Each platform_*.mm / platform_*.cpp file implements these four functions.
+// Exactly one backend must be linked into the final binary.
+
+#pragma once
+
+#ifdef __APPLE__
+#  include <OpenGL/gl3.h>
+#else
+#  include <GL/glew.h>
+#endif
+
+// Initialize the windowing system, create a window, and make an OpenGL 3.3
+// Core Profile context current on the calling thread.
+// Returns true on success.
+bool platformInit(int width, int height, const char* title);
+
+// Load OpenGL function pointers (no-op on macOS where the framework exports them directly).
+// Must be called after platformInit() while the GL context is current.
+// Returns true on success.
+bool platformInitGL();
+
+// Elapsed wall-clock time in seconds since platformInit().
+double platformGetTime();
+
+// Swap front and back buffers (present the rendered frame).
+void platformSwapBuffers();
+
+// Pixel scaling factor relative to the logical window size (1.0 on non-HiDPI displays).
+// Must be called after platformInit().
+void platformGetPixelDensityScale(float* x, float* y);
+
+// Enter the platform event/render loop.
+// Calls render() each frame at ~60 fps.
+// Calls shutdown() exactly once before returning.
+void platformRunLoop(void (*render)(), void (*shutdown)());
--- a/examples/opengl/triangle/platform/platform_sdl3.cpp
+++ b/examples/opengl/triangle/platform/platform_sdl3.cpp
@@ -0,0 +1,85 @@
+// platform_sdl3.cpp — SDL3 windowing backend (cross-platform)
+#include "platform.h"   // GL headers first (gl3.h / glew.h) so SDL sees guards set
+
+#define SDL_MAIN_HANDLED    // we don't want SDL_main
+#include <SDL3/SDL.h>
+
+#include <chrono>
+#include <cstdio>
+
+static SDL_Window*   sWin = nullptr;
+static SDL_GLContext sCtx = nullptr;
+static std::chrono::steady_clock::time_point sStartTime;
+
+bool platformInit(int width, int height, const char* title) {
+    if (!SDL_Init(SDL_INIT_VIDEO)) {
+        fprintf(stderr, "ERROR: SDL_Init failed: %s\n", SDL_GetError());
+        return false;
+    }
+
+    SDL_GL_SetAttribute(SDL_GL_CONTEXT_MAJOR_VERSION, 3);
+    SDL_GL_SetAttribute(SDL_GL_CONTEXT_MINOR_VERSION, 3);
+    SDL_GL_SetAttribute(SDL_GL_CONTEXT_PROFILE_MASK, SDL_GL_CONTEXT_PROFILE_CORE);
+
+    sWin = SDL_CreateWindow(title, width, height, SDL_WINDOW_OPENGL);
+    if (!sWin) {
+        fprintf(stderr, "ERROR: SDL_CreateWindow failed: %s\n", SDL_GetError());
+        SDL_Quit();
+        return false;
+    }
+    SDL_SetWindowPosition(sWin, SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED);
+
+    sCtx = SDL_GL_CreateContext(sWin);
+    if (!sCtx) {
+        fprintf(stderr, "ERROR: SDL_GL_CreateContext failed: %s\n", SDL_GetError());
+        SDL_DestroyWindow(sWin);
+        SDL_Quit();
+        return false;
+    }
+
+    SDL_GL_SetSwapInterval(1);
+    sStartTime = std::chrono::steady_clock::now();
+    return true;
+}
+
+bool platformInitGL() {
+#ifndef __APPLE__
+    glewExperimental = GL_TRUE;
+    if (glewInit() != GLEW_OK) {
+        fprintf(stderr, "Failed to initialize GLEW\n");
+        return false;
+    }
+#endif
+    return true;
+}
+
+double platformGetTime() {
+    return std::chrono::duration<double>(
+        std::chrono::steady_clock::now() - sStartTime).count();
+}
+
+void platformSwapBuffers() { SDL_GL_SwapWindow(sWin); }
+
+void platformGetPixelDensityScale(float* x, float* y) {
+    int pw, ph, ww, wh;
+    SDL_GetWindowSizeInPixels(sWin, &pw, &ph);
+    SDL_GetWindowSize(sWin, &ww, &wh);
+    *x = (ww > 0) ? (float)pw / (float)ww : 1.0f;
+    *y = (wh > 0) ? (float)ph / (float)wh : 1.0f;
+}
+
+void platformRunLoop(void (*render)(), void (*shutdown)()) {
+    bool running = true;
+    while (running) {
+        SDL_Event e;
+        while (SDL_PollEvent(&e)) {
+            if (e.type == SDL_EVENT_QUIT) running = false;
+            if (e.type == SDL_EVENT_KEY_DOWN && e.key.key == SDLK_ESCAPE) running = false;
+        }
+        if (running) render();
+    }
+    shutdown();
+    SDL_GL_DestroyContext(sCtx);
+    SDL_DestroyWindow(sWin);
+    SDL_Quit();
+}
--- a/examples/opengl/triangle/spinning_triangle.cpp
+++ b/examples/opengl/triangle/spinning_triangle.cpp
@@ -0,0 +1,145 @@
+// spinning_triangle.cpp — OpenGL spinning triangle demo with Tracy GPU profiling.
+
+#ifdef __APPLE__
+// NOTE: OpenGL is only available on MacOS (no iOS support)
+// Including and using anything related to OpenGL on Apple (like <OpenGL/gl3.h>)
+// will emit deprecation warnings, unless GL_SILENCE_DEPRECATION is defined
+#define GL_SILENCE_DEPRECATION
+// NOTE: TracyOpenGL.hpp will not work as expected even on Apple devices that
+// support OpenGL, because the OpenGL drivers do not implement ARB_timer_query
+// properly (querying GL_TIMESTAMP always resolves to 0). TracyOpenGL.hpp will
+// emit a compiler warning, and a Tracy message to the trace/profiler, but the
+// program will still run.
+#endif
+
+#include "platform/platform.h"  // also includes OpenGL headers
+
+#include <tracy/Tracy.hpp>
+
+// NOTE: opt-in toggle for periodic recalibrations during Collect()
+#define TRACY_OPENGL_AUTO_CALIBRATION
+#include <tracy/TracyOpenGL.hpp>
+
+static const int kWidth  = 800;
+static const int kHeight = 600;
+
+static GLuint gProgram  = 0;
+static GLuint gVao      = 0;
+static GLint  gAngleLoc = -1;
+
+// Vertex colors and positions are baked in; rotation is driven by a uniform.
+static const char* kVertSrc = R"(
+#version 150 core
+uniform float uAngle;
+const vec2 kPos[3] = vec2[3](
+    vec2( 0.0,    0.5  ),
+    vec2(-0.433, -0.25 ),
+    vec2( 0.433, -0.25 )
+);
+const vec3 kCol[3] = vec3[3](
+    vec3(1.0, 0.0, 0.0),
+    vec3(0.0, 1.0, 0.0),
+    vec3(0.0, 0.0, 1.0)
+);
+out vec3 vColor;
+void main() {
+    float c = cos(uAngle);
+    float s = sin(uAngle);
+    vec2  p = kPos[gl_VertexID];
+    gl_Position = vec4(p.x*c - p.y*s, p.x*s + p.y*c, 0.0, 1.0);
+    vColor = kCol[gl_VertexID];
+}
+)";
+
+static const char* kFragSrc = R"(
+#version 150 core
+in  vec3 vColor;
+out vec4 fragColor;
+void main() { fragColor = vec4(vColor, 1.0); }
+)";
+
+static GLuint compileShader(GLenum type, const char* src) {
+    GLuint s = glCreateShader(type);
+    glShaderSource(s, 1, &src, nullptr);
+    glCompileShader(s);
+    GLint ok = 0;
+    glGetShaderiv(s, GL_COMPILE_STATUS, &ok);
+    if (!ok) {
+        char log[512];
+        glGetShaderInfoLog(s, sizeof(log), nullptr, log);
+        fprintf(stderr, "Shader compile error: %s\n", log);
+        glDeleteShader(s);
+        return 0;
+    }
+    return s;
+}
+
+static int initGL() {
+    if (!platformInitGL()) return 1;
+
+    TracyGpuContext;
+    TracyGpuContextName("OpenGL", 6);
+
+    GLuint vert = compileShader(GL_VERTEX_SHADER,   kVertSrc);
+    GLuint frag = compileShader(GL_FRAGMENT_SHADER, kFragSrc);
+    if (!vert || !frag) return 1;
+
+    gProgram = glCreateProgram();
+    glAttachShader(gProgram, vert);
+    glAttachShader(gProgram, frag);
+    glLinkProgram(gProgram);
+    glDeleteShader(vert);
+    glDeleteShader(frag);
+
+    GLint ok = 0;
+    glGetProgramiv(gProgram, GL_LINK_STATUS, &ok);
+    if (!ok) {
+        char log[512];
+        glGetProgramInfoLog(gProgram, sizeof(log), nullptr, log);
+        fprintf(stderr, "Program link error: %s\n", log);
+        return 1;
+    }
+
+    gAngleLoc = glGetUniformLocation(gProgram, "uAngle");
+
+    // Core profile requires a bound VAO even with no vertex attributes.
+    glGenVertexArrays(1, &gVao);
+    glBindVertexArray(gVao);
+
+    glClearColor(0.05f, 0.05f, 0.08f, 1.0f);
+    float scaleX, scaleY;
+    platformGetPixelDensityScale(&scaleX, &scaleY);
+    glViewport(0, 0, (int)(kWidth * scaleX), (int)(kHeight * scaleY));
+    return 0;
+}
+
+static void renderFrame() {
+    ZoneScoped;
+
+    glClear(GL_COLOR_BUFFER_BIT);
+    glUseProgram(gProgram);
+
+    {
+        TracyGpuZone("triangle draw");
+        glUniform1f(gAngleLoc, (float)platformGetTime());
+        glDrawArrays(GL_TRIANGLES, 0, 3);
+    }
+
+    platformSwapBuffers();
+    TracyGpuCollect;
+}
+
+static void shutdown() {
+    fprintf(stderr, "application is shutting down...\n");
+    glDeleteVertexArrays(1, &gVao);
+    glDeleteProgram(gProgram);
+}
+
+int main() {
+    if (!platformInit(kWidth, kHeight, "OpenGL Spinning Triangle"))
+        return 1;
+    if (initGL() != 0)
+        return 2;
+    platformRunLoop(renderFrame, shutdown);
+    return 0;
+}
--- a/examples/webgpu/triangle/CMakeLists.txt
+++ b/examples/webgpu/triangle/CMakeLists.txt
@@ -0,0 +1,157 @@
+# CMakeLists.txt — WebGPU spinning triangle demo
+#
+#   macOS:
+#     clang++ -std=c++17 -ObjC++ spinning_triangle.cpp platform/platform_macos.mm \
+#         -I/path/to/wgpu/include -L/path/to/wgpu/lib -lwgpu_native \
+#         -Wl,-rpath,@executable_path \
+#         -framework Cocoa -framework Metal -framework QuartzCore \
+#         -framework Foundation -framework IOKit -framework IOSurface \
+#         -o spinning_triangle
+#
+#   Windows (MSVC):
+#     cl /std:c++17 spinning_triangle.cpp platform/platform_windows.cpp \
+#         /I\path\to\wgpu\include \path\to\wgpu\lib\wgpu_native.lib \
+#         user32.lib gdi32.lib /Fe:spinning_triangle.exe
+#
+#   Linux (requires libsdl3-dev):
+#     g++ -std=c++17 spinning_triangle.cpp platform/platform_wayland.cpp \
+#         xdg-shell-protocol.c \
+#         -I/path/to/wgpu/include -L/path/to/wgpu/lib -lwgpu_native \
+#         -lwayland-client -o spinning_triangle
+
+cmake_minimum_required(VERSION 3.16)
+project(spinning_triangle LANGUAGES C CXX)
+
+# ---------------------------------------------------------------------------
+# WebGPU backend — set WGPU_PATH to your wgpu-native or Dawn installation.
+# The library name differs between backends:
+#   wgpu-native  →  wgpu_native
+#   Dawn         →  webgpu_dawn
+# ---------------------------------------------------------------------------
+set(WGPU_PATH "" CACHE PATH "Root of the WebGPU native installation (contains include/ and lib/)")
+set(WGPU_LIB  "" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty")
+
+if(NOT WGPU_PATH)
+    message(FATAL_ERROR "Set WGPU_PATH to the root of your WebGPU native installation.")
+endif()
+
+# When WGPU_PATH changes, discard any previously auto-detected WGPU_LIB so
+# detection re-runs against the new path.
+if(NOT "${WGPU_PATH}" STREQUAL "${_WGPU_PATH_LAST}")
+    unset(WGPU_LIB CACHE)
+    set(WGPU_LIB "" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty")
+endif()
+set(_WGPU_PATH_LAST "${WGPU_PATH}" CACHE INTERNAL "")
+
+if(NOT WGPU_LIB)
+    unset(_WGPU_NATIVE_LIB CACHE)
+    unset(_WEBGPU_DAWN_LIB CACHE)
+    find_library(_WGPU_NATIVE_LIB NAMES wgpu_native wgpu_native.dll PATHS "${WGPU_PATH}/lib" NO_DEFAULT_PATH)
+    find_library(_WEBGPU_DAWN_LIB NAMES webgpu_dawn                 PATHS "${WGPU_PATH}/lib" NO_DEFAULT_PATH)
+    if(_WGPU_NATIVE_LIB)
+        set(WGPU_LIB "wgpu_native" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty" FORCE)
+    elseif(_WEBGPU_DAWN_LIB)
+        set(WGPU_LIB "webgpu_dawn" CACHE STRING "WebGPU library name (wgpu_native or webgpu_dawn); auto-detected if empty" FORCE)
+    else()
+        message(FATAL_ERROR "Could not detect a WebGPU library in ${WGPU_PATH}/lib. Set WGPU_LIB explicitly (wgpu_native or webgpu_dawn).")
+    endif()
+    message(STATUS "WebGPU library auto-detected: ${WGPU_LIB}")
+endif()
+
+# ---------------------------------------------------------------------------
+# Tracy root — defaults to two directories above this CMakeLists.txt.
+# ---------------------------------------------------------------------------
+set(TRACY_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../../..")
+option(TRACY_ENABLE "Enable Tracy profiling" ON)
+
+# ---------------------------------------------------------------------------
+# macOS quarantine — pre-built WebGPU binaries downloaded from the internet
+# carry a com.apple.quarantine extended attribute that prevents dyld from
+# loading them ("damaged or incomplete" / Gatekeeper block).  Strip it once
+# at configure time so the linker and the runtime loader can both access the
+# library directory without further user intervention.
+# ---------------------------------------------------------------------------
+if(APPLE)
+    execute_process(
+        COMMAND xattr -dr com.apple.quarantine "${WGPU_PATH}/lib"
+    )
+endif()
+
+# ---------------------------------------------------------------------------
+# Platform — SDL3 (cross-platform windowing, must be installed on the system)
+# ---------------------------------------------------------------------------
+find_package(SDL3 REQUIRED)
+
+set(PLATFORM_SOURCES platform/platform_sdl3.cpp)
+
+if(APPLE)
+    set(PLATFORM_LIBS
+        SDL3::SDL3
+        "-framework Cocoa"
+        "-framework Metal"
+        "-framework QuartzCore"
+        "-framework Foundation"
+        "-framework IOKit"
+        "-framework IOSurface"
+    )
+elseif(WIN32)
+    # wgpu-native (Rust stdlib) pull-ins: NtReadFile, GetUserProfileDirectoryW, ...
+    set(WGPU_NATIVE_WIN32_LIBS ntdll userenv)
+    # Dawn pull-ins: WKPDID_D3DDebugObjectName GUID, CompareObjectHandles, ...
+    set(WEBGPU_DAWN_WIN32_LIBS dxguid onecore)
+    set(PLATFORM_LIBS SDL3::SDL3 ${WGPU_NATIVE_WIN32_LIBS} ${WEBGPU_DAWN_WIN32_LIBS})
+else()
+    set(PLATFORM_LIBS SDL3::SDL3)
+endif()
+
+# ---------------------------------------------------------------------------
+# Target
+# ---------------------------------------------------------------------------
+add_executable(spinning_triangle
+    spinning_triangle.cpp
+    "${TRACY_DIR}/public/TracyClient.cpp"
+    ${PLATFORM_SOURCES}
+)
+
+# Treat TracyClient.cpp as third-party code — suppress all warnings so that
+# upstream changes don't pollute our build output.
+if(MSVC)
+    set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
+        PROPERTIES COMPILE_FLAGS "/w"
+    )
+else()
+    set_source_files_properties("${TRACY_DIR}/public/TracyClient.cpp"
+        PROPERTIES COMPILE_FLAGS "-w"
+    )
+endif()
+
+target_compile_features(spinning_triangle PRIVATE cxx_std_17)
+
+if(TRACY_ENABLE)
+    target_compile_definitions(spinning_triangle PRIVATE TRACY_ENABLE)
+endif()
+
+target_include_directories(spinning_triangle PRIVATE
+    "${WGPU_PATH}/include"
+    "${TRACY_DIR}/public"
+)
+
+target_link_directories(spinning_triangle PRIVATE "${WGPU_PATH}/lib")
+
+target_link_libraries(spinning_triangle PRIVATE
+    ${WGPU_LIB}
+    ${PLATFORM_LIBS}
+)
+
+# Embed the rpath so the binary finds the WebGPU dylib/so next to itself.
+if(APPLE)
+    set_target_properties(spinning_triangle PROPERTIES
+        BUILD_RPATH "${WGPU_PATH}/lib"
+        INSTALL_RPATH "@executable_path"
+    )
+elseif(UNIX)
+    set_target_properties(spinning_triangle PROPERTIES
+        BUILD_RPATH "${WGPU_PATH}/lib"
+        INSTALL_RPATH "$ORIGIN"
+    )
+endif()
--- a/examples/webgpu/triangle/platform/platform.h
+++ b/examples/webgpu/triangle/platform/platform.h
@@ -0,0 +1,23 @@
+// platform.h — interface between platform-agnostic code and platform backends
+//
+// Each platform_*.mm / platform_*.cpp file implements these five functions.
+// Exactly one backend must be linked into the final binary.
+
+#pragma once
+#include <webgpu/webgpu.h>
+
+// Initialize the windowing system and create a window of the given dimensions.
+// Returns true on success.
+bool platformInit(int width, int height, const char* title);
+
+// Create a WebGPU surface backed by the platform window.
+// Must be called after wgpuCreateInstance() and platformInit().
+WGPUSurface platformCreateSurface(WGPUInstance instance);
+
+// Elapsed wall-clock time in seconds since platformInit().
+double platformGetTime();
+
+// Enter the platform event/render loop.
+// Calls render() each frame at ~60 fps.
+// Calls shutdown() exactly once before returning.
+void platformRunLoop(void (*render)(), void (*shutdown)());
--- a/examples/webgpu/triangle/platform/platform_sdl3.cpp
+++ b/examples/webgpu/triangle/platform/platform_sdl3.cpp
@@ -0,0 +1,95 @@
+// platform_sdl3.cpp — SDL3 windowing backend for the WebGPU example
+#include "platform.h"   // webgpu/webgpu.h first
+
+#define SDL_MAIN_HANDLED    // we don't want SDL_main
+#include <SDL3/SDL.h>
+
+#ifdef __APPLE__
+#  include <SDL3/SDL_metal.h>
+#endif
+
+#include <chrono>
+#include <cstdio>
+
+static SDL_Window* sWin = nullptr;
+static std::chrono::steady_clock::time_point sStartTime;
+#ifdef __APPLE__
+static SDL_MetalView sMetalView = nullptr;
+#endif
+
+bool platformInit(int width, int height, const char* title) {
+    if (!SDL_Init(SDL_INIT_VIDEO)) {
+        fprintf(stderr, "ERROR: SDL_Init failed: %s\n", SDL_GetError());
+        return false;
+    }
+
+    SDL_WindowFlags flags = 0;
+#ifdef __APPLE__
+    flags |= SDL_WINDOW_METAL;
+#endif
+
+    sWin = SDL_CreateWindow(title, width, height, flags);
+    if (!sWin) {
+        fprintf(stderr, "ERROR: SDL_CreateWindow failed: %s\n", SDL_GetError());
+        SDL_Quit();
+        return false;
+    }
+    SDL_SetWindowPosition(sWin, SDL_WINDOWPOS_CENTERED, SDL_WINDOWPOS_CENTERED);
+
+    sStartTime = std::chrono::steady_clock::now();
+    return true;
+}
+
+WGPUSurface platformCreateSurface(WGPUInstance instance) {
+    WGPUSurfaceDescriptor desc = {};
+    SDL_PropertiesID props = SDL_GetWindowProperties(sWin);
+
+#if defined(__APPLE__)
+    sMetalView = SDL_Metal_CreateView(sWin);
+    if (!sMetalView) {
+        fprintf(stderr, "ERROR: SDL_Metal_CreateView failed\n");
+        return nullptr;
+    }
+    WGPUSurfaceSourceMetalLayer metalDesc = {};
+    metalDesc.chain.sType = WGPUSType_SurfaceSourceMetalLayer;
+    metalDesc.layer       = SDL_Metal_GetLayer(sMetalView);
+    desc.nextInChain      = &metalDesc.chain;
+#elif defined(_WIN32)
+    WGPUSurfaceSourceWindowsHWND hwndDesc = {};
+    hwndDesc.chain.sType = WGPUSType_SurfaceSourceWindowsHWND;
+    hwndDesc.hinstance   = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_WIN32_INSTANCE_POINTER, nullptr);
+    hwndDesc.hwnd        = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_WIN32_HWND_POINTER, nullptr);
+    desc.nextInChain     = &hwndDesc.chain;
+#else   // Linux / X11
+    WGPUSurfaceSourceXlibWindow x11Desc = {};
+    x11Desc.chain.sType = WGPUSType_SurfaceSourceXlibWindow;
+    x11Desc.display     = SDL_GetPointerProperty(props, SDL_PROP_WINDOW_X11_DISPLAY_POINTER, nullptr);
+    x11Desc.window      = (uint32_t)SDL_GetNumberProperty(props, SDL_PROP_WINDOW_X11_WINDOW_NUMBER, 0);
+    desc.nextInChain    = &x11Desc.chain;
+#endif
+
+    return wgpuInstanceCreateSurface(instance, &desc);
+}
+
+double platformGetTime() {
+    return std::chrono::duration<double>(
+        std::chrono::steady_clock::now() - sStartTime).count();
+}
+
+void platformRunLoop(void (*render)(), void (*shutdown)()) {
+    bool running = true;
+    while (running) {
+        SDL_Event e;
+        while (SDL_PollEvent(&e)) {
+            if (e.type == SDL_EVENT_QUIT) running = false;
+            if (e.type == SDL_EVENT_KEY_DOWN && e.key.key == SDLK_ESCAPE) running = false;
+        }
+        if (running) render();
+    }
+    shutdown();
+#ifdef __APPLE__
+    SDL_Metal_DestroyView(sMetalView);
+#endif
+    SDL_DestroyWindow(sWin);
+    SDL_Quit();
+}
--- a/examples/webgpu/triangle/spinning_triangle.cpp
+++ b/examples/webgpu/triangle/spinning_triangle.cpp
@@ -0,0 +1,352 @@
+// spinning_triangle.cpp — platform-agnostic WebGPU spinning triangle demo.
+
+#include "platform/platform.h"
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+#include <webgpu/webgpu.h>
+
+#include <tracy/Tracy.hpp>
+#include <tracy/TracyWebGPU.hpp>
+
+// ---------------------------------------------------------------------------
+// Globals
+// ---------------------------------------------------------------------------
+
+static const int kWidth  = 800;
+static const int kHeight = 600;
+
+static WGPUInstance       gInstance   = nullptr;
+static WGPUSurface        gSurface    = nullptr;
+static WGPUAdapter        gAdapter    = nullptr;
+static WGPUDevice         gDevice     = nullptr;
+static WGPUQueue          gQueue      = nullptr;
+static WGPURenderPipeline gPipeline   = nullptr;
+static WGPUBuffer         gUniformBuf = nullptr;
+static WGPUBindGroup      gBindGroup  = nullptr;
+
+static TracyWebGPUCtx     gTracyCtx   = nullptr;
+
+static WGPUTextureFormat gSurfaceFormat = WGPUTextureFormat_BGRA8Unorm;
+
+// TODO: this can become platformError() instead
+int error(int code, const char* message) {
+    fprintf(stderr, "ERROR: %s (code: %d)\n", message, code);
+    return code;
+}
+
+// ---------------------------------------------------------------------------
+// WGSL shader — vertex colours baked in, rotation via a uniform float.
+// ---------------------------------------------------------------------------
+
+static const char* kShaderSource = R"(
+struct Uniforms {
+    angle: f32,
+};
+@group(0) @binding(0) var<uniform> u: Uniforms;
+
+struct VSOut {
+    @builtin(position) pos: vec4f,
+    @location(0) color: vec3f,
+};
+
+@vertex
+fn vs_main(@builtin(vertex_index) vi: u32) -> VSOut {
+    var positions = array<vec2f, 3>(
+        vec2f( 0.0,  0.5),
+        vec2f(-0.433, -0.25),
+        vec2f( 0.433, -0.25),
+    );
+    var colors = array<vec3f, 3>(
+        vec3f(1.0, 0.0, 0.0),
+        vec3f(0.0, 1.0, 0.0),
+        vec3f(0.0, 0.0, 1.0),
+    );
+
+    let c = cos(u.angle);
+    let s = sin(u.angle);
+    let p = positions[vi];
+    let rotated = vec2f(p.x * c - p.y * s, p.x * s + p.y * c);
+
+    var out: VSOut;
+    out.pos   = vec4f(rotated, 0.0, 1.0);
+    out.color = colors[vi];
+    return out;
+}
+
+@fragment
+fn fs_main(@location(0) color: vec3f) -> @location(0) vec4f {
+    return vec4f(color, 1.0);
+}
+)";
+
+// ---------------------------------------------------------------------------
+// Adapter / Device request callbacks  (current wgpu-native API)
+// ---------------------------------------------------------------------------
+
+static void onAdapterReady(WGPURequestAdapterStatus status,
+                           WGPUAdapter adapter,
+                           WGPUStringView message,
+                           void* userdata1, void* /*userdata2*/) {
+    if (status == WGPURequestAdapterStatus_Success) {
+        *(WGPUAdapter*)userdata1 = adapter;
+    } else {
+        fprintf(stderr, "Adapter request failed: %.*s\n",
+                (int)message.length, message.data);
+    }
+}
+
+static void onDeviceReady(WGPURequestDeviceStatus status,
+                          WGPUDevice device,
+                          WGPUStringView message,
+                          void* userdata1, void* /*userdata2*/) {
+    if (status == WGPURequestDeviceStatus_Success) {
+        *(WGPUDevice*)userdata1 = device;
+    } else {
+        fprintf(stderr, "Device request failed: %.*s\n",
+                (int)message.length, message.data);
+    }
+}
+
+// ---------------------------------------------------------------------------
+// WebGPU init
+// ---------------------------------------------------------------------------
+
+static int initWebGPU() {
+    // Adapter
+    WGPURequestAdapterOptions adapterOpts = {};
+    adapterOpts.compatibleSurface = gSurface;
+
+    WGPURequestAdapterCallbackInfo adapterCB = {};
+    adapterCB.mode     = WGPUCallbackMode_AllowProcessEvents;
+    adapterCB.callback  = onAdapterReady;
+    adapterCB.userdata1 = &gAdapter;
+    wgpuInstanceRequestAdapter(gInstance, &adapterOpts, adapterCB);
+    while (!gAdapter) { wgpuInstanceProcessEvents(gInstance); }
+    if (!gAdapter) return error(11, "No adapter");
+
+    WGPUUncapturedErrorCallbackInfo errorCB = {};
+    errorCB.callback = [](WGPUDevice const*, WGPUErrorType type,
+                          WGPUStringView message, void*, void*) {
+        fprintf(stderr, "[WGPU ERROR] type=%d  %.*s\n",
+                (int)type, (int)message.length, message.data);
+    };
+
+    WGPUDeviceDescriptor deviceDesc = {};
+    deviceDesc.uncapturedErrorCallbackInfo = errorCB;
+
+    TracyWebGPUSetupDeviceDescriptor(deviceDesc);
+
+    WGPURequestDeviceCallbackInfo deviceCB = {};
+    deviceCB.mode      = WGPUCallbackMode_AllowProcessEvents;
+    deviceCB.callback  = onDeviceReady;
+    deviceCB.userdata1 = &gDevice;
+    wgpuAdapterRequestDevice(gAdapter, &deviceDesc, deviceCB);
+    while (!gDevice) { wgpuInstanceProcessEvents(gInstance); }
+    if (!gDevice) return error(12, "No device");
+
+    gQueue = wgpuDeviceGetQueue(gDevice);
+    gTracyCtx = TracyWebGPUContext(gInstance, gDevice, gQueue);
+    TracyWebGPUContextName(gTracyCtx, "WebGPU", 6);
+
+    // Configure surface
+    WGPUSurfaceConfiguration config = {};
+    config.device      = gDevice;
+    config.format      = gSurfaceFormat;
+    config.usage       = WGPUTextureUsage_RenderAttachment;
+    config.alphaMode   = WGPUCompositeAlphaMode_Opaque;
+    config.width       = kWidth;
+    config.height      = kHeight;
+    config.presentMode = WGPUPresentMode_Fifo;
+    wgpuSurfaceConfigure(gSurface, &config);
+
+    // Shader module
+    WGPUShaderSourceWGSL wgslSrc = {};
+    wgslSrc.chain.sType = WGPUSType_ShaderSourceWGSL;
+    wgslSrc.code = { kShaderSource, WGPU_STRLEN };
+
+    WGPUShaderModuleDescriptor smDesc = {};
+    smDesc.nextInChain = (WGPUChainedStruct*)&wgslSrc;
+    WGPUShaderModule shaderMod = wgpuDeviceCreateShaderModule(gDevice, &smDesc);
+
+    // Uniform buffer (one f32 for rotation angle)
+    WGPUBufferDescriptor bufDesc = {};
+    bufDesc.usage = WGPUBufferUsage_Uniform | WGPUBufferUsage_CopyDst;
+    bufDesc.size  = sizeof(float);
+    gUniformBuf = wgpuDeviceCreateBuffer(gDevice, &bufDesc);
+
+    // Bind group layout + bind group
+    WGPUBindGroupLayoutEntry bglEntry = {};
+    bglEntry.binding    = 0;
+    bglEntry.visibility = WGPUShaderStage_Vertex;
+    bglEntry.buffer.type            = WGPUBufferBindingType_Uniform;
+    bglEntry.buffer.minBindingSize  = sizeof(float);
+
+    WGPUBindGroupLayoutDescriptor bglDesc = {};
+    bglDesc.entryCount = 1;
+    bglDesc.entries    = &bglEntry;
+    WGPUBindGroupLayout bgl = wgpuDeviceCreateBindGroupLayout(gDevice, &bglDesc);
+
+    WGPUBindGroupEntry bgEntry = {};
+    bgEntry.binding = 0;
+    bgEntry.buffer  = gUniformBuf;
+    bgEntry.size    = sizeof(float);
+
+    WGPUBindGroupDescriptor bgDesc = {};
+    bgDesc.layout     = bgl;
+    bgDesc.entryCount = 1;
+    bgDesc.entries    = &bgEntry;
+    gBindGroup = wgpuDeviceCreateBindGroup(gDevice, &bgDesc);
+
+    // Pipeline layout
+    WGPUPipelineLayoutDescriptor plDesc = {};
+    plDesc.bindGroupLayoutCount = 1;
+    plDesc.bindGroupLayouts     = &bgl;
+    WGPUPipelineLayout pipelineLayout = wgpuDeviceCreatePipelineLayout(gDevice, &plDesc);
+
+    // Render pipeline
+    WGPUColorTargetState colorTarget = {};
+    colorTarget.format    = gSurfaceFormat;
+    colorTarget.writeMask = WGPUColorWriteMask_All;
+
+    WGPUFragmentState fragState = {};
+    fragState.module      = shaderMod;
+    fragState.entryPoint  = { "fs_main", WGPU_STRLEN };
+    fragState.targetCount = 1;
+    fragState.targets     = &colorTarget;
+
+    WGPURenderPipelineDescriptor rpDesc = {};
+    rpDesc.layout = pipelineLayout;
+    rpDesc.vertex.module     = shaderMod;
+    rpDesc.vertex.entryPoint = { "vs_main", WGPU_STRLEN };
+    rpDesc.primitive.topology = WGPUPrimitiveTopology_TriangleList;
+    rpDesc.multisample.count  = 1;
+    rpDesc.multisample.mask   = 0xFFFFFFFF;
+    rpDesc.fragment = &fragState;
+
+    gPipeline = wgpuDeviceCreateRenderPipeline(gDevice, &rpDesc);
+
+    // Cleanup intermediates
+    wgpuShaderModuleRelease(shaderMod);
+    wgpuPipelineLayoutRelease(pipelineLayout);
+    wgpuBindGroupLayoutRelease(bgl);
+    return 0;
+}
+
+// ---------------------------------------------------------------------------
+// Frame rendering
+// ---------------------------------------------------------------------------
+
+// Returns the surface texture for the current frame, or {.texture=nullptr} on
+// a skippable condition (timeout, occlusion) or an error.
+static WGPUSurfaceTexture getWindowSurface() {
+    WGPUSurfaceTexture surfTex = {};
+    wgpuSurfaceGetCurrentTexture(gSurface, &surfTex);
+    if (surfTex.status == WGPUSurfaceGetCurrentTextureStatus_SuccessOptimal ||
+        surfTex.status == WGPUSurfaceGetCurrentTextureStatus_SuccessSuboptimal)
+        return surfTex;
+
+    // Timeout and Occluded are normal OS events (window covered / on a different Space).
+    bool silent = surfTex.status == WGPUSurfaceGetCurrentTextureStatus_Timeout;
+#ifdef WGPU_H_
+    silent = silent || surfTex.status == (WGPUSurfaceGetCurrentTextureStatus)WGPUSurfaceGetCurrentTextureStatus_Occluded;
+#endif
+    if (!silent)
+        fprintf(stderr, "Failed to get surface texture (status %d)\n", surfTex.status);
+    if (surfTex.texture) wgpuTextureRelease(surfTex.texture);
+    surfTex.texture = nullptr;
+    return surfTex;
+}
+
+static void renderFrame() {
+    ZoneScoped;
+
+    // Update rotation angle
+    float angle = (float)platformGetTime();
+    wgpuQueueWriteBuffer(gQueue, gUniformBuf, 0, &angle, sizeof(float));
+
+    WGPUSurfaceTexture surfTex = getWindowSurface();
+    if (!surfTex.texture) return;
+
+    WGPUTextureView view = wgpuTextureCreateView(surfTex.texture, nullptr);
+
+    // Command encoder
+    WGPUCommandEncoder encoder = wgpuDeviceCreateCommandEncoder(gDevice, nullptr);
+
+    // Render pass
+    WGPURenderPassColorAttachment colorAtt = {};
+    colorAtt.view       = view;
+    colorAtt.loadOp     = WGPULoadOp_Clear;
+    colorAtt.storeOp    = WGPUStoreOp_Store;
+    colorAtt.clearValue  = { 0.05, 0.05, 0.08, 1.0 };
+    colorAtt.depthSlice  = WGPU_DEPTH_SLICE_UNDEFINED;
+
+    WGPURenderPassDescriptor passDesc = {};
+    passDesc.colorAttachmentCount = 1;
+    passDesc.colorAttachments     = &colorAtt;
+
+    {
+        ZoneScopedN("render-pass");
+        TracyWebGPUNamedZone(gTracyCtx, tracyZone, encoder, passDesc, "triangle draw", true);
+        WGPURenderPassEncoder pass = wgpuCommandEncoderBeginRenderPass(encoder, &passDesc);
+        wgpuRenderPassEncoderSetPipeline(pass, gPipeline);
+        wgpuRenderPassEncoderSetBindGroup(pass, 0, gBindGroup, 0, nullptr);
+        wgpuRenderPassEncoderDraw(pass, 3, 1, 0, 0);
+        wgpuRenderPassEncoderEnd(pass);
+        wgpuRenderPassEncoderRelease(pass);
+    }
+
+    // Submit
+    WGPUCommandBuffer cmdBuf = wgpuCommandEncoderFinish(encoder, nullptr);
+    wgpuQueueSubmit(gQueue, 1, &cmdBuf);
+
+    // Present
+    wgpuSurfacePresent(gSurface);
+
+    // Process Events
+    wgpuInstanceProcessEvents(gInstance);
+    TracyWebGPUCollect(gTracyCtx);
+
+    // Cleanup
+    wgpuCommandBufferRelease(cmdBuf);
+    wgpuCommandEncoderRelease(encoder);
+    wgpuTextureViewRelease(view);
+    wgpuTextureRelease(surfTex.texture);
+}
+
+// ---------------------------------------------------------------------------
+// Shutdown
+// ---------------------------------------------------------------------------
+
+static void shutdown() {
+    fprintf(stderr, "application is shutting down...\n");
+    TracyWebGPUDestroy(gTracyCtx);
+    if (gBindGroup)  wgpuBindGroupRelease(gBindGroup);
+    if (gUniformBuf) wgpuBufferRelease(gUniformBuf);
+    if (gPipeline)   wgpuRenderPipelineRelease(gPipeline);
+    if (gQueue)      wgpuQueueRelease(gQueue);
+    if (gDevice)     wgpuDeviceRelease(gDevice);
+    if (gAdapter)    wgpuAdapterRelease(gAdapter);
+    if (gSurface)    wgpuSurfaceRelease(gSurface);
+    if (gInstance)   wgpuInstanceRelease(gInstance);
+}
+
+// ---------------------------------------------------------------------------
+// main
+// ---------------------------------------------------------------------------
+
+int main(int argc, char* argv[]) {
+    if (!platformInit(kWidth, kHeight, "WebGPU Spinning Triangle"))
+        return 1;
+
+    gInstance = wgpuCreateInstance(nullptr);
+    if (!gInstance) return error(2, "Failed to create WebGPU instance.");
+
+    gSurface = platformCreateSurface(gInstance);
+    if (!gSurface) return error(3, "Failed to create surface.");
+
+    if (initWebGPU() != 0) return 4;
+
+    platformRunLoop(renderFrame, shutdown);
+    return 0;
+}
--- a/manual/tracy.md
+++ b/manual/tracy.md
@@ -11,7 +11,7 @@ The user manual

 **Bartosz Taudul** [\<wolf@nereid.pl\>](mailto:wolf@nereid.pl)

-2026-06-09 <https://github.com/wolfpld/tracy>
+2026-06-15 <https://github.com/wolfpld/tracy>

 # Quick overview {#quick-overview .unnumbered}

@@ -69,7 +69,7 @@ Tracy is a real-time, nanosecond resolution *hybrid frame and sampling profiler*

 [^1]: Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C#, OCaml, Odin, etc.

-[^2]: All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL.
+[^2]: All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.

 While Tracy can perform statistical analysis of sampled call stack data, just like other *statistical profilers* (such as VTune, perf, or Very Sleepy), it mainly focuses on manual markup of the source code. Such markup allows frame-by-frame inspection of the program execution. For example, you will be able to see exactly which functions are called, how much time they require, and how they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it cannot accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds.

@@ -145,7 +145,7 @@ Tracy aims to give you an understanding of the inner workings of a tight loop of

 ## Sampling profiler

-Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others.
+Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even "steal" an optimization performed by one compiler and make it available for the others.

 On some platforms, it is possible to sample the hardware performance counters, which will give you information not only *where* your program is running slowly, but also *why*.

@@ -279,7 +279,7 @@ Tracy Profiler supports MSVC, GCC, and clang. You will need to use a reasonably

 - QNX (x64)

-[^11]: Requires **\"OpenCL, OpenGL, and Vulkan Compatibility Pack\"** from Microsoft Store.
+[^11]: Requires **"OpenCL, OpenGL, and Vulkan Compatibility Pack"** from Microsoft Store.

 Moreover, the following platforms are not supported due to how secretive their owners are but were reported to be working after extending the system integration layer:

@@ -463,7 +463,7 @@ In the case of some programming environments, you may need to take extra steps t

 If you are using MSVC, you will need to disable the *Edit And Continue* feature, as it makes the compiler non-conformant to some aspects of the C++ standard. In order to do so, open the project properties and go to C/C++,General,Debug Information Format and make sure *Program Database for Edit And Continue (/ZI)* is *not* selected.

-For context, if you experience errors like \"error C2131: expression did not evaluate to a constant\", \"failure was caused by non-constant arguments or reference to a non-constant symbol\", and \"see usage of '`__LINE__Var`'\", chances are that your project has the *Edit And Continue* feature enabled.
+For context, if you experience errors like "error C2131: expression did not evaluate to a constant", "failure was caused by non-constant arguments or reference to a non-constant symbol", and "see usage of '`__LINE__Var`'", chances are that your project has the *Edit And Continue* feature enabled.

 #### Universal Windows Platform

@@ -641,11 +641,11 @@ Nevertheless, let's look at how we can try to stabilize the profiling data.

 Also known as: the *spectre* thing we have to deal with now.

-You must be aware that most processors available on the market[^19] *do not* execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more 'reliable' readings[^20] would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is *really* doing.
+You must be aware that most processors available on the market[^19] *do not* execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more "reliable" readings[^20] would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is *really* doing.

 [^19]: Except low-cost ARM CPUs.

-[^20]: And by saying 'reliable,' you do in reality mean: behaving in a way you expect it.
+[^20]: And by saying "reliable," you do in reality mean: behaving in a way you expect it.

 This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: <https://travisdowns.github.io/blog/2019/06/11/speed-limits.html>.

@@ -675,7 +675,7 @@ While the CPU is more-or-less designed always to be able to work at the advertis

 - Do you have complete control over the power profile? Spoiler alert: no. The operating system may run anything at any time on any of the other cores, which will impact the turbo frequency you're able to achieve.

-As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at *four* different speeds.
+As you can see, this feature basically screams "unreliable results!" Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at *four* different speeds.

 Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down throttled.

@@ -797,7 +797,7 @@ If you want to use X11 instead, you can enable the `LEGACY` option in CMake buil

 Special considerations must be taken to run the Tracy server/profiler GUI on Windows on ARM.

-Ensure that the **\"OpenCL, OpenGL, and Vulkan Compatibility Pack\"** is installed (from the Microsoft Store), otherwise the GUI will fail to open.
+Ensure that the **"OpenCL, OpenGL, and Vulkan Compatibility Pack"** is installed (from the Microsoft Store), otherwise the GUI will fail to open.

 ### Using an IDE

@@ -813,7 +813,7 @@ The CMake build configuration will begin immediately. It is likely that you will

 After the build configuration phase is over, you may want to make some further adjustments to what is being built. The primary place to do this is in the *Project Status* section of the CMake side panel. The two key settings there are also available in the status bar at the bottom of the window:

- The *Folder* setting allows you to choose which Tracy utility you want to work with. Select \"profiler\" for the profiler's GUI.
+- The *Folder* setting allows you to choose which Tracy utility you want to work with. Select "profiler" for the profiler's GUI.

 - The *Build variant* setting is used to toggle between the debug and release build configurations.

@@ -877,7 +877,7 @@ Some source location data such as function name, file path or line number can be

 On selected platforms (see section [2.6](#featurematrix)) Tracy will intercept application crashes[^28]. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.

-[^28]: For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.
+[^28]: For example, invalid memory accesses ("segmentation faults", "null pointer exceptions"), divisions by zero, etc.

 This is an automatic process, and it doesn't require user interaction. If you are experiencing issues with crash handling you may want to try defining the `TRACY_NO_CRASH_HANDLER` macro to disable the built in crash handling.

@@ -905,6 +905,8 @@ Some features of the profiler are only available on selected platforms. Please r
 | GPU zones (OpenGL) |  |  |  |  |  |  |  |
 | GPU zones (Vulkan) |  |  |  |  |  |  |  |
 | GPU zones (Metal) |  |  |  | ^*b*^ | ^*b*^ |  |  |
+| GPU zones (CUDA) |  |  |  |  |  | ? |  |
+| GPU zones (WebGPU) |  |  |  |  |  | ? | ? |
 | Call stacks |  |  |  |  |  |  |  |
 | Symbol resolution |  |  |  |  |  |  |  |
 | Crash handling |  |  |  |  |  |  |  |
@@ -966,7 +968,7 @@ In some cases marked in the manual, Tracy expects you to provide a unique pointe

 Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image[^33]. For example, on MSVC, this is controlled by Configuration Properties,C/C++,Code Generation,Enable String Pooling option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.

-[^33]: [@ISO:2012:III] §2.14.5.12: \"Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined.\"
+[^33]: [@ISO:2012:III] §2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."

 As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work around this problem, you may employ the following technique. In *one* source file create the unique pointer for a string literal, for example:

@@ -1237,7 +1239,7 @@ Zone objects can't be moved or copied.

 ### Filtering zones {#filteringzones}

-Zone logging can be disabled on a per-zone basis by making use of the `ZoneNamed` macros. Each of the macros takes an `active` argument ('`true`' in the example in section [3.4.2](#multizone)), which will determine whether the zone should be logged.
+Zone logging can be disabled on a per-zone basis by making use of the `ZoneNamed` macros. Each of the macros takes an `active` argument ("`true`" in the example in section [3.4.2](#multizone)), which will determine whether the zone should be logged.

 Note that this parameter may be a run-time variable, such as a user-controlled switch to enable profiling of a specific part of code only when required.

@@ -1371,13 +1373,27 @@ Fast navigation in large data sets and correlating zones with what was happening

 If you want to include color coding of the messages (for example to make critical messages easily visible), you can use `TracyMessageC(text, size, color)` or `TracyMessageLC(text, color)` macros.

-Messages can also have different severity levels: `Trace`, `Debug`, `Info`, `Warning`, `Error` or `Fatal`. The `TracyMessage` macros will log messages with the severity `Info`. To log a message with a different severity, you may use the `TracyLogString` macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.
+Messages can also have different severity levels:
+
+- *Trace* -- Broadly track variable states and events in the software program.
+
+- *Debug* -- Describes variable states and details about specific internal events in the software, that are useful for investigations.
+
+- *Info* -- Describes normal events, which inform on the expected progress and state of your software.
+
+- *Warning* -- Describes potentially dangerous situations caused by unexpected events and states.
+
+- *Error* -- Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
+
+- *Fatal* -- Describes a critical event that will lead to a software failure/crash.
+
+The `TracyMessage` macros will log messages with the severity `Info`. To log a message with a different severity, you may use the `TracyLogString` macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.

 Examples:

    std::string dynStr = "Trace using a dynamic string, blue color, no callstack";
    TracyLogString( tracy::MessageSeverity::Trace, 0xFF, 0, dynStr.size(), dynStr.c_str() );
-    TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string litteral, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
+    TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string literal, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );

 ### Application information {#appinfo}

@@ -1416,8 +1432,6 @@ To mark memory events, use the `TracyAlloc(ptr, size)` and `TracyFree(ptr)` macr
        free(ptr);
    }

-In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To work around this issue, you may use `TracySecureAlloc` and `TracySecureFree` variants of the macros.
-
 > [!IMPORTANT]
 > **Important**
 >
@@ -1446,9 +1460,11 @@ Sometimes an application will use more than one memory pool. For example, in add

 To mark that a separate memory pool is to be tracked you should use the named version of memory macros, for example `TracyAllocN(ptr, size, name)` and `TracyFreeN(ptr, name)`, where `name` is an unique pointer to a string literal (section [3.1.2](#uniquepointers)) identifying the memory pool.

+Certain memory allocator designs ("arena allocators") use an always-incrementing pointer to track the next region to allocate and do not support deallocation of individual objects. The only way to free memory with such an allocator is to simultaneously release all the objects that were allocated (reset the allocator state). You can mark such a mass-deallocation event in a memory pool with the `TracyMemoryDiscard(name)` macro.
+
 ## GPU profiling {#gpuprofiling}

-Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL and CUDA execution time on GPU.
+Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL, CUDA and WebGPU execution time on GPU.

 Note that the CPU and GPU timers may be unsynchronized unless you create a calibrated context, but the availability of calibrated contexts is limited. You can try to correct the desynchronization of uncalibrated contexts in the profiler's options (section [5.4](#options)).

@@ -1589,6 +1605,16 @@ Unlike other GPU backends in Tracy, there is no need to call `TracyCUDACollect(c

 To stop profiling, call the `TracyCUDAStopProfiling(ctx)` macro.

+### WebGPU
+
+WebGPU support is enabled by including the `public/tracy/TracyWebGPU.hpp` header file. Both major implementations of WebGPU (Dawn and wgpu-native) are supported.
+
+Before creating the WebGPU device, make sure to call `TracyWebGPUSetupDeviceDescriptor()` to let Tracy request the necessary device features and extensions necessary for profiling. After the device is created, use the `TracyWebGPUContext()` macro to instantiate the necessary `WebGPUQueueCtx` object required for GPU instrumentation. The object should later be cleaned up with the `TracyWebGPUDestroy()` macro. To set a custom name for the context, use the `TracyWebGPUContextName()` macro.
+
+To instrument a GPU zone, use the various `TracyWebGPU*Zone*()` macros. Note that WebGPU only offers command instrumentation at the "pass"-level. While command-level granularity is possible through implementation-specific WebGPU extensions, Tracy does not support it at the moment. Supply the corresponding WebGPU pass descriptor to the instrumentation macro *before* creating the WebGPU pass encoder.
+
+You are required to periodically collect the GPU events using the `TracyWebGPUCollect()` macro. Good places for collection are: after synchronous waits, after event processing `wgpuInstanceProcessEvents`, after present drawable calls (`wgpuSurfacePresent`), and inside the completion callback of command queues (`wgpuQueueOnSubmittedWorkDone`).
+
 ### ROCm

 On Linux, if rocprofiler-sdk is installed, tracy can automatically trace GPU dispatches and collect performance counter values. If CMake can't find rocprofiler-sdk, you can set the CMake variable `rocprofiler-sdk_DIR` to point it at the correct module directory. Use the `TRACY_ROCPROF_COUNTERS` environment variable with the desired counters separated by commas to control what values are collected. The results will appear for each dispatch in the tool tip and zone detail window. Results are summed across dimensions. You can get a list of the counters available for your hardware with this command:
@@ -1613,13 +1639,13 @@ rocprofv3 -L

 Putting more than one GPU zone macro in a single scope features the same issue as with the `ZoneScoped` macros, described in section [3.4.2](#multizone) (but this time the variable name is `___tracy_gpu_zone`).

-To solve this problem, in case of OpenGL use the `TracyGpuNamedZone` macro in place of `TracyGpuZone` (or the color variant). The same applies to Vulkan, Direct3D 11/12 and Metal -- replace `TracyVkZone` with `TracyVkNamedZone`, `TracyD3D11Zone`/`TracyD3D12Zone` with `TracyD3D11NamedZone`/`TracyD3D12NamedZone`, and `TracyMetalZone` with `TracyMetalNamedZone`.
+To solve this problem, in case of OpenGL use the `TracyGpuNamedZone` macro in place of `TracyGpuZone` (or the color variant). The same applies to Vulkan, Direct3D 11/12, Metal and WebGPU -- replace `TracyVkZone` with `TracyVkNamedZone`, `TracyD3D11Zone`/`TracyD3D12Zone` with `TracyD3D11NamedZone`/`TracyD3D12NamedZone`, `TracyMetalZone` with `TracyMetalNamedZone`, and `TracyWebGPUZone` with `TracyWebGPUNamedZone`.

 Remember to provide your name for the created stack variable as the first parameter to the macros.

 ### Transient GPU zones

-Transient zones (see section [3.4.4](#transientzones) for details) are available in OpenGL, Vulkan, and Direct3D 11/12 macros. Transient zones are not available for Metal at this moment.
+Transient zones (see section [3.4.4](#transientzones) for details) are available in OpenGL, Vulkan, Direct3D 11/12 and WebGPU macros. Transient zones are not available for Metal at this moment.

 ## Fibers

@@ -1664,7 +1690,7 @@ As you can see, there are two threads, `t1` and `t2`, which are simulating worke

 ## Collecting call stacks {#collectingcallstacks}

-Capture of true calls stacks can be performed by using macros with the `S` postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: `ZoneScopedS`, `ZoneScopedNS`, `ZoneScopedCS`, `ZoneScopedNCS`, `TracyAllocS`, `TracyFreeS`, `TracySecureAllocS`, `TracySecureFreeS`, `TracyMessageS`, `TracyMessageLS`, `TracyMessageCS`, `TracyMessageLCS`, `TracyGpuZoneS`, `TracyGpuZoneCS`, `TracyVkZoneS`, `TracyVkZoneCS`, and the named and transient variants.
+Capture of true calls stacks can be performed by using macros with the `S` postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: `ZoneScopedS`, `ZoneScopedNS`, `ZoneScopedCS`, `ZoneScopedNCS`, `TracyAllocS`, `TracyFreeS`, `TracyMessageS`, `TracyMessageLS`, `TracyMessageCS`, `TracyMessageLCS`, `TracyGpuZoneS`, `TracyGpuZoneCS`, `TracyVkZoneS`, `TracyVkZoneCS`, and the named and transient variants.

 Be aware that call stack collection is a relatively slow operation. Table [6](#CallstackTimes) and figure [6](#CallstackPlot) show how long it took to perform a single capture of varying depth on multiple CPU architectures.

@@ -1788,7 +1814,7 @@ An example implementation of such a lock interface is provided below, as a refer
    void DbgHelpUnlock() { ReleaseMutex(dbgHelpLock); }
    }

-At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the `TRACY_NO_DBGHELP_INIT_LOAD` environment variable to \"1\" to disable this behavior and rely on-demand symbol loading.
+At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the `TRACY_NO_DBGHELP_INIT_LOAD` environment variable to "1" to disable this behavior and rely on-demand symbol loading.

 #### Disabling resolution of inline frames

@@ -2039,10 +2065,6 @@ Use the following macros in your implementations of `malloc` and `free`:

 - `TracyCFree(ptr)`

- `TracyCSecureAlloc(ptr, size)`
-
- `TracyCSecureFree(ptr)`
-
 Correctly using this functionality can be pretty tricky. You also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions) and the allocations made by system functions. If you can't track such an allocation, you will need to make sure freeing is not reported[^56].

 [^56]: It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to release.
@@ -2108,7 +2130,7 @@ To see how you should use this API, you should look at the reference implementat
 > [!IMPORTANT]
 > **Important**
 >
-> A common mistake is to skip the zone \"`isActive`\" check. When using `TRACY_ON_DEMAND`, you need to read the value of `TracyCIsConnected` once, and check the same value for both\
+> A common mistake is to skip the zone "`isActive`" check. When using `TRACY_ON_DEMAND`, you need to read the value of `TracyCIsConnected` once, and check the same value for both\
 > `___tracy_emit_gpu_zone_begin_alloc` and `___tracy_emit_gpu_zone_end`. Tracy may otherwise receive a zone end without a zone begin.

 ### Fibers
@@ -2546,9 +2568,9 @@ To collect frame images, use `tracy_image(image, w, h, offset, flip)` call.

 Use the following calls in your implementations of allocator/deallocator:

- `tracy_memory_alloc(ptr, size, name, depth, secure)`
+- `tracy_memory_alloc(ptr, size, name, depth)`

- `tracy_memory_free(ptr, name, depth, secure)`
+- `tracy_memory_free(ptr, name, depth)`

 Correctly using this functionality can be pretty tricky especially in Fortran. In Fortran, you can not redefine `allocate` statement (as well as `deallocate` statement) to profile memory usage by `allocatable` variables. However, many applications[^58] uses stack allocator on memory tape where these calls can be useful.

@@ -2600,7 +2622,7 @@ Some profiling data can only be retrieved using the kernel facilities, which are

 [^59]: To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.

-As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the `TRACY_NO_SYSTEM_TRACING` define. If you want to disable this functionality dynamically at runtime instead, you can set the `TRACY_NO_SYSTEM_TRACING` environment variable to \"1\".
+As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the `TRACY_NO_SYSTEM_TRACING` define. If you want to disable this functionality dynamically at runtime instead, you can set the `TRACY_NO_SYSTEM_TRACING` environment variable to "1".

 > [!TIP]
 > **What should be granted privileges?**
@@ -2737,7 +2759,7 @@ It would be best to be extra careful when working with non-public code, as parts

 ### Vertical synchronization

-On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section [3.17.1](#privilegeelevation)). These events will be reported as '`[x] Vsync`' frame sets, where `x` is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
+On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section [3.17.1](#privilegeelevation)). These events will be reported as "`[x] Vsync`" frame sets, where `x` is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.

 Use the `TRACY_NO_VSYNC_CAPTURE` macro to disable capture of Vsync events.

@@ -2887,7 +2909,7 @@ The * Wrench* button opens the about dialog, which also contains a number of

 The client *address entry* field and the  *Connect* button are used to connect to a running client[^66]. You can use the connection history button  to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the  mouse cursor over an entry and pressing the Delete button on the keyboard.

-[^66]: Note that a custom port may be provided here, for example by entering '127.0.0.1:1234'.
+[^66]: Note that a custom port may be provided here, for example by entering "127.0.0.1:1234".

 If you want to open a trace that you have stored on the disk, you can do so by pressing the  *Open saved trace* button.

@@ -3403,13 +3425,13 @@ You will find the zones with locks and their associated threads on this combined

 The left-hand side *index area* of the timeline view displays various labels (threads, locks), which can be categorized in the following way:

- *Light blue label* -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12 and Metal contexts are additionally split into separate threads.
+- *Light blue label* -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12, Metal and WebGPU contexts are additionally split into separate threads.

 - *Pink label* -- CPU data graph.

- *White label* -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section [2.5](#crashhandling)). If automated sampling was performed, clicking the left mouse button on the * ghost zones* button will switch zone display mode between 'instrumented' and 'ghost.'
+- *White label* -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section [2.5](#crashhandling)). If automated sampling was performed, clicking the left mouse button on the * ghost zones* button will switch zone display mode between "instrumented" and "ghost."

- *Green label* -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread.'
+- *Green label* -- Fiber, coroutine, or any other sort of cooperative multitasking "green thread."

 - *Light red label* -- Indicates a lock.

@@ -3437,7 +3459,7 @@ In an example in figure [18](#zoneslocks) you can see that there are two thread

 Meanwhile, the *Streaming thread* is performing some *Streaming jobs*. The first *Streaming job* sent a message (section [3.7](#messagelog)). In addition to being listed in the message log, it is indicated by a triangle over the thread separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle.

-The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL context in place of a thread name.
+The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL/CUDA/WebGPU context in place of a thread name.

 Hovering the  mouse pointer over a zone will highlight all other zones that have the exact source location with a white outline. Clicking the left mouse button on a zone will open the zone information window (section [5.14](#zoneinfo)). Holding the Ctrl key and clicking the left mouse button on a zone will open the zone statistics window (section [5.7](#findzone)). Clicking the middle mouse button on a zone will zoom the view to the extent of the zone.

@@ -3659,7 +3681,7 @@ In this window, you can set various trace-related options. For example, the time

  - * Draw CPU usage graph* -- You can disable drawing of the CPU usage graph here.

- * Draw GPU zones* -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL zones. The *GPU zones* drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section [3.9](#gpuprofiling) for more information). The * Auto* button automatically measures the GPU drift value[^78].
+- * Draw GPU zones* -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL/CUDA/WebGPU zones. The *GPU zones* drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section [3.9](#gpuprofiling) for more information). The * Auto* button automatically measures the GPU drift value[^78].

 - * Draw CPU zones* -- Determines whether CPU zones are displayed.

@@ -3738,7 +3760,7 @@ You can filter the message list in the following ways:

 - By the originating thread in the * Visible threads* drop-down.

- By matching the message text to the expression in the * Filter messages* entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' *or* 'info'). You can exclude matches by preceding the term with a minus character (e.g., '-debug' will hide all messages containing the string 'debug').
+- By matching the message text to the expression in the * Filter messages* entry field. Multiple filter expressions can be comma-separated (e.g. "warn, info" will match messages containing strings "warn" *or* "info"). You can exclude matches by preceding the term with a minus character (e.g., "-debug" will hide all messages containing the string "debug").

 - By message source, distinguishing between * User* messages and internal * Tracy* diagnostics.

@@ -4215,7 +4237,7 @@ The zone information window displays detailed information about a single zone. T

 - Timing information.

- If the profiler performed context switch capture (section [3.17.3](#contextswitches)) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section [3.17.4](#cputopology)), the profiler will mark zone migrations across cores with 'C' and migrations across packages -- with 'P.' In some cases, context switch data might be incomplete[^92], in which case a warning message will be displayed.
+- If the profiler performed context switch capture (section [3.17.3](#contextswitches)) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section [3.17.4](#cputopology)), the profiler will mark zone migrations across cores with "C" and migrations across packages -- with "P." In some cases, context switch data might be incomplete[^92], in which case a warning message will be displayed.

 - Memory events list, both summarized and a list of individual allocation/free events (see section [5.10](#memorywindow) for more information on the memory events list).

@@ -4275,7 +4297,7 @@ This window shows the frames contained in the selected call stack. Information a

 A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with *inline* in place of frame number[^94].

-[^94]: Or '' icon in case of call stack tooltips.
+[^94]: Or "" icon in case of call stack tooltips.

 If the call stack shows a crash (see section [2.5](#crashhandling)), a red * Crash* label will be displayed. Clicking it will center the timeline on the crash. Note that the crash stack may contain OS or Tracy frames where the crash was intercepted and processed.

@@ -4289,7 +4311,7 @@ Stack frame location may be displayed in the following number of ways, depending

 - *Symbol address* -- displays begin address of the function containing the frame address.

-In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed '`[ntdll.dll]`' name of the image containing the frame address, or simply '`[unknown]`' if the profiler cannot retrieve even this information. Additionally, '`[kernel]`' is used to indicate unknown stack frames within the operating system's internal routines.
+In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed "`[ntdll.dll]`" name of the image containing the frame address, or simply "`[unknown]`" if the profiler cannot retrieve even this information. Additionally, "`[kernel]`" is used to indicate unknown stack frames within the operating system's internal routines.

 External frames from system libraries are hidden by default. Enabling the * External* option will show these frames, which can be useful for debugging issues in external code. When external frames are displayed, they are dimmed out.

@@ -4388,7 +4410,7 @@ Some modes may be unavailable in some circumstances (missing or outdated source

 #### Source mode

-This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an '`@`' prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
+This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an "`@`" prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.

 The *Propagate inlines* option, available when sample data is present, will enable propagation of the instruction costs down the local call stack. For example, suppose a base function in the symbol issues a call to an inlined function (which may not be readily visible due to being contained in another source file). In that case, any cost attributed to the inlined function will be visible in the base function. Because the cost information is added to all the entries in the local call stacks, it is possible to see seemingly nonsense total cost values when this feature is enabled. To quickly toggle this on or off, you may also press the X key.

@@ -4403,7 +4425,7 @@ If the * Source locations* option is selected, each line of the assembly code
 >
 > In some cases, it may be challenging to understand what is being displayed in the disassembly. For example, calling the `std::lower_bound` function may generate multiple levels of inlined functions: first, we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such an event, you will most likely see that some external code is taking a long time to execute, and you will be none the wiser on improving things.
 >
-> The local call stack for an assembly instruction represents all the inline function calls *within the symbol* (hence the 'local' part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the right mouse button on the source location.
+> The local call stack for an assembly instruction represents all the inline function calls *within the symbol* (hence the "local" part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the right mouse button on the source location.

 Selecting the * Raw code* option will enable the display of raw machine code bytes for each line. Individual bytes are displayed with interwoven colors to make reading easier.

@@ -4487,9 +4509,9 @@ In this mode, the source and assembly panes will be displayed together, providin

 #### Instruction pointer cost statistics

-If automated call stack sampling (see chapter [3.17.5](#sampling)) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify 'hot' places in the code at a glance.
+If automated call stack sampling (see chapter [3.17.5](#sampling)) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify "hot" places in the code at a glance.

-By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the * Child calls* option, which you may also temporarily toggle by holding the Z key. You can also click the  drop down control to display a child call distribution list[^101], which shows each known function[^102] that the symbol called. Make sure to familiarize yourself with section [5.15.1](#readingcallstacks) to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls (\"% Calls\") and the percentage of the total symbol time (\"% Total\").
+By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the * Child calls* option, which you may also temporarily toggle by holding the Z key. You can also click the  drop down control to display a child call distribution list[^101], which shows each known function[^102] that the symbol called. Make sure to familiarize yourself with section [5.15.1](#readingcallstacks) to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls ("% Calls") and the percentage of the total symbol time ("% Total").

 [^101]: The height of the list can be changed by dragging the separator bar.

@@ -4710,7 +4732,7 @@ There are no ideal LLM providers, but here are some options:

 - *llama-swap* (<https://github.com/mostlygeek/llama-swap>) -- Wrapper for llama.cpp that allows model selection. Recommended to augment the above.

- *LM Studio* (<https://lmstudio.ai/>) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable \"When applicable, separate `reasoning_content` and `content` in API responses\".
+- *LM Studio* (<https://lmstudio.ai/>) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable "When applicable, separate `reasoning_content` and `content` in API responses".

 ## Model selection

@@ -4727,19 +4749,19 @@ A good *starting* point that will work fairly well on almost any hardware is the
 > [!TIP]
 > **Model quantization**
 >
-> Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more \"dumbed down\" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
+> Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more "dumbed down" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.

 > [!TIP]
 > **Model size**
 >
-> Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the \"smarter\" its responses will be.
+> Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the "smarter" its responses will be.
 >
-> Most modern models will be \"Mixture of Experts\", or MoE, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
+> Most modern models will be "Mixture of Experts", or "MoE", and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.

 > [!TIP]
 > **Context size**
 >
-> The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can \"remember\". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
+> The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can "remember". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
 >
 > Each token present in the context window may require a fairly large amount of memory, and that can quickly add up to gigabytes. Some modern models use solutions that greatly reduce context memory requirements, but that varies from model to model. If needed, the KV cache used for context can be quantized, just like model parameters. In this case, the recommended size per weight is 8 bits.
 >
@@ -4749,7 +4771,7 @@ A good *starting* point that will work fairly well on almost any hardware is the

 Sometimes Tracy needs to do some language processing where speed is more important than the smarts. The default setting is to use the chat model with the reasoning disabled, which is fine for most applications.

-It may be more convenient to use a small, quick model instead, in which case enable the *Fast model* checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set `-ngl 0` for llama.cpp or set \"GPU offload\" to 0 in LM Studio) and disable the KV cache offload to GPU (set `-nkvo` for llama.cpp or disable \"Offload KV Cache to GPU Memory\" in LM Studio).
+It may be more convenient to use a small, quick model instead, in which case enable the *Fast model* checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set `-ngl 0` for llama.cpp or set "GPU offload" to 0 in LM Studio) and disable the KV cache offload to GPU (set `-nkvo` for llama.cpp or disable "Offload KV Cache to GPU Memory" in LM Studio).

 ### Embedding model

@@ -4817,7 +4839,7 @@ The horizontal meter directly below shows how much of the context size has been

 The chat section contains the conversation with the automated assistant with alternating user and assistant turns. Clicking on the * User* role icon removes the chat content up to the selected question. Similarly, clicking on the * Assistant* role icon removes the conversation content up to this point and generates another response from the assistant.

-The assistant may give preliminary replies to the user, for example, *\"I will now check the source of function foobar\"*, followed by performing the actual check, then a continuation of the reply, such as *\"Now I can see that\...\"*. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
+The assistant may give preliminary replies to the user, for example, "I will now check the source of function foobar", followed by performing the actual check, then a continuation of the reply, such as "Now I can see that\...". To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.

 Each assistant reply contains a note about the language model that was used and the time it took to generate the text.

--- a/manual/tracy.tex
+++ b/manual/tracy.tex
@@ -141,7 +141,7 @@ There's much more Tracy can do, which can be explored by carefully reading this
 \section{A quick look at Tracy Profiler}
 \label{quicklook}

-Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that you can use for remote or embedded telemetry of games and other applications. It can profile CPU\footnote{Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C\#, OCaml, Odin, etc.}, GPU\footnote{All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL.}, memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
+Tracy is a real-time, nanosecond resolution \emph{hybrid frame and sampling profiler} that you can use for remote or embedded telemetry of games and other applications. It can profile CPU\footnote{Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as Rust, Zig, C\#, OCaml, Odin, etc.}, GPU\footnote{All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.}, memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.

 While Tracy can perform statistical analysis of sampled call stack data, just like other \emph{statistical profilers} (such as VTune, perf, or Very Sleepy), it mainly focuses on manual markup of the source code. Such markup allows frame-by-frame inspection of the program execution. For example, you will be able to see exactly which functions are called, how much time they require, and how they interact with each other in a multi-threaded environment. In contrast, the statistical analysis may show you the hot spots in your code, but it cannot accurately pinpoint the underlying cause for semi-random frame stutter that may occur every couple of seconds.

@@ -228,7 +228,7 @@ Tracy aims to give you an understanding of the inner workings of a tight loop of

 \subsection{Sampling profiler}

-Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even 'steal' an optimization performed by one compiler and make it available for the others.
+Tracy can periodically sample what the profiled application is doing, which provides detailed performance information at the source line/assembly instruction level. This can give you a deep understanding of how the processor executes the program. Using this information, you can get a coarse view at the call stacks, fine-tune your algorithms, or even \enquote{steal} an optimization performed by one compiler and make it available for the others.

 On some platforms, it is possible to sample the hardware performance counters, which will give you information not only \emph{where} your program is running slowly, but also \emph{why}.

@@ -369,7 +369,7 @@ Note that these binary releases require AVX2 instruction set support on the proc
 Tracy Profiler supports  MSVC, GCC, and clang. You will need to use a reasonably recent version of the compiler due to the C++11 requirement. The following platforms are confirmed to be working (this is not a complete list):

 \begin{itemize}
-\item Windows (x86, x64, ARM64\footnote{Requires \textbf{"OpenCL, OpenGL, and Vulkan Compatibility Pack"} from Microsoft Store.})
+\item Windows (x86, x64, ARM64\footnote{Requires \textbf{\enquote{OpenCL, OpenGL, and Vulkan Compatibility Pack}} from Microsoft Store.})
 \item Linux (x86, x64, ARM, ARM64)
 \item Android (ARM, ARM64, x86)
 \item FreeBSD (x64)
@@ -594,7 +594,7 @@ In the case of some programming environments, you may need to take extra steps t

 If you are using MSVC, you will need to disable the \emph{Edit And Continue} feature, as it makes the compiler non-conformant to some aspects of the C++ standard. In order to do so, open the project properties and go to \menu[,]{C/C++,General,Debug Information Format} and make sure \emph{Program Database for Edit And Continue (/ZI)} is \emph{not} selected.

-For context, if you experience errors like "error C2131: expression did not evaluate to a constant", "failure was caused by non-constant arguments or reference to a non-constant symbol", and "see usage of '\texttt{\_\_LINE\_\_Var}'", chances are that your project has the \emph{Edit And Continue} feature enabled.
+For context, if you experience errors like \enquote{error C2131: expression did not evaluate to a constant}, \enquote{failure was caused by non-constant arguments or reference to a non-constant symbol}, and \enquote{see usage of \enquote{\texttt{\_\_LINE\_\_Var}}}, chances are that your project has the \emph{Edit And Continue} feature enabled.

 \paragraph{Universal Windows Platform}

@@ -778,7 +778,7 @@ Nevertheless, let's look at how we can try to stabilize the profiling data.

 Also known as: the \emph{spectre} thing we have to deal with now.

-You must be aware that most processors available on the market\footnote{Except low-cost ARM CPUs.} \emph{do not} execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more 'reliable' readings\footnote{And by saying 'reliable,' you do in reality mean: behaving in a way you expect it.} would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is \emph{really} doing.
+You must be aware that most processors available on the market\footnote{Except low-cost ARM CPUs.} \emph{do not} execute machine code linearly, as laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to get more \enquote{reliable} readings\footnote{And by saying \enquote{reliable,} you do in reality mean: behaving in a way you expect it.} would require a change in the behavior of the code, and this is not a thing a profiler should do. So instead, Tracy shows you what the hardware is \emph{really} doing.

 This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the topic at the following address: \url{https://travisdowns.github.io/blog/2019/06/11/speed-limits.html}.

@@ -805,7 +805,7 @@ While the CPU is more-or-less designed always to be able to work at the advertis
 \item Do you have complete control over the power profile? Spoiler alert: no. The operating system may run anything at any time on any of the other cores, which will impact the turbo frequency you're able to achieve.
 \end{itemize}

-As you can see, this feature basically screams 'unreliable results!' Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds.
+As you can see, this feature basically screams \enquote{unreliable results!} Best keep it disabled and run at the base frequency. Otherwise, your timings won't make much sense. A true example: branchless compression function executing multiple times with the same input data was measured executing at \emph{four} different speeds.

 Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down throttled.

@@ -940,7 +940,7 @@ Please don't ask about window decorations in Gnome. The current behavior is the

 Special considerations must be taken to run the Tracy server/profiler GUI on Windows on ARM.

-Ensure that the \textbf{"OpenCL, OpenGL, and Vulkan Compatibility Pack"} is installed (from the Microsoft Store), otherwise the GUI will fail to open.
+Ensure that the \textbf{\enquote{OpenCL, OpenGL, and Vulkan Compatibility Pack}} is installed (from the Microsoft Store), otherwise the GUI will fail to open.

 \subsubsection{Using an IDE}

@@ -955,7 +955,7 @@ The CMake build configuration will begin immediately. It is likely that you will
 After the build configuration phase is over, you may want to make some further adjustments to what is being built. The primary place to do this is in the \emph{Project Status} section of the CMake side panel. The two key settings there are also available in the status bar at the bottom of the window:

 \begin{itemize}
-\item The \emph{Folder} setting allows you to choose which Tracy utility you want to work with. Select "profiler" for the profiler's GUI.
+\item The \emph{Folder} setting allows you to choose which Tracy utility you want to work with. Select \enquote{profiler} for the profiler's GUI.
 \item The \emph{Build variant} setting is used to toggle between the debug and release build configurations.
 \end{itemize}

@@ -1016,7 +1016,7 @@ void Graphics::Render()
 \subsection{Crash handling}
 \label{crashhandling}

-On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses ('segmentation faults', 'null pointer exceptions'), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.
+On selected platforms (see section~\ref{featurematrix}) Tracy will intercept application crashes\footnote{For example, invalid memory accesses (\enquote{segmentation faults}, \enquote{null pointer exceptions}), divisions by zero, etc.}. This serves two purposes. First, the client application will be able to send the remaining profiling data to the server. Second, the server will receive a crash report with the crash reason, call stack at the time of the crash, etc.

 This is an automatic process, and it doesn't require user interaction. If you are experiencing issues with crash handling you may want to try defining the \texttt{TRACY\_NO\_CRASH\_HANDLER} macro to disable the built in crash handling.

@@ -1050,6 +1050,8 @@ Memory & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faXm
 GPU zones (OpenGL) & \faCheck & \faCheck & \faCheck & \faPoo & \faPoo & & \faXmark \\
 GPU zones (Vulkan) & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & & \faXmark \\
 GPU zones (Metal) & \faXmark & \faXmark & \faXmark & \faCheck\textsuperscript{\emph{b}} & \faCheck\textsuperscript{\emph{b}} & \faXmark & \faXmark \\
+GPU zones (CUDA) & \faCheck & \faCheck & \faXmark & \faXmark & \faXmark & \faQuestion & \faXmark \\
+GPU zones (WebGPU) & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faQuestion & \faQuestion \\
 Call stacks & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faXmark \\
 Symbol resolution & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck & \faCheck \\
 Crash handling & \faCheck & \faCheck & \faCheck & \faXmark & \faXmark & \faXmark & \faXmark \\
@@ -1108,7 +1110,7 @@ FrameMarkStart("Audio processing");
 FrameMarkEnd("Audio processing");
 \end{lstlisting}

-Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."}. For example, on MSVC, this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.
+Here, we pass two string literals with identical contents to two different macros. It is entirely up to the compiler to decide if it will pool these two strings into one pointer or if there will be two instances present in the executable image\footnote{\cite{ISO:2012:III} \S 2.14.5.12: \enquote{Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined.}}. For example, on MSVC, this is controlled by \menu[,]{Configuration Properties,C/C++,Code Generation,Enable String Pooling} option in the project properties (optimized builds enable it automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker to implement pooling across object files.

 As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work around this problem, you may employ the following technique. In \emph{one} source file create the unique pointer for a string literal, for example:

@@ -1406,7 +1408,7 @@ It is valid to set the \texttt{Zone1} text or name \emph{only} in places \circle
 \subsubsection{Filtering zones}
 \label{filteringzones}

-Zone logging can be disabled on a per-zone basis by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument ('\texttt{true}' in the example in section~\ref{multizone}), which will determine whether the zone should be logged.
+Zone logging can be disabled on a per-zone basis by making use of the \texttt{ZoneNamed} macros. Each of the macros takes an \texttt{active} argument (\enquote{\texttt{true}} in the example in section~\ref{multizone}), which will determine whether the zone should be logged.

 Note that this parameter may be a run-time variable, such as a user-controlled switch to enable profiling of a specific part of code only when required.

@@ -1558,14 +1560,24 @@ Fast navigation in large data sets and correlating zones with what was happening

 If you want to include color coding of the messages (for example to make critical messages easily visible), you can use \texttt{TracyMessageC(text, size, color)} or \texttt{TracyMessageLC(text, color)} macros.

-Messages can also have different severity levels: \texttt{Trace}, \texttt{Debug}, \texttt{Info}, \texttt{Warning}, \texttt{Error} or \texttt{Fatal}. 
+Messages can also have different severity levels:
+
+\begin{itemize}
+\item \emph{Trace} -- Broadly track variable states and events in the software program.
+\item \emph{Debug} -- Describes variable states and details about specific internal events in the software, that are useful for investigations.
+\item \emph{Info} -- Describes normal events, which inform on the expected progress and state of your software.
+\item \emph{Warning} -- Describes potentially dangerous situations caused by unexpected events and states.
+\item \emph{Error} -- Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
+\item \emph{Fatal} -- Describes a critical event that will lead to a software failure/crash.
+\end{itemize}
+
 The \texttt{TracyMessage} macros will log messages with the severity \texttt{Info}. To log a message with a different severity, you may use the \texttt{TracyLogString} macro that regroups all the functionalities from the previous macros. We recommend writing your own macros, wrapping the different severities for easier use. You may provide a color of 0 if you do not want to set a color for this message.

 Examples:
 \begin{lstlisting}
 std::string dynStr = "Trace using a dynamic string, blue color, no callstack";
 TracyLogString( tracy::MessageSeverity::Trace, 0xFF, 0, dynStr.size(), dynStr.c_str() );
-TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string litteral, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
+TracyLogString( tracy::MessageSeverity::Warning, 0, TRACY_CALLSTACK, "Warning using a string literal, no color, capturing the callstack to a depth of TRACY_CALLSTACK" );
 \end{lstlisting}


@@ -1607,8 +1619,6 @@ void operator delete(void* ptr) noexcept
 }
 \end{lstlisting}

-In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer available, which would lead to a crash. To work around this issue, you may use \texttt{TracySecureAlloc} and \texttt{TracySecureFree} variants of the macros.
-
 \begin{bclogo}[
 noborder=true,
 couleur=black!5,
@@ -1642,10 +1652,12 @@ Sometimes an application will use more than one memory pool. For example, in add

 To mark that a separate memory pool is to be tracked you should use the named version of memory macros, for example \texttt{TracyAllocN(ptr, size, name)} and \texttt{TracyFreeN(ptr, name)}, where \texttt{name} is an unique pointer to a string literal (section~\ref{uniquepointers}) identifying the memory pool.

+Certain memory allocator designs (\enquote{arena allocators}) use an always-incrementing pointer to track the next region to allocate and do not support deallocation of individual objects. The only way to free memory with such an allocator is to simultaneously release all the objects that were allocated (reset the allocator state). You can mark such a mass-deallocation event in a memory pool with the \texttt{TracyMemoryDiscard(name)} macro.
+
 \subsection{GPU profiling}
 \label{gpuprofiling}

-Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL and CUDA execution time on GPU.
+Tracy provides bindings for profiling OpenGL, Vulkan, Direct3D 11, Direct3D 12, Metal, OpenCL, CUDA and WebGPU execution time on GPU.

 Note that the CPU and GPU timers may be unsynchronized unless you create a calibrated context, but the availability of calibrated contexts is limited. You can try to correct the desynchronization of uncalibrated contexts in the profiler's options (section~\ref{options}).

@@ -1791,6 +1803,16 @@ Unlike other GPU backends in Tracy, there is no need to call \texttt{TracyCUDACo

 To stop profiling, call the \texttt{TracyCUDAStopProfiling(ctx)} macro.

+\subsubsection{WebGPU}
+
+WebGPU support is enabled by including the \texttt{public/tracy/TracyWebGPU.hpp} header file. Both major implementations of WebGPU (Dawn and wgpu-native) are supported.
+
+Before creating the WebGPU device, make sure to call \texttt{TracyWebGPUSetupDeviceDescriptor()} to let Tracy request the necessary device features and extensions necessary for profiling. After the device is created, use the \texttt{TracyWebGPUContext()} macro to instantiate the necessary \texttt{WebGPUQueueCtx} object required for GPU instrumentation. The object should later be cleaned up with the \texttt{TracyWebGPUDestroy()} macro. To set a custom name for the context, use the \texttt{TracyWebGPUContextName()} macro.
+
+To instrument a GPU zone, use the various \texttt{TracyWebGPU*Zone*()} macros. Note that WebGPU only offers command instrumentation at the \enquote{pass}-level. While command-level granularity is possible through implementation-specific WebGPU extensions, Tracy does not support it at the moment. Supply the corresponding WebGPU pass descriptor to the instrumentation macro \textit{before} creating the WebGPU pass encoder.
+
+You are required to periodically collect the GPU events using the \texttt{TracyWebGPUCollect()} macro. Good places for collection are: after synchronous waits, after event processing \texttt{wgpuInstanceProcessEvents}, after present drawable calls (\texttt{wgpuSurfacePresent}), and inside the completion callback of command queues (\texttt{wgpuQueueOnSubmittedWorkDone}).
+
 \subsubsection{ROCm}

 On Linux, if rocprofiler-sdk is installed, tracy can automatically trace GPU dispatches and collect
@@ -1824,13 +1846,13 @@ sudo amd-smi set -g 0 -l stable_std

 Putting more than one GPU zone macro in a single scope features the same issue as with the \texttt{ZoneScoped} macros, described in section~\ref{multizone} (but this time the variable name is \texttt{\_\_\_tracy\_gpu\_zone}).

-To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan, Direct3D 11/12 and Metal -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone}, \texttt{TracyD3D11Zone}/\texttt{TracyD3D12Zone} with \texttt{TracyD3D11NamedZone}/\texttt{TracyD3D12NamedZone}, and \texttt{TracyMetalZone} with \texttt{TracyMetalNamedZone}.
+To solve this problem, in case of OpenGL use the \texttt{TracyGpuNamedZone} macro in place of \texttt{TracyGpuZone} (or the color variant). The same applies to Vulkan, Direct3D 11/12, Metal and WebGPU -- replace \texttt{TracyVkZone} with \texttt{TracyVkNamedZone}, \texttt{TracyD3D11Zone}/\texttt{TracyD3D12Zone} with \texttt{TracyD3D11NamedZone}/\texttt{TracyD3D12NamedZone}, \texttt{TracyMetalZone} with \texttt{TracyMetalNamedZone}, and \texttt{TracyWebGPUZone} with \texttt{TracyWebGPUNamedZone}.

 Remember to provide your name for the created stack variable as the first parameter to the macros.

 \subsubsection{Transient GPU zones}

-Transient zones (see section~\ref{transientzones} for details) are available in OpenGL, Vulkan, and Direct3D 11/12 macros. Transient zones are not available for Metal at this moment.
+Transient zones (see section~\ref{transientzones} for details) are available in OpenGL, Vulkan, Direct3D 11/12 and WebGPU macros. Transient zones are not available for Metal at this moment.

 \subsection{Fibers}
 \label{fibers}
@@ -1877,7 +1899,7 @@ As you can see, there are two threads, \texttt{t1} and \texttt{t2}, which are si
 \subsection{Collecting call stacks}
 \label{collectingcallstacks}

-Capture of true calls stacks can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracySecureAllocS}, \texttt{TracySecureFreeS}, \texttt{TracyMessageS}, \texttt{TracyMessageLS}, \texttt{TracyMessageCS}, \texttt{TracyMessageLCS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}, and the named and transient variants.
+Capture of true calls stacks can be performed by using macros with the \texttt{S} postfix, which require an additional parameter, specifying the depth of call stack to be captured. The greater the depth, the longer it will take to perform capture. Currently you can use the following macros: \texttt{ZoneScopedS}, \texttt{ZoneScopedNS}, \texttt{ZoneScopedCS}, \texttt{ZoneScopedNCS}, \texttt{TracyAllocS}, \texttt{TracyFreeS}, \texttt{TracyMessageS}, \texttt{TracyMessageLS}, \texttt{TracyMessageCS}, \texttt{TracyMessageLCS}, \texttt{TracyGpuZoneS}, \texttt{TracyGpuZoneCS}, \texttt{TracyVkZoneS}, \texttt{TracyVkZoneCS}, and the named and transient variants.

 Be aware that call stack collection is a relatively slow operation. Table~\ref{CallstackTimes} and figure~\ref{CallstackPlot} show how long it took to perform a single capture of varying depth on multiple CPU architectures.

@@ -2023,7 +2045,7 @@ void DbgHelpUnlock() { ReleaseMutex(dbgHelpLock); }
 }
 \end{lstlisting}

-At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the \texttt{TRACY\_NO\_DBGHELP\_INIT\_LOAD} environment variable to "1" to disable this behavior and rely on-demand symbol loading.
+At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As this process can be slow when a lot of pdbs are involved, you can set the \texttt{TRACY\_NO\_DBGHELP\_INIT\_LOAD} environment variable to \enquote{1} to disable this behavior and rely on-demand symbol loading.

 \paragraph{Disabling resolution of inline frames}

@@ -2306,8 +2328,6 @@ Use the following macros in your implementations of \texttt{malloc} and \texttt{
 \begin{itemize}
 \item \texttt{TracyCAlloc(ptr, size)}
 \item \texttt{TracyCFree(ptr)}
-\item \texttt{TracyCSecureAlloc(ptr, size)}
-\item \texttt{TracyCSecureFree(ptr)}
 \end{itemize}

 Correctly using this functionality can be pretty tricky. You also will need to handle all the memory allocations made by external libraries (which typically allow usage of custom memory allocation functions) and the allocations made by system functions. If you can't track such an allocation, you will need to make sure freeing is not reported\footnote{It's not uncommon to see a pattern where a system function returns some allocated memory, which you then need to release.}.
@@ -2369,7 +2389,7 @@ To see how you should use this API, you should look at the reference implementat
 	couleur=black!5,
 	logo=\bcbombe
 	]{Important}
-A common mistake is to skip the zone "\texttt{isActive}" check. When using \texttt{TRACY\_ON\_DEMAND}, you need to read the value of \texttt{TracyCIsConnected} once, and check the same value for both \newline \texttt{\_\_\_tracy\_emit\_gpu\_zone\_begin\_alloc} and \texttt{\_\_\_tracy\_emit\_gpu\_zone\_end}. Tracy may otherwise receive a zone end without a zone begin.
+A common mistake is to skip the zone \enquote{\texttt{isActive}} check. When using \texttt{TRACY\_ON\_DEMAND}, you need to read the value of \texttt{TracyCIsConnected} once, and check the same value for both \newline \texttt{\_\_\_tracy\_emit\_gpu\_zone\_begin\_alloc} and \texttt{\_\_\_tracy\_emit\_gpu\_zone\_end}. Tracy may otherwise receive a zone end without a zone begin.
 \end{bclogo}

 \subsubsection{Fibers}
@@ -2867,8 +2887,8 @@ logo=\bclampe
 Use the following calls in your implementations of allocator/deallocator:

 \begin{itemize}
-\item \texttt{tracy\_memory\_alloc(ptr, size, name, depth, secure)}
-\item \texttt{tracy\_memory\_free(ptr, name, depth, secure)}
+\item \texttt{tracy\_memory\_alloc(ptr, size, name, depth)}
+\item \texttt{tracy\_memory\_free(ptr, name, depth)}
 \end{itemize}

 Correctly using this functionality can be pretty tricky especially in Fortran.
@@ -2924,7 +2944,7 @@ Tracy will perform an automatic collection of system data without user intervent

 Some profiling data can only be retrieved using the kernel facilities, which are not available to users with normal privilege level. To collect such data, you will need to elevate your rights to the administrator level. You can do so either by running the profiled program from the \texttt{root} account on Unix or through the \emph{Run as administrator} option on Windows\footnote{To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from within the IDE.}. On Android, you will need to have a rooted device (see section~\ref{androidlunacy} for additional information).

-As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. If you want to disable this functionality dynamically at runtime instead, you can set the \texttt{TRACY\_NO\_SYSTEM\_TRACING} environment variable to "1".
+As this system-level tracing functionality is part of the automated collection process, no user intervention is necessary to enable it (assuming that the program was granted the rights needed). However, if, for some reason, you would want to prevent your application from trying to access kernel data, you may recompile your program with the \texttt{TRACY\_NO\_SYSTEM\_TRACING} define. If you want to disable this functionality dynamically at runtime instead, you can set the \texttt{TRACY\_NO\_SYSTEM\_TRACING} environment variable to \enquote{1}.

 \begin{bclogo}[
 noborder=true,
@@ -3076,7 +3096,7 @@ On Linux, Tracy will override the \texttt{dlclose} function call to prevent shar

 \subsubsection{Vertical synchronization}

-On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section~\ref{privilegeelevation}). These events will be reported as '\texttt{[x] Vsync}' frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.
+On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the application has access to the kernel data (privilege elevation may be needed, see section~\ref{privilegeelevation}). These events will be reported as \enquote{\texttt{[x] Vsync}} frame sets, where \texttt{x} is the identifier of a specific monitor. Note that hardware vertical synchronization might not correspond to the one seen by your application due to desktop composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods where no vertical synchronization events are reported.

 Use the \texttt{TRACY\_NO\_VSYNC\_CAPTURE} macro to disable capture of Vsync events.

@@ -3230,7 +3250,7 @@ If you want to look at the profile data in real-time (or load a saved trace file

 The \emph{\faWrench{}~Wrench} button opens the about dialog, which also contains a number of global settings you may want to tweak (section~\ref{aboutwindow}).

-The client \emph{address entry} field and the \faWifi{}~\emph{Connect} button are used to connect to a running client\footnote{Note that a custom port may be provided here, for example by entering '127.0.0.1:1234'.}. You can use the connection history button~\faCaretDown{} to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the \faArrowPointer{}~mouse cursor over an entry and pressing the \keys{\del} button on the keyboard.
+The client \emph{address entry} field and the \faWifi{}~\emph{Connect} button are used to connect to a running client\footnote{Note that a custom port may be provided here, for example by entering \enquote{127.0.0.1:1234}.}. You can use the connection history button~\faCaretDown{} to display a list of commonly used targets, from which you can quickly select an address. You can remove entries from this list by hovering the \faArrowPointer{}~mouse cursor over an entry and pressing the \keys{\del} button on the keyboard.

 If you want to open a trace that you have stored on the disk, you can do so by pressing the \faFolderOpen{}~\emph{Open saved trace} button.

@@ -3877,10 +3897,10 @@ You will find the zones with locks and their associated threads on this combined
 The left-hand side \emph{index area} of the timeline view displays various labels (threads, locks), which can be categorized in the following way:

 \begin{itemize}
-\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12 and Metal contexts are additionally split into separate threads.
+\item \emph{Light blue label} -- GPU context. Multi-threaded Vulkan, OpenCL, Direct3D 12, Metal and WebGPU contexts are additionally split into separate threads.
 \item \emph{Pink label} -- CPU data graph.
-\item \emph{White label} -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between 'instrumented' and 'ghost.'
-\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking 'green thread.'
+\item \emph{White label} -- A CPU thread. It will be replaced by a bright red label in a thread that has crashed (section~\ref{crashhandling}). If automated sampling was performed, clicking the~\LMB{}~left mouse button on the \emph{\faGhost{}~ghost zones} button will switch zone display mode between \enquote{instrumented} and \enquote{ghost.}
+\item \emph{Green label} -- Fiber, coroutine, or any other sort of cooperative multitasking \enquote{green thread.}
 \item \emph{Light red label} -- Indicates a lock.
 \item \emph{Yellow label} -- Plot.
 \end{itemize}
@@ -3899,7 +3919,7 @@ In an example in figure~\ref{zoneslocks} you can see that there are two threads:

 Meanwhile, the \emph{Streaming thread} is performing some \emph{Streaming jobs}. The first \emph{Streaming job} sent a message (section~\ref{messagelog}). In addition to being listed in the message log, it is indicated by a triangle over the thread separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle.

-The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL context in place of a thread name.
+The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/Metal/OpenCL/CUDA/WebGPU context in place of a thread name.

 Hovering the \faArrowPointer{} mouse pointer over a zone will highlight all other zones that have the exact source location with a white outline. Clicking the \LMB{}~left mouse button on a zone will open the zone information window (section~\ref{zoneinfo}). Holding the \keys{\ctrl} key and clicking the \LMB{}~left mouse button on a zone will open the zone statistics window (section~\ref{findzone}). Clicking the \MMB{}~middle mouse button on a zone will zoom the view to the extent of the zone.

@@ -4108,7 +4128,7 @@ In this window, you can set various trace-related options. For example, the time
 \begin{itemize}
 \item \emph{\faSignature{} Draw CPU usage graph} -- You can disable drawing of the CPU usage graph here.
 \end{itemize}
-\item \emph{\faEye{} Draw GPU zones} -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL zones. The \emph{GPU zones} drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section~\ref{gpuprofiling} for more information). The \emph{\faRobot~Auto} button automatically measures the GPU drift value\footnote{There is an assumption that drift is linear. Automated measurement calculates and removes change over time in delay-to-execution of GPU zones. Resulting value may still be incorrect.}.
+\item \emph{\faEye{} Draw GPU zones} -- Allows disabling display of OpenGL/Vulkan/Metal/Direct3D/OpenCL/CUDA/WebGPU zones. The \emph{GPU zones} drop-down allows disabling individual GPU contexts and setting CPU/GPU drift offsets of uncalibrated contexts (see section~\ref{gpuprofiling} for more information). The \emph{\faRobot~Auto} button automatically measures the GPU drift value\footnote{There is an assumption that drift is linear. Automated measurement calculates and removes change over time in delay-to-execution of GPU zones. Resulting value may still be incorrect.}.
 \item \emph{\faMicrochip{} Draw CPU zones} -- Determines whether CPU zones are displayed.
 \begin{itemize}
 \item \emph{\faGhost{} Draw ghost zones} -- Controls if ghost zones should be displayed in threads which don't have any instrumented zones available.
@@ -4158,7 +4178,7 @@ You can filter the message list in the following ways:

 \begin{itemize}
 \item By the originating thread in the \emph{\faShuffle{} Visible threads} drop-down.
-\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. 'warn, info' will match messages containing strings 'warn' \emph{or} 'info'). You can exclude matches by preceding the term with a minus character (e.g., '-debug' will hide all messages containing the string 'debug').
+\item By matching the message text to the expression in the \emph{\faFilter{}~Filter messages} entry field. Multiple filter expressions can be comma-separated (e.g. \enquote{warn, info} will match messages containing strings \enquote{warn} \emph{or} \enquote{info}). You can exclude matches by preceding the term with a minus character (e.g., \enquote{-debug} will hide all messages containing the string \enquote{debug}).
 \item By message source, distinguishing between \emph{\faUser{}~User} messages and internal \emph{\faMicroscope{}~Tracy} diagnostics.
 \item By severity level: \emph{\faShoePrints{}~Trace}, \emph{\faBug{}~Debug}, \emph{\faInfo{}~Info}, \emph{\faTriangleExclamation{}~Warning}, \emph{\faCircleXmark{}~Error}, or \emph{\faSkullCrossbones{}~Fatal}.
 \end{itemize}
@@ -4623,7 +4643,7 @@ The zone information window displays detailed information about a single zone. T
 \begin{itemize}
 \item Basic source location information: function name, source file location, and the thread name.
 \item Timing information.
-\item If the profiler performed context switch capture (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section~\ref{cputopology}), the profiler will mark zone migrations across cores with 'C' and migrations across packages -- with 'P.' In some cases, context switch data might be incomplete\footnote{For example, when capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed.
+\item If the profiler performed context switch capture (section~\ref{contextswitches}) and a thread was suspended during zone execution, a list of wait regions will be displayed, with complete information about the timing, CPU migrations, and wait reasons. If CPU topology data is available (section~\ref{cputopology}), the profiler will mark zone migrations across cores with \enquote{C} and migrations across packages -- with \enquote{P.} In some cases, context switch data might be incomplete\footnote{For example, when capture is ongoing and context switch information has not yet been received.}, in which case a warning message will be displayed.
 \item Memory events list, both summarized and a list of individual allocation/free events (see section~\ref{memorywindow} for more information on the memory events list).
 \item List of messages that the profiler logged in the zone's scope. If the \emph{exclude children} option is disabled, messages emitted in child zones will also be included.
 \item Parent zones list, showing the hierarchy of parent zones that contain the current zone. Hovering the \faArrowPointer{}~mouse pointer over a parent zone will highlight it on the timeline view with a red outline. Clicking the \LMB{}~left mouse button on a zone will switch the zone info window to that zone. Clicking the \MMB{}~middle mouse button on a zone will zoom the timeline view to the zone's extent. Clicking the \RMB{}~right mouse button on a source file location will open the source file view window (if applicable, see section~\ref{sourceview}).
@@ -4660,7 +4680,7 @@ Clicking on the \emph{\faClipboard{}~Copy to clipboard} buttons will copy the ap

 This window shows the frames contained in the selected call stack. Information about the originating thread is included. Each frame is described by a function name, source file location, and originating image\footnote{Executable images are called \emph{modules} by Microsoft.} name. Function frames originating from the kernel are marked with a red color. Clicking the \LMB{}~left mouse button on either the function name of source file location will copy the name to the clipboard. Clicking the \RMB{}~right mouse button on the source file location will open the source file view window (if applicable, see section~\ref{sourceview}).

-A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or '\faCaretRight{}'~icon in case of call stack tooltips.}.
+A single stack frame may have multiple function call places associated with it. This happens in the case of inlined function calls. Such entries will be displayed in the call stack window, with \emph{inline} in place of frame number\footnote{Or \enquote{\faCaretRight{}}~icon in case of call stack tooltips.}.

 If the call stack shows a crash (see section~\ref{crashhandling}), a red \emph{\faSkull{}~Crash} label will be displayed. Clicking it will center the timeline on the crash. Note that the crash stack may contain OS or Tracy frames where the crash was intercepted and processed.

@@ -4673,7 +4693,7 @@ Stack frame location may be displayed in the following number of ways, depending
 \item \emph{Symbol address} -- displays begin address of the function containing the frame address.
 \end{itemize}

-In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed '\texttt{[ntdll.dll]}' name of the image containing the frame address, or simply '\texttt{[unknown]}' if the profiler cannot retrieve even this information. Additionally, '\texttt{[kernel]}' is used to indicate unknown stack frames within the operating system's internal routines.
+In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will be presented with a dimmed \enquote{\texttt{[ntdll.dll]}} name of the image containing the frame address, or simply \enquote{\texttt{[unknown]}} if the profiler cannot retrieve even this information. Additionally, \enquote{\texttt{[kernel]}} is used to indicate unknown stack frames within the operating system's internal routines.

 External frames from system libraries are hidden by default. Enabling the \emph{\faShieldHalved{}~External} option will show these frames, which can be useful for debugging issues in external code. When external frames are displayed, they are dimmed out.

@@ -4761,7 +4781,7 @@ Some modes may be unavailable in some circumstances (missing or outdated source

 \paragraph{Source mode}

-This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an '\texttt{@}' prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.
+This is pretty much the source file view window, but with the ability to select one of the source files that the compiler used to build the symbol. Additionally, each source file line that produced machine code in the symbol will show a count of associated assembly instructions, displayed with an \enquote{\texttt{@}} prefix, and will be marked with grey color on the scroll bar. Due to how optimizing compilers work, some lines may seemingly not produce any machine code, for example, because iterating a loop counter index might have been reduced to advancing a data pointer. Some other lines may have a disproportionate amount of associated instructions, e.g., when the compiler applied a loop unrolling optimization. This varies from case to case and from compiler to compiler.

 The \emph{Propagate inlines} option, available when sample data is present, will enable propagation of the instruction costs down the local call stack. For example, suppose a base function in the symbol issues a call to an inlined function (which may not be readily visible due to being contained in another source file). In that case, any cost attributed to the inlined function will be visible in the base function. Because the cost information is added to all the entries in the local call stacks, it is possible to see seemingly nonsense total cost values when this feature is enabled. To quickly toggle this on or off, you may also press the \keys{X} key.

@@ -4779,7 +4799,7 @@ logo=\bclampe
 ]{Local call stack}
 In some cases, it may be challenging to understand what is being displayed in the disassembly. For example, calling the \texttt{std::lower\_bound} function may generate multiple levels of inlined functions: first, we enter the search algorithm, then the comparison functions, which in turn may be lambdas that call even more external code, and so on. In such an event, you will most likely see that some external code is taking a long time to execute, and you will be none the wiser on improving things.

-The local call stack for an assembly instruction represents all the inline function calls \emph{within the symbol} (hence the 'local' part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the \RMB{}~right mouse button on the source location.
+The local call stack for an assembly instruction represents all the inline function calls \emph{within the symbol} (hence the \enquote{local} part), which were made to reach the instruction. Deeper inspection of the local call stack, including navigation to the source call site of each participating inline function, can be performed through the context menu accessible by pressing the \RMB{}~right mouse button on the source location.
 \end{bclogo}

 Selecting the \emph{\faGears{}~Raw code} option will enable the display of raw machine code bytes for each line. Individual bytes are displayed with interwoven colors to make reading easier.
@@ -4830,9 +4850,9 @@ In this mode, the source and assembly panes will be displayed together, providin

 \paragraph{Instruction pointer cost statistics}

-If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify 'hot' places in the code at a glance.
+If automated call stack sampling (see chapter~\ref{sampling}) was performed, additional profiling information will be available. The first column of source and assembly views will contain percentage counts of collected instruction pointer samples for each displayed line, both in numerical and graphical bar form. You can use this information to determine which function line takes the most time. The displayed percentage values are heat map color-coded, with the lowest values mapped to dark red and the highest to bright yellow. The color code will appear next to the percentage value and on the scroll bar so that you can identify \enquote{hot} places in the code at a glance.

-By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the \emph{\faRightFromBracket{}~Child calls} option, which you may also temporarily toggle by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list\footnote{The height of the list can be changed by dragging the separator bar.}, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls ("\%~Calls") and the percentage of the total symbol time ("\%~Total").
+By default, samples are displayed only within the selected symbol, in isolation. In some cases, you may, however, want to include samples from functions that the selected symbol called. To do so, enable the \emph{\faRightFromBracket{}~Child calls} option, which you may also temporarily toggle by holding the \keys{Z} key. You can also click the~\faCaretDown{}~drop down control to display a child call distribution list\footnote{The height of the list can be changed by dragging the separator bar.}, which shows each known function\footnote{You should remember that these are results of random sampling. Some function calls may be missing here.} that the symbol called. Make sure to familiarize yourself with section~\ref{readingcallstacks} to be able to read the results correctly. Each child call on the list has an attributed time cost, which is also displayed as a percentage of the child calls (\enquote{\%~Calls}) and the percentage of the total symbol time (\enquote{\%~Total}).

 The total number of collected samples is displayed in the UI under the~\emph{\faEyeDropper~Samples} label and converted to a time approximation at the~\emph{\faStopwatch~Time} label. The displayed values show the local count if child calls are disabled and the total count if the option is enabled. In either case, the number of samples attributed only to the child calls is displayed in parentheses with the + or - symbol and as a percentage of the total symbol time.

@@ -5009,7 +5029,7 @@ There are no ideal LLM providers, but here are some options:
 \begin{itemize}
 \item \emph{llama.cpp} (\url{https://github.com/ggml-org/llama.cpp}) -- Recommended as the easiest to use. Clone from git and build it yourself. By default it fits the model automatically to available memory. It is rapidly advancing with new features and model support. Most other providers use it to do the actual work, and they typically use an outdated release. The \url{https://llama.app/} site might provide easy way to install llama.
 \item \emph{llama-swap} (\url{https://github.com/mostlygeek/llama-swap}) -- Wrapper for llama.cpp that allows model selection. Recommended to augment the above.
-\item \emph{LM Studio} (\url{https://lmstudio.ai/}) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable "When applicable, separate \texttt{reasoning\_content} and \texttt{content} in API responses".
+\item \emph{LM Studio} (\url{https://lmstudio.ai/}) -- It is easy to install on all platforms and has a GUI. But it is overwhelming when it comes to the number of options it offers. Some people may question the licensing. Its features lag a behind llama.cpp. Manual configuration of each model is required. To get it to work properly, go to it settings (using the gear icon in the bottom right corner of the program window), then select the Developer tab and enable \enquote{When applicable, separate \texttt{reasoning\_content} and \texttt{content} in API responses}.
 \end{itemize}

 \subsection{Model selection}
@@ -5029,7 +5049,7 @@ noborder=true,
 couleur=black!5,
 logo=\bclampe
 ]{Model quantization}
-Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more "dumbed down" the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
+Running a model with full 32-bit floating-point weights is not feasible due to memory requirements. Instead, the model parameters are quantized, for which 4 bits is typically the sweet spot. In general, the lower the parameter precision, the more \enquote{dumbed down} the model becomes. However, the loss of model coherence due to quantization is less than the benefit of being able to run a larger model.
 \end{bclogo}

 \begin{bclogo}[
@@ -5037,9 +5057,9 @@ noborder=true,
 couleur=black!5,
 logo=\bclampe
 ]{Model size}
-Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the "smarter" its responses will be.
+Another thing to consider when selecting a model is its size, which is typically measured in billions of parameters (weights) and written as 4B, for example. The model size determines how much memory, computation, and time are required to run it. Generally, the larger the model, the \enquote{smarter} its responses will be.

-Most modern models will be "Mixture of Experts", or MoE, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
+Most modern models will be \enquote{Mixture of Experts}, or \enquote{MoE}, and their size will be denoted, for example, 35B-A3B. This means that the model size is 35B, but only 3B parameters are active and used to compute the next token. In practice, this means that the model has knowledge closer to the full, dense 35B model but speed and GPU memory requirements closer to the fast 3B model.
 \end{bclogo}

 \begin{bclogo}[
@@ -5047,7 +5067,7 @@ noborder=true,
 couleur=black!5,
 logo=\bclampe
 ]{Context size}
-The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can "remember". This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.
+The model size only indicates the minimum memory requirement. For the model to operate properly, you also need to set the context size, which determines how much information from the conversation the model can \enquote{remember}. This size is measured in tokens, and a very rough approximation is that each token is a combination of three or four letters.

 Each token present in the context window may require a fairly large amount of memory, and that can quickly add up to gigabytes. Some modern models use solutions that greatly reduce context memory requirements, but that varies from model to model. If needed, the KV cache used for context can be quantized, just like model parameters. In this case, the recommended size per weight is 8 bits.

@@ -5058,7 +5078,7 @@ The realistic minimum required context size for Tracy to run the assistant is 10

 Sometimes Tracy needs to do some language processing where speed is more important than the smarts. The default setting is to use the chat model with the reasoning disabled, which is fine for most applications.

-It may be more convenient to use a small, quick model instead, in which case enable the \emph{Fast model} checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set \texttt{-ngl 0} for llama.cpp or set "GPU offload" to 0 in LM Studio) and disable the KV cache offload to GPU (set \texttt{-nkvo} for llama.cpp or disable "Offload KV Cache to GPU Memory" in LM Studio).
+It may be more convenient to use a small, quick model instead, in which case enable the \emph{Fast model} checkbox and choose the second model. To save precious GPU resources for the chat model, you may want to keep this model entirely in system RAM (set \texttt{-ngl 0} for llama.cpp or set \enquote{GPU offload} to 0 in LM Studio) and disable the KV cache offload to GPU (set \texttt{-nkvo} for llama.cpp or disable \enquote{Offload KV Cache to GPU Memory} in LM Studio).

 \subsubsection{Embedding model}

@@ -5119,7 +5139,7 @@ The horizontal meter directly below shows how much of the context size has been

 The chat section contains the conversation with the automated assistant with alternating user and assistant turns. Clicking on the~\emph{\faUser{}~User} role icon removes the chat content up to the selected question. Similarly, clicking on the~\emph{\faRobot{}~Assistant} role icon removes the conversation content up to this point and generates another response from the assistant.

-The assistant may give preliminary replies to the user, for example, \emph{"I will now check the source of function foobar"}, followed by performing the actual check, then a continuation of the reply, such as \emph{"Now I can see that..."}. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.
+The assistant may give preliminary replies to the user, for example, \enquote{I will now check the source of function foobar}, followed by performing the actual check, then a continuation of the reply, such as \enquote{Now I can see that...}. To make reading these tiered replies easier, only the most recent reply is printed in normal text, while the preliminary responses are dimmed out.

 Each assistant reply contains a note about the language model that was used and the time it took to generate the text.

@@ -5187,8 +5207,8 @@ You can customize the output with the following command line options:
  \item \texttt{-h, -\hspace{-1.25ex} -help} -- Display a help message
  \item \texttt{-f, -\hspace{-1.25ex} -filter <name>} -- Filter the zone names
  \item \texttt{-c, -\hspace{-1.25ex} -case} -- Make the name filtering case sensitive
-  \item \texttt{-s, -\hspace{-1.25ex} -sep <separator>} -- Customize the CSV separator (default is ``\texttt{,}'')
-  \item \texttt{-e, -\hspace{-1.25ex} -self} -- Use self time (equivalent to the ``Self time'' toggle in the profiler GUI)
+  \item \texttt{-s, -\hspace{-1.25ex} -sep <separator>} -- Customize the CSV separator (default is \enquote{\texttt{,}})
+  \item \texttt{-e, -\hspace{-1.25ex} -self} -- Use self time (equivalent to the \enquote{Self time} toggle in the profiler GUI)
  \item \texttt{-u, -\hspace{-1.25ex} -unwrap} -- Report each zone individually; this will discard the statistics columns and instead report the timestamp and duration for each zone entry
  \item \texttt{-g, -\hspace{-1.25ex} -gpu} -- Report each GPU zone event
  \item \texttt{-m, -\hspace{-1.25ex} -messages} -- Report only messages
--- a/profiler/CMakeLists.txt
+++ b/profiler/CMakeLists.txt
@@ -44,10 +44,16 @@ ExternalProject_Add(embed
 )

 function(Embed LIST NAME FILE)
+    cmake_parse_arguments(EMBED "TEXT" "" "" ${ARGN})
+    if(EMBED_TEXT)
+        set(EMBED_FLAGS -t)
+    else()
+        set(EMBED_FLAGS)
+    endif()
    add_custom_command(
        OUTPUT data/${NAME}.cpp data/${NAME}.hpp
        COMMAND ${CMAKE_COMMAND} -E make_directory data
-        COMMAND ${CMAKE_CURRENT_BINARY_DIR}/embed ${NAME} ${CMAKE_CURRENT_LIST_DIR}/${FILE} data/${NAME}
+        COMMAND ${CMAKE_CURRENT_BINARY_DIR}/embed ${EMBED_FLAGS} ${NAME} ${CMAKE_CURRENT_LIST_DIR}/${FILE} data/${NAME}
        DEPENDS embed ${CMAKE_CURRENT_LIST_DIR}/${FILE}
    )
    list(APPEND ${LIST} data/${NAME}.cpp)
@@ -146,10 +152,10 @@ set(PROFILER_FILES
    src/winmainArchDiscovery.cpp
 )

-Embed(PROFILER_FILES SystemPrompt src/llm/system.prompt.md)
-Embed(PROFILER_FILES SkillCallstack src/llm/skill.callstack.md)
-Embed(PROFILER_FILES SkillOptimization src/llm/skill.optimization.md)
-Embed(PROFILER_FILES ToolsJson src/llm/tools.json)
+Embed(PROFILER_FILES SystemPrompt src/llm/system.prompt.md TEXT)
+Embed(PROFILER_FILES SkillCallstack src/llm/skill.callstack.md TEXT)
+Embed(PROFILER_FILES SkillOptimization src/llm/skill.optimization.md TEXT)
+Embed(PROFILER_FILES ToolsJson src/llm/tools.json TEXT)

 Embed(PROFILER_FILES FontFixed src/font/FiraCode-Retina.ttf)
 Embed(PROFILER_FILES FontIcons src/font/Font\ Awesome\ 7\ Free-Solid-900.otf)
@@ -159,20 +165,20 @@ Embed(PROFILER_FILES FontItalic src/font/Roboto-Italic.ttf)
 Embed(PROFILER_FILES FontBoldItalic src/font/Roboto-BoldItalic.ttf)
 Embed(PROFILER_FILES FontEmoji src/font/NotoEmoji-Regular.ttf)

-Embed(PROFILER_FILES Manual ../manual/tracy.md)
+Embed(PROFILER_FILES Manual ../manual/tracy.md TEXT)

-Embed(PROFILER_FILES Text100Million src/achievements/100Million.md)
-Embed(PROFILER_FILES TextConnectToClient src/achievements/ConnectToClient.md)
-Embed(PROFILER_FILES TextFindZone src/achievements/FindZone.md)
-Embed(PROFILER_FILES TextFrameImages src/achievements/FrameImages.md)
-Embed(PROFILER_FILES TextGlobalSettings src/achievements/GlobalSettings.md)
-Embed(PROFILER_FILES TextInstrumentationIntro src/achievements/InstrumentationIntro.md)
-Embed(PROFILER_FILES TextInstrumentationStatistics src/achievements/InstrumentationStatistics.md)
-Embed(PROFILER_FILES TextInstrumentFrames src/achievements/InstrumentFrames.md)
-Embed(PROFILER_FILES TextIntro src/achievements/Intro.md)
-Embed(PROFILER_FILES TextLoadTrace src/achievements/LoadTrace.md)
-Embed(PROFILER_FILES TextSamplingIntro src/achievements/SamplingIntro.md)
-Embed(PROFILER_FILES TextSaveTrace src/achievements/SaveTrace.md)
+Embed(PROFILER_FILES Text100Million src/achievements/100Million.md TEXT)
+Embed(PROFILER_FILES TextConnectToClient src/achievements/ConnectToClient.md TEXT)
+Embed(PROFILER_FILES TextFindZone src/achievements/FindZone.md TEXT)
+Embed(PROFILER_FILES TextFrameImages src/achievements/FrameImages.md TEXT)
+Embed(PROFILER_FILES TextGlobalSettings src/achievements/GlobalSettings.md TEXT)
+Embed(PROFILER_FILES TextInstrumentationIntro src/achievements/InstrumentationIntro.md TEXT)
+Embed(PROFILER_FILES TextInstrumentationStatistics src/achievements/InstrumentationStatistics.md TEXT)
+Embed(PROFILER_FILES TextInstrumentFrames src/achievements/InstrumentFrames.md TEXT)
+Embed(PROFILER_FILES TextIntro src/achievements/Intro.md TEXT)
+Embed(PROFILER_FILES TextLoadTrace src/achievements/LoadTrace.md TEXT)
+Embed(PROFILER_FILES TextSamplingIntro src/achievements/SamplingIntro.md TEXT)
+Embed(PROFILER_FILES TextSaveTrace src/achievements/SaveTrace.md TEXT)

 set(INCLUDES "${CMAKE_CURRENT_BINARY_DIR}")
 set(LIBS "")
@@ -294,7 +300,19 @@ if(NOT EMSCRIPTEN)
 endif()

 if(EMSCRIPTEN)
-    target_link_options(${PROJECT_NAME} PRIVATE -pthread -sASSERTIONS=0 -sINITIAL_MEMORY=384mb -sALLOW_MEMORY_GROWTH=1 -sMAXIMUM_MEMORY=4gb -sSTACK_SIZE=1048576 -sWASM_BIGINT=1 -sPTHREAD_POOL_SIZE=8 -sEXPORTED_FUNCTIONS=_main,_nativeOpenFile,_tracy_paste_clipboard -sEXPORTED_RUNTIME_METHODS=ccall -sENVIRONMENT=web,worker --preload-file embed.tracy)
+    target_link_options(${PROJECT_NAME} PRIVATE
+        -pthread
+        -sASSERTIONS=0
+        -sINITIAL_MEMORY=384mb
+        -sALLOW_MEMORY_GROWTH=1
+        -sMAXIMUM_MEMORY=4gb
+        -sSTACK_SIZE=1048576
+        -sPTHREAD_POOL_SIZE=8
+        -sEXPORTED_FUNCTIONS=_main,_nativeOpenFile,_tracy_paste_clipboard
+        -sEXPORTED_RUNTIME_METHODS=ccall
+        -sENVIRONMENT=web,worker
+        --preload-file embed.tracy
+    )

    file(DOWNLOAD https://share.nereid.pl/i/embed.tracy ${CMAKE_CURRENT_BINARY_DIR}/embed.tracy EXPECTED_MD5 ca0fa4f01e7b8ca5581daa16b16c768d)
    file(COPY ${CMAKE_CURRENT_LIST_DIR}/wasm/index.html DESTINATION ${CMAKE_CURRENT_BINARY_DIR})
--- a/profiler/helpers/embed.cpp
+++ b/profiler/helpers/embed.cpp
@@ -1,17 +1,27 @@
 #include <stdint.h>
 #include <stdio.h>
+#include <string.h>
 #include <string>

 #include "../../public/common/tracy_lz4hc.hpp"

 static void Usage()
 {
-    fprintf( stderr, "Usage: embed <objectName> <source> <destination>\n" );
+    fprintf( stderr, "Usage: embed [-t] <objectName> <source> <destination>\n" );
    fprintf( stderr, "  destination should be without extension, will create cpp, hpp pair\n" );
+    fprintf( stderr, "  -t: treat source as text, convert line endings to unix\n" );
 }

 int main( int argc, char** argv )
 {
+    bool text = false;
+    if( argc >= 2 && strcmp( argv[1], "-t" ) == 0 )
+    {
+        text = true;
+        argc--;
+        argv++;
+    }
+
    if( argc < 4 )
    {
        Usage();
@@ -38,6 +48,16 @@ int main( int argc, char** argv )
    fread( data, 1, sz, src );
    fclose( src );

+    if( text )
+    {
+        size_t pos = 0;
+        for( size_t i=0; i<sz; i++ )
+        {
+            if( data[i] != '\r' ) data[pos++] = data[i];
+        }
+        sz = pos;
+    }
+
    const auto lz4szMax = tracy::LZ4_compressBound( sz );
    auto lz4data = new uint8_t[lz4szMax];
    const auto lz4sz = tracy::LZ4_compress_HC( (const char*)data, (char*)lz4data, sz, lz4szMax, 6 );
--- a/profiler/src/BackendEmscripten.cpp
+++ b/profiler/src/BackendEmscripten.cpp
@@ -162,6 +162,15 @@ static ImGuiKey TranslateKeyCode( const char* code )
    return ImGuiKey_None;
 }

+static void UpdateKeyModifiers( const EmscriptenKeyboardEvent* e )
+{
+    ImGuiIO& io = ImGui::GetIO();
+    io.AddKeyEvent( ImGuiMod_Ctrl, e->ctrlKey );
+    io.AddKeyEvent( ImGuiMod_Shift, e->shiftKey );
+    io.AddKeyEvent( ImGuiMod_Alt, e->altKey );
+    io.AddKeyEvent( ImGuiMod_Super, e->metaKey );
+}
+
 Backend::Backend( const char* title, const std::function<void()>& redraw, const std::function<void(float)>& scaleChanged, const std::function<int(void)>& isBusy, RunQueue* mainThreadTasks )
 {
    constexpr EGLint eglConfigAttrib[] = {
@@ -243,6 +252,7 @@ Backend::Backend( const char* title, const std::function<void()>& redraw, const
        return EM_TRUE;
    } );
    emscripten_set_keydown_callback( EMSCRIPTEN_EVENT_TARGET_WINDOW, nullptr, EM_TRUE, [] ( int, const EmscriptenKeyboardEvent* e, void* ) -> EM_BOOL {
+        UpdateKeyModifiers( e );
        const auto code = TranslateKeyCode( e->code );
        if( code == ImGuiKey_None ) return EM_FALSE;
        ImGui::GetIO().AddKeyEvent( code, true );
@@ -250,6 +260,7 @@ Backend::Backend( const char* title, const std::function<void()>& redraw, const
        return EM_TRUE;
    } );
    emscripten_set_keyup_callback( EMSCRIPTEN_EVENT_TARGET_WINDOW, nullptr, EM_TRUE, [] ( int, const EmscriptenKeyboardEvent* e, void* ) -> EM_BOOL {
+        UpdateKeyModifiers( e );
        const auto code = TranslateKeyCode( e->code );
        if( code == ImGuiKey_None ) return EM_FALSE;
        ImGui::GetIO().AddKeyEvent( code, false );
--- a/profiler/src/profiler/TracyConfig.hpp
+++ b/profiler/src/profiler/TracyConfig.hpp
@@ -44,7 +44,7 @@ struct Config
    std::string llmSearchIdentifier;
    std::string llmSearchApiKey;
    std::string llmSearchBraveApiKey;
-    bool llmSeparateFastModel = true;
+    bool llmSeparateFastModel = false;
    bool llmAnnotateCallstacks = false;
    bool llmLimitToolReplySize = false;
    int llmMaxToolReplySizeValue = 48*1024;
--- a/profiler/src/profiler/TracyUserData.cpp
+++ b/profiler/src/profiler/TracyUserData.cpp
@@ -295,6 +295,7 @@ bool UserData::Load()
                LoadValue( v, "min", a->range.min );
                LoadValue( v, "max", a->range.max );
                LoadValue( v, "color", a->color );
+                a->range.active = true;
                m_annotations.emplace_back( std::move( a ) );
            }
        }
--- a/profiler/src/profiler/TracyView.hpp
+++ b/profiler/src/profiler/TracyView.hpp
@@ -49,7 +49,8 @@ constexpr const char* GpuContextNames[] = {
    "Metal",
    "Custom",
    "CUDA",
-    "Rocprof"
+    "Rocprof",
+    "WebGPU"
 };

 struct MemoryPage;
--- a/profiler/src/profiler/TracyView_Timeline.cpp
+++ b/profiler/src/profiler/TracyView_Timeline.cpp
@@ -299,6 +299,22 @@ void View::DrawTimeline()
            v->range.StartFrame();
            HandleRange( v->range, timespan, ImGui::GetCursorScreenPos(), w );
        }
+        if( IsMouseClicked( 0 ) )
+        {
+            const auto ty = ImGui::GetTextLineHeight();
+            for( auto& ann : m_annotations )
+            {
+                if( ann->range.min >= m_vd.zvEnd || ann->range.max <= m_vd.zvStart ) continue;
+                const auto aMin = ( ann->range.min - m_vd.zvStart ) * pxns;
+                const auto aMax = ( ann->range.max - m_vd.zvStart ) * pxns;
+                if( ImGui::IsMouseHoveringRect( linepos + ImVec2( aMin, lineh - ty * 1.5f ), linepos + ImVec2( aMax, lineh ) ) )
+                {
+                    m_selectedAnnotation = ann.get();
+                    ConsumeMouseEvents( 0 );
+                    break;
+                }
+            }
+        }
        HandleTimelineMouse( timespan, ImGui::GetCursorScreenPos(), w );
    }
    if( ImGui::IsWindowFocused( ImGuiHoveredFlags_ChildWindows | ImGuiHoveredFlags_AllowWhenBlockedByActiveItem ) )
@@ -360,9 +376,8 @@ void View::DrawTimeline()
    bool hover = ImGui::IsWindowHovered() && ImGui::IsMouseHoveringRect( wpos, wpos + ImVec2( w, h ) );
    draw = ImGui::GetWindowDrawList();

+    const auto scale = GetScale();
    const auto ty = ImGui::GetTextLineHeight();
-    const auto to = 9.f;
-    const auto th = ( ty - to ) * sqrt( 3 ) * 0.5;

    if( m_vd.drawGpuZones )
    {
@@ -415,17 +430,24 @@ void View::DrawTimeline()

    m_lockHighlight = m_nextLockHighlight;

+    const auto iconSize = ImGui::CalcTextSize( ICON_FA_NOTE_STICKY );
    for( auto& ann : m_annotations )
    {
        if( ann->range.min < m_vd.zvEnd && ann->range.max > m_vd.zvStart )
        {
-            uint32_t c0 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x44000000 : 0x22000000 );
-            uint32_t c1 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x66000000 : 0x44000000 );
-            uint32_t c2 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0xCC000000 : 0xAA000000 );
-            draw->AddRectFilled( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ), c0 );
-            DrawLine( draw, linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + 0.5f, 0.5f ), linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + 0.5f, lineh + 0.5f ), ann->range.hiMin ? c2 : c1, ann->range.hiMin ? 2 : 1 );
-            DrawLine( draw, linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns + 0.5f, 0.5f ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns + 0.5f, lineh + 0.5f ), ann->range.hiMax ? c2 : c1, ann->range.hiMax ? 2 : 1 );
-            if( drawMouseLine && ImGui::IsMouseHoveringRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ) ) )
+            uint32_t c0 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x22000000 : 0x11000000 );
+            uint32_t c1 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0x88000000 : 0x66000000 );
+            uint32_t c2 = ( ann->color & 0xFFFFFF ) | ( m_selectedAnnotation == ann.get() ? 0xDD000000 : 0xBB000000 );
+
+            const auto aMin = ( ann->range.min - m_vd.zvStart ) * pxns;
+            const auto aMax = ( ann->range.max - m_vd.zvStart ) * pxns;
+
+            draw->AddRectFilled( linepos + ImVec2( aMin, 0 ), linepos + ImVec2( aMax, lineh ), c0 );
+            draw->AddRectFilled( linepos + ImVec2( aMin + 1, lineh - ty * 1.5f ), linepos + ImVec2( aMax - 1, lineh ), 0x88000000 );
+            DrawLine( draw, linepos + ImVec2( aMin + 0.5f, 0.5f ), linepos + ImVec2( aMin + 0.5f, lineh + 0.5f ), ann->range.hiMin ? c2 : c1, ann->range.hiMin ? 2 : 1 );
+            DrawLine( draw, linepos + ImVec2( aMax - 0.5f, 0.5f ), linepos + ImVec2( aMax - 0.5f, lineh + 0.5f ), ann->range.hiMax ? c2 : c1, ann->range.hiMax ? 2 : 1 );
+
+            if( drawMouseLine && ImGui::IsMouseHoveringRect( linepos + ImVec2( aMin, 0 ), linepos + ImVec2( aMax, lineh ) ) )
            {
                ImGui::BeginTooltip();
                if( ann->text.empty() )
@@ -442,27 +464,22 @@ void View::DrawTimeline()
                TextFocused( "Annotation length:", TimeToString( ann->range.max - ann->range.min ) );
                ImGui::EndTooltip();
            }
-            const auto aw = ( ann->range.max - ann->range.min ) * pxns;
-            if( aw > th * 4 )
-            {
-                draw->AddCircleFilled( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 2, th * 2 ), th, 0x88AABB22 );
-                draw->AddCircle( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 2, th * 2 ), th, 0xAAAABB22 );
-                if( drawMouseLine && IsMouseClicked( 0 ) && ImGui::IsMouseHoveringRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th, th ), linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 3, th * 3 ) ) )
-                {
-                    m_selectedAnnotation = ann.get();
-                }

+            const auto aw = ( ann->range.max - ann->range.min ) * pxns;
+            if( aw > ty + iconSize.x )
+            {
+                draw->AddText( linepos + ImVec2( aMin + ty * 0.5f, lineh - ty * 1.25f ), ann->color | 0xFF000000, ICON_FA_NOTE_STICKY );
                if( !ann->text.empty() )
                {
                    const auto tw = ImGui::CalcTextSize( ann->text.c_str() ).x;
-                    if( aw - th*4 > tw )
+                    if( aw > ty + iconSize.x + tw )
                    {
-                        draw->AddText( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 4, th * 0.5 ), 0xFFFFFFFF, ann->text.c_str() );
+                        draw->AddText( linepos + ImVec2( aMin + ty + iconSize.x, lineh - ty * 1.25f ), 0xFFFFFFFF, ann->text.c_str() );
                    }
                    else
                    {
-                        draw->PushClipRect( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns, 0 ), linepos + ImVec2( ( ann->range.max - m_vd.zvStart ) * pxns, lineh ), true );
-                        draw->AddText( linepos + ImVec2( ( ann->range.min - m_vd.zvStart ) * pxns + th * 4, th * 0.5 ), 0xFFFFFFFF, ann->text.c_str() );
+                        draw->PushClipRect( linepos + ImVec2( aMin + 1, lineh - ty * 1.5f ), linepos + ImVec2( aMax - 1, lineh ) );
+                        draw->AddText( linepos + ImVec2( aMin + ty + iconSize.x, lineh - ty * 1.25f ), 0xFFFFFFFF, ann->text.c_str() );
                        draw->PopClipRect();
                    }
                }
@@ -485,7 +502,6 @@ void View::DrawTimeline()
        draw->AddRect( ImVec2( wpos.x + px0, linepos.y ), ImVec2( wpos.x + px1, linepos.y + lineh ), 0x4488DD88 );
    }

-    const auto scale = GetScale();
    if( m_findZone.range.active && ( m_findZone.show || m_showRanges ) )
    {
        const auto px0 = ( m_findZone.range.min - m_vd.zvStart ) * pxns;
--- a/public/TracyClient.F90
+++ b/public/TracyClient.F90
@@ -861,43 +861,38 @@ module tracy
  end interface

  interface
-    subroutine impl_tracy_emit_memory_alloc_callstack(ptr, size, depth, secure) &
+    subroutine impl_tracy_emit_memory_alloc_callstack(ptr, size, depth) &
      bind(C, name="___tracy_emit_memory_alloc_callstack")
      import
      type(c_ptr), intent(in), value :: ptr
      integer(c_size_t), intent(in), value :: size
      integer(c_int32_t), intent(in), value :: depth
-      integer(c_int32_t), intent(in), value :: secure
    end subroutine impl_tracy_emit_memory_alloc_callstack
-    subroutine impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth, secure, name) &
+    subroutine impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth, name) &
      bind(C, name="___tracy_emit_memory_alloc_callstack_named")
      import
      type(c_ptr), intent(in), value :: ptr
      integer(c_size_t), intent(in), value :: size
      integer(c_int32_t), intent(in), value :: depth
-      integer(c_int32_t), intent(in), value :: secure
      type(c_ptr), intent(in), value :: name
    end subroutine impl_tracy_emit_memory_alloc_callstack_named
-    subroutine impl_tracy_emit_memory_free_callstack(ptr, depth, secure) &
+    subroutine impl_tracy_emit_memory_free_callstack(ptr, depth) &
      bind(C, name="___tracy_emit_memory_free_callstack")
      import
      type(c_ptr), intent(in), value :: ptr
      integer(c_int32_t), intent(in), value :: depth
-      integer(c_int32_t), intent(in), value :: secure
    end subroutine impl_tracy_emit_memory_free_callstack
-    subroutine impl_tracy_emit_memory_free_callstack_named(ptr, depth, secure, name) &
+    subroutine impl_tracy_emit_memory_free_callstack_named(ptr, depth, name) &
      bind(C, name="___tracy_emit_memory_free_callstack_named")
      import
      type(c_ptr), intent(in), value :: ptr
      integer(c_int32_t), intent(in), value :: depth
-      integer(c_int32_t), intent(in), value :: secure
      type(c_ptr), intent(in), value :: name
    end subroutine impl_tracy_emit_memory_free_callstack_named
-    subroutine impl_tracy_emit_memory_discard_callstack(name, secure, depth) &
+    subroutine impl_tracy_emit_memory_discard_callstack(name, depth) &
      bind(C, name="___tracy_emit_memory_discard_callstack")
      import
      type(c_ptr), intent(in), value :: name
-      integer(c_int32_t), intent(in), value :: secure
      integer(c_int32_t), intent(in), value :: depth
    end subroutine impl_tracy_emit_memory_discard_callstack
  end interface
@@ -1128,58 +1123,43 @@ contains
    tracy_connected = impl_tracy_connected() /= 0_c_int32_t
  end function tracy_connected

-  subroutine tracy_memory_alloc(ptr, size, name, depth, secure)
+  subroutine tracy_memory_alloc(ptr, size, name, depth)
    type(c_ptr), intent(in) :: ptr
    integer(c_size_t), intent(in) :: size
    character(kind=c_char, len=*), target, intent(in), optional :: name
    integer(c_int32_t), intent(in), optional :: depth
-    logical(1), intent(in), optional :: secure
    !
-    integer(c_int32_t) :: depth_, secure_
-    secure_ = 0_c_int32_t
+    integer(c_int32_t) :: depth_
    depth_ = 0_c_int32_t
-    if (present(secure)) then
-      if (secure) secure_ = 1_c_int32_t
-    end if
    if (present(depth)) depth_ = depth
    if (present(name)) then
-      call impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth_, secure_, c_loc(name))
+      call impl_tracy_emit_memory_alloc_callstack_named(ptr, size, depth_, c_loc(name))
    else
-      call impl_tracy_emit_memory_alloc_callstack(ptr, size, depth_, secure_)
+      call impl_tracy_emit_memory_alloc_callstack(ptr, size, depth_)
    end if
  end subroutine tracy_memory_alloc
-  subroutine tracy_memory_free(ptr, name, depth, secure)
+  subroutine tracy_memory_free(ptr, name, depth)
    type(c_ptr), intent(in) :: ptr
    character(kind=c_char, len=*), target, intent(in), optional :: name
    integer(c_int32_t), intent(in), optional :: depth
-    logical(1), intent(in), optional :: secure
    !
-    integer(c_int32_t) :: depth_, secure_
-    secure_ = 0_c_int32_t
+    integer(c_int32_t) :: depth_
    depth_ = 0_c_int32_t
-    if (present(secure)) then
-      if (secure) secure_ = 1_c_int32_t
-    end if
    if (present(depth)) depth_ = depth
    if (present(name)) then
-      call impl_tracy_emit_memory_free_callstack_named(ptr, depth_, secure_, c_loc(name))
+      call impl_tracy_emit_memory_free_callstack_named(ptr, depth_, c_loc(name))
    else
-      call impl_tracy_emit_memory_free_callstack(ptr, depth_, secure_)
+      call impl_tracy_emit_memory_free_callstack(ptr, depth_)
    end if
  end subroutine tracy_memory_free
-  subroutine tracy_memory_discard(name, depth, secure)
+  subroutine tracy_memory_discard(name, depth)
    character(kind=c_char, len=*), target, intent(in) :: name
    integer(c_int32_t), intent(in), optional :: depth
-    logical(1), intent(in), optional :: secure
    !
-    integer(c_int32_t) :: depth_, secure_
-    secure_ = 0_c_int32_t
+    integer(c_int32_t) :: depth_
    depth_ = 0_c_int32_t
-    if (present(secure)) then
-      if (secure) secure_ = 1_c_int32_t
-    end if
    if (present(depth)) depth_ = depth
-    call impl_tracy_emit_memory_discard_callstack(c_loc(name), depth_, secure_)
+    call impl_tracy_emit_memory_discard_callstack(c_loc(name), depth_)
  end subroutine tracy_memory_discard

  subroutine tracy_message(msg, color, depth)
--- a/public/client/TracyProfiler.cpp
+++ b/public/client/TracyProfiler.cpp
@@ -524,7 +524,7 @@ static const char* GetHostInfo()
    auto ptr = buf;
 #if defined _WIN32
 #  if defined TRACY_WIN32_NO_DESKTOP
-    auto GetVersion = &::GetVersionEx;
+    auto GetVersion = &::GetVersionExW;
 #  else
    auto GetVersion = (t_RtlGetVersion)GetProcAddress( GetModuleHandleA( "ntdll.dll" ), "RtlGetVersion" );
 #  endif
@@ -1408,9 +1408,30 @@ namespace
 // 1a. But s_queue is needed for initialization of variables in point 2.
 extern moodycamel::ConcurrentQueue<QueueItem> s_queue;

+// A producer token may be created before s_initTime is constructed (the dynamic loader
+// runs shared object initializers before any of the executable's constructors, and such
+// an initializer may emit a zone). Remember the time of such an early token creation, so
+// that the init time can be backdated accordingly and no event timestamp precedes the
+// trace epoch.
+static std::atomic<int64_t> s_earlyTokenTime { 0 };
+static bool s_initTimeConstructed = false;
+
 // 2. If these variables would be in the .CRT$XCB section, they would be initialized only in main thread.
 thread_local moodycamel::ProducerToken init_order(107) s_token_detail( s_queue );
-thread_local ProducerWrapper init_order(108) s_token { s_queue.get_explicit_producer( s_token_detail ) };
+
+static moodycamel::ConcurrentQueue<QueueItem>::ExplicitProducer* CreateProducerToken()
+{
+    auto ptr = s_queue.get_explicit_producer( s_token_detail );
+    if( !s_initTimeConstructed )
+    {
+        const auto t = Profiler::GetTime();
+        auto e = s_earlyTokenTime.load( std::memory_order_relaxed );
+        while( ( e == 0 || t < e ) && !s_earlyTokenTime.compare_exchange_weak( e, t, std::memory_order_relaxed ) ) {}
+    }
+    return ptr;
+}
+
+thread_local ProducerWrapper init_order(108) s_token { CreateProducerToken() };
 thread_local ThreadHandleWrapper init_order(104) s_threadHandle { detail::GetThreadHandleImpl() };

 #  ifdef _MSC_VER
@@ -1419,12 +1440,36 @@ thread_local ThreadHandleWrapper init_order(104) s_threadHandle { detail::GetThr
 #    pragma init_seg( ".CRT$XCB" )
 #  endif

-static InitTimeWrapper init_order(101) s_initTime { SetupHwTimer() };
+static int64_t GetInitTimeImpl()
+{
+    auto t = SetupHwTimer();
+    const auto e = s_earlyTokenTime.load( std::memory_order_relaxed );
+    if( e != 0 && e < t ) t = e;
+    s_initTimeConstructed = true;
+    return t;
+}
+static InitTimeWrapper init_order(101) s_initTime { GetInitTimeImpl() };
 std::atomic<int> init_order(102) RpInitDone( 0 );
 std::atomic<int> init_order(102) RpInitLock( 0 );
 thread_local bool RpThreadInitDone = false;
 thread_local bool RpThreadShutdown = false;
 moodycamel::ConcurrentQueue<QueueItem> init_order(103) s_queue( QueuePrealloc );
+
+#  ifndef _MSC_VER
+// An instrumented shared object may emit zones from its static initializers, which the
+// dynamic loader runs before any of the executable's constructors, including the
+// priority-ordered constructor of s_queue above. The main thread producer token (s_token)
+// is then lazily created against the zero-initialized queue memory, and the queue
+// constructor subsequently orphans it, making all zones emitted on the main thread
+// invisible to the consumer. Re-adopt such a producer here. If no zones were emitted up
+// to this point, this only triggers construction of s_token, which is a no-op repair.
+struct EarlyMainThreadTokenRepair
+{
+    EarlyMainThreadTokenRepair() { if( s_token.ptr ) s_queue.readopt_orphaned_producer( s_token.ptr ); }
+};
+static EarlyMainThreadTokenRepair init_order(104) s_earlyMainThreadTokenRepair;
+#  endif
+
 std::atomic<uint32_t> init_order(104) s_lockCounter( 0 );
 std::atomic<uint8_t> init_order(104) s_gpuCtxCounter( 0 );

@@ -2290,12 +2335,12 @@ void Profiler::CompressWorker()
                const auto w = fi->w;
                const auto h = fi->h;
                const auto csz = size_t( w * h / 2 );
-                auto etc1buf = (char*)tracy_malloc( csz );
-                CompressImageDxt1( (const char*)fi->image, etc1buf, w, h );
+                auto texbuf = (char*)tracy_malloc( csz );
+                CompressImageDxt1( (const char*)fi->image, texbuf, w, h );
                tracy_free( fi->image );

                TracyLfqPrepare( QueueType::FrameImage );
-                MemWrite( &item->frameImageFat.image, (uint64_t)etc1buf );
+                MemWrite( &item->frameImageFat.image, (uint64_t)texbuf );
                MemWrite( &item->frameImageFat.frame, fi->frame );
                MemWrite( &item->frameImageFat.w, w );
                MemWrite( &item->frameImageFat.h, h );
@@ -3409,34 +3454,68 @@ void Profiler::SendString( uint64_t str, const char* ptr, size_t len, QueueType
    AppendDataUnsafe( ptr, l16 );
 }

-void Profiler::SendSingleString( const char* ptr, size_t len )
+void Profiler::SendSingleString8( const char* ptr, size_t len )
+{
+    QueueItem item;
+    MemWrite( &item.hdr.type, QueueType::SingleStringData8 );
+
+    assert( len <= std::numeric_limits<uint8_t>::max() );
+    auto l8 = uint8_t( len );
+
+    NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData8] + sizeof( l8 ) + len );
+
+    AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SingleStringData8] );
+    AppendDataUnsafe( &l8, sizeof( l8 ) );
+    AppendDataUnsafe( ptr, len );
+}
+
+void Profiler::SendSingleString16( const char* ptr, size_t len )
 {
    QueueItem item;
    MemWrite( &item.hdr.type, QueueType::SingleStringData );

+    // Ignoring u16+ range by design
+    assert( len > std::numeric_limits<uint8_t>::max() );
    assert( len <= std::numeric_limits<uint16_t>::max() );
-    auto l16 = uint16_t( len );
+    auto l16 = uint16_t( len - ProtocolOffset8Bit );

-    NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData] + sizeof( l16 ) + l16 );
+    NeedDataSize( QueueDataSize[(int)QueueType::SingleStringData] + sizeof( l16 ) + len );

    AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SingleStringData] );
    AppendDataUnsafe( &l16, sizeof( l16 ) );
-    AppendDataUnsafe( ptr, l16 );
+    AppendDataUnsafe( ptr, len );
 }

-void Profiler::SendSecondString( const char* ptr, size_t len )
+void Profiler::SendSecondString8( const char* ptr, size_t len )
+{
+    QueueItem item;
+    MemWrite( &item.hdr.type, QueueType::SecondStringData8 );
+
+    assert( len <= std::numeric_limits<uint8_t>::max() );
+    auto l8 = uint8_t( len );
+
+    NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData8] + sizeof( l8 ) + len );
+
+    AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SecondStringData8] );
+    AppendDataUnsafe( &l8, sizeof( l8 ) );
+    AppendDataUnsafe( ptr, len );
+}
+
+void Profiler::SendSecondString16( const char* ptr, size_t len )
 {
    QueueItem item;
    MemWrite( &item.hdr.type, QueueType::SecondStringData );

+    // Ignoring u16+ range by design
+    assert( len > std::numeric_limits<uint8_t>::max() );
    assert( len <= std::numeric_limits<uint16_t>::max() );
-    auto l16 = uint16_t( len );
+    auto l16 = uint16_t( len - ProtocolOffset8Bit );

-    NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData] + sizeof( l16 ) + l16 );
+    NeedDataSize( QueueDataSize[(int)QueueType::SecondStringData] + sizeof( l16 ) + len );

    AppendDataUnsafe( &item, QueueDataSize[(int)QueueType::SecondStringData] );
    AppendDataUnsafe( &l16, sizeof( l16 ) );
-    AppendDataUnsafe( ptr, l16 );
+    AppendDataUnsafe( ptr, len );
 }

 void Profiler::SendLongString( uint64_t str, const char* ptr, size_t len, QueueType type )
@@ -4664,64 +4743,64 @@ TRACY_API void ___tracy_emit_zone_value( TracyCZoneCtx ctx, uint64_t value )
    }
 }

-TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size, int32_t secure ) { tracy::Profiler::MemAlloc( ptr, size, secure != 0 ); }
-TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth, int32_t secure )
+TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size ) { tracy::Profiler::MemAlloc( ptr, size ); }
+TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth )
 {
    if( depth > 0 && tracy::has_callstack() )
    {
-        tracy::Profiler::MemAllocCallstack( ptr, size, depth, secure != 0 );
+        tracy::Profiler::MemAllocCallstack( ptr, size, depth );
    }
    else
    {
-        tracy::Profiler::MemAlloc( ptr, size, secure != 0 );
+        tracy::Profiler::MemAlloc( ptr, size );
    }
 }
-TRACY_API void ___tracy_emit_memory_free( const void* ptr, int32_t secure ) { tracy::Profiler::MemFree( ptr, secure != 0 ); }
-TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth, int32_t secure )
+TRACY_API void ___tracy_emit_memory_free( const void* ptr ) { tracy::Profiler::MemFree( ptr ); }
+TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth )
 {
    if( depth > 0 && tracy::has_callstack() )
    {
-        tracy::Profiler::MemFreeCallstack( ptr, depth, secure != 0 );
+        tracy::Profiler::MemFreeCallstack( ptr, depth );
    }
    else
    {
-        tracy::Profiler::MemFree( ptr, secure != 0 );
+        tracy::Profiler::MemFree( ptr );
    }
 }
-TRACY_API void ___tracy_emit_memory_discard( const char* name, int32_t secure ) { tracy::Profiler::MemDiscard( name, secure != 0 ); }
-TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t secure, int32_t depth )
+TRACY_API void ___tracy_emit_memory_discard( const char* name ) { tracy::Profiler::MemDiscard( name ); }
+TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t depth )
 {
    if( depth > 0 && tracy::has_callstack() )
    {
-        tracy::Profiler::MemDiscardCallstack( name, secure != 0, depth );
+        tracy::Profiler::MemDiscardCallstack( name, depth );
    }
    else
    {
-        tracy::Profiler::MemDiscard( name, secure != 0 );
+        tracy::Profiler::MemDiscard( name );
    }
 }
-TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, int32_t secure, const char* name ) { tracy::Profiler::MemAllocNamed( ptr, size, secure != 0, name ); }
-TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, int32_t secure, const char* name )
+TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, const char* name ) { tracy::Profiler::MemAllocNamed( ptr, size, name ); }
+TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, const char* name )
 {
    if( depth > 0 && tracy::has_callstack() )
    {
-        tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, secure != 0, name );
+        tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, name );
    }
    else
    {
-        tracy::Profiler::MemAllocNamed( ptr, size, secure != 0, name );
+        tracy::Profiler::MemAllocNamed( ptr, size, name );
    }
 }
-TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, int32_t secure, const char* name ) { tracy::Profiler::MemFreeNamed( ptr, secure != 0, name ); }
-TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, int32_t secure, const char* name )
+TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, const char* name ) { tracy::Profiler::MemFreeNamed( ptr, name ); }
+TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, const char* name )
 {
    if( depth > 0 && tracy::has_callstack() )
    {
-        tracy::Profiler::MemFreeCallstackNamed( ptr, depth, secure != 0, name );
+        tracy::Profiler::MemFreeCallstackNamed( ptr, depth, name );
    }
    else
    {
-        tracy::Profiler::MemFreeNamed( ptr, secure != 0, name );
+        tracy::Profiler::MemFreeNamed( ptr, name );
    }
 }
 TRACY_API void ___tracy_emit_frame_mark( const char* name ) { tracy::Profiler::SendFrameMark( name ); }
--- a/public/client/TracyProfiler.hpp
+++ b/public/client/TracyProfiler.hpp
@@ -535,9 +535,9 @@ public:
        TracyLfqCommit;
    }

-    static tracy_force_inline void MemAlloc( const void* ptr, size_t size, bool secure )
+    static tracy_force_inline void MemAlloc( const void* ptr, size_t size )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
 #ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
 #endif
@@ -548,9 +548,9 @@ public:
        GetProfiler().m_serialLock.unlock();
    }

-    static tracy_force_inline void MemFree( const void* ptr, bool secure )
+    static tracy_force_inline void MemFree( const void* ptr )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
 #ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
 #endif
@@ -561,9 +561,9 @@ public:
        GetProfiler().m_serialLock.unlock();
    }

-    static tracy_force_inline void MemAllocCallstack( const void* ptr, size_t size, int32_t depth, bool secure )
+    static tracy_force_inline void MemAllocCallstack( const void* ptr, size_t size, int32_t depth )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
        if( depth > 0 && has_callstack() )
        {
            auto& profiler = GetProfiler();
@@ -581,16 +581,16 @@ public:
        }
        else
        {
-            MemAlloc( ptr, size, secure );
+            MemAlloc( ptr, size );
        }
    }

-    static tracy_force_inline void MemFreeCallstack( const void* ptr, int32_t depth, bool secure )
+    static tracy_force_inline void MemFreeCallstack( const void* ptr, int32_t depth )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
        if( !ProfilerAllocatorAvailable() )
        {
-            MemFree( ptr, secure );
+            MemFree( ptr );
            return;
        }
        if( depth > 0 && has_callstack() )
@@ -610,13 +610,13 @@ public:
        }
        else
        {
-            MemFree( ptr, secure );
+            MemFree( ptr );
        }
    }

-    static tracy_force_inline void MemAllocNamed( const void* ptr, size_t size, bool secure, const char* name )
+    static tracy_force_inline void MemAllocNamed( const void* ptr, size_t size, const char* name )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
 #ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
 #endif
@@ -628,9 +628,9 @@ public:
        GetProfiler().m_serialLock.unlock();
    }

-    static tracy_force_inline void MemFreeNamed( const void* ptr, bool secure, const char* name )
+    static tracy_force_inline void MemFreeNamed( const void* ptr, const char* name )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
 #ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
 #endif
@@ -642,9 +642,9 @@ public:
        GetProfiler().m_serialLock.unlock();
    }

-    static tracy_force_inline void MemAllocCallstackNamed( const void* ptr, size_t size, int32_t depth, bool secure, const char* name )
+    static tracy_force_inline void MemAllocCallstackNamed( const void* ptr, size_t size, int32_t depth, const char* name )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
        if( depth > 0 && has_callstack() )
        {
            auto& profiler = GetProfiler();
@@ -663,13 +663,13 @@ public:
        }
        else
        {
-            MemAllocNamed( ptr, size, secure, name );
+            MemAllocNamed( ptr, size, name );
        }
    }

-    static tracy_force_inline void MemFreeCallstackNamed( const void* ptr, int32_t depth, bool secure, const char* name )
+    static tracy_force_inline void MemFreeCallstackNamed( const void* ptr, int32_t depth, const char* name )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
        if( depth > 0 && has_callstack() )
        {
            auto& profiler = GetProfiler();
@@ -688,13 +688,13 @@ public:
        }
        else
        {
-            MemFreeNamed( ptr, secure, name );
+            MemFreeNamed( ptr, name );
        }
    }

-    static tracy_force_inline void MemDiscard( const char* name, bool secure )
+    static tracy_force_inline void MemDiscard( const char* name )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
 #ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
 #endif
@@ -705,9 +705,9 @@ public:
        GetProfiler().m_serialLock.unlock();
    }

-    static tracy_force_inline void MemDiscardCallstack( const char* name, bool secure, int32_t depth )
+    static tracy_force_inline void MemDiscardCallstack( const char* name, int32_t depth )
    {
-        if( secure && !ProfilerAvailable() ) return;
+        if( !ProfilerAvailable() ) return;
        if( depth > 0 && has_callstack() )
        {
 #  ifdef TRACY_ON_DEMAND
@@ -719,12 +719,12 @@ public:

            GetProfiler().m_serialLock.lock();
            SendCallstackSerial( callstack );
-            SendMemDiscard( QueueType::MemDiscard, thread, name );
+            SendMemDiscard( QueueType::MemDiscardCallstack, thread, name );
            GetProfiler().m_serialLock.unlock();
        }
        else
        {
-            MemDiscard( name, secure );
+            MemDiscard( name );
        }
    }

@@ -827,12 +827,12 @@ public:
    void RequestShutdown() { m_shutdown.store( true, std::memory_order_relaxed ); m_shutdownManual.store( true, std::memory_order_relaxed ); }
    bool HasShutdownFinished() const { return m_shutdownFinished.load( std::memory_order_relaxed ); }

-    void SendString( uint64_t str, const char* ptr, QueueType type ) { SendString( str, ptr, strlen( ptr ), type ); }
+    tracy_force_inline void SendString( uint64_t str, const char* ptr, QueueType type ) { SendString( str, ptr, strlen( ptr ), type ); }
    void SendString( uint64_t str, const char* ptr, size_t len, QueueType type );
-    void SendSingleString( const char* ptr ) { SendSingleString( ptr, strlen( ptr ) ); }
-    void SendSingleString( const char* ptr, size_t len );
-    void SendSecondString( const char* ptr ) { SendSecondString( ptr, strlen( ptr ) ); }
-    void SendSecondString( const char* ptr, size_t len );
+    tracy_force_inline void SendSingleString( const char* ptr ) { SendSingleString( ptr, strlen( ptr ) ); }
+    tracy_force_inline void SendSingleString( const char* ptr, size_t len ) { len <= 255 ? SendSingleString8( ptr, len ) : SendSingleString16( ptr, len ); }
+    tracy_force_inline void SendSecondString( const char* ptr ) { SendSecondString( ptr, strlen( ptr ) ); }
+    tracy_force_inline void SendSecondString( const char* ptr, size_t len ) { len <= 255 ? SendSecondString8( ptr, len ) : SendSecondString16( ptr, len ); }


    // Allocated source location data layout:
@@ -975,6 +975,11 @@ private:
    void CalibrateDelay();
    void ReportTopology();

+    void SendSingleString8( const char* ptr, size_t len );
+    void SendSingleString16( const char* ptr, size_t len );
+    void SendSecondString8( const char* ptr, size_t len );
+    void SendSecondString16( const char* ptr, size_t len );
+
    static tracy_force_inline void SendCallstackSerial( void* ptr )
    {
        if( has_callstack() )
--- a/public/client/TracyRingBuffer.hpp
+++ b/public/client/TracyRingBuffer.hpp
@@ -52,20 +52,8 @@ public:
    RingBuffer( const RingBuffer& ) = delete;
    RingBuffer& operator=( const RingBuffer& ) = delete;

-    RingBuffer( RingBuffer&& other )
-    {
-        memcpy( (char*)&other, (char*)this, sizeof( RingBuffer ) );
-        m_metadata = nullptr;
-        m_fd = 0;
-    }
-
-    RingBuffer& operator=( RingBuffer&& other )
-    {
-        memcpy( (char*)&other, (char*)this, sizeof( RingBuffer ) );
-        m_metadata = nullptr;
-        m_fd = 0;
-        return *this;
-    }
+    RingBuffer( RingBuffer&& other ) = delete;
+    RingBuffer& operator=( RingBuffer&& other ) = delete;

    bool IsValid() const { return m_metadata != nullptr; }
    int GetId() const { return m_id; }
--- a/public/client/tracy_concurrentqueue.h
+++ b/public/client/tracy_concurrentqueue.h
@@ -1210,6 +1210,21 @@ private:
        return static_cast<ExplicitProducer*>(token.producer);
    }

+    // If a producer token is created before the constructor of a statically allocated
+    // queue runs (which may happen due to the undefined order of static initialization
+    // across module boundaries), the constructor will orphan it by resetting the
+    // producer list. Such a producer is functional, as producer creation works on the
+    // zero-initialized queue memory, but the consumer is not able to see the data it
+    // enqueues. This method links the producer back into the list.
+    bool readopt_orphaned_producer(ExplicitProducer* producer)
+    {
+        for (auto ptr = producerListTail.load(std::memory_order_relaxed); ptr != nullptr; ptr = ptr->next_prod()) {
+            if (ptr == static_cast<ProducerBase*>(producer)) return false;
+        }
+        add_producer(static_cast<ProducerBase*>(producer));
+        return true;
+    }
+
    private:

 	//////////////////////////////////
--- a/public/common/TracyProtocol.hpp
+++ b/public/common/TracyProtocol.hpp
@@ -10,7 +10,7 @@ namespace tracy

 constexpr unsigned Lz4CompressBound( unsigned isize ) { return isize + ( isize / 255 ) + 16; }

-constexpr uint32_t ProtocolVersion = 79;
+constexpr uint32_t ProtocolVersion = 80;
 constexpr uint16_t BroadcastVersion = 3;

 using lz4sz_t = uint32_t;
@@ -155,6 +155,7 @@ struct BroadcastMessage_v0

 #pragma pack( pop )

+constexpr uint64_t ProtocolOffset8Bit  = (1ull << 8);
 constexpr uint64_t ProtocolOffset16Bit = (1ull << 16);
 constexpr uint64_t ProtocolOffset32Bit = (1ull << 16) + (1ull << 32);

--- a/public/common/TracyQueue.hpp
+++ b/public/common/TracyQueue.hpp
@@ -122,6 +122,8 @@ enum class QueueType : uint8_t
    CpuTopology,
    SingleStringData,
    SecondStringData,
+    SingleStringData8,
+    SecondStringData8,
    MemNamePayload,
    ThreadGroupHint,
    GpuZoneAnnotation,
@@ -390,7 +392,7 @@ enum class MessageSeverity : uint8_t
    Debug,   // Describes variable states and details about specific internal events in the software, that are useful for investigations.
    Info,    // Describes normal events, which inform on the expected progress and state of your software.
    Warning, // Describes potentially dangerous situations caused by unexpected events and states.
-    Error,   // Describes the occurance of unexpected behavior. Does not interrupt the execution of the software.
+    Error,   // Describes the occurrence of unexpected behavior. Does not interrupt the execution of the software.
    Fatal,   // Describes a critical event that will lead to a software failure/crash.
    COUNT
 };
@@ -492,7 +494,8 @@ enum class GpuContextType : uint8_t
    Metal,
    Custom,
    CUDA,
-    Rocprof
+    Rocprof,
+    WebGPU
 };

 enum GpuContextFlags : uint8_t
@@ -1039,6 +1042,8 @@ static constexpr size_t QueueDataSize[] = {
    sizeof( QueueHeader ) + sizeof( QueueCpuTopology ),
    sizeof( QueueHeader ),                                  // single string data
    sizeof( QueueHeader ),                                  // second string data
+    sizeof( QueueHeader ),                                  // single string data, 8 bit length
+    sizeof( QueueHeader ),                                  // second string data, 8 bit length
    sizeof( QueueHeader ) + sizeof( QueueMemNamePayload ),
    sizeof( QueueHeader ) + sizeof( QueueThreadGroupHint ),
    sizeof( QueueHeader ) + sizeof( QueueGpuZoneAnnotation ), // GPU zone annotation
--- a/public/tracy/Tracy.hpp
+++ b/public/tracy/Tracy.hpp
@@ -76,14 +76,9 @@
 #define TracyAlloc(x,y)
 #define TracyFree(x)
 #define TracyMemoryDiscard(x)
-#define TracySecureAlloc(x,y)
-#define TracySecureFree(x)
-#define TracySecureMemoryDiscard(x)

 #define TracyAllocN(x,y,z)
 #define TracyFreeN(x,y)
-#define TracySecureAllocN(x,y,z)
-#define TracySecureFreeN(x,y)

 #define ZoneNamedS(x,y,z)
 #define ZoneNamedNS(x,y,z,w)
@@ -101,14 +96,9 @@
 #define TracyAllocS(x,y,z)
 #define TracyFreeS(x,y)
 #define TracyMemoryDiscardS(x,y)
-#define TracySecureAllocS(x,y,z)
-#define TracySecureFreeS(x,y)
-#define TracySecureMemoryDiscardS(x,y)

 #define TracyAllocNS(x,y,z,w)
 #define TracyFreeNS(x,y,z)
-#define TracySecureAllocNS(x,y,z,w)
-#define TracySecureFreeNS(x,y,z)

 #define TracyMessageS(x,y,z)
 #define TracyMessageLS(x,y)
@@ -221,17 +211,12 @@
 #define TracyMessageC( txt, size, color ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, color, TRACY_CALLSTACK, size, txt )
 #define TracyMessageLC( txt, color ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, color, TRACY_CALLSTACK, txt )

-#define TracyAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK, false )
-#define TracyFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK, false )
-#define TracySecureAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK, true )
-#define TracySecureFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK, true )
+#define TracyAlloc( ptr, size ) tracy::Profiler::MemAllocCallstack( ptr, size, TRACY_CALLSTACK )
+#define TracyFree( ptr ) tracy::Profiler::MemFreeCallstack( ptr, TRACY_CALLSTACK )

-#define TracyAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, false, name )
-#define TracyFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, false, name )
-#define TracyMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, false, TRACY_CALLSTACK )
-#define TracySecureAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, true, name )
-#define TracySecureFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, true, name )
-#define TracySecureMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, true, TRACY_CALLSTACK )
+#define TracyAllocN( ptr, size, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, TRACY_CALLSTACK, name )
+#define TracyFreeN( ptr, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, TRACY_CALLSTACK, name )
+#define TracyMemoryDiscard( name ) tracy::Profiler::MemDiscardCallstack( name, TRACY_CALLSTACK )

 #define ZoneNamedS( varname, depth, active ) static constexpr tracy::SourceLocationData TracyConcat(__tracy_source_location,TracyLine) { nullptr, TracyFunction,  TracyFile, (uint32_t)TracyLine, 0 }; tracy::ScopedZone varname( &TracyConcat(__tracy_source_location,TracyLine), depth, active )
 #define ZoneNamedNS( varname, name, depth, active ) static constexpr tracy::SourceLocationData TracyConcat(__tracy_source_location,TracyLine) { name, TracyFunction,  TracyFile, (uint32_t)TracyLine, 0 }; tracy::ScopedZone varname( &TracyConcat(__tracy_source_location,TracyLine), depth, active )
@@ -246,17 +231,12 @@
 #define ZoneScopedCS( color, depth ) ZoneNamedCS( ___tracy_scoped_zone, color, depth, true )
 #define ZoneScopedNCS( name, color, depth ) ZoneNamedNCS( ___tracy_scoped_zone, name, color, depth, true )

-#define TracyAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth, false )
-#define TracyFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth, false )
-#define TracySecureAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth, true )
-#define TracySecureFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth, true )
+#define TracyAllocS( ptr, size, depth ) tracy::Profiler::MemAllocCallstack( ptr, size, depth )
+#define TracyFreeS( ptr, depth ) tracy::Profiler::MemFreeCallstack( ptr, depth )

-#define TracyAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, false, name )
-#define TracyFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, false, name )
-#define TracyMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, false, depth )
-#define TracySecureAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, true, name )
-#define TracySecureFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, true, name )
-#define TracySecureMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, true, depth )
+#define TracyAllocNS( ptr, size, depth, name ) tracy::Profiler::MemAllocCallstackNamed( ptr, size, depth, name )
+#define TracyFreeNS( ptr, depth, name ) tracy::Profiler::MemFreeCallstackNamed( ptr, depth, name )
+#define TracyMemoryDiscardS( name, depth ) tracy::Profiler::MemDiscardCallstack( name, depth )

 #define TracyMessageS( txt, size, depth ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, 0, depth, size, txt )
 #define TracyMessageLS( txt, depth ) tracy::Profiler::LogString( tracy::MessageSourceType::User, tracy::MessageSeverity::Info, 0, depth, txt )
--- a/public/tracy/TracyC.h
+++ b/public/tracy/TracyC.h
@@ -64,14 +64,9 @@ typedef const void* TracyCSharedLockCtx;
 #define TracyCAlloc(x,y)
 #define TracyCFree(x)
 #define TracyCMemoryDiscard(x)
-#define TracyCSecureAlloc(x,y)
-#define TracyCSecureFree(x)
-#define TracyCSecureMemoryDiscard(x)

 #define TracyCAllocN(x,y,z)
 #define TracyCFreeN(x,y)
-#define TracyCSecureAllocN(x,y,z)
-#define TracyCSecureFreeN(x,y)

 #define TracyCFrameMark
 #define TracyCFrameMarkNamed(x)
@@ -98,14 +93,9 @@ typedef const void* TracyCSharedLockCtx;
 #define TracyCAllocS(x,y,z)
 #define TracyCFreeS(x,y)
 #define TracyCMemoryDiscardS(x,y)
-#define TracyCSecureAllocS(x,y,z)
-#define TracyCSecureFreeS(x,y)
-#define TracyCSecureMemoryDiscardS(x,y)

 #define TracyCAllocNS(x,y,z,w)
 #define TracyCFreeNS(x,y,z)
-#define TracyCSecureAllocNS(x,y,z,w)
-#define TracyCSecureFreeNS(x,y,z)

 #define TracyCMessageS(x,y,z)
 #define TracyCMessageLS(x,y)
@@ -295,31 +285,26 @@ TRACY_API int32_t ___tracy_connected(void);
 #define TracyCZoneValue( ctx, value ) ___tracy_emit_zone_value( ctx, value );


-TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size, int32_t secure );
-TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth, int32_t secure );
-TRACY_API void ___tracy_emit_memory_free( const void* ptr, int32_t secure );
-TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth, int32_t secure );
-TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, int32_t secure, const char* name );
-TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, int32_t secure, const char* name );
-TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, int32_t secure, const char* name );
-TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, int32_t secure, const char* name );
-TRACY_API void ___tracy_emit_memory_discard( const char* name, int32_t secure );
-TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t secure, int32_t depth );
+TRACY_API void ___tracy_emit_memory_alloc( const void* ptr, size_t size );
+TRACY_API void ___tracy_emit_memory_alloc_callstack( const void* ptr, size_t size, int32_t depth );
+TRACY_API void ___tracy_emit_memory_free( const void* ptr );
+TRACY_API void ___tracy_emit_memory_free_callstack( const void* ptr, int32_t depth );
+TRACY_API void ___tracy_emit_memory_alloc_named( const void* ptr, size_t size, const char* name );
+TRACY_API void ___tracy_emit_memory_alloc_callstack_named( const void* ptr, size_t size, int32_t depth, const char* name );
+TRACY_API void ___tracy_emit_memory_free_named( const void* ptr, const char* name );
+TRACY_API void ___tracy_emit_memory_free_callstack_named( const void* ptr, int32_t depth, const char* name );
+TRACY_API void ___tracy_emit_memory_discard( const char* name );
+TRACY_API void ___tracy_emit_memory_discard_callstack( const char* name, int32_t depth );

 TRACY_API void ___tracy_emit_logString( int8_t severity, int32_t color, int32_t callstack_depth, size_t size, const char* txt );
 TRACY_API void ___tracy_emit_logStringL( int8_t severity, int32_t color, int32_t callstack_depth, const char* txt );

-#define TracyCAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK, 0 )
-#define TracyCFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK, 0 )
-#define TracyCMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, 0, TRACY_CALLSTACK );
-#define TracyCSecureAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK, 1 )
-#define TracyCSecureFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK, 1 )
-#define TracyCSecureMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, 1, TRACY_CALLSTACK );
+#define TracyCAlloc( ptr, size ) ___tracy_emit_memory_alloc_callstack( ptr, size, TRACY_CALLSTACK )
+#define TracyCFree( ptr ) ___tracy_emit_memory_free_callstack( ptr, TRACY_CALLSTACK )
+#define TracyCMemoryDiscard( name ) ___tracy_emit_memory_discard_callstack( name, TRACY_CALLSTACK );

-#define TracyCAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, 0, name )
-#define TracyCFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, 0, name )
-#define TracyCSecureAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, 1, name )
-#define TracyCSecureFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, 1, name )
+#define TracyCAllocN( ptr, size, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, TRACY_CALLSTACK, name )
+#define TracyCFreeN( ptr, name ) ___tracy_emit_memory_free_callstack_named( ptr, TRACY_CALLSTACK, name )

 #define TracyCMessage( txt, size ) ___tracy_emit_logString( TracyMessageSeverityInfo, 0, TRACY_CALLSTACK, size, txt )
 #define TracyCMessageL( txt ) ___tracy_emit_logStringL( TracyMessageSeverityInfo, 0, TRACY_CALLSTACK, txt )
@@ -357,17 +342,12 @@ TRACY_API void ___tracy_emit_message_appinfo( const char* txt, size_t size );
 #define TracyCZoneCS( ctx, color, depth, active ) static const struct ___tracy_source_location_data TracyConcat(__tracy_source_location,TracyLine) = { NULL, __func__,  TracyFile, (uint32_t)TracyLine, color }; TracyCZoneCtx ctx = ___tracy_emit_zone_begin_callstack( &TracyConcat(__tracy_source_location,TracyLine), depth, active );
 #define TracyCZoneNCS( ctx, name, color, depth, active ) static const struct ___tracy_source_location_data TracyConcat(__tracy_source_location,TracyLine) = { name, __func__,  TracyFile, (uint32_t)TracyLine, color }; TracyCZoneCtx ctx = ___tracy_emit_zone_begin_callstack( &TracyConcat(__tracy_source_location,TracyLine), depth, active );

-#define TracyCAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth, 0 )
-#define TracyCFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth, 0 )
-#define TracyCMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, 0, depth )
-#define TracyCSecureAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth, 1 )
-#define TracyCSecureFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth, 1 )
-#define TracyCSecureMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, 1, depth )
+#define TracyCAllocS( ptr, size, depth ) ___tracy_emit_memory_alloc_callstack( ptr, size, depth )
+#define TracyCFreeS( ptr, depth ) ___tracy_emit_memory_free_callstack( ptr, depth )
+#define TracyCMemoryDiscardS( name, depth ) ___tracy_emit_memory_discard_callstack( name, depth )

-#define TracyCAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, 0, name )
-#define TracyCFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, 0, name )
-#define TracyCSecureAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, 1, name )
-#define TracyCSecureFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, 1, name )
+#define TracyCAllocNS( ptr, size, depth, name ) ___tracy_emit_memory_alloc_callstack_named( ptr, size, depth, name )
+#define TracyCFreeNS( ptr, depth, name ) ___tracy_emit_memory_free_callstack_named( ptr, depth, name )

 #define TracyCMessageS( txt, size, depth ) ___tracy_emit_logString( TracyMessageSeverityInfo, 0, depth, size, txt )
 #define TracyCMessageLS( txt, depth ) ___tracy_emit_logStringL( TracyMessageSeverityInfo, 0, depth, txt )
--- a/public/tracy/TracyOpenGL.hpp
+++ b/public/tracy/TracyOpenGL.hpp
@@ -1,7 +1,12 @@
 #ifndef __TRACYOPENGL_HPP__
 #define __TRACYOPENGL_HPP__

-#if !defined TRACY_ENABLE || defined __APPLE__
+#ifdef __APPLE__
+#define TRACY_OPENGL_DISABLE
+#warning "OpenGL timestamps are unreliable on Apple devices that still run OpenGL."
+#endif
+
+#if !defined TRACY_ENABLE || defined TRACY_OPENGL_DISABLE

 #define TracyGpuContext
 #define TracyGpuContextName(x,y)
@@ -98,17 +103,31 @@ public:
        , m_head( 0 )
        , m_tail( 0 )
    {
+        ZoneScopedC( Color::Red4 );
+
        assert( m_context != 255 );

-        glGenQueries( QueryCount, m_query );
+        if( !CheckFeature( "GL_ARB_timer_query" ) )
+        {
+            Profiler::LogString( MessageSourceType::Tracy, MessageSeverity::Warning, Color::Tomato, 0,
+                    "OpenGL context does not support GL_ARB_timer_query." );
+        }
+
+        GLint bits;
+        glGetQueryiv( GL_TIMESTAMP, GL_QUERY_COUNTER_BITS, &bits );
+        if( bits == 0 )
+        {
+            // all timestamp queries would resolve to 0 (and produce 0ns GPU zones).
+            // (this is the case for many TBDR GPUs, including Apple Silicon)
+            Profiler::LogString( MessageSourceType::Tracy, MessageSeverity::Warning, Color::Tomato, 0,
+                "OpenGL driver does not implement GL_TIMESTAMP precision." );
+        }
+        assert( bits > 0 );

        int64_t tgpu;
        glGetInteger64v( GL_TIMESTAMP, &tgpu );
        int64_t tcpu = Profiler::GetTime();

-        GLint bits;
-        glGetQueryiv( GL_TIMESTAMP, GL_QUERY_COUNTER_BITS, &bits );
-
 #ifdef TRACY_OPENGL_AUTO_CALIBRATION
        // The anchor above is never refreshed; advertise calibration and emit periodic
        // GpuCalibration events to correct CPU/GPU drift (see Recalibrate). Opt-in,
@@ -117,6 +136,8 @@ public:
        m_prevCalibration = GetHostTimeNs();
 #endif

+        glGenQueries( QueryCount, m_query );
+
        const float period = 1.f;
        const auto thread = GetThreadHandle();
        TracyLfqPrepare( QueueType::GpuNewContext );
@@ -194,6 +215,30 @@ public:
    }

 private:
+    // Returns whether the driver advertises a single extension (full GL_-prefixed token).
+    static bool CheckFeature( const char* feature )
+    {
+        GLint major = 0;
+        glGetIntegerv( GL_MAJOR_VERSION, &major );
+        if( glGetError() != GL_NO_ERROR ) major = 0;   // pre-3.0: enum not supported
+
+        if( major >= 3 )
+        {
+            GLint numExt = 0;
+            glGetIntegerv( GL_NUM_EXTENSIONS, &numExt );
+            for( GLint i = 0; i < numExt; i++ )
+            {
+                auto ext = (const char*)glGetStringi( GL_EXTENSIONS, i );
+                if( ext && strcmp( ext, feature ) == 0 ) return true;
+            }
+            return false;
+        }
+
+        // pre GL3 fallback:
+        auto exts = (const char*)glGetString( GL_EXTENSIONS );
+        return exts && strstr( exts, feature ) != nullptr;
+    }
+
 #ifdef TRACY_OPENGL_AUTO_CALIBRATION
    // Monotonic host ns for the inter-calibration interval (cpuDelta), kept
    // separate from Profiler::GetTime() as in the D3D12/Vulkan backends.
--- a/public/tracy/TracyWebGPU.hpp
+++ b/public/tracy/TracyWebGPU.hpp
@@ -0,0 +1,971 @@
+#ifndef __TRACYWEBGPU_HPP__
+#define __TRACYWEBGPU_HPP__
+
+// WebGPU, unlike other graphics APIs, has many annoying restrictions that complicate
+// the design of the Tracy WebGPU back-end:
+// - there's no CPU/GPU clock calibration API
+// - submitting GPU commands that touch a buffer that the host is mapping is not permitted
+// - resolving timestamps require destination offsets aligned to 256 bytes
+// - timestamps are only available at pass granularity (implementations may need to emulate this)
+// - spec mandates timestamps to be in nanoseconds (implementationw may need to emulate this)
+
+#ifndef TRACY_ENABLE
+
+#define TracyWebGPUSetupDeviceDescriptor(deviceDescriptor)
+
+#define TracyWebGPUContext(instance, device, queue) nullptr
+#define TracyWebGPUDestroy(ctx)
+#define TracyWebGPUContextName(ctx, name, size)
+
+#define TracyWebGPUZone(ctx, encoder, passDesc, name)
+#define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color)
+#define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active)
+#define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active)
+#define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active)
+
+#define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth)
+#define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth)
+#define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active)
+#define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active)
+#define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active)
+
+#define TracyWebGPUCollect(ctx)
+
+namespace tracy
+{
+    class WebGPUZoneScope {};
+}
+
+using TracyWebGPUCtx = void*;
+
+#else
+
+#include "Tracy.hpp"
+#include "../client/TracyProfiler.hpp"
+#include "../client/TracyCallstack.hpp"
+#include "../common/TracyAlign.hpp"
+#include "../common/TracyAlloc.hpp"
+
+#include <atomic>
+#include <mutex>
+#include <vector>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <cassert>
+#include <chrono>
+#include <thread>
+
+#include <webgpu/webgpu.h>
+
+// piggy-back on WGPU_DAWN_TOGGLES_DESCRIPTOR_INIT to detect Dawn header
+#ifdef WGPU_DAWN_TOGGLES_DESCRIPTOR_INIT
+#define TRACY_WEBGPU_DAWN_NATIVE (1)
+#include <dawn/native/DawnNative.h>
+#else
+#define TRACY_WEBGPU_WGPU_NATIVE (1)
+#include <webgpu/wgpu.h>
+#endif
+
+#ifndef TRACY_WEBGPU_DEBUG_LEVEL
+#define TRACY_WEBGPU_DEBUG_LEVEL (0)
+#endif//TRACY_WEBGPU_DEBUG_LEVEL
+
+#if TRACY_WEBGPU_DEBUG_LEVEL
+#define TracyWebGPUDebug(...) __VA_ARGS__;
+#if defined(_MSC_VER)
+extern "C" int32_t IsDebuggerPresent(void);
+#define TracyWebGPUBreak() if (IsDebuggerPresent()) __debugbreak()
+#else
+#define TracyWebGPUBreak() ((void)0)
+#endif
+#define TracyWebGPUAssert(predicate, ...) if (predicate) {} else { __VA_ARGS__; TracyWebGPUBreak(); }
+#else
+#define TracyWebGPUDebug(...)
+#define TracyWebGPUBreak()
+#define TracyWebGPUAssert(predicate, ...) assert(predicate);
+#endif
+
+#define TracyWebGPULog(severity, msg) fprintf(stdout, "%s", msg), tracy::Profiler::LogString( tracy::MessageSourceType::Tracy, tracy::MessageSeverity::severity, tracy::Color::Red4, 0, msg );
+#define TracyWebGPUPanic(msg, ...) do { TracyWebGPULog(Error, msg); TracyWebGPUAssert(false && "TracyWebGPU: " msg); __VA_ARGS__; } while(false);
+
+namespace tracy
+{
+
+    class WebGPUQueueCtx
+    {
+        friend class WebGPUZoneScope;
+
+        uint8_t m_contextId = 255;  // 255 represents "invalid id"
+
+        std::mutex m_collectionMutex;
+
+        WGPUInstance m_instance = nullptr;
+        WGPUDevice m_device = nullptr;
+        WGPUQueue m_queue = nullptr;
+
+        struct ReadbackStage
+        {
+            WGPUBuffer buffer = nullptr;
+            std::atomic<uint64_t> copiedUpto {0};
+            std::atomic<WGPUMapAsyncStatus> mapStatus = {};
+            WGPUFuture pendingFuture = {};
+        };
+        static_assert(std::atomic<WGPUMapAsyncStatus>::is_always_lock_free, "WGPUMapAsyncStatus must be lock-free atomic");
+
+        WGPUQuerySet  m_querySet = nullptr;
+        WGPUBuffer    m_resolveBuffer = nullptr;
+        ReadbackStage m_readbackReel [3];
+        std::atomic<int> m_writeIdx {0};
+
+        using atomic_counter = std::atomic<uint64_t>;
+        atomic_counter m_queryCounter = 0;
+        atomic_counter m_previousCheckpoint = 0;
+
+        uint32_t m_queryLimit = 0;
+
+        std::vector<uint64_t> m_shadowBuffer;
+
+        using WallTime = std::chrono::steady_clock::time_point;
+        static tracy_force_inline auto GetWallTime() { return WallTime::clock::now(); }
+        static tracy_force_inline auto Milliseconds(int value) { return std::chrono::milliseconds(value); }
+
+        static bool WaitQueueIdle(WGPUQueue queue, WGPUInstance instance)
+        {
+            bool gpuDone = false;
+            WGPUQueueWorkDoneCallbackInfo doneCB = {};
+            doneCB.mode = WGPUCallbackMode_AllowProcessEvents;
+            doneCB.callback = [](WGPUQueueWorkDoneStatus, WGPUStringView, void* userData, void*) {
+                *static_cast<bool*>(userData) = true;
+            };
+            doneCB.userdata1 = &gpuDone;
+            wgpuQueueOnSubmittedWorkDone(queue, doneCB);
+
+            const auto deadline = GetWallTime() + Milliseconds(2000);
+            while (!gpuDone && GetWallTime() < deadline)
+                wgpuInstanceProcessEvents(instance);
+            return gpuDone;
+        }
+
+        static const uint64_t* MapBufferSync(WGPUBuffer buffer, WGPUInstance instance)
+        {
+            struct MapCtx { WGPUMapAsyncStatus status = {}; } ctx;
+            WGPUBufferMapCallbackInfo cbInfo = {};
+            cbInfo.mode      = WGPUCallbackMode_AllowProcessEvents;
+            cbInfo.callback  = [](WGPUMapAsyncStatus status, WGPUStringView, void* userData, void*) {
+                auto* ctx = static_cast<MapCtx*>(userData);
+                ctx->status = status;
+            };
+            cbInfo.userdata1 = &ctx;
+            size_t offset = 0;
+            size_t size = 2 * sizeof(uint64_t);
+            wgpuBufferMapAsync(buffer, WGPUMapMode_Read, offset, size, cbInfo);
+
+            const auto deadline = GetWallTime() + Milliseconds(2000);
+            while (ctx.status == 0 && GetWallTime() < deadline)
+                wgpuInstanceProcessEvents(instance);
+
+            if (ctx.status != WGPUMapAsyncStatus_Success) return nullptr;
+            auto data = wgpuBufferGetConstMappedRange(buffer, offset, size);
+            return static_cast<const uint64_t*>(data);
+        }
+
+        struct Calibration {
+            int64_t minCpuRange = ~uint64_t(0) >> 1;
+            struct Regression
+            {
+                int64_t n = 0;
+                int64_t mean_x = 0;
+                int64_t mean_y = 0;
+                int64_t S_xx = 0;
+                int64_t S_xy = 0;
+                void Update(int64_t x, int64_t y)
+                {
+                    n += 1;
+                    int64_t dx = x - mean_x;
+                    int64_t dy = y - mean_y;
+                    mean_x += dx / n;
+                    mean_y += dy / n;
+                    S_xx += dx * (x - mean_x);
+                    S_xy += dx * (y - mean_y);
+                }
+                double Slope() const { return double(S_xy) / S_xx; }
+                double Intercept() const { return mean_y - Slope() * mean_x; }
+            };
+            Regression cpuToGpuModel;   // cpu-ticks to gpu-ticks
+            Regression cpuRangeModel;   // cpu-tick interval uncertainty
+            Regression wallToGpuModel;  // nanoseconds to gpu-ticks
+            void GetReferenceTime(uint64_t& cpuTime, uint64_t& gpuTime) const
+            {
+                // the mean belongs to the regression line
+                cpuTime = cpuToGpuModel.mean_x;
+                gpuTime = cpuToGpuModel.mean_y;
+            }
+            double Period() const { return 1.0 / wallToGpuModel.Slope(); }    // ns/tick
+            bool AcceptX(const Regression& r, int64_t x, double threshold = 3.0) const {
+                if (r.n < 2) return true;
+                auto dx = x - r.mean_x;
+                if (dx <= 0) return true; // always accept "tighter" outliers
+                double variance = double(r.S_xx) / (r.n - 1);
+                if (variance == 0.0) return true;
+                // WARN: dx*dx "could" overflow, but very unlikely in practice
+                double zz = (double)(dx*dx) / variance;
+                return zz <= (threshold*threshold);
+            }
+            bool Update(WallTime twall0, WallTime twall1, uint64_t tcpu0, uint64_t tcpu1, uint64_t tgpu)
+            {
+                using namespace std::chrono;
+                int64_t cpuRange = tcpu1 - tcpu0;
+                cpuRangeModel.Update(cpuRange, 0);
+                if (!AcceptX(cpuRangeModel, cpuRange, 1.0)) return false;
+                // Process sample:
+                int64_t tcpu = tcpu0 + (tcpu1 - tcpu0) / 2; // mid-point
+                int64_t twall = duration_cast<nanoseconds>(
+                    (twall0 + (twall1 - twall0) / 2)        // mid-point
+                    .time_since_epoch()
+                ).count();
+                // incremental regression:
+                cpuToGpuModel.Update(tcpu, tgpu);
+                wallToGpuModel.Update(twall, tgpu);
+                TracyWebGPUDebug( fprintf(stderr, "----- (sample accepted! wall = %lld | cpu = %lld | gpu = %lld | period = %f)\n", twall, tcpu, tgpu, Period()) );
+                return true;
+            }
+        } m_calibration;
+
+        tracy_force_inline void SubmitQueueItem(tracy::QueueItem* item)
+        {
+#ifdef TRACY_ON_DEMAND
+            GetProfiler().DeferItem(*item);
+#endif
+            Profiler::QueueSerialFinish();
+        }
+
+        bool CalibrateClocks(uint64_t& outCpuTime, uint64_t& outGpuTime, double& period)
+        {
+            // WebGPU does not have any clock calibration API.
+            // This routine attempts to estimates a reasonable (cpuTime, gpuTime) correlation
+            // by sampling CPU and GPU timestamps around a "synchronous" draw call.
+            // Several samples are taken to tighten the estimation.
+
+            ZoneScoped;
+
+            WGPUShaderSourceWGSL wgslSrc = {};
+            wgslSrc.chain.sType = WGPUSType_ShaderSourceWGSL;
+            wgslSrc.code =
+            {
+                R"(
+                @vertex fn vs(@builtin(vertex_index) i: u32) -> @builtin(position) vec4f {
+                    var p = array(vec4f(-1,-1,.5,1), vec4f(3,-1,.5,1), vec4f(-1,3,.5,1));
+                    return p[i];
+                }
+                @fragment fn fs() -> @location(0) vec4f { return vec4f(0.0); }
+                )",
+                WGPU_STRLEN
+            };
+            WGPUShaderModuleDescriptor smDesc = {};
+            smDesc.nextInChain  = reinterpret_cast<WGPUChainedStruct*>(&wgslSrc);
+            WGPUShaderModule calibShader = wgpuDeviceCreateShaderModule(m_device, &smDesc);
+            if (!calibShader) { TracyWebGPUPanic("Failed to create calibration shader.", return false); }
+
+            WGPUTextureDescriptor texDesc = {};
+            texDesc.usage         = WGPUTextureUsage_RenderAttachment;
+            texDesc.dimension     = WGPUTextureDimension_2D;
+            texDesc.size          = { 1, 1, 1 };
+            texDesc.format        = WGPUTextureFormat_BGRA8Unorm;
+            texDesc.mipLevelCount = 1;
+            texDesc.sampleCount   = 1;
+            WGPUTexture tex = wgpuDeviceCreateTexture(m_device, &texDesc);
+            if (!tex) { wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration scratch texture.", return false); }
+            WGPUTextureView texView = wgpuTextureCreateView(tex, nullptr);
+            if (!texView) { wgpuTextureRelease(tex); wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration scratch texture view.", return false); }
+
+            WGPUColorTargetState colorTarget = {};
+            colorTarget.format    = WGPUTextureFormat_BGRA8Unorm;
+            colorTarget.writeMask = WGPUColorWriteMask_All;
+            WGPUFragmentState fragState = {};
+            fragState.module      = calibShader;
+            fragState.entryPoint  = { "fs", WGPU_STRLEN };
+            fragState.targetCount = 1;
+            fragState.targets     = &colorTarget;
+            WGPURenderPipelineDescriptor pipeDesc = {};
+            pipeDesc.vertex.module        = calibShader;
+            pipeDesc.vertex.entryPoint    = { "vs", WGPU_STRLEN };
+            pipeDesc.primitive.topology   = WGPUPrimitiveTopology_TriangleList;
+            pipeDesc.multisample.count    = 1;
+            pipeDesc.fragment             = &fragState;
+            WGPURenderPipeline calibPipeline = wgpuDeviceCreateRenderPipeline(m_device, &pipeDesc);
+            if (!calibPipeline) { wgpuTextureViewRelease(texView); wgpuTextureRelease(tex); wgpuShaderModuleRelease(calibShader); TracyWebGPUPanic("Failed to create calibration pipeline.", return false); }
+
+            uint32_t queryId = 0;
+            WGPUPassTimestampWrites anchorTs = {};
+            anchorTs.querySet                  = m_querySet;
+            anchorTs.beginningOfPassWriteIndex = queryId;
+            anchorTs.endOfPassWriteIndex       = queryId+1;
+
+            WGPURenderPassColorAttachment att = {};
+            att.view       = texView;
+            att.loadOp     = WGPULoadOp_Clear;
+            att.storeOp    = WGPUStoreOp_Store;
+            att.depthSlice = WGPU_DEPTH_SLICE_UNDEFINED;
+
+            WGPURenderPassDescriptor passDesc = {};
+            passDesc.colorAttachmentCount = 1;
+            passDesc.colorAttachments     = &att;
+            passDesc.timestampWrites      = &anchorTs;
+
+            // calibration loop
+            const auto deadline = GetWallTime() + Milliseconds(100);
+            for (int i = 0; i < 1000; ++i)
+            {
+                // loop until time budget (100ms) allows, but ensure at least 5 iterations
+                if ((GetWallTime() >= deadline) && (i > 5))
+                    break;
+
+                WGPUCommandEncoder enc = wgpuDeviceCreateCommandEncoder(m_device, nullptr);
+                if (!enc) { TracyWebGPUPanic("Failed to create command encoder for time calibration.", return false); }
+
+                WGPURenderPassEncoder pass = wgpuCommandEncoderBeginRenderPass(enc, &passDesc);
+                wgpuRenderPassEncoderSetPipeline(pass, calibPipeline);
+                wgpuRenderPassEncoderDraw(pass, 3, 1, 0, 0);
+                wgpuRenderPassEncoderEnd(pass);
+                wgpuRenderPassEncoderRelease(pass);
+
+                WGPUBuffer readBackBuffer = m_readbackReel[0].buffer;
+                uint32_t byteOffset = queryId * sizeof(uint64_t);
+                uint32_t sizeInBytes = 2 * sizeof(uint64_t);
+                wgpuCommandEncoderResolveQuerySet(enc, m_querySet, queryId, 2, m_resolveBuffer, byteOffset);
+                wgpuCommandEncoderCopyBufferToBuffer(enc, m_resolveBuffer, byteOffset, readBackBuffer, byteOffset, sizeInBytes);
+
+                WGPUCommandBuffer cmd = wgpuCommandEncoderFinish(enc, nullptr);
+                wgpuCommandEncoderRelease(enc);
+                if (!cmd) { TracyWebGPUPanic("Failed to finish calibration command encoder.", return false); }
+
+                WaitQueueIdle(m_queue, m_instance);
+                int64_t cpu [2] = {};
+                int64_t gpu [2] = {};
+                WallTime wall [2] = {};
+                cpu[0] = Profiler::GetTime();
+                wall[0] = GetWallTime();
+                wgpuQueueSubmit(m_queue, 1, &cmd);
+                wgpuCommandBufferRelease(cmd);
+                WaitQueueIdle(m_queue, m_instance);
+                wall[1] = GetWallTime();
+                cpu[1] = Profiler::GetTime();
+                auto gpuTimestamps = MapBufferSync(readBackBuffer, m_instance);
+                TracyWebGPUAssert(gpuTimestamps != nullptr);
+                gpu[0] = gpuTimestamps[0];
+                gpu[1] = gpuTimestamps[1];
+                wgpuBufferUnmap(readBackBuffer);
+                TracyWebGPUDebug(
+                    fprintf(stdout, "[%03d] CalibrateClocks() [CPU] %16lld | %16lld | /// %lld\n", i, cpu[0], cpu[1], cpu[1]-cpu[0]);
+                    fprintf(stdout,  "----------------------- [GPU] %16llu | %16llu | /// %lld\n",    gpu[0], gpu[1], gpu[1]-gpu[0]);
+                    uint64_t cpuTimeRef, gpuTimeRef;
+                    m_calibration.GetReferenceTime(cpuTimeRef, gpuTimeRef);
+                    if (gpu[0] < gpuTimeRef)
+                        fprintf(stdout, "!!!!! CalibrateClocks() -> WARNING!!! going backwards!\n%llu\n%llu\n%lld\n", gpuTimeRef, gpu[0], gpu[0] - gpuTimeRef);
+                );
+
+                // skip first sample since it is quite jittery (lazy intialization of WebGPU objects)
+                if (i == 0)
+                    continue;
+
+                m_calibration.Update(wall[0], wall[1], cpu[0], cpu[1], gpu[0]);
+            };
+
+            TracyWebGPUDebug(
+                fprintf(stdout, "##### CalibrateClocks() WALL = %lld | CPU = %lld | GPU = %lld | period = %f\n",
+                    m_calibration.wallToGpuModel.mean_x,
+                    m_calibration.cpuToGpuModel.mean_x,
+                    m_calibration.cpuToGpuModel.mean_y,
+                    m_calibration.Period());
+            );
+
+            wgpuRenderPipelineRelease(calibPipeline);
+            wgpuShaderModuleRelease(calibShader);
+            wgpuTextureViewRelease(texView);
+            wgpuTextureRelease(tex);
+
+            m_calibration.GetReferenceTime(outCpuTime, outGpuTime);
+            period = m_calibration.Period();
+            // assume 1 ns/tick if the period estimation is close enough to 1
+            if (std::abs(period - 1.0) < 0.001)
+                period = 1.0;
+
+            return true;
+        }
+
+    public:
+        class Requirements
+        {
+            private:
+#           if (TRACY_WEBGPU_DAWN_NATIVE)
+                WGPUDawnTogglesDescriptor dawnTogglesDesc = {};
+                static constexpr int NumExtras = 0;
+#           elif (TRACY_WEBGPU_WGPU_NATIVE)
+                static constexpr int NumExtras = 1;
+#           endif
+
+            public:
+            static constexpr int NumFeatures = 1 + NumExtras;
+            WGPUFeatureName  features [NumFeatures] = {};
+            WGPUChainedStruct* togglesDesc = nullptr;
+
+            Requirements()
+            {
+                this->features[0] = WGPUFeatureName_TimestampQuery;
+#               if (TRACY_WEBGPU_WGPU_NATIVE)
+                    this->features[1] = (WGPUFeatureName)WGPUNativeFeature_TimestampQueryInsideEncoders;
+#               endif
+#               if (TRACY_WEBGPU_DAWN_NATIVE)
+                    static const char* dawnDisabledToggles[] = { "timestamp_quantization" };
+                    static const char* dawnEnabledToggles[]  = { "disable_timestamp_query_conversion" };
+                    this->dawnTogglesDesc.chain.sType = WGPUSType_DawnTogglesDescriptor;
+                    this->dawnTogglesDesc.disabledToggles = dawnDisabledToggles;
+                    this->dawnTogglesDesc.disabledToggleCount = 1;
+                    this->dawnTogglesDesc.enabledToggles = dawnEnabledToggles;
+                    this->dawnTogglesDesc.enabledToggleCount  = 1;
+                    this->togglesDesc = reinterpret_cast<WGPUChainedStruct*>(&this->dawnTogglesDesc);
+#               endif
+            }
+
+            static bool VerifyDevice(WGPUDevice device)
+            {
+                if (device == nullptr)
+                    return false;
+                if (wgpuDeviceHasFeature(device, WGPUFeatureName_TimestampQuery) == WGPU_FALSE)
+                    return false;
+#               if (TRACY_WEBGPU_DAWN_NATIVE)
+                    bool hasDisableConversion = false, hasQuantization = false;
+                    for (const char* t : ::dawn::native::GetTogglesUsed(device))
+                    {
+                        if (strcmp(t, "disable_timestamp_query_conversion") == 0)
+                            hasDisableConversion = true;
+                        if (strcmp(t, "timestamp_quantization") == 0)
+                            hasQuantization = true;
+                    }
+                    return hasDisableConversion && !hasQuantization;
+#               elif (TRACY_WEBGPU_WGPU_NATIVE)
+                    if (wgpuDeviceHasFeature(device, (WGPUFeatureName)WGPUNativeFeature_TimestampQueryInsideEncoders) == WGPU_FALSE)
+                        return false;
+                    return true;
+#               endif
+                return false;
+            }
+
+            void ApplyToDeviceDescriptor(WGPUDeviceDescriptor& deviceDescriptor)
+            {
+                size_t userCount  = deviceDescriptor.requiredFeatureCount;
+                size_t totalCount = userCount + NumFeatures;
+                // NOTE: this allocation will leak...
+                auto* mergedFeatures = static_cast<WGPUFeatureName*>(tracy_malloc(totalCount * sizeof(WGPUFeatureName)));
+                if (userCount > 0 && deviceDescriptor.requiredFeatures)
+                    memcpy(mergedFeatures, deviceDescriptor.requiredFeatures, userCount * sizeof(WGPUFeatureName));
+                memcpy(mergedFeatures + userCount, features, NumFeatures * sizeof(WGPUFeatureName));
+                deviceDescriptor.requiredFeatures     = mergedFeatures;
+                deviceDescriptor.requiredFeatureCount = totalCount;
+
+                if (togglesDesc)
+                {
+                    togglesDesc->next            = deviceDescriptor.nextInChain;
+                    deviceDescriptor.nextInChain = togglesDesc;
+                }
+            }
+        };
+
+        WebGPUQueueCtx(WGPUInstance instance, WGPUDevice device, WGPUQueue queue)
+        {
+            ZoneScopedC(Color::Red4);
+
+            if (!Requirements::VerifyDevice(device))
+                TracyWebGPUPanic("GPU profiling disabled because the device did not enable the necessary features.", return)
+
+            TracyWebGPUAssert(instance); wgpuInstanceAddRef(instance); m_instance = instance;
+            TracyWebGPUAssert(device);   wgpuDeviceAddRef(device);     m_device   = device;
+            TracyWebGPUAssert(queue);    wgpuQueueAddRef(queue);       m_queue    = queue;
+
+            // Setup Query Set: must have even size since queries are issued in pairs.
+            // (The WebGPU spec mandates 4096, with no way to query the device limit.)
+            WGPUQuerySetDescriptor qsDesc = {};
+            qsDesc.type = WGPUQueryType_Timestamp;
+            qsDesc.count = 4096;
+            for (;;)
+            {
+                m_querySet = wgpuDeviceCreateQuerySet(m_device, &qsDesc);
+                if (m_querySet) break;
+                qsDesc.count /= 2;
+                if (qsDesc.count < 128) break;
+            }
+            if (m_querySet == nullptr)
+                TracyWebGPUPanic("Failed to create timestamp query set.", return);
+            m_queryLimit = qsDesc.count;
+
+            WGPUBufferDescriptor resolveDesc = {};
+            resolveDesc.usage = WGPUBufferUsage_QueryResolve | WGPUBufferUsage_CopySrc;
+            resolveDesc.size  = static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t);
+            m_resolveBuffer = wgpuDeviceCreateBuffer(m_device, &resolveDesc);
+            if (!m_resolveBuffer)
+                TracyWebGPUPanic("Failed to create timestamp resolve buffer.", return);
+
+            WGPUBufferDescriptor readbackDesc = {};
+            readbackDesc.usage = WGPUBufferUsage_CopyDst | WGPUBufferUsage_MapRead;
+            readbackDesc.size  = static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t);
+            for (auto& stage : m_readbackReel)
+            {
+                stage.buffer = wgpuDeviceCreateBuffer(m_device, &readbackDesc);
+                stage.copiedUpto = 0;
+                if (!stage.buffer) { TracyWebGPUPanic("Failed to create timestamp readback buffer.", return); }
+            }
+
+            uint64_t cpuTimestamp = 0;
+            uint64_t gpuTimestamp = 0;
+            double period = 0.0;  // in nanoseconds per gpu-tick
+            if (!CalibrateClocks(cpuTimestamp, gpuTimestamp, period))
+                TracyWebGPUPanic("Failed to calibrate CPU/GPU clocks.", return);
+
+            TracyWebGPUDebug( fprintf(stdout, "[WebGPUQueueCtx] cpuTimestamp: %llu | gpuTimestamp: %llu | period: %f\n", cpuTimestamp, gpuTimestamp, period) );
+            m_shadowBuffer.resize(m_queryLimit, gpuTimestamp);
+
+            // All setup completed: register the context.
+            m_contextId = GetGpuCtxCounter().fetch_add(1);
+            ZoneValue(m_contextId);
+
+            auto* item = Profiler::QueueSerial();
+            MemWrite(&item->hdr.type, QueueType::GpuNewContext);
+            MemWrite(&item->gpuNewContext.cpuTime, static_cast<int64_t>(cpuTimestamp));
+            MemWrite(&item->gpuNewContext.gpuTime, static_cast<int64_t>(gpuTimestamp));
+            MemWrite(&item->gpuNewContext.thread, static_cast<uint32_t>(0));
+            MemWrite(&item->gpuNewContext.period, static_cast<float>(period));
+            MemWrite(&item->gpuNewContext.context, static_cast<uint8_t>(GetId()));
+            MemWrite(&item->gpuNewContext.flags, GpuContextFlags(0));  // no calibration available
+            MemWrite(&item->gpuNewContext.type, GpuContextType::WebGPU);
+            SubmitQueueItem(item);
+        }
+
+        ~WebGPUQueueCtx()
+        {
+            // TODO: a few problems to address later during this final Collect():
+            // 1. ensure "partial" query batches are collected
+            // 2. ensure all readback stages are collected and empty
+            // 3. ensure readback buffers are not mapped before deleting them
+            Collect();
+
+            for (auto& stage : m_readbackReel)
+                if (stage.buffer) { wgpuBufferRelease(stage.buffer);     stage.buffer     = nullptr; }
+            if (m_resolveBuffer)  { wgpuBufferRelease(m_resolveBuffer);  m_resolveBuffer  = nullptr; }
+            if (m_querySet)       { wgpuQuerySetRelease(m_querySet);     m_querySet       = nullptr; }
+            if (m_queue)          { wgpuQueueRelease(m_queue);           m_queue          = nullptr; }
+            if (m_device)         { wgpuDeviceRelease(m_device);         m_device         = nullptr; }
+            if (m_instance)       { wgpuInstanceRelease(m_instance);     m_instance       = nullptr; }
+        }
+
+        tracy_force_inline uint8_t GetId() const
+        {
+            return m_contextId;
+        }
+
+        void Name(const char* name, uint16_t len)
+        {
+            auto ptr = (char*)tracy_malloc(len);
+            memcpy(ptr, name, len);
+
+            auto item = Profiler::QueueSerial();
+            MemWrite(&item->hdr.type, QueueType::GpuContextName);
+            MemWrite(&item->gpuContextNameFat.context, GetId());
+            MemWrite(&item->gpuContextNameFat.ptr, (uint64_t)ptr);
+            MemWrite(&item->gpuContextNameFat.size, len);
+            SubmitQueueItem(item);
+        }
+
+        void Collect(bool webgpuProcessEvents=false)
+        {
+#ifdef TRACY_ON_DEMAND
+            if (!GetProfiler().IsConnected()) return;
+#endif
+            if (!m_collectionMutex.try_lock()) return;
+            std::unique_lock<std::mutex> lock(m_collectionMutex, std::adopt_lock);
+
+            ZoneScopedC(Color::Red4);
+
+            if (Distance(m_previousCheckpoint, m_queryCounter) <= 0)
+                return;
+
+            // Current Readback "Reel" Stages:
+            const int state = m_writeIdx;
+            const int fillingIdx = (state + 0) % 3; // this is where instrumentation is pushing new queries
+            const int pendingIdx = (state + 1) % 3; // instrumentation is done here; ready to be collected
+            const int collectIdx = (state + 2) % 3; // this is where queries are being collected right now
+
+            // Ensure readback buffer has been mapped to the host
+            auto& collectStage = m_readbackReel[collectIdx];
+            if (collectStage.pendingFuture.id != 0)
+            {
+                if (webgpuProcessEvents)
+                    wgpuInstanceProcessEvents(m_instance);
+                if (collectStage.mapStatus == WGPUMapAsyncStatus{})
+                    return;  // callback hasn't fired yet
+                collectStage.pendingFuture = {};
+                if (collectStage.mapStatus != WGPUMapAsyncStatus_Success)
+                    TracyWebGPUPanic("Colect(): unable to map readback buffer.", return);
+            }
+
+            if (collectStage.mapStatus == WGPUMapAsyncStatus_Success)
+            {
+                const uint64_t* ts = static_cast<const uint64_t*>(
+                    wgpuBufferGetConstMappedRange(collectStage.buffer, 0,
+                        static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t)));
+                if (ts)
+                {
+                    uint64_t ticket = m_previousCheckpoint;
+                    const uint64_t end = collectStage.copiedUpto;
+                    TracyWebGPUDebug( fprintf(stdout, "[TWG] Collect [%d] (%llu, %llu)\n", collectIdx, ticket, end) );
+                    for (; Distance(ticket, end) > 0; ticket += 2)
+                    {
+                        const uint32_t slotB = RingIndex(ticket);
+                        const uint32_t slotE = slotB + 1;
+                        TracyWebGPUDebug(
+                            fprintf(stderr,
+                                "[TWG] slot B=%4u E=%4u ts[B]=%llu ts[E]=%llu shadow[E]=%llu ts-diff=%lld shadow-diff=%lld\n",
+                                slotB, slotE,
+                                ts[slotB], ts[slotE], m_shadowBuffer[slotE],
+                                Distance(ts[slotB], ts[slotE]),
+                                Distance(m_shadowBuffer[slotE], ts[slotE]));
+                        );
+                        if (Distance(m_shadowBuffer[slotE], ts[slotE]) <= 0)
+                            break; // GPU hasn't written this timestamp yet; retry next Collect()
+                        EmitGpuTime(ts[slotB], slotB);
+                        EmitGpuTime(ts[slotE], slotE);
+                    }
+                    m_previousCheckpoint = ticket;
+
+                    if (Distance(ticket, end) > 0)
+                        return; // still unresolved queries in this buffer; come back next Collect()
+                }
+
+                // All queries resolved (or getMappedRange failed): unmap and fall through to rotate.
+                wgpuBufferUnmap(collectStage.buffer);
+                collectStage.mapStatus = {};
+            }
+
+            // At this point, all queries in the collect buffer have been processed.
+            // (it's now tie to "rotate" the buffers around...)
+
+            // Has any ResolveQueryBatch call landed in this reel stage since it was last recycled?
+            // (Are there any queries to resolve and collect at all?)
+            if (m_readbackReel[fillingIdx].copiedUpto <= m_previousCheckpoint)
+                return;
+
+            // Rotate/Cycle the Readback Pipeline State:
+            // the buffer that was just collected shall now be used for instrumentation
+            collectStage.copiedUpto = m_previousCheckpoint.load();
+            m_writeIdx = collectIdx;    // atomically commit the pipeline rotation
+
+            auto& nextToCollect = m_readbackReel[pendingIdx];
+            WGPUBufferMapCallbackInfo cbInfo = {};
+            cbInfo.mode = WGPUCallbackMode_AllowProcessEvents;
+            cbInfo.callback = [](WGPUMapAsyncStatus status, WGPUStringView, void* userData, void*)
+            {
+                auto* stage = static_cast<ReadbackStage*>(userData);
+                stage->mapStatus = status;
+            };
+            cbInfo.userdata1 = &nextToCollect;
+            nextToCollect.pendingFuture = wgpuBufferMapAsync(
+                nextToCollect.buffer, WGPUMapMode_Read, 0,
+                static_cast<uint64_t>(m_queryLimit) * sizeof(uint64_t), cbInfo);
+        }
+
+    private:
+        void EmitGpuTime(uint64_t gpuTimestamp, uint32_t queryId)
+        {
+            auto* item = Profiler::QueueSerial();
+            MemWrite(&item->hdr.type, QueueType::GpuTime);
+            MemWrite(&item->gpuTime.gpuTime, static_cast<int64_t>(gpuTimestamp));
+            MemWrite(&item->gpuTime.queryId, static_cast<uint16_t>(queryId));
+            MemWrite(&item->gpuTime.context, GetId());
+            Profiler::QueueSerialFinish();
+            m_shadowBuffer[queryId] = gpuTimestamp;
+        }
+
+        tracy_force_inline uint32_t RingCapacity() const { return m_queryLimit; }
+
+        tracy_force_inline uint32_t RingIndex(uint64_t t) const
+        {
+            return static_cast<uint32_t>(t % RingCapacity());
+        }
+
+        tracy_force_inline static int64_t Distance(uint64_t begin, uint64_t end)
+        {
+            return static_cast<int64_t>(end - begin);
+        }
+
+        tracy_force_inline uint64_t NextQueryId()
+        {
+            const uint64_t ticket = m_queryCounter.fetch_add(2, std::memory_order_relaxed);
+            if (Distance(m_previousCheckpoint, ticket)
+                >= static_cast<int64_t>(RingCapacity()))
+            {
+                TracyWebGPULog(Warning, "Too many pending GPU queries: stalling!");
+                Collect();
+            }
+            return ticket;
+        }
+    };
+
+    class WebGPUZoneScope
+    {
+        const bool m_active;
+        WebGPUQueueCtx* m_ctx = nullptr;
+        WGPUCommandEncoder m_encoder = nullptr;
+        uint64_t m_rawTicket = 0;
+        uint32_t m_queryId = 0;
+
+        WGPUPassTimestampWrites m_timestampWrites = {};
+
+        void ResolveQueryBatch(uint32_t queryBatchStartId)
+        {
+            // Ensure there are pending queries to resolve in the batch
+            auto& stage = m_ctx->m_readbackReel[m_ctx->m_writeIdx];
+            if (WebGPUQueueCtx::Distance(stage.copiedUpto, m_rawTicket) <= 0) return;
+
+            // 32 queries = 32 * 8 bytes = 256 bytes
+            TracyWebGPUAssert(queryBatchStartId % 32 == 0, return);
+            queryBatchStartId = m_ctx->RingIndex(queryBatchStartId);
+
+            const uint64_t blockOffset = static_cast<uint64_t>(queryBatchStartId) * sizeof(uint64_t);
+            wgpuCommandEncoderResolveQuerySet(
+                m_encoder,
+                m_ctx->m_querySet,
+                queryBatchStartId, 32,
+                m_ctx->m_resolveBuffer,
+                blockOffset // MUST be a multiple of (aligned to) 256...
+            );
+
+            auto readbackBuffer = stage.buffer;
+            wgpuCommandEncoderCopyBufferToBuffer(
+                m_encoder,
+                m_ctx->m_resolveBuffer,
+                blockOffset,
+                readbackBuffer,
+                blockOffset,
+                32 * sizeof(uint64_t)
+            );
+
+            // Advance this stage's high-water mark to cover the block just encoded.
+            // TODO: maybe we can use fetch_add to increment the atomic and not need
+            // to keep track of the raw ticket; Collect would need to derive the raw
+            // end ticket number.
+            const uint64_t blockEnd = m_rawTicket;
+            uint64_t prev = stage.copiedUpto;
+            while ((WebGPUQueueCtx::Distance(prev, blockEnd) > 0) &&
+                   !stage.copiedUpto.compare_exchange_weak(prev, blockEnd)) {}
+            TracyWebGPUDebug( fprintf(stdout, "[TWG] WebGPUZoneScope [%d] (%d,%d)\n", (int)m_ctx->m_writeIdx, queryBatchStartId, queryBatchStartId+32) );
+        }
+
+        tracy_force_inline void WriteQueueItem(const SourceLocationData* srcLocation, int32_t callstackDepth, uint32_t sourceLine, const char* sourceFile, size_t sourceFileLen, const char* functionName, size_t functionNameLen, const char* zoneName, size_t zoneNameLen)
+        {
+            if (!m_active) return;
+
+            const bool captureCallstack = callstackDepth > 0 && has_callstack();
+            const bool transientZone = srcLocation == nullptr;
+            uint64_t srcLocationAddr = reinterpret_cast<uint64_t>(srcLocation);
+
+            QueueItem* item = nullptr;
+            QueueType itemType;
+            if (transientZone)
+            {
+                srcLocationAddr = Profiler::AllocSourceLocation(sourceLine, sourceFile, sourceFileLen, functionName, functionNameLen, zoneName, zoneNameLen);
+                if (captureCallstack)
+                {
+                    item = Profiler::QueueSerialCallstack(Callstack(callstackDepth));
+                    itemType = QueueType::GpuZoneBeginAllocSrcLocCallstackSerial;
+                }
+                else
+                {
+                    item = Profiler::QueueSerial();
+                    itemType = QueueType::GpuZoneBeginAllocSrcLocSerial;
+                }
+            }
+            else
+            {
+                if (captureCallstack)
+                {
+                    item = Profiler::QueueSerialCallstack(Callstack(callstackDepth));
+                    itemType = QueueType::GpuZoneBeginCallstackSerial;
+                }
+                else
+                {
+                    item = Profiler::QueueSerial();
+                    itemType = QueueType::GpuZoneBeginSerial;
+                }
+            }
+
+            MemWrite(&item->hdr.type, itemType);
+            MemWrite(&item->gpuZoneBegin.cpuTime, Profiler::GetTime());
+            MemWrite(&item->gpuZoneBegin.srcloc, srcLocationAddr);
+            MemWrite(&item->gpuZoneBegin.thread, GetThreadHandle());
+            MemWrite(&item->gpuZoneBegin.queryId, static_cast<uint16_t>(m_queryId));
+            MemWrite(&item->gpuZoneBegin.context, m_ctx->GetId());
+            Profiler::QueueSerialFinish();
+        }
+
+        // Fills in m_timestampWrites and assigns its address to passDesc.timestampWrites.
+        // Works with both WGPURenderPassDescriptor and WGPUComputePassDescriptor.
+        template<typename PassDescriptor>
+        tracy_force_inline void InitBase(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc)
+        {
+            m_ctx     = ctx;
+            m_encoder = encoder;
+
+            m_rawTicket = m_ctx->NextQueryId();
+            m_queryId   = m_ctx->RingIndex(m_rawTicket);
+
+            m_timestampWrites.querySet                  = m_ctx->m_querySet;
+            m_timestampWrites.beginningOfPassWriteIndex = m_queryId;
+            m_timestampWrites.endOfPassWriteIndex       = m_queryId + 1;
+            passDesc.timestampWrites                    = &m_timestampWrites;
+        }
+
+    public:
+        template<typename PassDescriptor>
+        tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc, const SourceLocationData* srcLocation, bool active)
+#ifdef TRACY_ON_DEMAND
+            : m_active(active && GetProfiler().IsConnected())
+#else
+            : m_active(active)
+#endif
+        {
+            if (!m_active || !ctx) return;
+            InitBase(ctx, encoder, passDesc);
+            WriteQueueItem(srcLocation, 0, 0, nullptr, 0, nullptr, 0, nullptr, 0);
+        }
+
+        template<typename PassDescriptor>
+        tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, WGPUCommandEncoder encoder, PassDescriptor& passDesc, const SourceLocationData* srcLocation, int32_t depth, bool active)
+#ifdef TRACY_ON_DEMAND
+            : m_active(active && GetProfiler().IsConnected())
+#else
+            : m_active(active)
+#endif
+        {
+            if (!m_active || !ctx) return;
+            InitBase(ctx, encoder, passDesc);
+            WriteQueueItem(srcLocation, depth, 0, nullptr, 0, nullptr, 0, nullptr, 0);
+        }
+
+        template<typename PassDescriptor>
+        tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, uint32_t line, const char* source, size_t sourceSz, const char* function, size_t functionSz, const char* name, size_t nameSz, WGPUCommandEncoder encoder, PassDescriptor& passDesc, bool active)
+#ifdef TRACY_ON_DEMAND
+            : m_active(active && GetProfiler().IsConnected())
+#else
+            : m_active(active)
+#endif
+        {
+            if (!m_active || !ctx) return;
+            InitBase(ctx, encoder, passDesc);
+            WriteQueueItem(nullptr, 0, line, source, sourceSz, function, functionSz, name, nameSz);
+        }
+
+        template<typename PassDescriptor>
+        tracy_force_inline WebGPUZoneScope(WebGPUQueueCtx* ctx, uint32_t line, const char* source, size_t sourceSz, const char* function, size_t functionSz, const char* name, size_t nameSz, WGPUCommandEncoder encoder, PassDescriptor& passDesc, int32_t depth, bool active)
+#ifdef TRACY_ON_DEMAND
+            : m_active(active && GetProfiler().IsConnected())
+#else
+            : m_active(active)
+#endif
+        {
+            if (!m_active || !ctx) return;
+            InitBase(ctx, encoder, passDesc);
+            WriteQueueItem(nullptr, depth, line, source, sourceSz, function, functionSz, name, nameSz);
+        }
+
+        tracy_force_inline ~WebGPUZoneScope()
+        {
+            if (!m_active || !m_ctx) return;
+
+            const auto queryId = m_queryId + 1;
+
+            auto* item = Profiler::QueueSerial();
+            MemWrite(&item->hdr.type, QueueType::GpuZoneEndSerial);
+            MemWrite(&item->gpuZoneEnd.cpuTime, Profiler::GetTime());
+            MemWrite(&item->gpuZoneEnd.thread, GetThreadHandle());
+            MemWrite(&item->gpuZoneEnd.queryId, static_cast<uint16_t>(queryId));
+            MemWrite(&item->gpuZoneEnd.context, m_ctx->GetId());
+            Profiler::QueueSerialFinish();
+
+            if (m_queryId % 32 == 0)
+                ResolveQueryBatch(m_queryId-32);
+        }
+    };
+
+    static inline void DestroyWebGPUContext(WebGPUQueueCtx* ctx)
+    {
+        if (!ctx) return;
+        ctx->~WebGPUQueueCtx();
+        tracy_free(ctx);
+    }
+
+    static inline WebGPUQueueCtx* CreateWebGPUContext(WGPUInstance instance, WGPUDevice device, WGPUQueue queue)
+    {
+        auto* ctx = static_cast<WebGPUQueueCtx*>(tracy_malloc(sizeof(WebGPUQueueCtx)));
+        new (ctx) WebGPUQueueCtx{ instance, device, queue };
+        if (ctx->GetId() == 255)
+        {
+            DestroyWebGPUContext(ctx);
+            return nullptr;
+        }
+        return ctx;
+    }
+
+}
+
+#undef TracyWebGPUPanic
+#undef TracyWebGPULog
+#undef TracyWebGPUAssert
+#undef TracyWebGPUBreak
+#undef TracyWebGPUDebug
+#undef TRACY_WEBGPU_DEBUG_LEVEL
+
+using TracyWebGPUCtx = tracy::WebGPUQueueCtx*;
+
+#define TracyWebGPUSetupDeviceDescriptor(deviceDescriptor) tracy::WebGPUQueueCtx::Requirements TracyConcat(__tracy_wgpu_setup_, TracyLine); TracyConcat(__tracy_wgpu_setup_, TracyLine).ApplyToDeviceDescriptor(deviceDescriptor)
+
+#define TracyWebGPUContext(instance, device, queue) tracy::CreateWebGPUContext(instance, device, queue);
+#define TracyWebGPUDestroy(ctx) tracy::DestroyWebGPUContext(ctx);
+#define TracyWebGPUContextName(ctx, name, size) if (ctx) ctx->Name(name, size);
+
+#define TracyWebGPUUnnamedZone ___tracy_gpu_webgpu_zone
+#define TracyWebGPUSrcLocSymbol TracyConcat(__tracy_webgpu_source_location,TracyLine)
+#define TracyWebGPUSrcLocObject(name, color) static constexpr tracy::SourceLocationData TracyWebGPUSrcLocSymbol { name, TracyFunction, TracyFile, (uint32_t)TracyLine, color };
+
+#if defined TRACY_HAS_CALLSTACK && defined TRACY_CALLSTACK
+#  define TracyWebGPUZone(ctx, encoder, passDesc, name) TracyWebGPUNamedZoneS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, TRACY_CALLSTACK, true)
+#  define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color) TracyWebGPUNamedZoneCS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, TRACY_CALLSTACK, true)
+#  define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, TRACY_CALLSTACK, active };
+#  define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, TRACY_CALLSTACK, active };
+#  define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active) TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, TRACY_CALLSTACK, active)
+#else
+#  define TracyWebGPUZone(ctx, encoder, passDesc, name) TracyWebGPUNamedZone(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, true)
+#  define TracyWebGPUZoneC(ctx, encoder, passDesc, name, color) TracyWebGPUNamedZoneC(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, true)
+#  define TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, active };
+#  define TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, active };
+#  define TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active) tracy::WebGPUZoneScope varname{ ctx, TracyLine, TracyFile, strlen(TracyFile), TracyFunction, strlen(TracyFunction), name, strlen(name), encoder, passDesc, active };
+#endif
+
+#ifdef TRACY_HAS_CALLSTACK
+#  define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth) TracyWebGPUNamedZoneS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, depth, true)
+#  define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth) TracyWebGPUNamedZoneCS(ctx, TracyWebGPUUnnamedZone, encoder, passDesc, name, color, depth, true)
+#  define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUSrcLocObject(name, 0); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, depth, active };
+#  define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active) TracyWebGPUSrcLocObject(name, color); tracy::WebGPUZoneScope varname{ ctx, encoder, passDesc, &TracyWebGPUSrcLocSymbol, depth, active };
+#  define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active) tracy::WebGPUZoneScope varname{ ctx, TracyLine, TracyFile, strlen(TracyFile), TracyFunction, strlen(TracyFunction), name, strlen(name), encoder, passDesc, depth, active };
+#else
+#  define TracyWebGPUZoneS(ctx, encoder, passDesc, name, depth) TracyWebGPUZone(ctx, encoder, passDesc, name)
+#  define TracyWebGPUZoneCS(ctx, encoder, passDesc, name, color, depth) TracyWebGPUZoneC(ctx, encoder, passDesc, name, color)
+#  define TracyWebGPUNamedZoneS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUNamedZone(ctx, varname, encoder, passDesc, name, active)
+#  define TracyWebGPUNamedZoneCS(ctx, varname, encoder, passDesc, name, color, depth, active) TracyWebGPUNamedZoneC(ctx, varname, encoder, passDesc, name, color, active)
+#  define TracyWebGPUZoneTransientS(ctx, varname, encoder, passDesc, name, depth, active) TracyWebGPUZoneTransient(ctx, varname, encoder, passDesc, name, active)
+#endif
+
+#define TracyWebGPUCollect(ctx) if (ctx) ctx->Collect();
+
+#endif
+
+#endif
--- a/python/bindings/ServerModule.cpp
+++ b/python/bindings/ServerModule.cpp
@@ -1033,14 +1033,15 @@ PYBIND11_MODULE( TracyServerBindings, m )
        // --- GPU contexts ---
        .def( "get_gpu_contexts", []( const Worker& w ) {
        static const char* gpuTypeStr[] = {
-            "Invalid", "OpenGL", "Vulkan", "OpenCL", "Direct3D12", "Direct3D11", "Metal", "Custom", "CUDA", "Rocprof" };
+            "Invalid", "OpenGL", "Vulkan", "OpenCL", "Direct3D12", "Direct3D11", "Metal", "Custom", "CUDA", "Rocprof", "WebGPU" };
+        static size_t numTypes = sizeof(gpuTypeStr) / sizeof(gpuTypeStr[0]);
        std::vector<GpuContextSummary> result;
        for( const auto* ctx : w.GetGpuData() )
        {
            if( !ctx ) continue;
            const std::string name = ctx->name.Active() ? w.GetString( ctx->name ) : "";
            const uint8_t typeIdx = (uint8_t)ctx->type;
-            const char* typeStr = typeIdx < 10 ? gpuTypeStr[typeIdx] : "Unknown";
+            const char* typeStr = typeIdx < numTypes ? gpuTypeStr[typeIdx] : "Unknown";
            result.push_back( GpuContextSummary{
                name, ctx->count, std::string( typeStr ), ctx->thread } );
        }
--- a/server/TracyWorker.cpp
+++ b/server/TracyWorker.cpp
@@ -3137,6 +3137,7 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
    }
    else
    {
+        uint8_t sz8;
        uint16_t sz;
        switch( ev.hdr.type )
        {
@@ -3144,6 +3145,7 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
            ptr += sizeof( QueueHeader );
            memcpy( &sz, ptr, sizeof( sz ) );
            ptr += sizeof( sz );
+            sz += ProtocolOffset8Bit;
            AddSingleStringFailure( ptr, sz );
            ptr += sz;
            break;
@@ -3151,9 +3153,24 @@ void Worker::DispatchFailure( const QueueItem& ev, const char*& ptr )
            ptr += sizeof( QueueHeader );
            memcpy( &sz, ptr, sizeof( sz ) );
            ptr += sizeof( sz );
+            sz += ProtocolOffset8Bit;
            AddSecondString( ptr, sz );
            ptr += sz;
            break;
+        case QueueType::SingleStringData8:
+            ptr += sizeof( QueueHeader );
+            memcpy( &sz8, ptr, sizeof( sz8 ) );
+            ptr += sizeof( sz8 );
+            AddSingleStringFailure( ptr, sz8 );
+            ptr += sz8;
+            break;
+        case QueueType::SecondStringData8:
+            ptr += sizeof( QueueHeader );
+            memcpy( &sz8, ptr, sizeof( sz8 ) );
+            ptr += sizeof( sz8 );
+            AddSecondString( ptr, sz8 );
+            ptr += sz8;
+            break;
        default:
            ptr += QueueDataSize[ev.hdr.idx];
            switch( ev.hdr.type )
@@ -3337,6 +3354,7 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
    }
    else
    {
+        uint8_t sz8;
        uint16_t sz;
        switch( ev.hdr.type )
        {
@@ -3344,6 +3362,7 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
            ptr += sizeof( QueueHeader );
            memcpy( &sz, ptr, sizeof( sz ) );
            ptr += sizeof( sz );
+            sz += ProtocolOffset8Bit;
            AddSingleString( ptr, sz );
            ptr += sz;
            return true;
@@ -3351,9 +3370,24 @@ bool Worker::DispatchProcess( const QueueItem& ev, const char*& ptr )
            ptr += sizeof( QueueHeader );
            memcpy( &sz, ptr, sizeof( sz ) );
            ptr += sizeof( sz );
+            sz += ProtocolOffset8Bit;
            AddSecondString( ptr, sz );
            ptr += sz;
            return true;
+        case QueueType::SingleStringData8:
+            ptr += sizeof( QueueHeader );
+            memcpy( &sz8, ptr, sizeof( sz8 ) );
+            ptr += sizeof( sz8 );
+            AddSingleString( ptr, sz8 );
+            ptr += sz8;
+            return true;
+        case QueueType::SecondStringData8:
+            ptr += sizeof( QueueHeader );
+            memcpy( &sz8, ptr, sizeof( sz8 ) );
+            ptr += sizeof( sz8 );
+            AddSecondString( ptr, sz8 );
+            ptr += sz8;
+            return true;
        default:
            ptr += QueueDataSize[ev.hdr.idx];
            return Process( ev );
Author	SHA1	Message	Date
Marcos Slomp	1de94aa856	add routine to check for GL features/extensions at run-time	2026-06-15 21:19:12 -07:00
Bartosz Taudul	ec1d5bd3d7	Merge pull request #1402 from wolfpld/slomp/webgpu-example-platform Switch webgpu example to SDL3, plus patch edge-case for wgpu-native	2026-06-15 23:48:14 +02:00
Marcos Slomp	69af195c98	edge-case bug-fix (could cause wgpu-native to panic)	2026-06-15 13:14:28 -07:00
Marcos Slomp	60699c4a92	fixing win32 builds with SDL3 + WebGPU	2026-06-15 13:14:28 -07:00
Marcos Slomp	cc45cf6046	switch to SDL3 (no cmake fetch, just find_package)	2026-06-15 13:14:28 -07:00
Bartosz Taudul	62560a6429	Add 8-bit length string transfers to the protocol.	2026-06-15 20:43:55 +02:00
Bartosz Taudul	f7b4e177ff	Change misleading etc1buf variable to texbuf.	2026-06-15 19:31:06 +02:00
Bartosz Taudul	084daf0516	Force inline send string strlen helpers.	2026-06-15 19:19:13 +02:00
Bartosz Taudul	a98956f2d9	Another typo.	2026-06-15 17:17:32 +02:00
Bartosz Taudul	ac6f0f88fa	Actually describe the message severity levels.	2026-06-15 17:09:42 +02:00
Bartosz Taudul	33fccb3530	Typos.	2026-06-15 17:08:39 +02:00
Bartosz Taudul	45576f6972	Merge pull request #1400 from wolfpld/slomp/gl-example adding OpenGL example (spinning triangle)	2026-06-14 21:39:54 +02:00
Marcos Slomp	17e13bc2e0	SDL2 -> SDL3	2026-06-14 12:18:06 -07:00
Marcos Slomp	ee0c73bf25	switch to SDL2 (no cmake fetch, just find_package)	2026-06-14 11:24:14 -07:00
Bartosz Taudul	343567a3f2	Regenerate markdown manual.	2026-06-14 17:31:49 +02:00
Bartosz Taudul	20b3535623	Use fancy quotes in the manual.	2026-06-14 17:31:32 +02:00
Bartosz Taudul	5298316480	Revert emscripten back to 5.0.7. There are threading problems with 6.0.0. Specifically, click on the red power off button to go back to the welcome screen, and the cleanup popup never goes away.	2026-06-14 16:24:13 +02:00
Bartosz Taudul	83719fb29b	WASM_BIGINT is enabled by default since emscripten 4.0.0.	2026-06-14 15:17:32 +02:00
Bartosz Taudul	f7d789eddb	Split emscripten link options to multiple lines.	2026-06-14 15:09:40 +02:00
Bartosz Taudul	3816b2485e	Bump used emscripten version to 6.0.0.	2026-06-14 15:06:53 +02:00
Bartosz Taudul	f8aa88d522	Explicitly disable shared libs for md4c. Fixes emscripten build.	2026-06-14 15:06:36 +02:00
Bartosz Taudul	b5ae187f76	Disable separate fast model by default.	2026-06-12 22:20:47 +02:00
Marcos Slomp	3f203806e2	X11 workaround check	2026-06-12 13:00:33 -07:00
Bartosz Taudul	15c6b49de2	Mark text embeds as TEXT.	2026-06-12 21:43:51 +02:00
Bartosz Taudul	a153f3a562	Extend Embed macro to support TEXT parameter enabling CRLF to LF conversion.	2026-06-12 21:43:12 +02:00
Bartosz Taudul	c2998310cf	Add CRLF to LF conversion support to embed.	2026-06-12 21:42:45 +02:00
Bartosz Taudul	a43b74ed8f	Update NEWS.	2026-06-12 21:08:33 +02:00
Bartosz Taudul	d3047f8069	Fix memory discard + callstack. Bug (High Severity): Wrong queue type in MemDiscardCallstack In the callstack path of MemDiscardCallstack, the wrong queue type is sent: SendMemDiscard( QueueType::MemDiscard, thread, name ); Every other callstack variant correctly uses its callstack queue type (MemAllocCallstack, MemFreeCallstack, etc.), but this one uses the non-callstack type. The SendMemDiscard assertion at line 1026 confirms MemDiscardCallstack is a valid value. Impact: The callstack captured by SendCallstackSerial() will be orphaned. The server processes the event via the non-callstack handler, leaving the callstack serial data unconsumed, which desynchronizes the serial queue and corrupts all subsequent events.	2026-06-12 20:30:59 +02:00
Bartosz Taudul	3804b2580a	Regenerate markdown manual.	2026-06-12 19:58:06 +02:00
Bartosz Taudul	329ac6c9f1	Document memory discard macro.	2026-06-12 19:57:45 +02:00
Bartosz Taudul	a091bb4ad2	Remove "secure" variant of alloc/free. Random crashes are not fun. Always use the "secure" code path.	2026-06-12 19:41:17 +02:00
Bartosz Taudul	86b5f43959	Provide proper test directory.	2026-06-12 19:23:28 +02:00
Marcos Slomp	39dc688340	adding Xrandr dependency	2026-06-12 08:44:46 -07:00
Marcos Slomp	832234838b	better comments and messages	2026-06-12 07:33:06 -07:00
Marcos Slomp	daba5acfbc	more explicit compiler warning message	2026-06-12 07:31:03 -07:00
Bartosz Taudul	07bfe3465e	Merge pull request #1356 from wolfpld/slomp/tracy-webgpu GPU: WebGPU back-end	2026-06-12 12:11:32 +02:00
Bartosz Taudul	0544440a34	Remove unused, extremely broken code.	2026-06-11 22:35:25 +02:00
Marcos Slomp	f287508772	addressing type conversion warning	2026-06-11 13:28:32 -07:00
Bartosz Taudul	f622b97436	Backdate init time when a producer token predates it. A zone emitted from a shared object initializer runs before the executable's constructors, so its timestamp precedes s_initTime, which the server uses as the trace epoch (baseTime). Such a zone converts to negative trace time and its end no longer satisfies IsEndValid(), which excludes it from statistics reconstruction and makes it render as never-ending. Record the current time when a producer token is created before s_initTime is constructed and use it as the init time, ensuring no event timestamp precedes the trace epoch.	2026-06-11 20:05:59 +02:00
Bartosz Taudul	dfded9d55d	Recover main thread producer orphaned by cross-module init order. ELF init_priority only orders constructors within a single module. All of a shared object's initializers run before any of the executable's, so an instrumented dependency .so emitting a zone from its static initializer creates the main thread producer token against the zero-initialized s_queue. The queue constructor then resets the producer list, orphaning that producer: every zone emitted on the main thread from that point on is enqueued into blocks no consumer ever iterates and silently lost, while sampling (worker thread producer) keeps working. Re-link such a producer right after the queue is constructed. In the common case, where nothing was emitted during shared object init, this merely constructs the main thread token eagerly.	2026-06-11 20:05:57 +02:00
Marcos Slomp	a2555fbb33	fixing Windows/Linux build	2026-06-11 07:37:58 -07:00
Bartosz Taudul	7180ea381f	Merge pull request #1401 from Lectem/fix/win32-non-desktop `TRACY_WIN32_NO_DESKTOP` should use `GetVersionExW` explicitly.	2026-06-11 13:01:24 +02:00
Clément Grégoire	0c74658dd3	`TRACY_WIN32_NO_DESKTOP` should use `GetVersionExW` explicitly. Since we use `RTL_OSVERSIONINFOW` we need to use W version explicitly	2026-06-11 12:06:34 +02:00
Marcos Slomp	debda1df55	scoping the GpuCtx constructor	2026-06-10 18:57:22 -07:00
Marcos Slomp	d98608b022	issue a Tracy warning message when timestamp queries are supported but not properly implemented	2026-06-10 18:52:57 -07:00
Marcos Slomp	eb88c6eba0	adding warning about TracyOpenGL usage on Apple devices	2026-06-10 18:52:09 -07:00
Marcos Slomp	e83429c926	replacing the various platform layers by RGFW	2026-06-10 18:38:48 -07:00
Bartosz Taudul	cc091a99a2	Support key modifiers on emscripten.	2026-06-10 23:39:08 +02:00
Marcos Slomp	1b207d3e2a	adding OpenGL example (spinning triangle)	2026-06-10 14:14:54 -07:00
Bartosz Taudul	f89709e99e	Prevent click-through when activating annotation.	2026-06-10 22:50:45 +02:00
Bartosz Taudul	a4c5f15312	Rewrite annotations drawing.	2026-06-10 22:42:05 +02:00
Bartosz Taudul	3455fd9f82	Fix regression making existing annotations non-editable after trace load.	2026-06-10 21:48:12 +02:00
Marcos Slomp	cfc046abcd	refactoring requirements	2026-06-10 11:09:51 -07:00
Marcos Slomp	0d848c3042	proper device descriptor chaining	2026-06-10 08:03:18 -07:00
Marcos Slomp	54270d3fd5	move window to top when launching from console	2026-06-10 06:24:53 -07:00
Marcos Slomp	1341f98c61	cleanup	2026-06-10 06:24:32 -07:00
Marcos Slomp	6fc279eef4	more descriptive API name	2026-06-10 06:23:57 -07:00
Marcos Slomp	28d3a91980	more changes to allow for null context	2026-06-09 16:26:51 -07:00
Marcos Slomp	0fbb2eaaa4	typo	2026-06-09 16:00:43 -07:00
Marcos Slomp	b27dab4584	remove "spontaneous" callback (better determinism)	2026-06-09 15:59:38 -07:00
Marcos Slomp	75bee5370f	cosmetics	2026-06-09 15:58:24 -07:00
Marcos Slomp	e7499458e9	allow scoped instrumentation to no-op with null context	2026-06-09 15:58:06 -07:00
Marcos Slomp	958cb8d7f8	WGPU_PATH fix	2026-06-09 12:56:34 -07:00
Marcos Slomp	59f17794a5	fixing MemWrite casts	2026-06-09 09:06:48 -07:00
Marcos Slomp	3b2c7dbacb	fixing webgpu lib linkage based on WGPU_PATH	2026-06-09 09:06:48 -07:00
Marcos Slomp	56ed480ed2	relocating webgpu example	2026-06-09 09:06:48 -07:00
Marcos Slomp	0572c86551	Wayland woes...	2026-06-09 09:06:48 -07:00
Marcos Slomp	6499e3383b	fix Linux build	2026-06-09 09:06:48 -07:00
Marcos Slomp	8278ace0c1	build fix	2026-06-09 09:06:48 -07:00
Marcos Slomp	5981eca141	adding webgpu example/demo	2026-06-09 09:06:48 -07:00
Marcos Slomp	1b2856b885	GPU context name	2026-06-09 09:06:48 -07:00
Marcos Slomp	118f18cf4b	updating docs	2026-06-09 09:06:48 -07:00
Marcos Slomp	bfbc1d3bee	missing interface, and more debugging	2026-06-09 09:06:48 -07:00
Marcos Slomp	831779508f	minor fixes/comments	2026-06-09 09:06:48 -07:00
Marcos Slomp	286309af3f	refactoring calibration estimations	2026-06-09 09:06:47 -07:00
Marcos Slomp	3db70a2237	refactoring	2026-06-09 09:06:47 -07:00
Marcos Slomp	da952f3f38	more refactoring	2026-06-09 09:06:47 -07:00
Marcos Slomp	efba4685ef	more cleanup and refactoring	2026-06-09 09:06:47 -07:00
Marcos Slomp	598984c45d	refactoring initial calibration	2026-06-09 09:06:47 -07:00
Marcos Slomp	860011c604	calibration stability	2026-06-09 09:06:47 -07:00
Marcos Slomp	0cdcbfc75d	refactoring query resolve	2026-06-09 09:06:47 -07:00
Marcos Slomp	e5d4be95df	getting rid of spontaneous callbacks	2026-06-09 09:06:47 -07:00
Marcos Slomp	7b3863d93d	redesign...	2026-06-09 09:06:47 -07:00
Marcos Slomp	de2a18d964	initial prototype for WebGPU back-end	2026-06-09 09:06:47 -07:00