Without this using the popup is quite unintuitive. Setting the range
apparently does not have an effect – because ranges are only shown if the
ranges window is open (or the windows appropriate for each of the ranges).
eval_guide.md referenced a tracy://catalog resource that was never
registered (only tracy://prompt and tracy://eval-guide exist), so an
agent following the guide would try to read a nonexistent resource.
Remove the references; the worked snippets the catalog described are
already inlined under "Common query patterns".
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016EvfzHvUsDBSAwEzTfLTtA
Bind Worker::GetSections() as get_sections() so the TracySectionEnter /
TracySectionLeave instrumentation added to the client is reachable from
the MCP eval tool's ctx object. Returns a list of {start, end, text}
dicts with nanosecond timestamps, matching the existing list-of-dict
accessor convention. Documented in eval_guide.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016EvfzHvUsDBSAwEzTfLTtA
Tracy requires some textures to have repeat wrapping mode set. The ImGui
implementation of samplers doesn't make it easy to achieve. Disalbe use
of samplers and rely on texture flags, as done originally.
Bug (High Severity): Wrong queue type in MemDiscardCallstack
In the callstack path of MemDiscardCallstack, the wrong queue type is
sent:
SendMemDiscard( QueueType::MemDiscard, thread, name );
Every other callstack variant correctly uses its callstack queue type
(MemAllocCallstack, MemFreeCallstack, etc.), but this one uses the
non-callstack type. The SendMemDiscard assertion at line 1026 confirms
MemDiscardCallstack is a valid value.
Impact: The callstack captured by SendCallstackSerial() will be orphaned.
The server processes the event via the non-callstack handler, leaving the
callstack serial data unconsumed, which desynchronizes the serial queue
and corrupts all subsequent events.
A zone emitted from a shared object initializer runs before the
executable's constructors, so its timestamp precedes s_initTime, which
the server uses as the trace epoch (baseTime). Such a zone converts to
negative trace time and its end no longer satisfies IsEndValid(), which
excludes it from statistics reconstruction and makes it render as
never-ending.
Record the current time when a producer token is created before
s_initTime is constructed and use it as the init time, ensuring no event
timestamp precedes the trace epoch.
ELF init_priority only orders constructors within a single module. All of
a shared object's initializers run before any of the executable's, so an
instrumented dependency .so emitting a zone from its static initializer
creates the main thread producer token against the zero-initialized
s_queue. The queue constructor then resets the producer list, orphaning
that producer: every zone emitted on the main thread from that point on
is enqueued into blocks no consumer ever iterates and silently lost,
while sampling (worker thread producer) keeps working.
Re-link such a producer right after the queue is constructed. In the
common case, where nothing was emitted during shared object init, this
merely constructs the main thread token eagerly.
* disabling LTO when building the profiler on macos via the github workflow
* diagnosing where in the linking stage it's getting stuck
* fowrward declarations
* compilation time report
* trying clang build analyzer...
* reducing number of parallel workers
* limiting parallel workers on windows/linux as well
* re-enabling LTO on macos
* reverting forward declaration header include (emscripten is failing with them).
* reverting act changes
* removing comments
Conform to the new repo layout (master moved repros under tests/, e.g.
tests/cuda/repro/graph). Relocate examples/RocprofOnDemandRepro to
tests/rocprof/repro/on_demand and replace the hand-written Makefile with
a CMake build mirroring the CUDA repro: builds the HIP reproducer, wires
it as a ctest target, and optionally builds the check_gpu_ctx_name
verification helper against the Tracy server library.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fixed batches of 1024 addresses could overflow the platform's command-line limit (`La ligne de commande est trop longue.` from cmd.exe on Windows, whose limit is ~8191 characters). Build each batch by appending addresses until a length budget is reached instead. A single conservative budget of 8000 stays under the smallest limit on every platform, and keeps batches in the same ballpark as before (several hundred addresses per invocation).
The new `-R` option of tracy-update sets every callstack frame back to `[unresolved]` / `[unknown]`. Since failed lookups leave frames untouched and the image-relative offset in `symAddr` survives patching, this makes it possible to chain several resolution passes over the same capture, each with different `-p` path substitutions (e.g. one pass per symbol directory).
The addr2line backend of tracy-update now builds on every platform, including Windows, and can be pointed at any addr2line-compatible executable:
- `-a`: path to a custom symbol resolution tool (e.g. `llvm-addr2line` or a cross-compilation toolchain's `addr2line`). Works on all platforms and takes precedence over the platform default (DbgHelp on Windows, the `addr2line` found in `PATH` elsewhere). Path-like values are validated up front so a wrong path fails with an actionable message instead of a cryptic, localized shell error.
- `-A`: extra arguments passed verbatim to the tool, e.g. `--relative-address` so `llvm-addr2line`/`llvm-symbolizer` accept the image-relative offsets Tracy records for images with a non-zero preferred base (PE, Mach-O).
- `-v`: verbose output while patching symbols.
OLDNAMES.lib may not be linked if you use /NODEFAULTLIB.
Note: This uses `_MSC_VER` as a gate and not _WIN32 as MinGW apparently uses `fileno` and not `_fileno`.
When zoomed in very far the panning resolution can be so small that it
is less than one unit. In order to continue panning, we store partial
pans so that they can accumulate across frames.
Now that the ruler just shows the delta time across the view it doesn't
indicate where the view is currently looking.
The new position bar fills this role to allow orientating oneself.
It could be challenging to examine fine details within the flamegraph.
The flamegraph has been enhanced so that it allows zooming with the
mouse wheel, and then panning around with the right mouse button.
This provides a familiar experience to the timeline view.
When typing in e.g. "127.0.0.1" the first character "1" as a valid address
that does not immediately fail the connection attempt. The result was that
any further interaction with the UI (including completing the input) was
blocked by the "please wait" screen during connection attempt.
Extension point so private/unsupported platforms can plug in their own implementations of the kernel/libc primitives Tracy depends on, without patching the `#if`/`#elif` chains.
Projects supply a platform header via `-DTRACY_PLATFORM_HEADER="\"my_platform.h\""` at build time. Tracy includes it in any TU that needs the hooks. The header toggles per-category `TRACY_HAS_CUSTOM_*` macros and declares matching `tracy::Platform*` functions.
Available hooks:
- `TRACY_HAS_CUSTOM_THREAD_ID` → `PlatformGetThreadId`
- `TRACY_HAS_CUSTOM_USER_INFO` → `PlatformGetHostname`, `PlatformGetUserLogin`, `PlatformGetUserFullName`
- `TRACY_HAS_CUSTOM_SAFE_COPY` → `PlatformSafeMemcpy`
- `TRACY_HAS_CUSTOM_ALLOCATOR` → `PlatformMalloc`, `PlatformFree`, `PlatformRealloc`, `PlatformAllocatorInit`, `PlatformAllocatorThreadInit`, `PlatformAllocatorFinalize`, `PlatformAllocatorThreadFinalize`
Each hook is wired as the first arm of its respective `#if`/`#elif` chain, so existing supported platforms are unaffected.
Template files in `examples/CustomPlatform/` and a new subsection in `manual/tracy.tex` document the mechanism.
Move `tracy_set_option` and `tracy_set_option_value` from `CMakeLists.txt` into `cmake/options.cmake`. Add `tracy_set_option_value_as_string` for options whose value is embedded as a C string literal. All three accept an optional trailing target argument; when provided, the option is also propagated as a PUBLIC compile definition on that target.
Existing `set_option`/`set_option_value` are unchanged but will be replaced later by the `tracy_*` versions.
The function is about to dispatch between rpmalloc and a pluggable allocator hook, so the rpmalloc-specific name no longer fits. Pure rename plus a small consequence: the SymbolWorker call site no longer needs the TRACY_USE_RPMALLOC guard, since the no-op static-inline fallback in TracyAlloc.hpp makes InitAllocator() safe to call unconditionally.
If as.ipMaxAsm.local is 0 and m_childCalls is false, GetHotnessColor(count, 0)
performs float(2 * count) / 0. The old code explicitly guarded against this
with the as.ipMaxAsm.local != 0 check.
The "outdated" concept is strictly for chain of assistant replies with
nothing in between, i.e.:
"I will check this..." <- outdated
<tool call> <- not displayed
"Now I will do that..." <- outdated
<tool call> <- not displayed
"Let me consider..." <- outdated
<reasoning> <- not displayed
"Now I have the answer..."
The first three messages are at this point considered outdated, as the
model provided a more recent message.
Note that in chain such as below there are NO outdated messages:
"How can I help..."
<user input>
"Ah, I see..."
<user input>
"You may try to..."
Similarly, if the tool calls or reasoning sections are explicitly enabled
in the chat UI, the messages are also not considered outdated.
Replace `#ifdef BSD` (which requires including `<sys/param.h>` first) with explicit checks for `__FreeBSD__`, `__NetBSD__`, `__OpenBSD__` and `__DragonFly__`, matching how these BSDs are already enumerated elsewhere in the codebase (OS name strings, thread id helpers, etc.).
This also avoids leaking the `sys/param.h` requirement through public headers (`TracySysTime.hpp`, `TracyCallstack.h`), where consumers would otherwise need it to correctly see `TRACY_HAS_SYSTIME` / `TRACY_HAS_CALLSTACK`.
`libbacktrace/config.h` is left as-is — it's third-party and only included from .c files where the `BSD` macro can still be picked up locally.
Note: for `setsockopt( m_sock, IPPROTO_IPV6, IPV6_V6ONLY, (const char*)&val, sizeof( val ) );` I added `__APPLE__` too since this was the only place where it was not checked explicitely.
This can happen notably when the user does not call ZoneEnd.
I used 256 arbitrarily as it seemed higher values would just make the UI freeze anyway due to perf reasons.
I added a warning in the notification area so that users can locate it.
Many of the zones would have a negative running time due to a missing `cs->IsEndValid()` check.
This could end reporting context switches before the zone start, due to `cs->End()` returning -1.
This happened when systrace dropped event, or when using Fibers and `TracyFiberEnter` is called on the new thread once the fiber has been scheduled. (The manual actually does not really hint this is wrong, we should probably fix the manual or the server code.)
In both cases, we assume runtime to be 0 for that context switch. Since we have no actual information. Both options (counting full runtime or no runtime) are wrong, and most of the code handling `!cs->IsEndValid()` uses `Start` instead so that's what I did. This is still a net improvement over displaying negative values. If we want to change this handling, we'd need to review the other places that do `it->IsEndValid() ? it->End() : it->Start()` as well.
It also seems two different concepts were being mixed:
1. Do we have any context switch data at all ? (`it != ctx->v.end()` ie `count != 0`)
2. Do we have complete data for the last context switch (`eit != ctx->v.end()`)
This led to some places of the code not displaying or counting running time at all, notably when hovering a zone.
I think most of the time we wanted 1, as it reports correctly and assumes the last context switch is still running, which is a fair assumption if we didn't see one putting the thread to sleep.
I also fixed a case where we were overcounting runtime when range start was during a sleep.
- save_worker binding: wraps Worker::Write under
Worker::ObtainLockForMainThread() so live instances yield their
receive thread cooperatively for the save's duration — the same
pattern View::Save uses in the GUI.
- save_trace MCP tool: defaults to async_mode=True for multi-GB
traces; reuses the existing Task/executor machinery so callers
poll via the task tool. Path resolution mirrors load_capture.
- manual/tracy.tex: add save_trace bullet to the MCP tool list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cache is shared between image names and source file names, because the
underlying StringIdx storage makes indices unique. Both name sets should
be completely separate, but if you have conflicts here, you have much
more pressing problems to solve.
In terms of how much BuildFrameGraph execution time was spent in
IsFrameExternal:
1. no cache: 67%
2. global + shared_mutex: 84%
3. global + mutex: 80%
4. local: 41% (this commit)
For consistency, always provide the OS-level sampling data, not the hardware
samples.
Always disable inline propagation, which is intended as a local area help
for a human. It does not make sense in context of an llm.
Two bugs fixed in the process:
1. X86_REG_BPL is now properly set (previously there were duplicate
X86_REG_BP entries).
2. maxLine is now properly calculated, instead of being set to the
last line value.
The end address is now readily available in lower_bound search, instead
of needing to be calculated constantly in the lambda.
The size can be equivalently calculated from the end address, but this
only happens once, after the symbol is found.
Frames whose symbol data is shipped inline with the callstack payload
(sel=1, e.g. Lua-side stack entries) were being passed to
GetCanonicalPointer() in the AddCallstackAllocPayload() query loop,
tripping its sel==0 assertion. They have no native pointer to query
and were already registered in callstackFrameMap earlier in the same
function, so just skip them.
Regression from c704f909, which hoisted the per-call-site dedup into
QueryCallstackFrame(). Three of the four updated call sites were
equivalent before and after, because the old guard and the new one
keyed on the same value. The fourth, this one, was not: the old guard
tested the frame as-is and matched the entry inserted a few lines above,
short-circuiting before GetCanonicalPointer() ran. The new guard keys on
PackPointer(addr), so GetCanonicalPointer() must run first to compute
addr, and the assert fires.
The profiler will typically want to send bursts of queries (e.g. 3 queries
to retrieve source location strings, or multiple queries to get all the call
stack frames, etc.).
Each of these queries will be sent immediately, if available space in the
network buffer permits. Each of these sends is a separate syscall.
Remove this and instead batch all queries with the already existing network
buffer overflow handling functionality.
After investigating (downloading and installing) all publicly available SDKs at https://learn.microsoft.com/en-us/windows/apps/windows-sdk/downloads-archive I concluded the `TRACEHANDLE` deprecation started in `10.0.26100`.
This defines `PROCESSTRACE_HANDLE` and `CONTROLTRACE_ID` as done by the SDK when using older versions. Using `WDK_NTDDI_VERSION` (and not `NTDDI_VERSION` which may change based on `_WIN32_WINNT` or user input seems to be the most reliable way to do it. While it says "WDK" it's been part of the SDK in `shared\sdkddkver.h`. Note it doesn't work for MinGW because it updates half of its sdk files for some reason.
Tested with both 10.0.26100 and 10.0.22621.0 which is the last one I found without the new types.
Also changes CONTROLTRACE_ID to ULONG64 on mingw which is correct (type used by `TRACEHANDLE` too in mingw fe2763863a/mingw-w64-headers/include/evntrace.h (L60) )
* Add MCP server for AI-assisted trace analysis.
Introduce an optional Model Context Protocol (MCP) server that lets AI
assistants analyze Tracy captures and live sessions through Tracy's own
server engine. The server runs as a Python sidecar and talks to the
existing C++ analysis code through new pybind11 bindings.
- python/bindings/ServerModule.cpp: TracyServerBindings module exposing
Worker, file I/O, zones, GPU zones, frame data, plots, messages, locks,
source locations, and summary statistics (zone/GPU child stats, frame
timing, etc.).
- python/CMakeLists.txt: builds and installs TracyServerBindings alongside
TracyClientBindings.
- extra/mcp/tracy_mcp.py: FastMCP SSE singleton with dynamic port
discovery, PID-file based singleton detection, session-isolated worker
instances, synchronous and background eval, task polling, and a
shutdown tool to release the .pyd lock during development.
- extra/mcp/start_mcp.sh, .gitignore: launcher with local override hook;
ignores generated port/pid files.
- manual/tracy.md: documents building, running, and integrating the
server with an AI assistant.
* Improve Tracy MCP cold-start guidance.
Cold-start usability testing showed an LLM agent burned ~7 exploratory
calls discovering the ctx object model, time-unit conventions, and join
keys before producing useful analysis. Surface that information up front
through MCP resources and entry-point tool guidance.
- extra/mcp/eval_guide.md: new bindings-layer reference covering the
Worker object graph (zone / GPU zone / frame / thread / message /
plot / lock / memory entry points), nanosecond time units, ZoneStats
field semantics including self-time via get_child_zone_stats, the
opaque 'name (addr)[arch] <srcloc_id>' key format, and worked
examples translating common queries into ctx Python.
- extra/mcp/tracy_mcp.py: expose system.prompt.md and eval_guide.md as
MCP resources (tracy://prompt and tracy://eval-guide) so external
agents and Tracy Assist share the same guidance source. Resource
content is re-read per request — edits propagate without a server
restart.
- Point load_capture and live_connect return values plus the eval tool
description at the resources, so the agent reads them before its
first eval rather than introspecting blind.
- Expand load_capture docstring: name the path parameter explicitly,
show Windows path syntax, and direct agents to list_captures plus
TRACY_CAPTURES_DIR for capture discovery.
- Probe is_connected() briefly after Worker construction in
live_connect and surface an actionable error on silent handshake
failures (typically a Tracy client/server version mismatch or
TRACY_ON_DEMAND) instead of returning misleading success.
Reduces a fresh agent's cold-start overhead from 7 exploratory calls
to 4, where the remaining 4 are unavoidable harness/schema-fetch
overhead, not API-design friction.
* Detect Tracy protocol mismatches via UDP broadcast pre-flight.
Tracy clients announce themselves on UDP port 8086 every ~3 seconds with
a BroadcastMessage carrying the protocol version, listen port, and
program name (public/common/TracyProtocol.hpp). The Tracy GUI reads this
and refuses to attempt a TCP connection on protocol mismatch, surfacing
a precise error. live_connect previously had no equivalent check, so a
mismatch produced an opaque 2-second handshake timeout with no
diagnostic about what was wrong.
- Add a broadcast parser handling versions 0-3, with variable-length
programName (Tracy sends only the actual name + null terminator on
the wire, not the full 64-byte buffer).
- Add a non-blocking UDP listener that binds 8086 with SO_REUSEADDR
and waits up to 3.5s — enough to guarantee catching at least one
beat at the 3s broadcast cadence.
- Read our bindings' ProtocolVersion at startup by parsing
TracyProtocol.hpp, so the comparison stays in sync with the build
without new C++ wiring.
- live_connect runs the broadcast pre-flight before constructing
Worker. On a matched listen_port with a differing protocol_version,
it returns a single-line error naming the program, both versions,
and the remediation, without ever opening a TCP connection. If no
matching broadcast arrives, it falls through to the existing
handshake probe, which now reports any other broadcasts seen as a
hint (helpful when the target uses a non-default port).
* Add MCP Server section to LaTeX manual.
The markdown manual is auto-generated from the LaTeX source; add the
corresponding \subsection{MCP Server} so the two stay in sync.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Remove hand-written MCP section from tracy.md.
tracy.md is generated from tracy.tex via latex2md.sh. The MCP section
was previously written by hand directly in the markdown; now that the
LaTeX source has been updated, the markdown section should be
regenerated by running latex2md.sh rather than maintained manually.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous solution (75c173) didn't account for the fact that the text
to print may start with a space, in which case the text width calculation
results in 0. In effect, first word length was never greater than the
space left for printing, and the problem was still there.
Fix by walking through all initial spaces in firstWord.
If fwLen > left, then a line break is needed. In this case ignore initial
spaces in text.
TracyDebug fires from SysPower's ctor while it scans intel-rapl, which
runs as a Profiler member initializer -- before s_instance is set in
the Profiler ctor body. Under TRACY_MANUAL_LIFETIME without
TRACY_ON_DEMAND, the TracyInternalMessage path guarded this with
assert(ProfilerAvailable()), which aborted tracy-monitor whenever it
was run as root (only then is intel-rapl readable, so the log is
actually reached).
Soften the assert to an early-out, matching the TRACY_ON_DEMAND branch.
A TracyDebug issued before the profiler is up now silently skips
instead of aborting.
FindExternalImageRefresh already re-parsed /proc/<pid>/maps on miss,
but only one of three external decode paths used it. Switch the
DecodeCallstackPtrFastExternal and DecodeSymbolAddressExternal paths
over so symbol-name and file/line lookups stay fresh after the target
dlopens a library.
Rate-limit the re-parse to once per wall-clock second so samples
landing on permanently unresolvable regions (JIT, vDSO, stacks) do
not trigger a full parse each time.
perf_event_open(pid>0, cpu>=0) binds to a single task, so the previous
setup only sampled the target's main thread. In monitor mode, enumerate
/proc/<pid>/task/ and open one per-task event per existing thread with
cpu=-1; inherit=1 then covers every descendant. Self-profiling behavior
is preserved byte-for-byte: the iter list becomes (currentPid, i) for
each CPU, exactly what the old code did inline.
- Loop startup waitpid on EINTR; kill and reap the child on fatal error
or when interrupted, instead of leaking a ptrace-stopped process.
- Treat PTRACE_DETACH failure as fatal -- otherwise the child is stuck
stopped forever.
- Zero-initialize procName so the memcpy into ___tracy_magic_process_name
does not copy uninitialized stack past the NUL.
- Forward SIGINT to the child from the signal handler when in forked
mode, so Ctrl-C during a blocking waitpid unblocks cleanly.
- Preflight perf_event_open on the target before StartupProfiler so
permission failures surface with actionable guidance instead of
silently producing no samples.
- Also handle SIGHUP and SIGQUIT.
We redirect GetProfiler() (most likely used by any project consuming tracy, since it's used by `tracy::ScopedZone`) to its implementation which now has a different function name based on the macros that can impact ABI (and enabled/disabled).
That way, when linking with mismatched defines you'd get an error such as
> main.obj : error LNK2019: unresolved external symbol "int __cdecl GetProfiler_CFG_E0_OD0_DI0_ML0_F0_DHT0_TF0(void)" (?GetProfiler_CFG_E0_OD0_DI0_ML0_F0_DHT0_TF0@@YAHXZ) referenced in function "int __cdecl GetProfiler(void)" (?GetProfiler@@YAHXZ)
Or
>[build] /usr/bin/ld: CMakeFiles/app.dir/main.cpp.o: in function `GetProfiler()':
[build] /..../TracyProfiler.hpp:143: undefined reference to `GetProfiler_CFG_E1_OD0_DI0_ML0_F0_DHT0_TF0()'
Reason for going with acronym+0/1 instead of just acronym when enabled is for us to be able to tell users easily which define is wrong by just looking at the error if needed.
The only thing we don't really detect is user not having TRACY_ENABLE but tracy having been built with it. This is because macros become noops in that case, with no reference to `GetProfiler`.
There may be a way to do it by introducing a local variable into each TU, but I don't really like that idea.
We could also add pragma detect mismatch for a more user-friendly error on windows (https://learn.microsoft.com/en-us/cpp/preprocessor/detect-mismatch?view=msvc-170).
This is to be consistent with what libtracefs does: 6fad6a14ba/src/tracefs-utils.c (L104)
In theory one may mount tracefs with another name, though unlikely.
This forces to (re)use frequency values as input, which may be changed by the platform code later on. This way we have a single "source of truth" for sample freq.
Also removed the Win32 `GetSamplingInterval` which was a wrapper above `GetSamplingPeriod` but its value would be divided again anyway.
On 32-bit arm, phdr.p_vaddr is 32-bit, which causes a compilation
error because std::min expects both arguments to be of the same
type. Adding the static cast handles this case explicitly.
Without this, a late-connecting client receives the deferred
GpuNewContext but not the GpuContextName, so the GPU context appears
unnamed in the profiler.
Add check_gpu_ctx_name tool to verify context names in captured traces.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Minimal HIP program that demonstrates the assertion failure in
tracy-capture when connecting to a TRACY_ON_DEMAND + TRACY_ROCPROF
application. See examples/RocprofOnDemandRepro/README.md for details.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two issues prevented the rocprofiler GPU backend from working with
TRACY_ON_DEMAND:
1. GpuNewContext not deferred: When a Tracy client connects late (on-demand
mode), it never receives the GPU context creation message because the
GpuNewContext queue item was not buffered via DeferItem. This caused an
assertion failure (ctx == nullptr) in the capture/profiler when
processing GPU zone events. Add the same DeferItem pattern used by the
CUDA backend.
2. Kernel symbols dropped before init: The data->init guard at the top of
tool_callback_tracing_callback() blocked kernel symbol registrations
(CODE_OBJECT_DEVICE_KERNEL_SYMBOL_REGISTER) which happen at HIP init
time, before any Tracy client connects. Move the init guard after the
code_object block so symbols are always recorded, while dispatch and
memory-copy events are still gated on initialization.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment still described consuming/not-consuming cudaCallSiteInfo
entries, but memory CBIDs are no longer tracked so no entry exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the mutex-guarded empty check in OnBufferCompleted with an
std::atomic<bool> dirty flag. The mutex is now only acquired when
there is actual retirement work to do. Also update stale comment
on cudaGraphCurrentLaunch that said "let them leak".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
operator[] on ConcurrentHashMap returns a reference after releasing the
read lock — the subsequent assignment happens with no lock held. This is
a latent data race if the map is ever accessed from multiple threads.
insert_or_assign performs the lookup and assignment atomically under a
single write lock, which is the correct pattern for a ConcurrentHashMap.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add `using GraphID = uint32_t` typedef and use it throughout for
graphId-typed variables (PersistentState, matchGraphActivityToAPICall,
getGraphIdFromRecord, retirement set, buffer loop).
- Move matchError from matchGraphActivityToAPICall to caller sites
(KERNEL, MEMCPY, MEMSET handlers). Keeping the error at the caller
provides more debugging context about which activity kind failed.
Remove the now-unnecessary `kind` parameter from the function.
- Replace insert_or_assign with operator[] assignment in
matchGraphActivityToAPICall. Access to graphLaunchCache is
single-threaded (CUPTI worker), so the simpler syntax is sufficient.
Remove the insert_or_assign method from ConcurrentHashMap entirely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cudaMalloc/cudaFree (and driver equivalents) were tracked in
cbidRuntimeTrackers/cbidDriverTrackers, creating a cudaCallSiteInfo
entry on each API call. But the MEMORY2 handler never calls
matchActivityToAPICall (and never calls EmitGpuZone) — it only needs
the address, size, and timestamp from the activity record itself. Since
no activity handler consumes these entries, they leaked indefinitely.
Remove the 6 memory API CBIDs from both tracker maps so no entry is
created. This eliminates the leak with no change in visible behavior:
the MEMORY2 handler already operates independently of cudaCallSiteInfo.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without retirement, the cache grows by one entry per unique exec handle
ever launched and never shrinks. While bounded by the number of distinct
execs in the application, long-running programs creating and destroying
many exec handles accumulate stale entries indefinitely.
Retirement mechanism:
- At cudaGraphExecDestroy (ENTER, while handle is still valid): call
cuptiGetGraphExecId to translate exec handle → graphId and add to a
pending-retirement set. Works for both runtime (cudaGraphExecDestroy)
and driver (cuGraphExecDestroy) APIs. No new subscription needed —
the existing cuptiEnableDomain already routes all API callbacks here.
- Deferral in OnBufferCompleted: erasure is not done immediately because
cudaGraphExecDestroy does not wait for GPU completion. CUPTI may still
have undelivered activity records for the last launch in its internal
buffers. We defer the erase until a full buffer arrives that contains
no records bearing the retired graphId, indicating all in-flight
records have been delivered.
- getGraphIdFromRecord: new helper that extracts the graphId field from
CONCURRENT_KERNEL / MEMCPY / MEMSET activity records (the three kinds
that carry a graphId) for use in the per-buffer tracking.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In the JSON exception catch handler, m_jobsLock.lock() is called directly
on the mutex instead of through the jobsLock unique_lock. When the function
returns, jobsLock's destructor runs but it doesn't own the lock (it was
unlocked earlier at line 1134), and m_jobsLock is never released. This
causes a permanent deadlock the next time anything tries to acquire
m_jobsLock.
Currently including the Tracy.hpp header from a set of installed Tracy
headers will result in the following error:
In file included from <...>/tracy/include/tracy/tracy/Tracy.hpp:133:
In file included from <...>/tracy/include/tracy/tracy/../client/TracyLock.hpp:9:
In file included from <...>/tracy/include/tracy/tracy/../client/TracyProfiler.hpp:18:
<...>/tracy/include/tracy/tracy/../client/../common/TracyQueue.hpp:6:10: fatal error: 'TracyTaggedUserlandAddress.hpp' file not found
6 | #include "TracyTaggedUserlandAddress.hpp"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Apparently introduced in f981330, which included the
TracyTaggedUserlandAddress.hpp header in TracyQueue.hpp without adding
it to the list of installed common header. Fixed by making the necessary
CMake change to install the header.
Ran into this issue while integrating Tracy as a dependency within
Blender[^1], where we use the latest main instead of stable for WoA
support, and use the install target to harvest the static lib and
headers for our libraries.
[^1]: https://projects.blender.org/blender/blender/pulls/156661
The documentation states that Tracy is disabled by default, but the
build system defaults were ON/true. Change CMake and Meson defaults to
OFF/false. Projects that need profiling enabled must now opt in
explicitly. Add explicit TRACY_ENABLE=ON / tracy_enable=true to CI
steps and the test project to preserve existing behavior.
Tests whether CUPTI recycles graphId values after cudaGraphExecDestroy,
which would be the only scenario where the graphLaunchCache in TracyCUDA
could serve stale entries for a non-matching exec handle.
Result (H100, CUDA 12, CUPTI): graphId is a monotonically increasing
counter that is never recycled. 22 create/instantiate/launch/destroy
cycles produced unique IDs ranging from 2 to 65 (incrementing by 3 per
cycle — one unit per node created during graph construction).
This confirms that the stale-cache concern raised in code review is not
a real risk in practice: two distinct exec handles always have distinct
graphIds.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests two questions:
1. Does relaunching the same cudaGraphExec produce a new correlationId
each time, or is it reused?
2. Do two different cudaGraphExec handles from the same cudaGraph share
a graphId?
Results on H100, CUDA 13.1:
- Each launch of the same exec handle gets a strictly unique, monotonically
increasing correlationId. CPU callback corrId == GPU activity corrId.
This is formally documented in cupti_activity.h:
"Each graph launch is assigned a unique correlation ID that is
identical to the correlation ID in the driver API activity record
that launched the graph."
- graphId identifies the exec handle (instantiation), not the graph
definition. Two cudaGraphInstantiate calls on the same graph produce
different graphIds.
These findings confirm that the cudaGraphCurrentLaunch cache in
matchGraphActivityToAPICall is always refreshed by the first activity
of each new launch before the graphId fallback path is ever used.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Calling matchActivityToAPICall in the MEMORY2 handler was consuming
the cudaCallSiteInfo entry for the graph launch correlationId. If a
graph mixes alloc nodes with kernel/memcpy nodes, all activities share
the same correlationId — consuming it here would cause
matchGraphActivityToAPICall to fail for the kernel/memcpy records that
follow, silently dropping their GPU zones.
Since apiCall is never used by the MEMORY2 handler (only address, size,
and timestamp from the activity record are needed), remove the call
entirely and leave the entry for the kernel/memcpy to consume and cache.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUpti_ActivityMemory3 has no graphId field, so graph-launched alloc
nodes and pre-profiling allocations can't be correlated to an API call.
The handler only needs address, size, and timestamp from the activity
record — apiCall is never used. Remove the early return so memory
tracking works in all cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUpti_ActivityMemory3 has no graphId field, so matchGraphActivityToAPICall
cannot be applied. Graph-launched cudaGraphAddMemAllocNode emits multiple
MEMORY2 records sharing the launch correlationId; only the first is
tracked, subsequent ones fire a spurious matchError.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix wrong comments on graph launch tracker entries: they claimed
correlation works "via CUPTI_ACTIVITY_KIND_GRAPH_TRACE", but that
approach was rejected (GRAPH_TRACE suppresses per-kernel records).
The actual mechanism is the shared correlationId across all nodes
in one graph launch.
- Fix ConcurrentHashMap::fetch() missing its read lock — a pre-existing
data race now exercised by the new graph correlation hot path.
- Cache PersistentState::Get().cudaGraphCurrentLaunch in a local ref
inside matchGraphActivityToAPICall instead of calling Get() twice.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cudaGraphCurrentLaunch cache update was acquiring the write lock
twice (once for erase, once for emplace). Wrapping
std::unordered_map::insert_or_assign under a single write lock lets
the caller do it in one operation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVCC 13.1 defaults to a PTX version incompatible with the installed
driver (580.105.08), causing kernels to silently fail with "provided
PTX was compiled with an unsupported toolchain". Use -arch=native so
NVCC auto-detects the target GPU (H100, sm_90) at build time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The kernel, memcpy, and memset cases all had identical logic for
handling graph-launched activities. Extract it into a single helper
next to matchActivityToAPICall.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CUPTI discovery: all kernels launched by one cuGraphLaunch share the
same correlationId as the launch call itself. GRAPH_TRACE was the
wrong approach — enabling it suppresses per-kernel CONCURRENT_KERNEL
records entirely, replacing them with graph-level summaries.
New approach:
- Drop CUPTI_ACTIVITY_KIND_GRAPH_TRACE (it conflicts with CONCURRENT_KERNEL)
- Drop two-pass buffer processing (no longer needed)
- On the first kernel/memcpy/memset from a graph launch, matchActivityToAPICall
succeeds (consuming the cuGraphLaunch entry) and the result is cached in
cudaGraphCurrentLaunch[graphId]
- Subsequent operations from the same launch find the cached entry via graphId
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The repro uses cudaGraphLaunch (runtime API) not cuGraphLaunch (driver
API). Add cudaGraphLaunch_v10000 and its _ptsz variant to
cbidRuntimeTrackers so that graphs launched via the runtime API also
get their CPU call site captured for GPU zone correlation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the synthetic APICallInfo hack with proper correlation via
CUPTI_ACTIVITY_KIND_GRAPH_TRACE. When cuGraphLaunch fires an API
callback, its correlationId is stored in cudaCallSiteInfo. The
GRAPH_TRACE activity record carries the same correlationId plus the
graphId, which lets us build a graphId→APICallInfo map. Kernel/memcpy/
memset activities then look up this map via their graphId field.
Key changes:
- Add cuGraphLaunch/cuGraphLaunch_ptsz to cbidDriverTrackers so the
API callback machinery captures the CPU call site
- Enable CUPTI_ACTIVITY_KIND_GRAPH_TRACE and handle it in
DoProcessDeviceEvent to populate cudaGraphCurrentLaunch[graphId]
- Add cudaGraphCurrentLaunch map to PersistentState
- Two-pass buffer processing in OnBufferCompleted so GRAPH_TRACE
records (which complete last on GPU) are processed before the
kernel/memcpy/memset records that depend on them
- Replace graphId=0 fallback in kernel/memcpy/memset with proper
cudaGraphCurrentLaunch lookup; fall through to matchError if
the graphId is not found
- Update repro to include TracyCUDA headers and properly test
GPU zone correlation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This is often a source of missing symbols or incomprehension as to why they are not getting resolved. Having a debug log will help debugging such cases.
- Add external_file field to backtrace_state struct
- Add backtrace_create_state_for_file() function that marks
state for external ELF files not loaded in current process
- In backtrace_initialize, use external_file flag to:
- Pass exe=0 to elf_add for ET_DYN files (allows DWARF loading)
- Skip dl_iterate_phdr enumeration (avoids noise from current process)
This enables symbol resolution from arbitrary ELF files on disk,
with caller responsible for address translation.
Minimal reproducer showing that CUDA Graph-launched kernels produce
0 GPU zones in Tracy. The repro creates a simple graph (2 kernels +
1 memcpy), launches it 10 times, and expects ~30 GPU zones. Without
the fallback patch, all activity records are dropped by matchError().
Tested on NVIDIA H100, CUDA 13.1.
When kernels are launched via CUDA Graphs (cuGraphLaunch), CUPTI delivers
CONCURRENT_KERNEL, MEMCPY, and MEMSET activity records but no
corresponding API callback fires for the individual operations. This
means matchActivityToAPICall() always fails, and every GPU activity
record is silently dropped by matchError().
Fix this by falling back to a synthetic APICallInfo using the GPU
timestamps from the activity record when no API correlation exists.
This produces correct GPU zones with kernel names and timing — just
without the CPU-to-GPU launch correlation arrow.
Tested on NVIDIA H100 with CUDA 13.1: before this fix, 0 GPU zones
appeared for CUDA Graph workloads; after, all kernel and memcpy zones
are visible in the Tracy timeline.
Use '<pid>_<ip>_<port>' string as client ID instead of IP+port hash.
This allows:
- Same program restarting (new PID) to be recognized as new client
- Multiple instances of same program (different PIDs) to capture separately
- Add subsection about tracy-capture-daemon in the capturing section
- Move merge tool documentation to follow capture daemon section
- Both tools are now documented as part of the capture workflow
A discovery-and-capture daemon that listens for UDP broadcasts from
Tracy clients, automatically connects to discovered clients, and
captures each to a separate file.
Features:
- Continuous discovery until Ctrl+C
- Per-client capture threads
- Terminal display with per-client stats
- Output files named: <program>_<ip>_<port>.tracy
- Collision handling with _1, _2 suffix
- Graceful shutdown on signal
Based on the multicapture design by Grégoire Roussel, but simplified
to output separate files instead of merging (use tracy-merge for that).
Co-authored-by: Grégoire Roussel <gregoire.roussel@wandercraft.eu>
Extract common output functions from capture.cpp into a proper library:
- InitTerminalDetection()/IsTerminal() - terminal detection
- AnsiPrintf() - printf with ANSI escape codes
- WaitForConnection() - blocks until connected, returns error code
- PrintCaptureProgress() - prints throughput/memory/time stats
- PrintWorkerFailure() - prints failure details with callstack
Functions are declared in CaptureOutput.hpp and implemented in
CaptureOutput.cpp. Both tracy-capture and future tools can share
this code.
Co-authored-by: Grégoire Roussel <gregoire.roussel@wandercraft.eu>
- Always export plots (remove -p/--export-plots option)
- Add plot name disambiguation: prefix with process name
- Include PID in name when same process/plot appears in multiple traces
Merges multiple .tracy files into a single combined trace using the
Import API. Each trace's threads are remapped using compound TIDs
encoding (pid << 32) | tid to prevent collisions.
Thread names are prefixed with process name. If the same process/thread
name combination appears in multiple traces, PID is included to
disambiguate (e.g., myapp[12345]/MainThread).
Co-authored-by: Grégoire Roussel <gregoire.roussel@wandercraft.eu>
Move broadcast message parsing logic from profiler/src/main.cpp into
server/TracyBroadcast.cpp/hpp. This reduces code duplication and enables
reuse by other tools (e.g., multi-capture).
ParseBroadcastMessage() handles all broadcast protocol versions (0-3) and
returns std::optional<BroadcastMessage>. ClientUniqueID() generates a unique
identifier from IP address and port.
Co-authored-by: Grégoire Roussel <gregoire.roussel@wandercraft.eu>
Add option to configure the maximum tool reply size in the advanced
settings of the LLM assistant. The limit can be enabled/disabled via
a checkbox, with the value stored in bytes. The context-based limit
is now displayed alongside the configured limit for transparency.
External symbols checkbox now appears before kernel symbols, and kernel
is disabled when external is not selected since kernel symbols are a
subset of external symbols.
Introduce Windows ARM64(native) support across ToyPathTracer,
profiler, and server code paths when building with MSVC(_M_ARM64).
Key changes:
- MathSimd.h/Maths.h:
- Fix NEON movemask constants for MSVC/ARM64 by loading from a uint32_t[]
via vld1q_u32() and using vdupq_n_u32() for highbit.
- enkiTS/TaskScheduler.cpp:
- Provide Pause() implementation on _M_ARM64 using __yield().
- profiler/winmain.cpp:
- AVX feature checks to x86/x64 only and skip on ARM64.
- server/TracyPopcnt.hpp:
- Implement TracyCountBits using ARM NEON intrinsics.
- Implement TracyLzcnt using _BitScanReverse64().
The chat regenerate/trash button handling had a monolithic lock that
blocked the worker thread during token counting. The token counting
is performed by calling TracyLlmApi::Tokenize(), which issues a
blocking HTTP POST request to the LLM API - on the UI thread.
To avoid blocking both the worker and freezing the UI during HTTP
calls, the lock was split into m_jobsLock and m_chatLock. However,
this introduced a race condition.
For the assistant role (regenerate message action), the code:
1. Stops the current job and queues a new one (under m_jobsLock)
2. Releases the lock
3. Performs token counting via HTTP calls (no lock held)
4. Re-acquires m_jobsLock and stops m_currentJob again
Between steps 2 and 4, the worker thread can complete the old job
and pick up the newly queued job. The stop at step 4 then incorrectly
stops the new regenerate job instead of the old one.
Under the original monolithic lock, the worker couldn't access
m_currentJob until after step 4, so the stop always applied to the
correct job. After the split, the second stop became harmful for
the assistant case.
The fix makes the second stop conditional on user role only (thrash
action), since that's the case where no new job is queued and we
genuinely need to stop any running generation before the user edits
their message.
Listen for browser paste events and directly inject clipboard text via
AddInputCharactersUTF8. Also suppress character input when Ctrl/Meta is
held to prevent 'v' from being typed on Cmd+V.
On macOS with Retina displays, dpiScale is 2.0 but gets overridden to
userScale alone under __APPLE__. The early-return check compared
prevScale against the pre-override value (dpiScale * userScale), which
for 50% zoom evaluates to 2.0 * 0.5 = 1.0, matching the previous
prevScale of 1.0 and causing an early return before the scale change
could take effect. Moving the __APPLE__ override before the early-return
check ensures the comparison uses the actual effective scale.
Ideally I think we might want to have TRACY_HW_TIMER mean only TSC/CNTVCT, and define TRACY_TIMER_FALLBACK for platforms that don't have them or have a special case such as iOS.
But for now, keep the same behaviour.
The Emscripten backend was missing Platform_SetClipboardTextFn/GetClipboardTextFn,
so ImGui::SetClipboardText was a no-op in the browser.
Hook navigator.clipboard.writeText for writes and keep a local copy for reads.
The "sort" button for threads uses an unstable sort, so threads with
the same name will shuffle around without ever stabilizing.
Use `id` to sort after comparing the name.
The valid check here is for the source file index being set (active).
The line number might be zero.
This fixes the "black" unknown source location in assembly listing.
Increase the context quota from 70% to 80%, as the previous value was too
conservative with large contexts.
Add a minimum bound of 4K to make small context workable.
TracyDebug() is in the profiler init path, and logging the message causes
the profiler to be initialized (the second time), which deadlocks in
GetProfilerData().
LM Studio will prefix the assistant's content with '\n\n'. While this is
not a problem when proper text follows (the markdown parser will ignore
this), the empty check does fail.
A typical use case would be $(HOME)/.cache/cpm/somelib/file.h.
Special care is needed to avoid filtering out dot-dot path elements: /../
While these have been normalized for some time now on the client-side, old
traces might still contain the dot-dot elements.
Before this commit, vertical scroll was always discrete. At least on
Wayland, this caused extremely fast scrolling on touchpads (that send
lots of small axis events) and on mice with high-resolution wheels (that
also send lots of small axis events). After this commit, all of this
scrolling works correctly, at a speed matching regular wheels.
Regular mice send a value of 15 for one wheel tick, not 8.
This currently doesn't change anything about vertical scrolling since
it's handled discretely, but that will change in the next commit.
In most cases this is not needed. However, some models, like Gemma3 or
Devstral require that user and assistant messages alternate.
The only case where this can happen in Tracy is when an attachment is added:
[
{
"role": "user",
"content": "<attachment>\n..."
},
{
"role": "user",
"content": "Tell me something about..."
}
]
It is trivial to glue these messages together. This is only done when sending
the data in the REST request, as the chat rendering logic expects these to be
separate and it would be too much work unnecessary work to do it "proper".
Nemotron 3 Nano outputs these spaces in the text. The currently used font
(or is it ImGui?) is not able to render this, and draws replacement character
instead.
Previously reasoning and tool calls were rendered first, followed by
content. The tool call response was then rendered in the second reasoning
section. This made the tool call and response disjoint and, with the
current reasoning hiding logic, not visible at the same time.
AltGr (right alt) is used to enter characters, at least with the Polish
keymap. ImGui uses alt to switch between some of the controls. Keep the
interaction on the (left) Alt key, and leave AltGr to do composition.
This is not removed later on, mainly to enable caching of previous replies
(changing any earlier message would require processing everything that's
later), but also to give the LLM a sense of passing time.
This also replaces `___tracy_emit_message*` by `___tracy_emit_logString`.
The `TracyCMessage*` defines no long include the `;`, which may be a breaking change even though we did already require semi-colon since 0.9.0. See #493 and #592
This now uses a single path for all cases (filter/threads changed, new message).
Note this no longer discriminates against `m_messageFilter.IsActive()` to skip the message filter. `ImGui::PassFilter already` early outs, and if performance is a concern we might as well start by caching the result of `VisibleMsgThread` which always does a hashmap lookup.
There are two changes to the protocol:
- `QueueMessageLiteral*` were changed and what used to be addresses are now addresses+metadata
- Other messages now send `QueueMessage*Metadata` with added metadata.
This will later be used to store and transmit message sources, level, etc.
Extended Tracy's CUPTI callback registration to track CUDA Driver API
memory operations that were previously missing.
Registration of these events allow users to trace applications that directly
call the Driver API.
- In the connection state, retrieve the FrameImage while owning the data lock.
- Use actual image data pointer as caching key instead of the address of ImageCache which may change during executation (unstable).
- Fixes scale Messages image tooltip scale.
- Free the connection image
OfflineSymbolResolverDbgHelper.cpp uses IMAGEHLP_LINE but
SymGetLineFromAddr64 expects IMAGEHLP_LINE64. On 64-bit Windows these are typedef'd to the same thing, but on 32-bit they're different.
1. fix bug build error can not find head file `#include <tracy/Tracy.hpp>`
2. when not enable TRACY_ENABLE macro, build error can not find `tracy::CUDACtx` type
Atomic variables need to be initialize with direct-initialization
instead of copy-initialization lest the compilation fails complaining
about missing constructor for the atomic type.
Imgui-docking >=v1.92.3 GLFW back-end requires linking against x11 libraries. This solves link failures when building tracy with the LEGACY option turned on.
Resolving compiling issue when building with previous version of lib-curl.
If locally is present an older version of lib-curl the profiler build fails with:
```
29 | curl_easy_setopt( curl, CURLOPT_CA_CACHE_TIMEOUT, 604800L );
| ^~~~~~~~~~~~~~~~~~~~~~~~
| CURLOPT_DNS_CACHE_TIMEOUT
include/curl/curl.h:3109:68: note: expanded from macro 'curl_easy_setopt'
3109 | #define curl_easy_setopt(handle,opt,param) curl_easy_setopt(handle,opt,param)
| ^
include/curl/curl.h:1398:11: note: 'CURLOPT_DNS_CACHE_TIMEOUT' declared here
1398 | CURLOPT(CURLOPT_DNS_CACHE_TIMEOUT, CURLOPTTYPE_LONG, 92),
| ^
tracy/profiler/src/profiler/TracyLlmApi.cpp:144:46: error: use of undeclared identifier 'CURL_WRITEFUNC_ERROR'; did you mean 'CURLE_WRITE_ERROR'?
144 | if( !v.callback( json ) ) return CURL_WRITEFUNC_ERROR;
| ^~~~~~~~~~~~~~~~~~~~
| CURLE_WRITE_ERROR
include/curl/curl.h:524:3: note: 'CURLE_WRITE_ERROR' declared here
524 | CURLE_WRITE_ERROR, /* 23 */
| ^
2 errors generated.
```
Both CURLOPT_CA_CACHE_TIMEOUT and CURL_WRITEFUNC_ERROR were added in
https://curl.se/ch/7.87.0.html
Now checking the version will avoid such issues:
-- Checking for module 'libcurl>=7.87.0'
-- Package dependency requirement 'libcurl >= 7.87.0' could not be satisfied.
Package 'libcurl' has version '7.84.0', required version is '>= 7.87.0'
-- CPM: Adding package libcurl@ (curl-8_17_0 to ...)
This makes the user manual available outside of the LLM context.
The code is also more readable, as splitting the manual into sections and
splitting section content into chunks fit for embeddings is now separated.
A bug has been fixed, where the above splits were mixed up for the last
manual section, producing invalid data.
The unembedded manual contents are no longer held in the memory. The only
use case for this was to calculate the manual contents hash. The hash is
now precalculated and cached.
My previous fix triggered another issue: apparently at least GCC expects
the `_Pragma` operator to be placed in its own statement (after a
semicolon). The current macro simply dumped the expression and the
_Pragma together, triggering an error. Putting a semicolon after `Expr`
fixes the issue (actually double-checked after a `git clean -fdx`),
although slightly changing the API (the semicolon after the wrapped
macros is now optional).
It looks like the actual macro defined by GCC is `__GNUC__`.
From my testing, `__GNU__` does not seem to be defined neither on my
musl machine nor a Fedora Linux VM. I have no idea if it's a typo or set
by specific libraries, but testing for `__GNUC__` actually suppresses
shadowing errors on my end.
the extended model was 4 bytes too low, as well as the extended family
the extended family was also not properly masked and the reserved space bled into it
fixes#1193
- Introduce both s_imageCache and s_krnlCache on all platforms, even if unused (will be reused later to unify platforms handling)
- This means that what userland images that used to be unsorted are now sorted
The name was a bit misleading as it could be mistaken to mean "The cache contains the address" and not as "has an image with this start address". ie: that it could be mistaken to do GetImageForAddress( startAddress ) != nullptr.
This was causing issues in the Infos -> Trace Statistics window as `GetCallstackFrameCount` uses `m_pendingCallstackFrames`. Just in case, init those all those variables where declared instead of constructor.
This disables the warnings for MSVC, GCC and Clang in the ZoneScopedXX
macros.
The warnings produced are both a false positive since they didn't find a
bug *and* they don't happen in user written code, so the user couldn't
even do much about it. The previous workaround of using ZoneNamedXXX is
a poor solution since the Zone(Text|Name|etc.) macros all rely on the
`___tracy_scoped_zone` name.
`ZoneNameF(fmt, ...)` was error-prone: if the format is inconsistent
with arguments, the code compiled fine without errors or warnings.
Use gcc/clang `__attribute__((format(printf, fmt_idx, arg_idx)))`
function attribute, so that `ZoneNameF(fmt, ...)` can now check
at compilation time the consistency between the format and the
arguments.
See https://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Function-Attributes.html
In Profiler::DequeueSerial if AppendData fails part way through m_serialDequeue then the elements could be freed again in Profiler::ClearSerial, which leads to memory corruption in rpmalloc.
This provides some instructions and tips for the manual. Also:
* Made the calibration feature a CMake option
* Cleaned up some minor code issues
* Fixed an issue with the calibration
* Incremented patch number
add_custom_command DEPENDS can only see the output of other commands or targets names, not the targets installation through ExternalProject_Add.
Add the `embed` target (created by `ExternalProject_Add`) as a dependency for commands using the utility
Callstack frames will now have nullptr as the value in the callstackFrameMap
map, as a way to signal that a query for given key is already pending.
Duplicate queries should no longer happen.
@slomp provided alternative implementation, which produced the following
results:
Queries made: 195,778
Duplicate queries skipped: 9,518,910
Co-authored-by: Marcos Slomp <slomp@adobe.com>
Send embedding requests in batches. This shaves off ~4% of run time.
Save ready-to-use database instead of intermediate data that need to be
processed.
Previously the reminder was prepended to each user message and each one did
persist in the chat history. This change injects the system reminder into
the assistant output, without storing it in the history.
The change also adds an initial <think> tag to each of the assistant replies
to force the thinking process.
The source is now the markdown text, not latex. Each embedding is now
a single paragraph from the manual. The search now returns whole sections
from the manual matching the query.
The following command was used:
pandoc --wrap=none --reference-location=block --number-sections -s tracy.tex -o tracy.md
This should be a part of the build process, but that would require pandoc,
which is not something users may have installed.
The ollama tool_calls do not preserve the context of why the tool was
called. The feature as it is right now doesn't seem to be designed for
the LLM to make queries, but rather to act as an agent that is supposed
to call some functions that will directly provide their output to the
user.
Instead, let's encode the tool calling protocol in the system prompt.
Synchronizes the GPU timeline periodically. This is needed to counter
network time updates that cause drift in the GPU and CPU timeline.
Signed-off-by: Eric Eaton <erieaton@amd.com>
This value is not used for anything, it was just a number displayed in
the UI without much meaning to anyone.
Operations on the queue during early init may not work correctly, stopping
some programs from running past the calibration loop.
# AllowShortEnumsOnASingleLine: true # Broken for some reason, even in last versions of clang-format... So don't use it or it may change formating in the future.
AllowShortLambdasOnASingleLine: All
BreakConstructorInitializers: BeforeComma
BreakStringLiterals: false
ColumnLimit: 120
SpaceAfterTemplateKeyword: false
AlwaysBreakTemplateDeclarations: Yes
# Allman seems to break lambda formatting for some reason with `ColumnLimit: 0`. See https://github.com/llvm/llvm-project/issues/50275
# Even though it is supposed to have been fixed, issue still remains in 20.1.8. (and is very much present in 18.x which is the one shipped by VS2022 and VSCord clangd as of 2025-07-27)
# Things work fine with `BasedOnStyle: Microsoft` so use that instead
#BreakBeforeBraces: Allman
ColumnLimit: 0
# We'd like to use LeftWithLastLine but it's only available in >=19.x
set_option(TRACY_LIBUNWIND_BACKTRACE"Use libunwind backtracing where supported"OFF)
set_option(TRACY_SYMBOL_OFFLINE_RESOLVE"Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution"OFF)
set_option(TRACY_LIBBACKTRACE_ELF_DYNLOAD_SUPPORT"Enable libbacktrace to support dynamically loaded elfs in symbol resolution resolution after the first symbol resolve operation"OFF)
set_option(TRACY_DISALLOW_HW_TIMER"Disallow hardware timer (may be useful on VMs). Requires TRACY_TIMER_FALLBACK=ON"OFFTracyClient)
set_option(TRACY_LIBUNWIND_BACKTRACE "Use libunwind backtracing where supported"OFFTracyClient)
set_option(TRACY_SYMBOL_OFFLINE_RESOLVE"Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution"OFFTracyClient)
set_option(TRACY_LIBBACKTRACE_ELF_DYNLOAD_SUPPORT"Enable libbacktrace to support dynamically loaded elfs in symbol resolution resolution after the first symbol resolve operation"OFFTracyClient)
### A real time, nanosecond resolution, remote telemetry, hybrid frame and sampling profiler for games and other applications.
Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
Tracy supports profiling CPU (Direct support is provided for C, C++, Lua, Python and Fortran integration. At the same time, third-party bindings to many other languages exist on the internet, such as [Rust](https://github.com/nagisa/rust_tracy_client), [Zig](https://github.com/tealsnow/zig-tracy), [C#](https://github.com/clibequilibrium/Tracy-CSharp), [OCaml](https://github.com/imandra-ai/ocaml-tracy), [Odin](https://github.com/oskarnp/odin-tracy), etc.), GPU (All major graphics/compute APIs: OpenGL, Vulkan, Direct3D 11/12, Metal, OpenCL, CUDA, WebGPU.), memory allocations, locks, context switches, automatically attribute screenshots to captured frames, and much more.
- [Documentation](https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf) for usage and build process instructions
- [Releases](https://github.com/wolfpld/tracy/releases) containing the documentation (`tracy.pdf`) and compiled Windows x64 binaries (`Tracy-<version>.7z`) as assets
printf("\nThe client you are trying to connect to uses incompatible protocol version.\nMake sure you are using the same Tracy version on both client and server.\n");
return1;
}
if(handshake==tracy::HandshakeNotAvailable)
{
printf("\nThe client you are trying to connect to is no longer able to sent profiling data,\nbecause another server was already connected to it.\nYou can do the following:\n\n 1. Restart the client application.\n 2. Rebuild the client application with on-demand mode enabled.\n");
return2;
}
if(handshake==tracy::HandshakeDropped)
{
printf("\nThe client you are trying to connect to has disconnected during the initial\nconnection handshake. Please check your network configuration.\n");
@@ -106,7 +56,7 @@ int main( int argc, char** argv )
}
#endif
InitIsStdoutATerminal();
InitTerminalDetection();
booloverwrite=false;
constchar*address="127.0.0.1";
@@ -165,27 +115,9 @@ int main( int argc, char** argv )
printf("Connecting to %s:%i...",address,port);
fflush(stdout);
tracy::Workerworker(address,port,memoryLimit);
while(!worker.HasData())
{
constautohandshake=worker.GetHandshakeStatus();
if(handshake==tracy::HandshakeProtocolMismatch)
{
printf("\nThe client you are trying to connect to uses incompatible protocol version.\nMake sure you are using the same Tracy version on both client and server.\n");
return1;
}
if(handshake==tracy::HandshakeNotAvailable)
{
printf("\nThe client you are trying to connect to is no longer able to sent profiling data,\nbecause another server was already connected to it.\nYou can do the following:\n\n 1. Restart the client application.\n 2. Rebuild the client application with on-demand mode enabled.\n");
return2;
}
if(handshake==tracy::HandshakeDropped)
{
printf("\nThe client you are trying to connect to has disconnected during the initial\nconnection handshake. Please check your network configuration.\n");
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.