`system-profile`¶

Pack: ARIS skills

Category: drafting

Field: —

License: MIT

Updated: 2026-05-18

Stages: paper-drafting

↗ view SKILL.md on source · GitHub stars

System Profile¶

Profile the specified target and summarize the results. Target: $ARGUMENTS

Instructions¶

You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, including writing instrumentation code when needed, then run profiling, analyze results, and produce a summary.

Step 1: Determine the profiling target¶

Parse $ARGUMENTS to understand what to profile. Examples: - A Python script or module - A running process (PID or service name) - A specific function or code block - An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components - "gpu" / "interconnect" / "memory" for focused profiling

If $ARGUMENTS is empty or unclear, ask the user.

Step 2: Choose profiling methods¶

Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.

External tools (check availability first): - CPU: cProfile, py-spy, line_profiler, perf stat, /usr/bin/time -v - Memory: tracemalloc, memory_profiler, memray - GPU: nvidia-smi, nvidia-smi dmon, nvitop, torch.profiler, nsys - Interconnect: nvidia-smi topo -m, nvidia-smi nvlink, NCCL_DEBUG=INFO - System: strace -c, iostat, vmstat

Code instrumentation — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios: - Timing specific code blocks (wall time vs CPU time) - Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth - Tracking memory allocation across CPU and GPU to detect redundancy - Wrapping NCCL collectives to measure latency and throughput - Adding CUDA event timing around kernels

Design the instrumentation based on what you observe in the code — don't use a fixed template.

Step 3: Key dimensions to investigate¶

Depending on the target, focus on some or all of these:

CPU overhead - Context switching (voluntary / involuntary) - CPU utilization: ratio of CPU time to wall time - Per-function execution time hotspots

Memory overhead - CPU and GPU memory usage (allocated vs reserved vs peak) - Redundant replication: same data living on both CPU and GPU - Per-device allocation balance in multi-GPU setups

Interconnect & communication - CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved - GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact - NCCL collectives: operation type, message size distribution, latency - Communication-to-computation ratio

GPU compute - SM utilization, kernel launch overhead - Memory bandwidth utilization vs peak

Step 4: Instrumentation guidelines¶

When inserting code into the target: 1. Read and understand the target code first 2. Prefer wrapping (decorator, context manager, standalone runner) over inline edits 3. If inline edits are necessary, mark them clearly (e.g., # [PROFILE] comments) 4. Minimize observer effect — don't instrument tight inner loops; sample instead 5. Collect results into a structured log, don't scatter print statements

Step 5: Run profiling¶

Check available tools and hardware topology
Run the chosen methods, capture all output
Save artifacts (flamegraphs, traces, logs) to ./profile_output/

Step 6: Produce the report¶

Part A — Profiling results (structured tables by dimension, as applicable): - CPU overhead table - Memory overhead table (with redundancy column) - Interconnect table (transfer type / frequency / size / latency / bandwidth) - Hotspots / bottleneck identification - Actionable recommendations ranked by expected impact

Part B — Instrumentation changelog (MANDATORY): List every file that was modified or created for profiling purposes:

File	Change type	What was added/modified	Line(s)
...	modified	...	...
...	created	...	—

This allows the user to review and revert all instrumentation changes. Offer to clean up (remove all instrumentation) when the user is done.

system-profile¶