
NOTICE: Existing SQLite export found: docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-decode-token-uninit/trace.sqlite
        It is assumed file was previously exported from: docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-decode-token-uninit/trace.nsys-rep
        Consider using --force-export=true if needed.

Processing [docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-decode-token-uninit/trace.sqlite] with [/opt/nvidia/nsight-systems/2023.2.3/host-linux-x64/reports/cuda_api_sum.py]... 

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)                 Name               
 --------  ---------------  ---------  ---------  ---------  --------  ---------  -----------  ----------------------------------
     36.4    1,321,254,467     37,235   35,484.2    2,360.0       166  1,342,646     75,183.6  cuMemFreeAsync                    
     31.7    1,153,234,518     37,235   30,971.8    8,725.0       114    758,841     42,520.3  cuMemAllocAsync                   
     10.9      396,508,246     38,791   10,221.7    7,594.0     2,679    109,395      7,186.6  cudaLaunchKernel                  
      8.8      320,617,570      1,044  307,105.0  179,998.5    11,251  5,908,210    382,276.2  cuMemcpyDtoHAsync_v2              
      6.1      221,017,031     25,891    8,536.4    5,674.0       111    211,368      7,505.7  cuMemsetD8Async                   
      1.1       39,610,880      6,896    5,744.0    2,921.5       508     90,507      6,976.9  cudaEventRecord                   
      0.9       34,255,819      2,084   16,437.5   12,047.5     4,564    102,430     11,594.7  cuMemcpyDtoDAsync_v2              
      0.9       34,228,860      6,192    5,527.9    2,755.5       550     64,614      6,834.5  cudaStreamWaitEvent               
      0.9       31,697,812      3,096   10,238.3    7,586.5     3,079     88,675      7,226.5  cuLaunchKernelEx                  
      0.8       29,674,977      2,076   14,294.3   10,242.0     3,119     84,404     10,717.9  cuMemcpyHtoDAsync_v2              
      0.6       21,147,008      6,616    3,196.3    1,426.0       243     55,548      4,679.8  cudaStreamGetCaptureInfo_v2_v11030
      0.4       15,583,435         16  973,964.7  958,105.5    34,068  2,301,808    809,299.1  cuStreamSynchronize               
      0.2        6,930,696      3,096    2,238.6      650.0       130     60,138      4,296.1  cudaGetFuncBySymbol_v11000        
      0.1        5,113,762        344   14,865.6   11,879.5     4,457     50,063      9,073.3  cudaMemsetAsync                   
      0.1        2,244,146          8  280,518.3  256,251.0   146,785    458,056    111,754.5  cuMemGetInfo_v2                   
      0.0           21,184          1   21,184.0   21,184.0    21,184     21,184          0.0  cuCtxSynchronize                  
      0.0            6,035          1    6,035.0    6,035.0     6,035      6,035          0.0  cuProfilerStart                   
      0.0            3,013          1    3,013.0    3,013.0     3,013      3,013          0.0  cuProfilerStop                    

