
NOTICE: Existing SQLite export found: /root/arle-nsys-token-current/docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-token-live/trace.sqlite
        It is assumed file was previously exported from: /root/arle-nsys-token-current/docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-token-live/trace.nsys-rep
        Consider using --force-export=true if needed.

Processing [/root/arle-nsys-token-current/docs/trace-artifacts/2026-05-15-dsv4-deepep/nsys-single-token-live/trace.sqlite] with [/opt/nvidia/nsight-systems/2023.2.3/host-linux-x64/reports/cuda_api_sum.py]... 

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                 Name               
 --------  ---------------  ---------  -----------  -----------  --------  ----------  -----------  ----------------------------------
     29.4    1,201,678,107     37,436     32,099.5      2,616.0       461   1,146,225     67,531.0  cuMemFreeAsync                    
     29.1    1,189,240,572     39,183     30,350.9     10,186.0       560  11,005,195    120,154.4  cuMemAllocAsync                   
     13.7      558,192,352      1,935    288,471.5     83,948.0    11,565  15,454,929    643,376.8  cuMemcpyDtoHAsync_v2              
      9.5      388,167,897     40,345      9,621.2      6,994.0     2,671      75,817      6,903.0  cudaLaunchKernel                  
      9.3      378,402,631     41,629      9,089.9      6,066.0     1,192     315,640      7,780.4  cuMemsetD8Async                   
      2.1       85,268,037         16  5,329,252.3  5,250,586.0   348,670  10,710,038  5,018,753.4  cuStreamSynchronize               
      1.6       63,681,012      3,780     16,846.8     12,433.0     3,977     714,028     16,067.0  cuMemcpyDtoDAsync_v2              
      1.4       56,636,488      7,584      7,467.9      4,088.5       485      71,109      8,121.8  cudaEventRecord                   
      1.3       53,097,236      6,880      7,717.6      4,248.5       565      61,439      8,320.8  cudaStreamWaitEvent               
      1.0       41,278,365      3,440     11,999.5      8,938.5     3,371     473,498     11,169.4  cuLaunchKernelEx                  
      0.6       26,344,629      1,935     13,614.8      9,333.0     2,509      80,061     10,242.8  cuMemcpyHtoDAsync_v2              
      0.6       22,853,438      6,960      3,283.5      1,422.5       244      93,629      5,116.4  cudaStreamGetCaptureInfo_v2_v11030
      0.3       11,614,332      3,440      3,376.3      1,024.0       140      49,236      5,386.4  cudaGetFuncBySymbol_v11000        
      0.1        4,553,467        344     13,236.8     10,112.5     4,613      49,777      7,916.4  cudaMemsetAsync                   
      0.0        1,181,346          8    147,668.3    147,952.0    71,924     231,869     67,552.8  cuMemGetInfo_v2                   
      0.0          121,389         40      3,034.7      1,685.0       654      20,285      4,281.6  cudaStreamIsCapturing_v10000      
      0.0           23,392          2     11,696.0     11,696.0     8,075      15,317      5,120.9  cuCtxSynchronize                  
      0.0            5,627          1      5,627.0      5,627.0     5,627       5,627          0.0  cuProfilerStart                   
      0.0            1,545          1      1,545.0      1,545.0     1,545       1,545          0.0  cuProfilerStop                    

