Common SQLite examples

This section lists recepies to frequently asked questions on “how-to’s” with NVIDIA Nsight Systems data exported in SQLite format.

General notes

The data exported into SQLite database don’t contain any indexes by default and are as close to the contents of the report as possible. Therefore, for better readability and speed of execution, queries below will alter tables.

Common helper commands

When utilizing sqlite3 commandline tool, it’s helpful to have data printed as named columns, this can be done with:

.mode column
.headers on

Default column width is determined by the data in the first row of results. If this doesn’t work out well, you can specify widths manually:

.width 10 20 50

Obtaining sample report

CLI interface of Nsight Systems was used to profile radixSortThrust CUDA sample, then the resulting .nsys-rep file was exported using the nsys export.

nsys profile --trace=cuda,osrt radixSortThrust
nsys export --type sqlite report1.nsys-rep

Serialized process and thread identifiers

NVIDIA Nsight Systems stores identifiers where events originated in serialized form. For events that have globalTid or globalPid fields exported, use the following code to extract numeric TID and PID:

SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID FROM TABLE_NAME;

NOTE: globalTid field includes both TID and PID values, while globalPid only the PID value.

Correlate CUDA kernel launches with CUDA API kernel launches

ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN name TEXT;
ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN kernelName TEXT;

UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET kernelName =
    (SELECT value FROM StringIds
    JOIN CUPTI_ACTIVITY_KIND_KERNEL AS cuda_gpu
        ON cuda_gpu.shortName = StringIds.id
        AND CUPTI_ACTIVITY_KIND_RUNTIME.correlationId = cuda_gpu.correlationId);

UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET name =
    (SELECT value FROM StringIds WHERE nameId = StringIds.id);

-- Select 10 longest CUDA API ranges that resulted in kernel execution.
SELECT name, kernelName, start, end FROM CUPTI_ACTIVITY_KIND_RUNTIME
    WHERE kernelName IS NOT NULL ORDER BY end - start LIMIT 10;

Results:

name                    kernelName               start      end      
----------------------  -----------------------  ---------  ---------
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  658863435  658868490
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  609755015  609760075
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  632683286  632688349
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  606495356  606500439
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  603114486  603119586
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  802729785  802734906
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  593381170  593386294
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  658759955  658765090
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  681549917  681555059
cudaLaunchKernel_v7000  RadixSortScanBinsKernel  717812527  717817671

Remove ranges overlapping with overhead

-- Use the this query to count CUDA API ranges overlapping with the overhead ones.
-- Replace "SELECT COUNT(*)" with "DELETE" to remove such ranges.

SELECT COUNT(*) FROM CUPTI_ACTIVITY_KIND_RUNTIME WHERE rowid IN
(
    SELECT cuda.rowid
    FROM PROFILER_OVERHEAD as overhead
    INNER JOIN CUPTI_ACTIVITY_KIND_RUNTIME as cuda ON
    (cuda.start BETWEEN overhead.start and overhead.end)
    OR (cuda.end BETWEEN overhead.start and overhead.end)
    OR (cuda.start < overhead.start AND cuda.end > overhead.end)
);

Results:

COUNT(*)
--------
1095    

Find CUDA API calls that resulted in original graph node creation

SELECT graph.graphNodeId, api.start, graph.start as graphStart, api.end,
    api.globalTid, api.correlationId, api.globalTid,
    (SELECT value FROM StringIds where api.nameId == id) as name
FROM CUPTI_ACTIVITY_KIND_RUNTIME as api
JOIN
    (
        SELECT start, graphNodeId, globalTid from CUDA_GRAPH_EVENTS
        GROUP BY graphNodeId
        HAVING COUNT(originalGraphNodeId) = 0
    ) as graph
ON api.globalTid == graph.globalTid AND api.start < graph.start AND api.end > graph.start
ORDER BY graphNodeId;

Results:

graphNodeId  start      graphStart  end        globalTid        correlationId  globalTid        name                         
-----------  ---------  ----------  ---------  ---------------  -------------  ---------------  -----------------------------
1            584366518  584378040   584379102  281560221750233  109            281560221750233  cudaGraphAddMemcpyNode_v10000
2            584379402  584382428   584383139  281560221750233  110            281560221750233  cudaGraphAddMemsetNode_v10000
3            584390663  584395352   584396053  281560221750233  111            281560221750233  cudaGraphAddKernelNode_v10000
4            584396314  584397857   584398438  281560221750233  112            281560221750233  cudaGraphAddMemsetNode_v10000
5            584398759  584400311   584400812  281560221750233  113            281560221750233  cudaGraphAddKernelNode_v10000
6            584401083  584403047   584403527  281560221750233  114            281560221750233  cudaGraphAddMemcpyNode_v10000
7            584403928  584404920   584405491  281560221750233  115            281560221750233  cudaGraphAddHostNode_v10000  
29           632107852  632117921   632121407  281560221750233  144            281560221750233  cudaMemcpyAsync_v3020        
30           632122168  632125545   632127989  281560221750233  145            281560221750233  cudaMemsetAsync_v3020        
31           632131546  632133339   632135584  281560221750233  147            281560221750233  cudaMemsetAsync_v3020        
34           632162514  632167393   632169297  281560221750233  151            281560221750233  cudaMemcpyAsync_v3020        
35           632170068  632173334   632175388  281560221750233  152            281560221750233  cudaLaunchHostFunc_v10000    

Backtraces for OSRT ranges

-- Adding text columns makes results of the query below more human-readable.
-- These steps are completely optional.
ALTER TABLE OSRT_API ADD COLUMN name TEXT;
UPDATE OSRT_API SET name = (SELECT value FROM StringIds WHERE OSRT_API.nameId = StringIds.id);

ALTER TABLE OSRT_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE OSRT_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);

ALTER TABLE OSRT_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE OSRT_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);

-- Print backtrace of the longest OSRT range
SELECT globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID,
    start, end, name, callchainId, stackDepth, symbolName, moduleName
FROM OSRT_API LEFT JOIN OSRT_CALLCHAINS ON callchainId == OSRT_CALLCHAINS.id
WHERE OSRT_API.rowid IN (SELECT rowid FROM OSRT_API ORDER BY end - start DESC LIMIT 1)
ORDER BY stackDepth LIMIT 10;

Results:

PID    TID    start      end        name                    callchainId  stackDepth  symbolName                      moduleName                                 
-----  -----  ---------  ---------  ----------------------  -----------  ----------  ------------------------------  -------------------------------------------
19163  19176  360897690  860966851  pthread_cond_timedwait  88           0           pthread_cond_timedwait@GLIBC_2  /lib/x86_64-linux-gnu/libpthread-2.27.so   
19163  19176  360897690  860966851  pthread_cond_timedwait  88           1           0x7fbc983b7227                  /usr/lib/x86_64-linux-gnu/libcuda.so.418.39
19163  19176  360897690  860966851  pthread_cond_timedwait  88           2           0x7fbc9835d5c7                  /usr/lib/x86_64-linux-gnu/libcuda.so.418.39
19163  19176  360897690  860966851  pthread_cond_timedwait  88           3           0x7fbc983b64a8                  /usr/lib/x86_64-linux-gnu/libcuda.so.418.39
19163  19176  360897690  860966851  pthread_cond_timedwait  88           4           start_thread                    /lib/x86_64-linux-gnu/libpthread-2.27.so   
19163  19176  360897690  860966851  pthread_cond_timedwait  88           5           __clone                         /lib/x86_64-linux-gnu/libc-2.27.so         

Profiled processes output streams

ALTER TABLE ProcessStreams ADD COLUMN filename TEXT;
UPDATE ProcessStreams SET filename = (SELECT value FROM StringIds WHERE ProcessStreams.filenameId = StringIds.id);

ALTER TABLE ProcessStreams ADD COLUMN content TEXT;
UPDATE ProcessStreams SET content = (SELECT value FROM StringIds WHERE ProcessStreams.contentId = StringIds.id);

-- Select all collected stdout and stderr streams.
select globalPid / 0x1000000 % 0x1000000 AS PID, filename, content from ProcessStreams;

Results:

PID    filename                                                 content                                                                                        
-----  -------------------------------------------------------  -----------------------------------------------------------------------------------------------
19163  /tmp/nvidia/nsight_systems/streams/pid_19163_stdout.log  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust Starting...
                                                                                                                                                               
                                                                GPU Device 0: "Quadro P2000" with compute capability 6.1                                       
                                                                                                                                                               
                                                                                                                                                               
                                                                Sorting 1048576 32-bit unsigned int keys and values                                            
                                                                                                                                                               
                                                                radixSortThrust, Throughput = 401.0872 MElements/s, Time = 0.00261 s, Size = 1048576 elements  
                                                                Test passed                                                                                    

19163  /tmp/nvidia/nsight_systems/streams/pid_19163_stderr.log                                                                                                 

Thread summary

Please note, that Nsight Systems applies additional logic during sampling events processing to work around lost events. This means that the results of the below query might differ slightly from the ones shown in “Analysis summary” tab.

-- Thread summary calculated using CPU cycles (when available)
SELECT
    globalTid / 0x1000000 % 0x1000000 AS PID,
    globalTid % 0x1000000 AS TID,
    ROUND(100.0 * SUM(cpuCycles) / 
        (
            SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS 
            GROUP BY globalTid / 0x1000000000000 % 0x100
        ),
        2
    ) as CPU_utilization,
    (SELECT value FROM StringIds WHERE id = 
        (
            SELECT nameId FROM ThreadNames 
            WHERE ThreadNames.globalTid = COMPOSITE_EVENTS.globalTid
        )
    ) as thread_name
FROM COMPOSITE_EVENTS
GROUP BY globalTid
ORDER BY CPU_utilization DESC
LIMIT 10;

Results:

PID    TID    CPU_utilization  thread_name        
-----  -----  ---------------  -------------------
19163  19163  98.4             radixSortThrust    
19163  19168  1.35             CUPTI worker thread
19163  19166  0.25             [NS]               

Thread running time may be calculated using scheduling data, when PMU counter data was not collected.

CREATE INDEX sched_start ON SCHED_EVENTS (start);

-- Intermediate table
CREATE TABLE CPU_USAGE AS
SELECT
    first.globalTid as globalTid,
    (SELECT nameId FROM ThreadNames WHERE ThreadNames.globalTid = first.globalTid) as nameId,
    sum(second.start - first.start) as total_duration,
    count() as ranges_count
FROM SCHED_EVENTS as first
LEFT JOIN SCHED_EVENTS as second
ON second.rowid =
    (
        SELECT rowid
        FROM SCHED_EVENTS
        WHERE start > first.start AND globalTid = first.globalTid
        ORDER BY start ASC
        LIMIT 1
    )
WHERE first.isSchedIn != 0
GROUP BY first.globalTid
ORDER BY total_duration DESC;

-- Thread summary calculated using scheduled time
SELECT
    globalTid / 0x1000000 % 0x1000000 AS PID,
    globalTid % 0x1000000 AS TID,
    (SELECT value FROM StringIds where nameId == id) as thread_name,
    ROUND(100.0 * total_duration / (SELECT SUM(total_duration) FROM CPU_USAGE), 2) as CPU_utilization
FROM CPU_USAGE
ORDER BY CPU_utilization DESC;

Results:

PID    TID    thread_name          CPU_utilization
-----  -----  -------------------  ---------------
19163  19163  radixSortThrust      93.74          
19163  19169  radixSortThrust      3.22           
19163  19168  CUPTI worker thread  2.46           
19163  19166  [NS]                 0.44           
19163  19172  radixSortThrust      0.07           
19163  19167  [NS Comms]           0.05           
19163  19176  radixSortThrust      0.02           
19163  19170  radixSortThrust      0.0            

Function table

These examples demonstrate how to calculate Flat and BottomUp (for top level only) views statistics.

ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE SAMPLING_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);

ALTER TABLE SAMPLING_CALLCHAINS ADD COLUMN moduleName TEXT;
UPDATE SAMPLING_CALLCHAINS SET moduleName = (SELECT value FROM StringIds WHERE module = StringIds.id);


-- Flat view
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
    (SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS flatTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
GROUP BY symbol, module
ORDER BY flatTimePercentage DESC
LIMIT 5;

-- BottomUp view (top level only)
SELECT symbolName, moduleName, ROUND(100.0 * sum(cpuCycles) /
    (SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS), 2) AS selfTimePercentage
FROM SAMPLING_CALLCHAINS
LEFT JOIN COMPOSITE_EVENTS ON SAMPLING_CALLCHAINS.id == COMPOSITE_EVENTS.id
WHERE stackDepth == 0
GROUP BY symbol, module
ORDER BY selfTimePercentage DESC
LIMIT 5;

Results:

symbolName                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   moduleName                                                                           flatTimePercentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------------------------------------------------------------------------  ------------------
[Max depth]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [Max depth]                                                                          99.92             
thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> > thrust::system::detail::sequential::for_each<thrust::system::cpp::detail::tag, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::detail::unary_transform_functor<thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> > >(thrust::system::detail::sequential::execution_policy<thrust::system::cpp::detail::tag>&, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::detail::unary_transform_functor<thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> >)  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  24.17             
thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> > thrust::for_each<thrust::system::cpp::detail::tag, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::detail::unary_transform_functor<thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> > >(thrust::detail::execution_policy_base<thrust::system::cpp::detail::tag> const&, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::zip_iterator<thrust::tuple<thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::detail::unary_transform_functor<thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> >)                                       /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  24.17             
thrust::detail::normal_iterator<unsigned int*> thrust::system::detail::generic::transform<thrust::system::cpp::detail::tag, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> >(thrust::execution_policy<thrust::system::cpp::detail::tag>&, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  24.17             
thrust::detail::normal_iterator<unsigned int*> thrust::transform<thrust::system::cpp::detail::tag, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int> >(thrust::detail::execution_policy_base<thrust::system::cpp::detail::tag> const&, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::counting_iterator<long, thrust::use_default, thrust::use_default, long>, thrust::detail::normal_iterator<unsigned int*>, thrust::system::detail::generic::sequence_detail::sequence_functor<unsigned int>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  24.17             
symbolName                                                                                                                                                                                       moduleName                                                                           selfTimePercentage
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------------------------------------------------------------------------  ------------------
0x7fbc984982b6                                                                                                                                                                                   /usr/lib/x86_64-linux-gnu/libcuda.so.418.39                                          5.29              
0x7fbc982d0010                                                                                                                                                                                   /usr/lib/x86_64-linux-gnu/libcuda.so.418.39                                          2.81              
thrust::iterator_facade<thrust::detail::normal_iterator<unsigned int*>, unsigned int, thrust::system::cpp::detail::tag, thrust::random_access_traversal_tag, unsigned int&, long>::derived()     /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  2.23              
thrust::iterator_facade<thrust::detail::normal_iterator<unsigned int*>, unsigned int, thrust::system::cpp::detail::tag, thrust::random_access_traversal_tag, unsigned int&, long>::operator++()  /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  1.55              
void thrust::iterator_core_access::increment<thrust::detail::normal_iterator<unsigned int*> >(thrust::detail::normal_iterator<unsigned int*>&)                                                   /home/user_name/NVIDIA_CUDA-10.1_Samples/6_Advanced/radixSortThrust/radixSortThrust  1.55              

DX12 API frame duration histogram

The example demonstrates how to calculate DX12 CPU frames durartion and construct a histogram out of it.

CREATE INDEX DX12_API_ENDTS ON DX12_API (end);

CREATE TEMP VIEW DX12_API_FPS AS SELECT end AS start,
    (SELECT end FROM DX12_API
        WHERE end > outer.end AND nameId == (SELECT id FROM StringIds
            WHERE value == "IDXGISwapChain::Present")
        ORDER BY end ASC LIMIT 1) AS end
FROM DX12_API AS outer
    WHERE nameId == (SELECT id FROM StringIds WHERE value == "IDXGISwapChain::Present")
ORDER BY end;

-- Number of frames with a duration of [X, X + 1) milliseconds
SELECT
    CAST((end - start) / 1000000.0 AS INT) AS duration_ms,
    count(*)
FROM DX12_API_FPS
WHERE end IS NOT NULL
GROUP BY duration_ms
ORDER BY duration_ms;

Results:

duration_ms  count(*)
-----------  --------
3            1       
4            2       
5            7       
6            153     
7            19      
8            116     
9            16      
10           8       
11           2       
12           2       
13           1       
14           4       
16           3       
17           2       
18           1       

GPU context switch events enumeration

GPU context duration is between first BEGIN and a matching END event.

SELECT (CASE tag WHEN 8 THEN "BEGIN" WHEN 7 THEN "END" END) AS tag,
    globalPid / 0x1000000 % 0x1000000 AS PID,
    vmId, seqNo, contextId, timestamp, gpuId FROM FECS_EVENTS
WHERE tag in (7, 8) ORDER BY seqNo LIMIT 10;

Results:

tag    PID    vmId  seqNo  contextId  timestamp   gpuId
-----  -----  ----  -----  ---------  ----------  -----
BEGIN  23371  0     0      1048578    56759171    0    
BEGIN  23371  0     1      1048578    56927765    0    
BEGIN  23371  0     3      1048578    63799379    0    
END    23371  0     4      1048578    63918806    0    
BEGIN  19397  0     5      1048577    64014692    0    
BEGIN  19397  0     6      1048577    64250369    0    
BEGIN  19397  0     8      1048577    1918310004  0    
END    19397  0     9      1048577    1918521098  0    
BEGIN  19397  0     10     1048577    2024164744  0    
BEGIN  19397  0     11     1048577    2024358650  0    

Resolve NVTX category name

The example demonstrates how to resolve NVTX category name for NVTX marks and ranges.

WITH
  event AS (
    SELECT *
    FROM NVTX_EVENTS
    WHERE eventType IN (34, 59, 60) -- mark, push/pop, start/end
  ),
  category AS (
    SELECT
      category,
      domainId,
      text AS categoryName
    FROM NVTX_EVENTS
    WHERE eventType == 33 --  new category
  )
SELECT
  start,
  end,
  globalTid,
  eventType,
  domainId,
  category,
  categoryName,
  text
FROM event JOIN category USING (category, domainId)
ORDER BY start;

Results:

start     end       globalTid        eventType  domainId  category  categoryName                 text             
--------  --------  ---------------  ---------  --------  --------  ---------------------------  -----------------
18281150  18311960  281534938484214  59         0         1         FirstCategoryUnderDefault    Push Pop Range A 
18288187  18306674  281534938484214  59         0         2         SecondCategoryUnderDefault   Push Pop Range B 
18294247            281534938484214  34         0         1         FirstCategoryUnderDefault    Mark A           
18300034            281534938484214  34         0         2         SecondCategoryUnderDefault   Mark B           
18345546  18372595  281534938484214  60         1         1         FirstCategoryUnderMyDomain   Start End Range A
18352924  18378342  281534938484214  60         1         2         SecondCategoryUnderMyDomain  Start End Range B
18359634            281534938484214  34         1         1         FirstCategoryUnderMyDomain   Mark A           
18365448            281534938484214  34         1         2         SecondCategoryUnderMyDomain  Mark B           

Rename CUDA kernels with NVTX

The example demonstrates how to map innermost NVTX push-pop range to a matching CUDA kernel run.

ALTER TABLE CUPTI_ACTIVITY_KIND_KERNEL ADD COLUMN nvtxRange TEXT;
CREATE INDEX nvtx_start ON NVTX_EVENTS (start);


UPDATE CUPTI_ACTIVITY_KIND_KERNEL SET nvtxRange = (
    SELECT NVTX_EVENTS.text
    FROM NVTX_EVENTS JOIN CUPTI_ACTIVITY_KIND_RUNTIME ON
        NVTX_EVENTS.eventType == 59 AND
        NVTX_EVENTS.globalTid == CUPTI_ACTIVITY_KIND_RUNTIME.globalTid AND
        NVTX_EVENTS.start <= CUPTI_ACTIVITY_KIND_RUNTIME.start AND
        NVTX_EVENTS.end >= CUPTI_ACTIVITY_KIND_RUNTIME.end
    WHERE
        CUPTI_ACTIVITY_KIND_KERNEL.correlationId == CUPTI_ACTIVITY_KIND_RUNTIME.correlationId
    ORDER BY NVTX_EVENTS.start DESC LIMIT 1
);

SELECT start, end, globalPid, StringIds.value as shortName, nvtxRange
FROM CUPTI_ACTIVITY_KIND_KERNEL JOIN StringIds ON shortName == id
ORDER BY start LIMIT 6;

Results:

start      end        globalPid          shortName      nvtxRange
---------  ---------  -----------------  -------------  ---------
526545376  526676256  72057700439031808  MatrixMulCUDA           
526899648  527030368  72057700439031808  MatrixMulCUDA  Add      
527031648  527162272  72057700439031808  MatrixMulCUDA  Add      
527163584  527294176  72057700439031808  MatrixMulCUDA  My Kernel
527296160  527426592  72057700439031808  MatrixMulCUDA  My Range 
527428096  527558656  72057700439031808  MatrixMulCUDA           

Select CUDA calls with backtraces

ALTER TABLE CUPTI_ACTIVITY_KIND_RUNTIME ADD COLUMN name TEXT;
UPDATE CUPTI_ACTIVITY_KIND_RUNTIME SET name = (SELECT value FROM StringIds WHERE CUPTI_ACTIVITY_KIND_RUNTIME.nameId = StringIds.id);

ALTER TABLE CUDA_CALLCHAINS ADD COLUMN symbolName TEXT;
UPDATE CUDA_CALLCHAINS SET symbolName = (SELECT value FROM StringIds WHERE symbol = StringIds.id);

SELECT globalTid % 0x1000000 AS TID,
    start, end, name, callchainId, stackDepth, symbolName
FROM CUDA_CALLCHAINS JOIN CUPTI_ACTIVITY_KIND_RUNTIME ON callchainId == CUDA_CALLCHAINS.id
ORDER BY callchainId, stackDepth LIMIT 11;

Results:

TID    start      end        name           callchainId  stackDepth  symbolName                                            
-----  ---------  ---------  -------------  -----------  ----------  ------------------------------------------------------
11928  168976467  169077826  cuMemAlloc_v2  1            0           0x7f13c44f02ab                                        
11928  168976467  169077826  cuMemAlloc_v2  1            1           0x7f13c44f0b8f                                        
11928  168976467  169077826  cuMemAlloc_v2  1            2           0x7f13c44f3719                                        
11928  168976467  169077826  cuMemAlloc_v2  1            3           cuMemAlloc_v2                                         
11928  168976467  169077826  cuMemAlloc_v2  1            4           cudart::driverHelper::mallocPtr(unsigned long, void**)
11928  168976467  169077826  cuMemAlloc_v2  1            5           cudart::cudaApiMalloc(void**, unsigned long)          
11928  168976467  169077826  cuMemAlloc_v2  1            6           cudaMalloc                                            
11928  168976467  169077826  cuMemAlloc_v2  1            7           cudaError cudaMalloc<float>(float**, unsigned long)   
11928  168976467  169077826  cuMemAlloc_v2  1            8           main                                                  
11928  168976467  169077826  cuMemAlloc_v2  1            9           __libc_start_main                                     
11928  168976467  169077826  cuMemAlloc_v2  1            10          _start                                                

SLI Peer-to-Peer query

The example demonstrates how to query SLI Peer-to-Peer events with resource size greater than value and within a time range sorted by resource size descending.

-- Show SLI Peer-to-Peer events with resource size greater than some value and
-- within a time range sorted by resource size descending
SELECT *
FROM SLI_P2P
WHERE resourceSize > 98304 AND start > 1568063100 AND end < 1579468901
ORDER BY resourceSize DESC;

Results:

start       end         eventClass  globalTid          gpu  frameId  transferSkipped  srcGpu  dstGpu  numSubResources  resourceSize  subResourceIdx  smplWidth  smplHeight  smplDepth  bytesPerElement  dxgiFormat  logSurfaceNames  transferInfo  isEarlyPushManagedByNvApi  useAsyncP2pForResolve  transferFuncName  regimeName  debugName  bindType
----------  ----------  ----------  -----------------  ---  -------  ---------------  ------  ------  ---------------  ------------  --------------  ---------  ----------  ---------  ---------------  ----------  ---------------  ------------  -------------------------  ---------------------  ----------------  ----------  ---------  --------
1570351100  1570351101  62          72057698056667136  0    771      0                256     512     1                1048576       0               256        256         1          16               2                            3             0                          0                                                                       
1570379300  1570379301  62          72057698056667136  0    771      0                256     512     1                1048576       0               64         64          64         4                31                           3             0                          0                                                                       
1572316400  1572316401  62          72057698056667136  0    773      0                256     512     1                1048576       0               256        256         1          16               2                            3             0                          0                                                                       
1572345400  1572345401  62          72057698056667136  0    773      0                256     512     1                1048576       0               64         64          64         4                31                           3             0                          0                                                                       
1574734300  1574734301  62          72057698056667136  0    775      0                256     512     1                1048576       0               256        256         1          16               2                            3             0                          0                                                                       
1574767200  1574767201  62          72057698056667136  0    775      0                256     512     1                1048576       0               64         64          64         4                31                           3             0                          0                                                                       

Generic events

Syscall usage histogram by PID:

SELECT json_extract(data, '$.common_pid') AS PID, count(*) AS total
FROM GENERIC_EVENTS WHERE PID IS NOT NULL AND typeId = (
  SELECT typeId FROM GENERIC_EVENT_TYPES
  WHERE json_extract(data, '$.Name') = "raw_syscalls:sys_enter")
GROUP BY PID
ORDER BY total DESC
LIMIT 10;

Results:

PID   total
----  -----
5551  32811
9680  3988 
4328  1477 
9564  1246 
4376  1204 
4377  1167 
4357  656  
4355  655  
4356  640  
4354  633  

Fetching generic events in JSON format

Text and JSON export modes don’t include generic events. Use the below queries (without LIMIT clause) to extract JSON lines representation of generic events, types and sources.

SELECT json_insert('{}',
    '$.sourceId', sourceId,
    '$.data', json(data)
)
FROM GENERIC_EVENT_SOURCES LIMIT 2;

SELECT json_insert('{}',
    '$.typeId', typeId,
    '$.sourceId', sourceId,
    '$.data', json(data)
)
FROM GENERIC_EVENT_TYPES LIMIT 2;

SELECT json_insert('{}',
    '$.rawTimestamp', rawTimestamp,
    '$.timestamp', timestamp,
    '$.typeId', typeId,
    '$.data', json(data)
)
FROM GENERIC_EVENTS LIMIT 2;

Results:

json_insert('{}',                                                                                              
---------------------------------------------------------------------------------------------------------------
{"sourceId":72057602627862528,"data":{"Name":"FTrace","TimeSource":"ClockMonotonicRaw","SourceGroup":"FTrace"}}
json_insert('{}',                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"typeId":72057602627862547,"sourceId":72057602627862528,"data":{"Name":"raw_syscalls:sys_enter","Format":"\"NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)\", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},{"Name":"id","Prefix":"long","Suffix":""},{"Name":"args","Prefix":"unsigned long","Suffix":"[6]"},{"Name":"common_type","Prefix":"unsigned short","Suffix":""},{"Name":"common_flags","Prefix":"unsigned char","Suffix":""},{"Name":"common_preempt_count","Prefix":"unsigned char","Suffix":""}]}}
{"typeId":72057602627862670,"sourceId":72057602627862528,"data":{"Name":"irq:irq_handler_entry","Format":"\"irq=%d name=%s\", REC->irq, __get_str(name)","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},{"Name":"irq","Prefix":"int","Suffix":""},{"Name":"name","Prefix":"__data_loc char[]","Suffix":""},{"Name":"common_type","Prefix":"unsigned short","Suffix":""},{"Name":"common_flags","Prefix":"unsigned char","Suffix":""},{"Name":"common_preempt_count","Prefix":"unsigned char","Suffix":""}]}}                                                                                         
json_insert('{}',                                                                                                                                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{"rawTimestamp":1183694330725221,"timestamp":6236683,"typeId":72057602627862670,"data":{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_preempt_count":"0"}}
{"rawTimestamp":1183694333695687,"timestamp":9207149,"typeId":72057602627862670,"data":{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_preempt_count":"0"}}