science:physics:nuclear-physics:shell-model:gpu-acceleration-of-the-tbme-calculation

Constant values and device memory
Thread indices for all elements in H
Thread indices for only the upper triangle of H
Performance and timing
Discussion

Constant values and device memory

Changing the CPU calculation of the TBMEs to be GPU accelerated proved to be quite straight-forward actually. I had done most of the hard thinking when GPU accelerating the OBME calculation. The structure of the loops had to be slightly altered and I had to get rid of the Indices data structure inside the kernels. I solved this by chucking a bunch of constant device arrays in the global scope of hamiltonian_device.cpp. Global scope because I had a lot of trouble defining the constant device arrays in a separate file. The mentioned constant arrays contain correctly laid-out indices for the creation and annihilation operators in the shell model hamiltonian, as well as corresponding coupled $J$ and $M$ values. These values are pre-calculated in the ''generate_indices'' function and greatly speeds up the process of generating the TBMEs. All of these values are completely constant during the entire run time of the program and they are therefore aptly placed in the constant memory of the GPU (which apparently work a bit differently in AMD compared to Nvidia). The basis states are also constant during the entire run time of the program, but have to be placed in dynamic memory on the device because of the potentially huge size of the array. Well, they're not completely constant as they have to be calculated at the beginning of the program, but once they are calculated they do not change.

Thread indices for all elements in H

One of the challenges of GPU acceleration is to appropriately map thread indices to indices of the data that each thread needs for its calculation. Since the hamiltonian is a 2D matrix (however, stored as a 1D array) I've chosen a 2D layout for the threads. The following calculation makes sure that a sufficient amount of blocks are spawned as to cover the entire hamiltonian matrix.

const size_t block_dim = 8; // Can't be larger than 32 (1024 threads per block)
const dim3 threads_per_block_twobody(block_dim, block_dim);
const dim3 blocks_per_grid_twobody(ceil(m_dim/((double)block_dim)), ceil(m_dim/((double)block_dim)));

On the kernel side, each thread calculates their own row and column position in the hamiltonian matrix, as well as a flattened index to correctly index the contiguous row-major flattened version of the hamiltonian matrix:

const size_t col_idx = blockIdx.x*blockDim.x + threadIdx.x;
const size_t row_idx = blockIdx.y*blockDim.y + threadIdx.y;
const size_t idx = row_idx*m_dim + col_idx;

The row and col indices are used to fetch the correct basis states for the threads' position in the matrix

const uint64_t left_state = dev_basis_states[row_idx];
const uint64_t right_state = dev_basis_states[col_idx];
 
H[idx] = calculate_twobody_matrix_element(
    n_orbitals,
    n_indices,
    left_state,
    right_state
);

Thread indices for only the upper triangle of H

Recall that the hamiltonian matrix is self-adjoint (hermitian). At this point in the software development there are no complex numbers involved, so the matrix can be taken as a symmetric matrix. In both cases however, there is no need to explicitly calculate both triangles of the matrix. Thats good news because there are $m^2$ elements in the complete matrix and only $m(m + 1)/2$ elements in the upper triangle, and consequently

$$ \lim_{x \rightarrow \infty} \frac{m(m + 1)}{2m^2} = \frac{1}{2} $$

Removing half of the calculations is good news for the performance! Storing the upper triangle as a 1D array required a re-thinking of how the threads are mapped to array indices. I went back to a 1D thread structure which is quite straight-forward to calculate:

const size_t threads_per_block_twobody = 64;
const size_t blocks_per_grid_twobody = (n_elements_upper_triangle + threads_per_block_twobody - 1)/threads_per_block_twobody;

and trivial to map the thread indices to a flat 1D index:

const size_t idx = blockIdx.x*blockDim.x + threadIdx.x;

Now, there is a part of this approach which was not trivial at all, namely mapping the flat 1D upper triangle indices to the corresponding row and col indices in the 2D version of the hamiltonian matrix. I need the row and col indices for the threads to fetch the correct basis states. I made two solutions for this problem (likely not in the most efficient way), the first of which is a straight-forward iteration over all the indices of the upper triangle array.

for (size_t flat_tri_idx = 0; flat_tri_idx < n_elements_upper_triangle; flat_tri_idx++)
{   /*
    Convert an index from the flat upper triangle array into an
    index of the flat H array. If the H array is
 
        H = [[ 0  1  2  3]
                [ 4  5  6  7]
                [ 8  9 10 11]
                [12 13 14 15]]
 
    the flat triangle array will be
 
        tri = [0, 1, 2, 3, 5, 6, 7, 10, 11, 15]
 
    tri[5] with value 6 corresponds to element H[1][2].
    */
    if (counter >= lim)
    {   /*
        Row 0 has lim number of upper triangular elements, row 1 has
        lim - 1 elements, row 2 lim - 2, ...
 
        The first triangular element of row 0 is offset by 0
        The first triangular element of row 1 is offset by 0 + 1
        The first triangular element of row 2 is offset by 0 + 1 + 2
        ...
        */
        lim--;
        counter = 0;
        prev_offset = row_idx + prev_offset;
    }
 
    row_idx = m_dim - lim;
    size_t flat_idx = flat_tri_idx + row_idx + prev_offset;
    counter++;
 
    H[flat_idx] = H_upper_triangle[flat_tri_idx];
}

This approach works fine if you need to do something with every single element of the upper triangle array, like if a single CPU thread is copying value-by-value like in the code above. However, a single GPU thread is only supposed to use a single index in the upper triangle array and its corresponding map to the “real” 2D hamiltonian, and it is wasteful that each GPU thread has to iterate from 0 for every time an index map is needed. I made another solution which “only” needs to iterate over the number of rows (or cols, it's the same number) which is considerably fewer iterations than iterating over all of the upper triangle elements.

__device__ inline void flat_tri_idx_to_2d_indices_ascending(const size_t n_rows, const size_t flat_tri_idx, size_t &row_idx, size_t &col_idx)
{   /*
    Find the row, col indices of the original matrix from the flat index
    of the upper triangle (with diagonal). Search for row in ascending
    order.
    */
    size_t cum_n_triangular_elements = 0;
    for (row_idx = 0; row_idx < n_rows; row_idx++)
    {
        const size_t n_triangular_elements_row = n_rows - row_idx;
        cum_n_triangular_elements += n_triangular_elements_row;  // Cumulative number of triangular elements at row `row_idx`.
 
        if (flat_tri_idx < cum_n_triangular_elements) break;
 
    }
    col_idx = n_rows - (cum_n_triangular_elements - flat_tri_idx);
}

This solution is better when a thread wants to do something with a single index, the aforementioned solution is better when a thread wants to do something with all the indices.

I will experiment with restructuring the basis states array, and maybe I can ditch these triangle index to 2D index mappings completely. Maybe it's better to pre-calculate the indices and just look them up when needed instead?

Performance and timing

The performance gains of calculating TBMEs on the GPU are substantial. For the following benchmark I'm using the w.snt interaction (sd model space) with 3 valence protons and 3 valence neutrons, resulting in $M_\text{dim} = 6116$. The CPU calculation is parallelised over 32 (virtual, SMT) cores on an AMD 7950x, all cores are completely saturated during the entirety of the calculation. The GPU calculation uses 8×8 threads per block which was slightly faster than 16×16 and 32×32, and much faster than 4×4.

Processor	Time [s]	Speed-up
CPU	185.954	1
GPU	1.964	94.630
GPU (triangle)	1.256	147.972

As you can see, the speed-up is huge. The GPU code is 95 times faster than the CPU code. This is of course dependent on how good (or bad) I was at writing the code for the CPU. However, I have done several optimisations on the CPU code before using it as a basis for the GPU code. For example, using primitive data types for the basis states and pre-generating a bunch of indices. Optimising the GPU code further so that it only calculates the upper triangle improved the performance even further, now at 148 times faster than the CPU code.

Table of Contents

Constant values and device memory

Thread indices for all elements in H

Thread indices for only the upper triangle of H

Performance and timing