science:physics:nuclear-physics:shell-model:gpu-acceleration-of-the-tbme-calculation
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
science:physics:nuclear-physics:shell-model:gpu-acceleration-of-the-tbme-calculation [2024/11/06 07:04] – created - external edit 127.0.0.1 | science:physics:nuclear-physics:shell-model:gpu-acceleration-of-the-tbme-calculation [2024/11/06 21:25] (current) – Add discussion jon-dokuwiki | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ==== Constant values and device memory ==== | ||
+ | Changing the CPU calculation of the TBMEs to be GPU accelerated proved to be quite straight-forward actually. I had done most of the hard thinking when GPU accelerating the OBME calculation. The structure of the loops had to be slightly altered and I had to get rid of the '' | ||
+ | |||
+ | ==== Thread indices for all elements in H ==== | ||
+ | One of the challenges of GPU acceleration is to appropriately map thread indices to indices of the data that each thread needs for its calculation. Since the hamiltonian is a 2D matrix (however, stored as a 1D array) I've chosen a 2D layout for the threads. The following calculation makes sure that a sufficient amount of blocks are spawned as to cover the entire hamiltonian matrix. | ||
+ | |||
+ | <code c++> | ||
+ | const size_t block_dim = 8; // Can't be larger than 32 (1024 threads per block) | ||
+ | const dim3 threads_per_block_twobody(block_dim, | ||
+ | const dim3 blocks_per_grid_twobody(ceil(m_dim/ | ||
+ | </ | ||
+ | |||
+ | On the kernel side, each thread calculates their own row and column position in the hamiltonian matrix, as well as a flattened index to correctly index the contiguous row-major flattened version of the hamiltonian matrix: | ||
+ | |||
+ | <code c++> | ||
+ | const size_t col_idx = blockIdx.x*blockDim.x + threadIdx.x; | ||
+ | const size_t row_idx = blockIdx.y*blockDim.y + threadIdx.y; | ||
+ | const size_t idx = row_idx*m_dim + col_idx; | ||
+ | </ | ||
+ | |||
+ | The row and col indices are used to fetch the correct basis states for the threads' | ||
+ | |||
+ | <code c++> | ||
+ | const uint64_t left_state = dev_basis_states[row_idx]; | ||
+ | const uint64_t right_state = dev_basis_states[col_idx]; | ||
+ | |||
+ | H[idx] = calculate_twobody_matrix_element( | ||
+ | n_orbitals, | ||
+ | n_indices, | ||
+ | left_state, | ||
+ | right_state | ||
+ | ); | ||
+ | </ | ||
+ | |||
+ | ==== Thread indices for only the upper triangle of H ==== | ||
+ | Recall that the hamiltonian matrix is self-adjoint (hermitian). At this point in the software development there are no complex numbers involved, so the matrix can be taken as a symmetric matrix. In both cases however, there is no need to explicitly calculate both triangles of the matrix. Thats good news because there are $m^2$ elements in the complete matrix and only $m(m + 1)/2$ elements in the upper triangle, and consequently | ||
+ | |||
+ | $$ | ||
+ | \lim_{x \rightarrow \infty} \frac{m(m + 1)}{2m^2} = \frac{1}{2} | ||
+ | $$ | ||
+ | |||
+ | Removing half of the calculations is good news for the performance! Storing the upper triangle as a 1D array required a re-thinking of how the threads are mapped to array indices. I went back to a 1D thread structure which is quite straight-forward to calculate: | ||
+ | |||
+ | <code c++> | ||
+ | const size_t threads_per_block_twobody = 64; | ||
+ | const size_t blocks_per_grid_twobody = (n_elements_upper_triangle + threads_per_block_twobody - 1)/ | ||
+ | </ | ||
+ | |||
+ | and trivial to map the thread indices to a flat 1D index: | ||
+ | |||
+ | <code c++> | ||
+ | const size_t idx = blockIdx.x*blockDim.x + threadIdx.x; | ||
+ | </ | ||
+ | |||
+ | Now, there is a part of this approach which was not trivial at all, namely mapping the flat 1D upper triangle indices to the corresponding row and col indices in the 2D version of the hamiltonian matrix. I need the row and col indices for the threads to fetch the correct basis states. I made two solutions for this problem (likely not in the most efficient way), the first of which is a straight-forward iteration over all the indices of the upper triangle array. | ||
+ | |||
+ | <code c++> | ||
+ | for (size_t flat_tri_idx = 0; flat_tri_idx < n_elements_upper_triangle; | ||
+ | { /* | ||
+ | Convert an index from the flat upper triangle array into an | ||
+ | index of the flat H array. If the H array is | ||
+ | |||
+ | H = [[ 0 1 2 3] | ||
+ | [ 4 5 6 7] | ||
+ | [ 8 9 10 11] | ||
+ | [12 13 14 15]] | ||
+ | |||
+ | the flat triangle array will be | ||
+ | |||
+ | tri = [0, 1, 2, 3, 5, 6, 7, 10, 11, 15] | ||
+ | |||
+ | tri[5] with value 6 corresponds to element H[1][2]. | ||
+ | */ | ||
+ | if (counter >= lim) | ||
+ | { /* | ||
+ | Row 0 has lim number of upper triangular elements, row 1 has | ||
+ | lim - 1 elements, row 2 lim - 2, ... | ||
+ | |||
+ | The first triangular element of row 0 is offset by 0 | ||
+ | The first triangular element of row 1 is offset by 0 + 1 | ||
+ | The first triangular element of row 2 is offset by 0 + 1 + 2 | ||
+ | ... | ||
+ | */ | ||
+ | lim--; | ||
+ | counter = 0; | ||
+ | prev_offset = row_idx + prev_offset; | ||
+ | } | ||
+ | |||
+ | row_idx = m_dim - lim; | ||
+ | size_t flat_idx = flat_tri_idx + row_idx + prev_offset; | ||
+ | counter++; | ||
+ | |||
+ | H[flat_idx] = H_upper_triangle[flat_tri_idx]; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | This approach works fine if you need to do something with every single element of the upper triangle array, like if a single CPU thread is copying value-by-value like in the code above. However, a single GPU thread is only supposed to use a single index in the upper triangle array and its corresponding map to the " | ||
+ | |||
+ | <code c++> | ||
+ | __device__ inline void flat_tri_idx_to_2d_indices_ascending(const size_t n_rows, const size_t flat_tri_idx, | ||
+ | { /* | ||
+ | Find the row, col indices of the original matrix from the flat index | ||
+ | of the upper triangle (with diagonal). Search for row in ascending | ||
+ | order. | ||
+ | */ | ||
+ | size_t cum_n_triangular_elements = 0; | ||
+ | for (row_idx = 0; row_idx < n_rows; row_idx++) | ||
+ | { | ||
+ | const size_t n_triangular_elements_row = n_rows - row_idx; | ||
+ | cum_n_triangular_elements += n_triangular_elements_row; | ||
+ | |||
+ | if (flat_tri_idx < cum_n_triangular_elements) break; | ||
+ | |||
+ | } | ||
+ | col_idx = n_rows - (cum_n_triangular_elements - flat_tri_idx); | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | This solution is better when a thread wants to do something with a single index, the aforementioned solution is better when a thread wants to do something with all the indices. | ||
+ | |||
+ | I will experiment with restructuring the basis states array, and maybe I can ditch these triangle index to 2D index mappings completely. Maybe it's better to pre-calculate the indices and just look them up when needed instead? | ||
+ | |||
+ | ==== Performance and timing ==== | ||
+ | The performance gains of calculating TBMEs on the GPU are substantial. For the following benchmark I'm using the '' | ||
+ | |||
+ | | Processor | ||
+ | | CPU | 185.954 | ||
+ | | GPU | 1.964 | 94.630 | | ||
+ | | GPU (triangle) | ||
+ | |||
+ | As you can see, the speed-up is huge. The GPU code is 95 times faster than the CPU code. This is of course dependent on how good (or bad) I was at writing the code for the CPU. However, I have done several optimisations on the CPU code before using it as a basis for the GPU code. For example, using primitive data types for the basis states and pre-generating a bunch of indices. Optimising the GPU code further so that it only calculates the upper triangle improved the performance even further, now at 148 times faster than the CPU code. | ||
+ | |||
+ | ~~DISCUSSION~~ |
science/physics/nuclear-physics/shell-model/gpu-acceleration-of-the-tbme-calculation.1730873068.txt.gz · Last modified: 2024/11/06 07:04 by 127.0.0.1