gemm_batch

Computes groups of matrix-matrix product with general matrices.

Syntax

Group API

event gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t *m, std::int64_t *n, std::int64_t *k, T *alpha, T **A, std::int64_t *lda, T **B, std::int64_t *ldb, T *beta, T **C, std::int64_t *ldc, std::int64_t group_count, std::int64_t *groupsize, const vector_class<event> &dependencies = {})

Strided API

event gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t m, std::int64_t n, std::int64_t k, T alpha, T *a, std::int64_t lda, std::int64_t stridea, T *b, std::int64_t ldb, std::int64_t strideb, T beta, T *c, std::int64_t ldc, std::int64_t stridec, std::int64_t batch_size, const vector_class<event> &dependencies = {})
void gemm_batch(queue &exec_queue, transpose transa, transpose transb, std::int64_t m, std::int64_t n, std::int64_t k, T alpha, buffer<T, 1> &a, std::int64_t &lda, std::int64_t stridea, buffer<T, 1> &b, std::int64_t ldb, std::int64_t strideb, T beta, buffer<T, 1> &c, std::int64_t ldc, std::int64_t stridec, std::int64_t batch_size)

gemm_batch supports the following precisions and devices.

T

Devices Supported

float

Host, CPU, and GPU

half

Host, CPU, and GPU

double

Host, CPU, and GPU

std::complex<float>

Host, CPU, and GPU

std::complex<double>

Host, CPU, and GPU

Description

The gemm_batch routines perform a series of matrix-matrix operations with general matrices. They are similar to the gemm routine counterparts, but the gemm_batch routines perform matrix-matrix operations with groups of matrices. The groups contain matrices with the same parameters.

The operation for the strided API is defined as

for i = 0 … batch_size – 1
    A, B and C are matrices at offset i * stridea, i * strideb, i * stridec in a, b and c.
    C = alpha * op(A) * op(B) + beta * C
end for

The operation for the group API is defined as

idx = 0
for i = 0 … group_count – 1
    m,n,k, alpha, beta, lda, ldb, ldc and group_size at position i in their respective arrays.
    for j = 0 … group_size – 1
        A,B and C are matrices of size at position idx in their respective arrays
        C = alpha * op(A) * op(B) + beta * C
        idx := idx + 1
    end for
end for

where:

  • op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH

  • alpha and beta are scalars

  • A, B, and C are matrices

  • The a, b and c buffers contains all the input matrices. The stride between matrices is either given by the exact size of the matrix or by the stride parameter. The total number of matrices in a, b and c buffers is given by the batch_size parameter.

Here, op(A) is mxk, op(B) is kxn, and C is mxn.

Input Parameters

Strided API

transa

Specifies op(A), the transposition operation applied to the matrices A. See Data Types for more details.

transb

Specifies op(B), the transposition operation applied to the matrices B. See Data Types for more details.

m

Number of rows of op(A) and C. Must be at least zero.

n

Number of columns of op(B) and C. Must be at least zero.

k

Number of columns of op(A) and rows of op(B). Must be at least zero.

alpha

Scaling factor for the matrix-matrix products.

a

Buffer holding the input matrices A. Must have size at least stridea*batch_size.

lda

Leading dimension of the A matrices. If matrices are stored using column major layout, lda must be at least m if A is not transposed, and at least k if A is transposed. If matrices are stored using row major layout, lda must be at least k if A is not transposed, and at least m if A is transposed. It must be positive.

stridea

Stride between the different A matrices. If matrices are stored using column (respectively, row) major layout, stridea must be at least lda*k (respectively, lda*m) if A is not transposed and at least lda*m (respectively, lda*k) if A is transposed.

b

Buffer holding the input matrices B. Must have size at least strideb*batch_size.

ldb

Leading dimension of the matrices B. If matrices are stored using column major layout, ldb must be at least k if B is not transposed, and m if B is transposed. If matrices are stored using row major layout, ldb must be at least n if B is not transposed, and at least k if B is transposed. It must be positive.

strideb

Stride between the different B matrices. If matrices are stored using column (respectively row) major layout, strideb must be at least ldb*n (respectively, lda*k) if B is not transposed and at least ldb*k (respectively, ldb*n) if B is transposed.

beta

Scaling factor for the matrices C.

c

Buffer holding input/output matrices C. Must have size at least stridec*batch_size.

ldc

Leading dimension of C. If matrices are stored using column major layout, ldc must be at least m. If matrices are stored using row major layout, ldc must be at least n. It must be positive.

stridec

Stride between the different C matrices. If matrices are stored using column (respectively, row) major layout, stridec must least ldc*n (respectively, ldc*m).

batch_size

Specifies the number of matrix multiply operations to perform.

Group API

transa

Array of size group_count. Each element i in the array specifies op(A) the transposition operation applied to the matrices A. See Data Types for more details.

transb

Array of size group_count. Each element i in the array specifies op(B) the transposition operation applied to the matrices B. See Data Types for more details.

m

Array of size group_count of number of rows of op(A) and C. Each must be at least zero.

n

Array of size group_count of number of columns of op(B) and C. Each must be at least zero.

k

Array of size group_count of number of columns of op(A) and rows of op(B). Each must be at least zero.

alpha

Array of size group_count containing scaling factors for the matrix-matrix products.

a

Array of size total_batch_count of pointers used to A matrices. If matrices are stored in column- (respectively, row-) major layout, the array allocated for the A matrices of the group i must be of size at least ldai * ki (respectively, ldai *mi ) if A is not transposed and ldai*mi (respectively, ldai*ki) if A is transposed.

lda

Array of size group_count of leading dimension of the A matrices. If matrices are stored using column major layout, ldai must be at least mi if A is not transposed, and at least ki if A is transposed. If matrices are stored using row major layout, ldai must be at least ki if A is not transposed, and at least mi if A is transposed. Each must be positive.

b

Array of size total_batch_count of pointers used to store B matrices. If matrices are stored using column (respectively, row) major, the array allocated for the B matrices of the group i must be of size at least ldbi * ki (respectively, ldbi * mi) if B is not transposed and ldbi*mi (respectively, ldbi*ki) if B is transposed.

ldb

Array of size group_count of leading dimension of the B matrices. If matrices are stored using column major layout, ldbi must be at least mi if B is not transposed, and at least ki if B is transposed. If matrices are stored using row major layout, ldbi must be at least ki if B is not transposed, and at least mi if B is transposed. Each must be positive.

beta

Array of size group_count containing scaling factors for the C matrices.

c

Array of size total_batch_count of pointers used to store C matrices. If matrices are stored using column (respectively, row) major, the array allocated for the C matrices of the group i must be of size at least ldci * ki (respectively, ldci * mi) if C is not transposed and ldci*mi (respectively, ldci*ki) if C is transposed.

ldc

Array of size group_count of leading dimension of the C matrices. If matrices are stored using column major layout, ldci must be at least mi if C is not transposed, and at least ki if C is transposed. If matrices are stored using row major layout, must be at least ki if Cldci is not transposed, and at least mi if C is transposed. Each must be positive.

group_count

Number of groups. Must be at least 0.

group_size

Array of size group_count. The element group_size[i] is the number of matrices in the group i. Each element in group_size must be at least 0.

dependencies

List of events to wait for before starting computation, if any. If omitted, defaults to no dependencies.

Output Parameters

Strided API

c

Output buffer, overwritten by batch_size matrix multiply operations of the formalpha*op(A)*op(B) + beta*C.

Group API

c

Output array of pointers to C matrices, overwritten by total_batch_count matrix multiply operations of the formalpha*op(A)*op(B) + beta*C.

Notes

If beta = 0, matrix C does not need to be initialized before calling gemm_batch.