Skip to main content

Module amx

Module amx 

Source
Available on x86-64 only.

Functionsยง

ldtilecfg ๐Ÿ”’ โš 
sttilecfg ๐Ÿ”’ โš 
tcmmimfp16ps ๐Ÿ”’ โš 
tcmmimfp16ps_internal ๐Ÿ”’ โš 
tcmmrlfp16ps ๐Ÿ”’ โš 
tcmmrlfp16ps_internal ๐Ÿ”’ โš 
tcvtrowd2ps ๐Ÿ”’ โš 
tcvtrowd2ps_internal ๐Ÿ”’ โš 
tcvtrowd2psi ๐Ÿ”’ โš 
tcvtrowps2bf16h ๐Ÿ”’ โš 
tcvtrowps2bf16h_internal ๐Ÿ”’ โš 
tcvtrowps2bf16hi ๐Ÿ”’ โš 
tcvtrowps2bf16l ๐Ÿ”’ โš 
tcvtrowps2bf16l_internal ๐Ÿ”’ โš 
tcvtrowps2bf16li ๐Ÿ”’ โš 
tcvtrowps2phh ๐Ÿ”’ โš 
tcvtrowps2phh_internal ๐Ÿ”’ โš 
tcvtrowps2phhi ๐Ÿ”’ โš 
tcvtrowps2phl ๐Ÿ”’ โš 
tcvtrowps2phl_internal ๐Ÿ”’ โš 
tcvtrowps2phli ๐Ÿ”’ โš 
tdpbf8ps ๐Ÿ”’ โš 
tdpbf8ps_internal ๐Ÿ”’ โš 
tdpbf16ps ๐Ÿ”’ โš 
tdpbf16ps_internal ๐Ÿ”’ โš 
tdpbhf8ps ๐Ÿ”’ โš 
tdpbhf8ps_internal ๐Ÿ”’ โš 
tdpbssd ๐Ÿ”’ โš 
tdpbssd_internal ๐Ÿ”’ โš 
tdpbsud ๐Ÿ”’ โš 
tdpbsud_internal ๐Ÿ”’ โš 
tdpbusd ๐Ÿ”’ โš 
tdpbusd_internal ๐Ÿ”’ โš 
tdpbuud ๐Ÿ”’ โš 
tdpbuud_internal ๐Ÿ”’ โš 
tdpfp16ps ๐Ÿ”’ โš 
tdpfp16ps_internal ๐Ÿ”’ โš 
tdphbf8ps ๐Ÿ”’ โš 
tdphbf8ps_internal ๐Ÿ”’ โš 
tdphf8ps ๐Ÿ”’ โš 
tdphf8ps_internal ๐Ÿ”’ โš 
tileloadd64 ๐Ÿ”’ โš 
tileloadd64_internal ๐Ÿ”’ โš 
tileloaddrs64 ๐Ÿ”’ โš 
tileloaddrs64_internal ๐Ÿ”’ โš 
tileloaddrst164 ๐Ÿ”’ โš 
tileloaddrst164_internal ๐Ÿ”’ โš 
tileloaddt164 ๐Ÿ”’ โš 
tileloaddt164_internal ๐Ÿ”’ โš 
tilemovrow ๐Ÿ”’ โš 
tilemovrow_internal ๐Ÿ”’ โš 
tilemovrowi ๐Ÿ”’ โš 
tilerelease ๐Ÿ”’ โš 
tilestored64 ๐Ÿ”’ โš 
tilestored64_internal ๐Ÿ”’ โš 
tilezero ๐Ÿ”’ โš 
tilezero_internal ๐Ÿ”’ โš 
tmmultf32ps ๐Ÿ”’ โš 
tmmultf32ps_internal ๐Ÿ”’ โš 
__tile_cmmimfp16psโš Experimentalamx-complex
Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the imaginary part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The imaginary part of the a element is multiplied with the real part of the corresponding b element, and the real part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cmmrlfp16psโš Experimentalamx-complex
Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the real part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The real part of the a element is multiplied with the real part of the corresponding b element, and the negated imaginary part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cvtrowd2psโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer elements to packed single-precision (32-bit) floating-point elements. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cvtrowps2bf16hโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cvtrowps2bf16lโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cvtrowps2phhโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_cvtrowps2phlโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbf8psโš Experimentalamx-fp8
Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbf16psโš Experimentalamx-bf16
Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbhf8psโš Experimentalamx-fp8
Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbssdโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbsudโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbusdโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpbuudโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dpfp16psโš Experimentalamx-fp16
Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dphbf8psโš Experimentalamx-fp8
Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_dphf8psโš Experimentalamx-fp8
Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_loaddโš Experimentalamx-tile
Load tile rows from memory specified by base address and stride into destination tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_loaddrsโš Experimentalamx-movrs
Load tile rows from memory specified by base address and stride into destination tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future.
__tile_mmultf32psโš Experimentalamx-tf32
Perform matrix multiplication of two tiles a and b, containing packed single precision (32-bit) floating-point elements, which are converted to TF32 (tensor-float32) format, and accumulate the results into a packed single precision tile. For each possible combination of (row of a, column of b), it performs
__tile_movrowโš Experimentalamx-avx512 and avx10.2
Moves one row of tile data into a zmm vector register The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_storedโš Experimentalamx-tile
Store the tile specified by src to memory specified by base address and stride. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
__tile_stream_loaddโš Experimentalamx-tile
Load tile rows from memory specified by base address and stride into destination tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler. This intrinsic provides a hint to the implementation that the data will likely not be reused in the near future and the data caching can be optimized accordingly.
__tile_stream_loaddrsโš Experimentalamx-movrs
Load tile rows from memory specified by base address and stride into destination tile dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler. Provides a hint to the implementation that the data would be reused but does not need to be resident in the nearest cache levels. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future.
__tile_zeroโš Experimentalamx-tile
Zero the tile specified by dst. The shape of the tile is specified in the struct of __tile1024i. The register of the tile is allocated by the compiler.
_tile_cmmimfp16psโš Experimentalamx-complex
Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the imaginary part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The imaginary part of the a element is multiplied with the real part of the corresponding b element, and the real part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst.
_tile_cmmrlfp16psโš Experimentalamx-complex
Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the real part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The real part of the a element is multiplied with the real part of the corresponding b element, and the negated imaginary part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst.
_tile_cvtrowd2psโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer elements to packed single-precision (32-bit) floating-point elements.
_tile_cvtrowd2psiโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer elements to packed single-precision (32-bit) floating-point elements.
_tile_cvtrowps2bf16hโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2bf16hiโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2bf16lโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2bf16liโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2phhโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2phhiโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2phlโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
_tile_cvtrowps2phliโš Experimentalamx-avx512 and avx10.2
Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
_tile_dpbf8psโš Experimentalamx-fp8
Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_dpbf16psโš Experimentalamx-bf16
Compute dot-product of BF16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_dpbhf8psโš Experimentalamx-fp8
Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_dpbssdโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
_tile_dpbsudโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
_tile_dpbusdโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
_tile_dpbuudโš Experimentalamx-int8
Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
_tile_dpfp16psโš Experimentalamx-fp16
Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_dphbf8psโš Experimentalamx-fp8
Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_dphf8psโš Experimentalamx-fp8
Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
_tile_loadconfigโš Experimentalamx-tile
Load tile configuration from a 64-byte memory location specified by mem_addr. The tile configuration format is specified below, and includes the tile type pallette, the number of bytes per row, and the number of rows. If the specified pallette_id is zero, that signifies the init state for both the tile config and the tile data, and the tiles are zeroed. Any invalid configurations will result in #GP fault.
_tile_loaddโš Experimentalamx-tile
Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration previously configured via _tile_loadconfig.
_tile_loaddrsโš Experimentalamx-movrs
Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration previously configured via _tile_loadconfig. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future.
_tile_mmultf32psโš Experimentalamx-tf32
Perform matrix multiplication of two tiles a and b, containing packed single precision (32-bit) floating-point elements, which are converted to TF32 (tensor-float32) format, and accumulate the results into a packed single precision tile. For each possible combination of (row of a, column of b), it performs
_tile_movrowโš Experimentalamx-avx512 and avx10.2
Moves one row of tile data into a zmm vector register
_tile_movrowiโš Experimentalamx-avx512 and avx10.2
Moves one row of tile data into a zmm vector register
_tile_releaseโš Experimentalamx-tile
Release the tile configuration to return to the init state, which releases all storage it currently holds.
_tile_storeconfigโš Experimentalamx-tile
Stores the current tile configuration to a 64-byte memory location specified by mem_addr. The tile configuration format is as specified in _tile_loadconfig, and includes the tile type pallette, the number of bytes per row, and the number of rows. If tiles are not configured, all zeroes will be stored to memory.
_tile_storedโš Experimentalamx-tile
Store the tile specified by src to memory specified by base address and stride using the tile configuration previously configured via _tile_loadconfig.
_tile_stream_loaddโš Experimentalamx-tile
Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration previously configured via _tile_loadconfig. This intrinsic provides a hint to the implementation that the data will likely not be reused in the near future and the data caching can be optimized accordingly.
_tile_stream_loaddrsโš Experimentalamx-movrs
Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration previously configured via _tile_loadconfig. Provides a hint to the implementation that the data would be reused but does not need to be resident in the nearest cache levels. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future.
_tile_zeroโš Experimentalamx-tile
Zero the tile specified by tdest.