Available on x86-64 only.
Functionsยง
- ldtilecfg ๐ โ
- sttilecfg ๐ โ
- tcmmimfp16ps ๐ โ
- tcmmimfp16ps_
internal ๐ โ - tcmmrlfp16ps ๐ โ
- tcmmrlfp16ps_
internal ๐ โ - tcvtrowd2ps ๐ โ
- tcvtrowd2ps_
internal ๐ โ - tcvtrowd2psi ๐ โ
- tcvtrowps2bf16h ๐ โ
- tcvtrowps2bf16h_
internal ๐ โ - tcvtrowps2bf16hi ๐ โ
- tcvtrowps2bf16l ๐ โ
- tcvtrowps2bf16l_
internal ๐ โ - tcvtrowps2bf16li ๐ โ
- tcvtrowps2phh ๐ โ
- tcvtrowps2phh_
internal ๐ โ - tcvtrowps2phhi ๐ โ
- tcvtrowps2phl ๐ โ
- tcvtrowps2phl_
internal ๐ โ - tcvtrowps2phli ๐ โ
- tdpbf8ps ๐ โ
- tdpbf8ps_
internal ๐ โ - tdpbf16ps ๐ โ
- tdpbf16ps_
internal ๐ โ - tdpbhf8ps ๐ โ
- tdpbhf8ps_
internal ๐ โ - tdpbssd ๐ โ
- tdpbssd_
internal ๐ โ - tdpbsud ๐ โ
- tdpbsud_
internal ๐ โ - tdpbusd ๐ โ
- tdpbusd_
internal ๐ โ - tdpbuud ๐ โ
- tdpbuud_
internal ๐ โ - tdpfp16ps ๐ โ
- tdpfp16ps_
internal ๐ โ - tdphbf8ps ๐ โ
- tdphbf8ps_
internal ๐ โ - tdphf8ps ๐ โ
- tdphf8ps_
internal ๐ โ - tileloadd64 ๐ โ
- tileloadd64_
internal ๐ โ - tileloaddrs64 ๐ โ
- tileloaddrs64_
internal ๐ โ - tileloaddrst164 ๐ โ
- tileloaddrst164_
internal ๐ โ - tileloaddt164 ๐ โ
- tileloaddt164_
internal ๐ โ - tilemovrow ๐ โ
- tilemovrow_
internal ๐ โ - tilemovrowi ๐ โ
- tilerelease ๐ โ
- tilestored64 ๐ โ
- tilestored64_
internal ๐ โ - tilezero ๐ โ
- tilezero_
internal ๐ โ - tmmultf32ps ๐ โ
- tmmultf32ps_
internal ๐ โ - __
tile_ โcmmimfp16ps Experimental amx-complex - Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile.
Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part.
Calculates the imaginary part of the result. For each possible combination of (row of a, column of b),
it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b).
The imaginary part of the a element is multiplied with the real part of the corresponding b element, and the real part of
the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added,
and then accumulated into the corresponding row and column of dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcmmrlfp16ps Experimental amx-complex - Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile.
Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part.
Calculates the real part of the result. For each possible combination of (row of a, column of b),
it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b).
The real part of the a element is multiplied with the real part of the corresponding b element, and the negated imaginary part of
the a element is multiplied with the imaginary part of the corresponding b elements.
The two accumulated results are added, and then accumulated into the corresponding row and column of dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcvtrowd2ps Experimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer
elements to packed single-precision (32-bit) floating-point elements.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcvtrowps2bf16h Experimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit)
floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting
16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcvtrowps2bf16l Experimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit)
floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting
16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcvtrowps2phh Experimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit)
floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting
16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โcvtrowps2phl Experimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit)
floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting
16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbf8ps Experimental amx-fp8 - Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and BF8 (8-bit E5M2)
floating-point elements in tile b, accumulating the intermediate single-precision
(32-bit) floating-point elements with elements in dst, and store the 32-bit result
back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbf16ps Experimental amx-bf16 - Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b,
accumulating the intermediate single-precision (32-bit) floating-point elements
with elements in dst, and store the 32-bit result back to tile dst. The shape of the tile
is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbhf8ps Experimental amx-fp8 - Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and HF8
(8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision
(32-bit) floating-point elements with elements in dst, and store the 32-bit result
back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbssd Experimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator.
Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding
signed 8-bit integers in b, producing 4 intermediate 32-bit results.
Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbsud Experimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator.
Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding
unsigned 8-bit integers in b, producing 4 intermediate 32-bit results.
Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbusd Experimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator.
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding
signed 8-bit integers in b, producing 4 intermediate 32-bit results.
Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpbuud Experimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator.
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding
unsigned 8-bit integers in b, producing 4 intermediate 32-bit results.
Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdpfp16ps Experimental amx-fp16 - Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b,
accumulating the intermediate single-precision (32-bit) floating-point elements
with elements in dst, and store the 32-bit result back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdphbf8ps Experimental amx-fp8 - Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and BF8
(8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision
(32-bit) floating-point elements with elements in dst, and store the 32-bit result
back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โdphf8ps Experimental amx-fp8 - Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and HF8 (8-bit E4M3)
floating-point elements in tile b, accumulating the intermediate single-precision
(32-bit) floating-point elements with elements in dst, and store the 32-bit result
back to tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โloadd Experimental amx-tile - Load tile rows from memory specified by base address and stride into destination tile dst. The shape
of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โloaddrs Experimental amx-movrs - Load tile rows from memory specified by base address and stride into destination tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future. - __
tile_ โmmultf32ps Experimental amx-tf32 - Perform matrix multiplication of two tiles a and b, containing packed single precision (32-bit) floating-point elements, which are converted to TF32 (tensor-float32) format, and accumulate the results into a packed single precision tile. For each possible combination of (row of a, column of b), it performs
- __
tile_ โmovrow Experimental amx-avx512andavx10.2 - Moves one row of tile data into a zmm vector register
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โstored Experimental amx-tile - Store the tile specified by src to memory specified by base address and stride. The shape of the tile
is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. - __
tile_ โstream_ loadd Experimental amx-tile - Load tile rows from memory specified by base address and stride into destination tile dst. The shape
of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. This intrinsic provides a hint to the implementation that the data will likely not be reused in the near future and the data caching can be optimized accordingly. - __
tile_ โstream_ loaddrs Experimental amx-movrs - Load tile rows from memory specified by base address and stride into destination tile dst.
The shape of the tile is specified in the struct of
__tile1024i. The register of the tile is allocated by the compiler. Provides a hint to the implementation that the data would be reused but does not need to be resident in the nearest cache levels. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future. - __
tile_ โzero Experimental amx-tile - Zero the tile specified by
dst. The shape of the tile is specified in the struct of__tile1024i. The register of the tile is allocated by the compiler. - _tile_
cmmimfp16ps โExperimental amx-complex - Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the imaginary part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The imaginary part of the a element is multiplied with the real part of the corresponding b element, and the real part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst.
- _tile_
cmmrlfp16ps โExperimental amx-complex - Perform matrix multiplication of two tiles containing complex elements and accumulate the results into a packed single precision tile. Each dword element in input tiles a and b is interpreted as a complex number with FP16 real part and FP16 imaginary part. Calculates the real part of the result. For each possible combination of (row of a, column of b), it performs a set of multiplication and accumulations on all corresponding complex numbers (one from a and one from b). The real part of the a element is multiplied with the real part of the corresponding b element, and the negated imaginary part of the a element is multiplied with the imaginary part of the corresponding b elements. The two accumulated results are added, and then accumulated into the corresponding row and column of dst.
- _tile_
cvtrowd2ps โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer elements to packed single-precision (32-bit) floating-point elements.
- _tile_
cvtrowd2psi โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed 32-bit signed integer elements to packed single-precision (32-bit) floating-point elements.
- _tile_
cvtrowps2bf16h โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2bf16hi โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2bf16l โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2bf16li โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed BF16 (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2phh โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2phhi โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the high 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2phl โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
- _tile_
cvtrowps2phli โExperimental amx-avx512andavx10.2 - Moves a row from a tile register to a zmm register, converting the packed single-precision (32-bit) floating-point elements to packed half-precision (16-bit) floating-point elements. The resulting 16-bit elements are placed in the low 16-bits within each 32-bit element of the returned vector.
- _tile_
dpbf8ps โExperimental amx-fp8 - Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbf16ps โExperimental amx-bf16 - Compute dot-product of BF16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbhf8ps โExperimental amx-fp8 - Compute dot-product of BF8 (8-bit E5M2) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbssd โExperimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbsud โExperimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of signed 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbusd โExperimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
- _tile_
dpbuud โExperimental amx-int8 - Compute dot-product of bytes in tiles with a source/destination accumulator. Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding unsigned 8-bit integers in b, producing 4 intermediate 32-bit results. Sum these 4 results with the corresponding 32-bit integer in dst, and store the 32-bit result back to tile dst.
- _tile_
dpfp16ps โExperimental amx-fp16 - Compute dot-product of FP16 (16-bit) floating-point pairs in tiles a and b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
dphbf8ps โExperimental amx-fp8 - Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and BF8 (8-bit E5M2) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
dphf8ps โExperimental amx-fp8 - Compute dot-product of HF8 (8-bit E4M3) floating-point elements in tile a and HF8 (8-bit E4M3) floating-point elements in tile b, accumulating the intermediate single-precision (32-bit) floating-point elements with elements in dst, and store the 32-bit result back to tile dst.
- _tile_
loadconfig โExperimental amx-tile - Load tile configuration from a 64-byte memory location specified by
mem_addr. The tile configuration format is specified below, and includes the tile type pallette, the number of bytes per row, and the number of rows. If the specified pallette_id is zero, that signifies the init state for both the tile config and the tile data, and the tiles are zeroed. Any invalid configurations will result in #GP fault. - _tile_
loadd โExperimental amx-tile - Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration previously configured via
_tile_loadconfig. - _tile_
loaddrs โExperimental amx-movrs - Load tile rows from memory specified by base address and stride into destination tile dst
using the tile configuration previously configured via
_tile_loadconfig. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future. - _tile_
mmultf32ps โExperimental amx-tf32 - Perform matrix multiplication of two tiles a and b, containing packed single precision (32-bit) floating-point elements, which are converted to TF32 (tensor-float32) format, and accumulate the results into a packed single precision tile. For each possible combination of (row of a, column of b), it performs
- _tile_
movrow โExperimental amx-avx512andavx10.2 - Moves one row of tile data into a zmm vector register
- _tile_
movrowi โExperimental amx-avx512andavx10.2 - Moves one row of tile data into a zmm vector register
- _tile_
release โExperimental amx-tile - Release the tile configuration to return to the init state, which releases all storage it currently holds.
- _tile_
storeconfig โExperimental amx-tile - Stores the current tile configuration to a 64-byte memory location specified by
mem_addr. The tile configuration format is as specified in_tile_loadconfig, and includes the tile type pallette, the number of bytes per row, and the number of rows. If tiles are not configured, all zeroes will be stored to memory. - _tile_
stored โExperimental amx-tile - Store the tile specified by src to memory specified by base address and stride using the tile configuration previously configured via
_tile_loadconfig. - _tile_
stream_ โloadd Experimental amx-tile - Load tile rows from memory specified by base address and stride into destination tile dst using the tile configuration
previously configured via
_tile_loadconfig. This intrinsic provides a hint to the implementation that the data will likely not be reused in the near future and the data caching can be optimized accordingly. - _tile_
stream_ โloaddrs Experimental amx-movrs - Load tile rows from memory specified by base address and stride into destination tile dst
using the tile configuration previously configured via
_tile_loadconfig. Provides a hint to the implementation that the data would be reused but does not need to be resident in the nearest cache levels. Additionally, this intrinsic indicates the source memory location is likely to become read-shared by multiple processors, i.e., read in the future by at least one other processor before it is written, assuming it is ever written in the future. - _tile_
zero โExperimental amx-tile - Zero the tile specified by
tdest.