05 Jun 20:54

vpirogov

v3.12.1

c8c37fd

v3.12.1 Latest

Latest

This is a patch release containing the following changes to v3.12:

Enabled SYCL Graph record/replay mode support in Graph API on Intel GPUs (f014724, 3e19991, 982650a)
Fixed a performance regression in matmul with 4D shapes and N == 1 or M == 1 on x64 CPUs (de9498f, 540d53d)
Fixed correctness issue in matmul primitive with ReLU post-op on RV64 CPUs (772ca13)
Fixed a segfault in s8/u8 depthwise convolution on x64 processors with Intel AVX10.2 instruction set support (7f413f9, dac7399)
Fixed a correctness issue in s8s8 convolution on x64 processors with Intel AVX-512 and Intel DL Boost instructions support (0dc2ca8)
Fixed an issue with incorrect memory use estimation of layer normalization, group normalization, and batch normalization primitives in benchdnn (881e9b6)
Fixed an assertion in benchdnn --graph driver (79b2593)

Assets 2

08 May 12:05

vgvozdeva

v3.12

80afa71

v3.12

Performance Optimizations

Intel 64/AMD64 Processors

Improved performance on future Intel Core Ultra processors with Intel AVX10.2 instruction set support (code name Nova Lake). These optimizations are now enabled by default on compatible processors.
Improved performance on future Intel Xeon processors with Intel AVX10.2 and Intel AMX instruction set support (code name Diamond Rapids). These optimizations are now enabled by default on compatible processors.
Improved performance of fp8 and int8 matmul with transposed source on processors with Intel AMX instruction set support.
Improved performance of bf16 and f16 matmul with transposed source on processors with Intel AVX2 instruction set support.

Intel Graphics

Introduced initial performance optimizations for future integrated GPUs based on Xe3p-LPG architecture.
Introduced initial performance optimizations for future discrete GPUs based on Xe3p-XPC architecture. This is a preview functionality not recommended for production use.
Improved f16 matmul performance on Intel Arc Graphics for Intel Core Ultra processor Series 3 (formerly Panther Lake).
Improved performance of matmul with host-side scalar arguments.
Improved matmul performance for cases with small M/N and large K.
Improved SDPA forward and backpropagation subgraph performance with Graph API.

AArch64 Processors

Improved f16 and f32 softmax performance across Arm Neoverse cores.
Improved eltwise performance on Arm Neoverse N1 cores.
Improved matmul and convolution performance on Arm Neoverse V2 cores.
Improved performance of multiple primitives by quering processor cache sizes.

RISC-V Processors

Improved f32 matmul, inner product, convolution, softmax and layer normalization primitives performance on processors with V extension support.
Improved f16 softmax primitive performance on processors with Zvfh extension support.

Functionality

Functional API

[experimental] Introduced grouped memory format and grouped matmul support to improve performance of AI models based on Mixture-of-Experts (MoE) architecture. This is an experimental feature that requires opt-in with ONEDNN_EXPERIMENTAL_GROUPED_MEMORY=ON build option. Optimized version of this functionality is implemented for Intel GPUs.
[experimental] Extended grouped matmul with optional execution-time hint DNNL_ARG_HINT_MAX_GROUP_SIZE to communicate the maximum size of the group across the variable dimension for the execution call.

Graph API

Introduced Dropout operation. Extended supported fusion patterns to enable fusion of Dropout with Matmul, Softmax, and elementwise operations.

Usability

Common

Extended information about primitive execution available in VTune Profiler with the same level of details as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.

Intel Graphics

[experimental] Introduced support for Level Zero runtime on Intel GPUs. New functionality includes Level Zero interoperability API and build knob ONEDNN_GPU_RUNTIME=ZE.

AArch64 Processors

Reduced memory usage of certain convolutions on Arm Neoverse V1/V2 cores.
Fixed a bug causing high-memory usage and crashes in convolution with certain post-ops.

Validation

Extended benchdnn with support for integer masks in quantization attributes.
Improved consistency of benchdnn performance results when data compression is enabled by default on Intel Graphics.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
f4_e3m0 data type is deprecated and will be removed in future releases.

Thanks to our Contributors

This release contains contributions from the project core team as well as Alexandre de Limas Santana @alexandrelimassantana, Andrei (Andrey) Khropov @andrey-khropov, Andrei Hutu @Anndrey24, Fadi Arafeh @fadara01, George Nash @georgen117, Kamil Wieloch @kwieloch-intel, Kasture Deeksha, MarkVeerasingam @MarkVeerasingam, Nikhil Gupta @nikhil-arm, @pmanczak, @vishwascm, and Xia Zhuozhao @xiazhuozhao.

Contributors

andrey-khropov, Anndrey24, and 9 other contributors

Assets 2

17 Apr 23:08

vpirogov

v3.11.3

74d0475

v3.11.3

This is a patch release containing the following changes to v3.11.2:

Fixed undefined behavior in matmul implementation on Intel64/AMD64 CPUs (bd117e4)
Fixed performance regression in f32 reorder on Intel64/AMD64 CPUs (a4acece)
Fixed a SEGFAULT in binary primitive with large sizes on Intel GPUs (157cba5, 1a8bc11)
Fixed performance regression in f32 convolution with small number of input channels on processors with Intel AVX-512 instruction set support (e1f4a61)

Assets 2

17 Apr 23:35

vpirogov

v3.12-rc

5f1ac51

v3.12-rc Pre-release

Pre-release

Performance Optimizations

Intel 64/AMD64 Processors

Improved performance on future Intel Core Ultra processors with Intel AVX10.2 instruction set support (code name Nova Lake). These optimizations are now enabled by default on compatible processors.
Improved performance on future Intel Xeon processors with Intel AVX10.2 and Intel AMX instruction set support (code name Diamond Rapids). These optimizations are now enabled by default on compatible processors.
Improved performance of fp8 and int8 matmul with transposed source on processors with Intel AMX instruction set support.
Improved performance of bf16 and f16 matmul with transposed source on processors with Intel AVX2 instruction set support.

Intel Graphics

Introduced initial performance optimizations for future integrated GPUs based on Xe3p-LPG architecture.
Introduced initial performance optimizations for future discrete GPUs based on Xe3p-XPC architecture.
Improved f16 matmul performance on Intel Arc Graphics for Intel Core Ultra processor Series 3 (formerly Panther Lake).
Improved performance of matmul with host-side scalar arguments.
Improved matmul performance for cases with small M/N and large K.
Improved SDPA forward and backpropagation subgraph performance with Graph API.

AArch64 Processors

Improved f16 and f32 softmax performance across Arm Neoverse cores.
Improved eltwise performance on Arm Neoverse N1 cores.
Improved matmul and convolution performance on Arm Neoverse V2 cores.

RISC-V Processors

Improved f32 matmul, inner product, convolution, softmax and layer normalization primitives performance on processors with V extension support.
Improved f16 softmax primitive performance on processors with Zvfh extension support.

Functionality

Functional API

[experimental] Introduced grouped memory format and grouped matmul support to improve performance of AI models based on Mixture-of-Experts (MoE) architecture. This is an experimental feature that requires opt-in with ONEDNN_EXPERIMENTAL_GROUPED_MEMORY=ON build option. Optimized version of this functionality is implemented for Intel GPUs.
[experimental] Extended grouped matmul with optional execution-time hint DNNL_ARG_HINT_MAX_GROUP_SIZE to communicate the maximum size of the group across the variable dimension for the execution call.

Graph API

Introduced Dropout operation. Extended supported fusion patterns to enable fusion of Dropout with Matmul, Softmax, and elementwise operations.

Usability

Common

Extended information about primitive execution available in VTune Profiler with the same level of details as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.

Intel Graphics

[experimental] Introduced support for Level Zero runtime on Intel GPUs. New functionality includes Level Zero interoperability API and build knob ONEDNN_GPU_RUNTIME=ZE.

AArch64 Processors

Introduced support for the library to correctly query processor cache sizes.
Reduced memory usage of certain convolutions on Arm Neoverse V1/V2 cores.
Fixed a bug causing high-memory usage and crashes in convolution with certain post-ops.

Validation

Extended benchdnn with support for integer masks in quantization attributes.
Improved consistency of benchdnn performance results when data compression is enabled by default on Intel Graphics.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
f4_e3m0 data type is deprecated and will be removed in future releases.

Thanks to our Contributors

Contributors

andrey-khropov, Anndrey24, and 9 other contributors

Assets 2

30 Mar 01:11

vpirogov

v3.11.2

03c022d

v3.11.2

This is a patch release containing the following changes to v3.11.1:

Fixed an issue with unintentionally exposed internal symbols in Graph API (d369c5f, 8fc3ec3)
Fixed an integer overflow in memory descriptor size computation for humungous tensors (265704b, 98e2011)
Fixed a potential heap corruption in f32 GEMM kernels on x64 CPUs (6191181)
Added support for bf16 and f16 matmul with transposed source on x64 CPUs with Intel AVX2 instruction set support (5af82d4, f0c6428, eef1cf0)
Updated benchdnn to use non-compressible random data in performance benchmarking mode on Intel GPUs (8051b37)

Assets 2

16 Mar 22:54

vpirogov

v3.11.1

10dcd61

v3.11.1

This is a patch release containing the following changes to v3.11:

Fixed performance regression in bf16 matmul with int4 weights on Intel GPUs based on Xe2 architecture (d4d4d7a)
Fixed performance regression in inner product primitive with transposed weights on x64 CPUs (c5d2d09)
Updated benchdnn input files for matmul and convolution performance benchmarking (e80a1a8, 96d72a9, b9c9bce)
Fixed an out of registers issue in SDPA fusion with Graph API on Intel GPUs (ba81382)
Fixed integer overflow in softmax primitive implementation for Intel GPUs (4a711d7, b02cfa0, c557f33, ab64a9b)
Fixed incorrect results in f64 convolution weight gradient on Intel GPUs based on Xe-LPG architecture (adcb323, 3d1a7e4)
Removed in-place optimization for reorder in Graph API to avoid correctness issues (a6c3630)
Improved performance of int8, f16, and bf16 convolution on processors with Intel AMX support (a418949)
Fixed a correctness issue in f32 convolution with small number of input channels (3d1d9b4, ada85c5)
Fixed a correctness issue in matmul with binary post-op and non-trivial strides on x64 CPUs (f49f470, 265df18, 5892570)
Fixed benchdnn graph driver test to support non-trivial strides (0232763, 662cbb3)
Fixed a correctness issue in 3D grouped convolution weight gradient on Intel GPUs (8a7996b)
Fixed a page fault issue in f32 SDPA subgraph on Intel GPUs (98845e5)
Fixed a performance regression in bf16 matmul on x64 CPUs with Intel AMX instruction set support (5b886e8, f3a79e7, 52cc900, cf9a11e)
Fixed a segmentation fault in matmul on x64 processors with Intel AVX 10.2 and Intel AMX instruction set support (98aea2f)
Fixed correctness issue in SDPA subgraph with non-trivial strides for mask on Intel GPUs (0ccdfba)

Assets 2

06 Feb 17:42

vgvozdeva

v3.11

fc61516

v3.11

Performance Optimizations

Intel 64/AMD64 Processors

Improved fp32 matmul performance with fp4 compressed weights.
Improved fp32 matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.

Intel Graphics

Improved fp16/bf16 matmul performance for large tensor cases on Intel Graphics for Intel Core Ultra processor Series 3 (formerly Panther Lake).
Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
Improved performance of fp16/bf16 matmul with mxfp4 weights.
Improved convolution performance with host-side scalar scales and zero points.
Improved matmul performance for LLM inference workloads on Intel GPUs based on Xe2/Xe3 architectures.
Improved f32 SDPA performance for small head sizes.

AArch64 Processors

Improved performance of bf16 matmul.
Improved performance of bf16/int8 convolutions.
Improved matmul performance for cases when one of the tensor has a trivial dimension.
Improved performance of s8/u8 eltwise post-ops on Arm(R) Neoverse(TM) V1 processors.
Improved f16 and bf16 eltwise performance with abs, relu, square, sqrt, clip, and clip_v2 algorithms.
Improved eltwise exp algorithm performance on Arm(R) Neoverse(TM) N1 processors.
Improved reorder primitive performance.

RISC-V Processors

Improved f32 matmul, inner product, convolution, softmax, batch normalization, layer normalization, and group normalization primitives performance.
Improved eltwise and binary primitives performance.
Improved f32 and fp16 pooling primitive performance.
Improved fp32 to u8 reorder primitive performance.

Functionality

Functional API

Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
Introduced support for NVFP4 quantization scheme. The changes include support for fp8_e4m3 grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation.
Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.

Graph API

Introduced support for RMS Normalization operation.
Introduced support for output gradient of attention mask for SDPA and GQA training.

Intel Graphics

Introduced support for convolution with u8 weights.
Introduced support for 2D grouped scales in fp8 and dual zero points in matmul.
Extended support for 5D and 6D tensors in matmul with post-ops.

Intel 64/AMD64 Processors

Introduced support for different data types of source and destination in pooling forward propagation.

AArch64 Processors

Added limited support for the BRGEMM Microkernel API
Added limited support for Windows on Arm builds with MSVC

Usability

Common

Extended quantization attributes documentation to cover all quantization schemes supported by the library.
Added matmul fp8 quantization example demonstrating use of matmul primitive with fp8 source, destination, and weights.
Enabled ONEDNN_ENABLE_GRAPH_DUMP knob by default.

Intel 64/AMD64 Processors

Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
Introduced ONEDNN_SAFE_RBP build knob that instructs x64 implementations to preserve value of rbp register for tools that rely on stack unwinding. This option may have visible performance impact on some workloads.

AArch64 Processors

Fixed a potential overflow on AArch64 builds with Arm Compute Library.
Significantly reduced memory consumption of convolution primitive with large spatial filters during primitive creation.

Intel Graphics

Removed build time dependency on OpenCL runtime in SYCL build configuration.

Validation

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24, Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, czekun @ZackyLake, Deeksha Kasture @kasturedeeksha, Fadi Arafeh @fadara01, Gassan Salama @gassan-arm, Henry Gardiner @henry-gar, @jstachowintel, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Murray Steele @murste01, Narendra Bagria @narenbagria, Joseph Kuo @PershingSquare, @pmanczak, @vishwascm, Yejing Lai @Yejing-Lai, 夏卓昭 @xiazhuozhao

Contributors

keanucz, Anndrey24, and 17 other contributors

Assets 2

23 Jan 17:26

vgvozdeva

v3.11-rc

1936384

v3.11-rc Pre-release

Pre-release

Performance Optimizations

Intel 64/AMD64 Processors

Improved fp32 matmul performance with fp4 compressed weights.
Improved fp32 matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.

Intel Graphics

Improved fp16/bf16 matmul performance for large tensor cases on Intel Arc graphics for Intel Core Ultra processor series 3 (formerly Panther Lake).
Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
Improved performance of fp16/bf16 matmul with mxfp4 weights.
Improved convolution performance with host-side scalar scales and zero points.

AArch64 Processors

Improved performance of s8/u8 eltwise post-ops on Arm(R) Neoverse(TM) V1 processors.
Improved f16 and bf16 eltwise performance for abs, relu, square, sqrt, clip, and clip_v2.
Improved exp eltwise performance on Arm(R) Neoverse(TM) N1 processors.
Improved reorder primitive performance.
Added matmul optimizations for GEMVs.
Improved performance of bf16 matmul.
Improved performance of bf16/int8 convolutions.
Convolutions with large spatial filters now consume much less memory during primitive setup.

RISC-V Processors

Improved eltwise and binary primitives performance.
Improved f32 GEMM performance.
Improved f32 matmul, softmax, convolution and inner product primitives performance.
Improved f32 batch, group and layer normalization primitives performance.
Improved f32 and fp16 pooling primitive performance.
Improved reorder(fp32 to u8) primitive performance.

Functionality

Functional API

Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
Introduced support for NVFP4 quantization scheme. The changes include support for fp8_e4m3 grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation.
Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.

Graph API

Introduced support for RMS Normalization operation.
Introduced support for output gradient of attention mask for SDPA and GQA training.

Intel Graphics

Introduced support for convolution with u8 weights.
Introduced support for 2D grouped scales in fp8 matmul.

Intel 64/AMD64 Processors

Introduced support for different data types of source and destination in pooling forward propagation.

AArch64 Processors

Added limited support for the BRGEMM Microkernel API.
Added limited support for Windows on Arm builds with MSVC.

Usability

Extended quantization attributes documentation to cover all quantization schemes supported by the library.
Added matmul fp8 quantization example demonstrating use of matmul primitive with fp8 source, destination, and weights.
Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
Extended information about primitive execution available in VTune(TM) Profiler with the same level of detail as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.
Introduced ONEDNN_SAFE_RBP build knob that instructs x64 implementations to preserve value of rbp register for tools that rely on stack unwinding. This option may have visible performance impact on some workloads.
Removed build time dependency on OpenCL runtime in SYCL build configuration.
ONEDNN_ENABLE_GRAPH_DUMP build knob is enabled by default.
Fixed a potential overflow on AArch64 builds with Arm Compute Library.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

Contributors

keanucz, Anndrey24, and 17 other contributors

Assets 2

02 Dec 17:12

vgvozdeva

v3.10.2

f1d4719

v3.10.2

This is a patch release containing the following changes to v3.10.1:

Fixed a memory leak in Graph API related to host scalars use (0441245)
Fixed f16 matmul performance regression with int4 weights on Intel Arc graphics for Intel Core Ultra processors (Series 3) (789711c, a160247)
Fixed bf16 matmul performance regression on Intel Xeon processors with Intel AMX instruction set support (c29ec26)
Changed register allocation in BRGEMM kernel to avoid register conflicts and improve code safety (95d651b)
Fixed a crash related to incorrect caching of int8 convolution primitive on Intel GPUs (28ccca4, 0bc8060)
Fixed a bug preventing correct detection of Intel AVX 10.2 instruction set on Intel Xeon processors (568171c)

Assets 2

19 Nov 00:07

vpirogov

v3.10.1

abbdd85

v3.10.1

This is a patch release containing the following changes to v3.10:

Fixed an issue with reorder primitive returning unimplemented for cases when only one scale mask is defined on AArch64 processors (be92457)
Fixed sporadic correctness issue in fp32 matmul on Intel GPUs based on Xe2 architecture (b4a761c)
Fixed correctness issue in fp16/bf16 matmul on Intel GPUs based on Xe3 architecture (48c114b)
Fixed performance regression in bf16 convolution weight gradient on Intel Arc Graphics B-series (3b6665b)
Improved convolution performance on AArch64 processors with SVE128 support (808227d)
Fixed regression in matmul primitive creation time on Intel GPUs (599ecb5)
Fixed potential overflow for matmul, convolution and inner product primitives with Arm Compute Library (be12d8c)
Fixed convolution performance regression on Intel Arc Graphics B-series (7e27159)

Assets 2

Releases: uxlfoundation/oneDNN

v3.12.1

Uh oh!

v3.12

Performance Optimizations

Intel 64/AMD64 Processors

Intel Graphics

AArch64 Processors

RISC-V Processors

Functionality

Functional API

Graph API

Usability

Common

Intel Graphics

AArch64 Processors

Validation

Deprecated Functionality

Thanks to our Contributors

Contributors

Uh oh!

v3.11.3

Uh oh!

v3.12-rc

Performance Optimizations

Intel 64/AMD64 Processors

Intel Graphics

AArch64 Processors

RISC-V Processors

Functionality

Functional API

Graph API

Usability

Common

Intel Graphics

AArch64 Processors

Validation

Deprecated Functionality

Thanks to our Contributors

Contributors

Uh oh!

v3.11.2

Uh oh!

v3.11.1

Uh oh!

v3.11

Performance Optimizations

Intel 64/AMD64 Processors

Intel Graphics

AArch64 Processors

RISC-V Processors

Functionality

Functional API

Graph API

Intel Graphics

Intel 64/AMD64 Processors

AArch64 Processors

Usability

Common

Intel 64/AMD64 Processors

AArch64 Processors

Intel Graphics

Validation

Deprecated Functionality

Thanks to our Contributors

Contributors

Uh oh!

v3.11-rc

Performance Optimizations

Intel 64/AMD64 Processors

Intel Graphics

AArch64 Processors

RISC-V Processors

Functionality

Functional API

Graph API

Intel Graphics

Intel 64/AMD64 Processors

AArch64 Processors

Usability