Benchmarking Gaudi 2 with 866M parameters
Intel’s Gaudi 2 silicon has outperformed Nvidia’s A100 80 GB and H100 in a fine-tuning performance benchmark for the Vision-Language AI model BridgeTower. The benchmark results show that Gaudi 2 performed 2.5x better than A100 80 GB and 1.4x better than H100.
Vision-Language AI models process and associate information across language and visual representation modalities. Companies like Midjourney, Stable Diffusion, and Ideogram are leading the fast-growing market of image generation models associated with VL models.
AI model fine-tuners use benchmarks like BridgeTower to compare the computational power of processors and other hardware when running workloads. Habana attributes the significant speedups to a hardware-accelerated data-loading system, which addresses a common bottleneck in AI model fine-tuning, especially for VL models.
“Fine-tuning AI models can often be a slow and frustrating process,” Habana says. “Especially with vision models, it can be very frustrating waiting for the GPU to finish loading the data. This is often the limiting factor when fine-tuning the best vision models available today.”
The bottleneck arises from CPU-intensive operations like image decoding and augmentation, which can cause the CPU to stall, waiting for data to be processed and sent to the AI accelerator.
Gaudi 2’s integrated hardware acceleration reduces the CPU’s load, allowing for improved performance. Habana benchmarked Gaudi 2 by fine-tuning a pre-trained BridgeTower checkpoint with 866M parameters.
The benchmark involved running workloads across 8 devices each of A100 80 GB, H100, and Gaudi 2. Results showed that Gaudi 2 performed 1.79x better than H100 and 2.23x better than A100 in the best-case scenario.
When two additional processes were used for data loading, Gaudi 2 still outperformed H100 by 1.3x and A100 by 2.23x. Increasing the number of data-loading processes beyond two resulted in diminishing returns.
“We were able to optimize the data loading on Gaudi 2 to be twice as fast as it was on H100 and 1.5x faster than on A100,” Habana explains. “This let us run bigger workloads and improve performance.”
Habana’s optimization allows for a 10% additional performance improvement against its best score. The benchmark does not explicitly list whether AMD Radeon Instinct is also optimized for data-loading acceleration, but it seems likely that this is the case.
The competition is fierce, and underdogs have historically caught up to and even surpassed the favorites in similar races. The future of AI acceleration remains uncertain, with companies like Intel aiming to dethrone Nvidia.