MLCommons, the consortium behind the MLPerf family of machine learning benchmarks, is announcing this morning that the organization will be developing a new desktop AI benchmarking suite under the MLPerf banner. Helmed by the body’s newly-formed MLPerf Client working group, the task force will be developing a client AI benchmark suit aimed at traditional desktop PCs, workstations, and laptops. According to the consortium, the first iteration of the MLPerf Client benchmark suite will be based on Meta’s Llama 2 LLM, with an initial focus on assembling a benchmark suite for Windows.

The de facto industry standard benchmark for AI inference and training on servers and HPC systems, MLCommons has slowly been extending the MLPerf family of benchmarks to additional devices over the past several years. This has included assembling benchmarks for mobile devices, and even low-power edge devices. Now, the consortium is setting about covering the “missing middle” of their family of benchmarks with an MLPerf suite designed for PCs and workstations. And while this is far from the group’s first benchmark, it is in some respects their most ambitious effort to date.

The aim of the new MLPerf Client working group will be to develop a benchmark suitable for client PCs – which is to say, a benchmark that is not only sized appropriately for the devices, but is a real-world client AI workload in order to provide useful and meaningful results. Given the cooperative, consensus-based nature of the consortium’s development structure, today’s announcement comes fairly early in the process, as the group is just now getting started on developing the MLPerf Client benchmark. As a result, there are still a number of technical details about the final benchmark suite that need to be hammered out over the coming months, but to kick things off the group has already narrowed down some of the technical aspects of their upcoming benchmark suite.

Perhaps most critically, the working group has already settled on basing the initial version of the MLPerf Client benchmark around Meta's Llama 2 large language model, which is already used in other versions of the MLPerf suite. Specifically, the group is eyeing 7 billion parameter version of that model (Llama-2-7B), as that’s believed to be the most appropriate size and complexity for client PCs (at INT8 precision, the 7B model would require roughly 7GB of RAM). Past that however, the group still needs to determine the specifics of the benchmark, most importantly the tasks which the LLM will be benchmarked executing on.

With the aim of getting it on PCs of all shapes and sizes, from laptops to workstations, the MLPerf Client working group is going straight for mass market adoption by targeting Windows first – a far cry from the *nix-focused benchmarks they’re best known for. To be sure, the group does plan to bring MLPerf Client to additional platforms over time, but their first target is to hit the bulk of the PC market where Windows reigns supreme.

In fact, the focus on client computing is arguably the most ambitious part of the project for a group that already has ample experience with machine learning workloads. Thus far, the other versions of MLPerf have been aimed at device manufacturers, data scientists, and the like – which is to say they’ve been barebones benchmarks. Even the mobile very of the MLPerf benchmark isn’t very accessible to end-users, as it’s distributed as a source-code release intended to be compiled on the target system. The MLPerf Client benchmark for PCs, on the other hand, will be a true client benchmark, distributed as a compiled application with a user-friendly front-end. Which means the MLPerf Client working group is tasked with not only figuring out what the most representative ML workloads will be for a client, but then how to tie that together into a useful graphical benchmark.

Meanwhile, although many of the finer technical points of the MLPerf Client benchmark suite remain to be sorted out, talking to MLCommons representatives, it sounds like the group has a clear direction in mind on the APIs and runtimes that they want the benchmark to run on: all of them. With Windows offering its own machine learning APIs (WinML and DirectML), and then most hardware vendors offering their own optimized platforms on top of that (CUDA, OpenVino, etc), there are numerous possible execution backends for MLPerf Client to target. And, keeping in line with the laissez faire nature of the other MLPerf benchmarks, the expectation is that MLPerf Client will support a full gamut of common and vendor-proprietary backends.

In practice, then, this would be very similar to how other desktop client AI benchmarks work today, such as UL’s Procyon AI benchmark suite, which allows for plugging in to multiple execution backends. The use of different backends does take away a bit from true apples-to-apples testing (though it would always be possible to force fallback to a common API like DirectML), but it gives the hardware vendors room to optimize the execution of the model to their hardware. MLPerf takes the same approach to their other benchmarks right now, essentially giving hardware vendors free reign to come up with new optimizations – including reduced precision and quantization – so long as they don’t lose inference accuracy and fail meet the benchmark’s overall accuracy requirements.

Even the type of hardware used to execute the benchmark is open to change: while the benchmark is clearly aimed at leveraging the new field of NPUs, vendors are also free to run it on GPUs and CPUs as they see fit. So MLPerf Client will not exclusively be an NPU or GPU benchmark.

Otherwise, keeping everyone on equal footing, the working group itself is a who’s who of hardware and software vendors. The list includes not only Intel, AMD, and NVIDIA, but Arm, Qualcomm, Microsoft, Dell, and others. So there is buy-in from all of the major industry players (at least in the Windows space), which has been critical for driving the acceptance of MLPerf for servers, and will similarly be needed to drive acceptance of MLPerf client.

The MLPerf Client benchmark itself is still quite some time from release, but once it’s out, it will be joining the current front-runners of UL’s Procyon AI benchmark and Primate Labs’ Geekbench ML, both of which already offer Windows client AI benchmarks. And while benchmark development is not necessarily a competitive field, MLCommons is hoping that their open, collaborative approach will be something that sets them apart from existing benchmarks. The nature of the consortium means that every member gets a say (and a vote) on matters, which isn’t the case for proprietary benchmarks. But it also means the group needs a complete consensus in order to move forward.

Ultimately, the initial version of the MLPerf Client benchmark is being devised as more of a beginning than an end product in and of itself. Besides expanding the benchmark to additional platforms beyond Windows, the working group will also eventually be looking at additional workloads to add to the suite – and, presumably, adding more models beyond Llama 2. So while the group has a good deal of work ahead of them just to get the initial benchmark out, the plan is for MLPerf Client to be long-lived, long-supported benchmark as the other MLPerf benchmarks are today.

Source: MLCommons

POST A COMMENT

2 Comments

View All Comments

  • GeoffreyA - Wednesday, January 24, 2024 - link

    As more models are added in, it will be interesting to see performance against power consumption. May it whip the Llama's behind. Reply
  • PeachNCream - Thursday, January 25, 2024 - link

    "...covering the “missing middle” of their family of benchmarks with an MLPerf suite designed for PCs and workstations"

    It's mostly a demonstration of how much less significance PCs play in our daily lives to see yet another benchmark find its way to the platform as an absolute last gap. And it demonstrates the nonsense behind benchmarks in the first place which have, for decades, been nothing more than a way for man children to play a game of consumer electronics sausage-measuring-and-comparing with each other.
    Reply

Log in

Don't have an account? Sign up now