Are we really measuring progress, or just chasing bigger numbers? The breathless coverage of Large Language Model (LLM) benchmarks often feels like the latter. While NVIDIA is touting performance gains on the new Blackwell architecture – and the numbers are impressive – the real story here isn't the speed of the models themselves, it’s the increasingly desperate need for financial institutions to actually use this technology, and the very specific workloads that demand is creating. The latest results from the Strategic Technology Analysis Center (STAC) benchmark, STAC-AI LANG6, illustrate this perfectly. It’s not about how fast an LLM can theoretically process text; it’s about how quickly it can dissect a 10-K filing and spit out actionable intelligence.
For over 15 years, the Strategic Technology Analysis Center (STAC) has been the quiet authority on benchmarking financial workloads. Their latest creation, STAC-AI, focuses on the end-to-end retrieval-augmented generation (RAG) pipeline, with LANG6 specifically isolating LLM inference performance. This isn’t a generic chatbot test; it’s designed to mimic the tasks financial analysts actually perform: sifting through mountains of regulatory filings to identify risk, opportunity, and potential fraud. The benchmark uses Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models, fed data from EDGAR4 and EDGAR5 datasets – essentially, summaries and full texts of company 10-K filings over the past five years. EDGAR4 focuses on medium-length requests, summarizing relationships between companies and economic factors, while EDGAR5 tackles long-context analysis of entire filings.
The STAC-AI LANG6 benchmark tests two key scenarios: batch mode (all requests at once, measuring throughput) and interactive mode (requests arriving randomly, measuring reaction time and words per second per user). This distinction is crucial. A high throughput number in batch mode is impressive, but useless if the system grinds to a halt when a trader needs a quick answer on a volatile market. The benchmark also rigorously checks output quality, ensuring the LLM isn’t just fast, but accurate. What sets STAC-AI apart is its insistence on including chat templates and tokenization during inference – a realistic constraint often glossed over in other benchmarks, and one that adds load to the CPU. This is a deliberate attempt to reflect real-world deployment challenges, where protecting system prompts is paramount.
The headline result? NVIDIA Blackwell delivers significant speedups across the board. In batch mode, Blackwell achieved substantially higher words per second (WPS) and requests per second (RPS) compared to previous generations. Specifically, using the Llama 3.1 8B model with the EDGAR4 dataset, Blackwell achieved 8,237 WPS and 51.53 RPS, compared to 5,500 WPS and 32.9 RPS on the HPE ProLiant Compute DL384 Gen12 powered by the NVIDIA GH200 Grace Hopper Superchip. The Nebius Cloud based on a single node of an NVIDIA GB200 NVL72 system showed even more dramatic improvements, though those results weren’t audited by STAC. However, the gains aren’t just about raw speed. Interactive mode testing revealed that the GB200 NVL72 maintains a better balance between throughput and user experience, offering lower latency and more consistent performance even under heavy load. This is where the rubber meets the road for financial applications – a trader needs answers now, not in a few seconds.
This piece references the developer.nvidia.com report.
It’s tempting to focus solely on the Blackwell numbers, but the continued relevance of NVIDIA Hopper is a critical takeaway. Even three years after its release, Hopper remains a highly effective solution, delivering strong performance in both batch and interactive scenarios. This highlights a crucial point: the financial industry isn’t necessarily chasing the absolute latest and greatest hardware. They need reliable, performant solutions that fit within their existing infrastructure and budget. The introduction of the SuperMicro AS -5126GS-TNRT with two NVIDIA RTX PRO 6000 Blackwell GPUs offers a more accessible entry point to Blackwell’s capabilities, providing substantial aggregate GPU memory for larger models and concurrent jobs. The benchmark also emphasizes the importance of software optimization. Models were quantized using NVIDIA TensorRT Model Optimizer to FP8 (Hopper) and NVFP4 (Blackwell) formats, and run using the TensorRT LLM inference framework, demonstrating that efficient model execution is just as important as powerful hardware.
But here’s where the skepticism kicks in. These benchmarks, while rigorous, are still synthetic. They use carefully curated datasets and controlled conditions. The real world is messy. Financial data is often incomplete, inconsistent, and subject to manipulation. The success of LLMs in finance won’t be determined by benchmark scores, but by their ability to handle real-world complexity and deliver consistently accurate insights. Furthermore, the focus on inference performance obscures the significant challenges of data preparation, model training, and ongoing maintenance. A fast LLM is useless without high-quality data and a robust pipeline for keeping it up-to-date.
Looking ahead, the next 12-18 months will see a surge in attempts to replicate these STAC-AI results with custom datasets and tailored workloads. Financial institutions will be less interested in theoretical performance gains and more focused on demonstrating tangible ROI. Expect to see a proliferation of “benchmarking-as-a-service” offerings, promising to help firms optimize their LLM deployments for specific use cases. The crucial question won’t be how fast can these models run, but how much money can they make – or save – for their clients. And the firms that can answer that question convincingly will be the ones that truly win in the AI-powered financial revolution.






