Modern NVMe devices can sustain millions of IOPS through the efficient, parallel processing of many queues. Yet the software storage stack still typically processes each I/O request individually - allocating resources, acquiring locks, traversing driver layers, and completing requests one at a time. As device speeds increase, this per-request software overhead becomes a significant factor. We present a new batched I/O mechanism for the Windows storage stack that allows an entire set of I/O operations to be described, dispatched, and completed as a single unit. By amortizing fixed costs across many operations and giving each layer of the stack the opportunity to process work in bulk, batched I/O significantly reduces CPU overhead per operation and improves throughput at high queue depths. This talk will cover the motivation behind the design, the challenges of propagating batched work through a multi-layered driver stack, and early performance results demonstrating the impact on real workloads.