During our recent SNIA Data, Storage & Networking (DSN) webinar, “Post Training and Fine Tuning for LLMs in the Enterprise,” we answered several audience questions spanning model selection, training pipelines, deployment constraints, and systems‑level considerations. This post consolidates and expands those answers into a coherent guide, with the goal of clarifying how post‑training techniques fit together in practice. If you missed the live webinar, it’s now available here on demand, along with the presentation slides.

Whether you’re experimenting on a laptop or architecting large‑scale inference systems, these answers should help ground the discussion.

Q: Choosing a model: Where to start?

A: A common first challenge is selecting the right base model. The best approach is to begin by clearly defining your use case. For example, do you want a chatbot to answer customer questions in a specific way, or specific documents to be periodically summarized and then shared with you as a PowerPoint presentation, or code generation, or something else? Once you have clearly defined what you want the outcome to be, you then need to think about your deployment constraints, such as latency targets, cost ceilings, and whether inference will run on-device or in the cloud. 

You’ll also need to think about the model you want to use. Model hubs like Hugging Face provide rich metadata, community benchmarks, and fine‑tuned variants that make initial filtering easier. From there, teams typically shortlist two or three candidate models and run lightweight evaluations on representative data. In practice, empirical testing on your workload almost always matters more than leaderboard rankings alone.

If you’ve just read this and you still have no idea how to get started, use an LLM to help you. Open Claude.AI or ChatGPT and describe what you want to do, provide the requirements and constraints and see what the AI suggests. This is a great way to get started when you don’t even know what you don’t know. 

Q: Mid‑Training vs. Fine‑Tuning: What’s the difference?

A: These terms are often conflated, but they refer to different concepts. To make things as clear as possible, let’s take a step back and start with Pretraining.

Pretraining is the large-scale “base education” phase where a model learns general language patterns and broad world knowledge by predicting the next token across massive, mostly unlabeled datasets (e.g., web text, books, code).  

Mid-training (often called “continued pretraining”) extends that same next-token objective, but on more targeted corpora—typically domain-specific or enterprise-specific text—to shift the model’s knowledge, vocabulary, and style toward a particular area (finance, healthcare, internal docs, codebases, etc.).

Fine-tuning (post-training) focuses less on adding knowledge and more on changing behavior: making the model follow instructions, adopt a desired format/tone, and improve helpfulness and alignment using supervised examples (SFT) and/or preference-based methods (e.g., DPO/GRPO). 

A useful analogy: pretraining is learning a language by reading broadly; mid-training is reading a specialized library to become fluent in a domain; fine-tuning is coaching the model on how to respond.

Q: Do you need both SFT and LoRA?

A: SFT and LoRA work on two different dimensions. They aren’t two sequential steps.

SFT (Supervised Fine‑Tuning) describes what you train on (instruction → response examples) to teach instruction-following behavior.

LoRA (Low‑Rank Adaptation) describes how you apply updates (parameter‑efficient adapters) so you don’t have to update all model weights.

In practice: you can do SFT with full fine‑tuning (update all weights) or SFT with LoRA/QLoRA (update small adapter weights). LoRA is most attractive when GPU memory is limited, you want faster iteration, or you want to maintain multiple “persona/domain” adapters without cloning the full model.


Q: How much data Is needed for SFT?

A: There is no hard minimum, but practical thresholds do exist. For narrow or well‑defined tasks, a few thousand high‑quality instruction–response pairs can already produce meaningful improvements. Broader instruction‑following behavior typically requires tens of thousands of examples or more.

Importantly, data quality dominates quantity. Curated, diverse, and well‑structured examples consistently outperform much larger but noisy datasets.

Q: Where does distillation fit into the pipeline?

A: Distillation (knowledge distillation) is a compression and transfer technique where a larger, higher‑quality teacher model generates outputs (and sometimes intermediate signals) that a smaller student model learns to imitate. In LLMs, distillation is often response/sequence distillation (train on teacher‑generated answers) and increasingly reasoning‑trace distillation (train on the teacher’s step‑by‑step trajectories). 

Distillation usually happens after you have a strong teacher (often after SFT and preference optimization) when you want a smaller model that’s cheaper, faster, or easier to deploy. The student is commonly trained with an SFT‑style objective on teacher‑generated data, sometimes combined with classic logit matching or intermediate‑layer matching. 

Example: DeepSeek’s released R1 “distilled” models transfer reasoning behavior from the larger DeepSeek‑R1 teacher into smaller Qwen/Llama‑based students using large volumes of teacher‑generated samples. We didn’t fully explore this in the live webinar, but may in a future session if there’s interest.

Q: How can the same question receive different answers for a CEO versus an engineer?

A: This is usually handled at inference time, not during training. Prompting techniques can inject metadata about the user’s role, goals, or constraints through system messages or structured templates.

More advanced systems combine prompting with routing or policy layers so that different user personas map to different response styles, levels of technical depth, or even different models. This approach preserves a single trained model while allowing contextual customization.

Q: Can you run the demo you showed on an 8GB MacBook?

A: Yes, but within reasonable limits. The smaller 360 M parameter example can be run and even fine‑tuned on CPU‑only systems, but it will take a long time (days). Larger models, such as ~1.7B parameters, may not be able to run on an 8 GB MacBook. While this won’t replace GPU‑backed infrastructure for large‑scale workloads, it makes post‑training experiments accessible for learning, prototyping, and debugging.

Q: Has ONNX gained popularity for Inference?

A: Yes. Over the past few years, ONNX has solidified its position as a de facto standard for inference deployment across ML frameworks. Its frameworkagnostic design allows models trained in PyTorch, TensorFlow, or other ecosystems to be exported into a common format that can be deployed consistently across environments.

One of the major drivers of ONNX adoption is its strong ecosystem of optimized runtimes and accelerators. ONNX Runtime provides highperformance inference on CPUs, GPUs, and specialized hardware, while maintaining a consistent execution model. At the same time, ONNX integrates tightly with backendspecific accelerators such as NVIDIA TensorRT, enabling graphlevel optimizations, kernel fusion, and hardwareaware execution without locking teams into a single training framework.

This portability and flexibility make ONNX especially attractive for production environments. Teams can decouple model development from deployment, simplify CI/CD pipelines, and more easily target heterogeneous hardware (e,g., from cloud GPUs to edge devices) without retraining or major refactoring. As a result, ONNX is increasingly used not just as an interchange format, but as a core inference artifact in realworld systems.

In short, ONNX has moved beyond being a convenience layer and is now a foundational piece of many modern inference stacks, particularly for organizations that prioritize performance, hardware flexibility, and longterm maintainability.

Q: What Is KV cache and why does it matter?  

A: KV cache stores the key‑value tensors generated during attention, so they don’t need to be recomputed for each new token. This drastically reduces compute for long prompts and multi‑turn conversations. As context lengths scale, KV cache increasingly shifts the bottleneck from compute to memory capacity and bandwidth. This is why KV cache compression, offloading, and tiered memory placement have become active areas of systems research and engineering.

For more information about KV cache, check out Inference Illustrated. It’s an interactive learning app that covers the main concept from the paper “Attention is all you need” as well as the basics of KV cache and implementation considerations.  

We discussed KV cache in our webinar, “AI Stack: AI Model Inferencing and Deployment Options.”

Q: Thoughts on multi‑tiered memory (DDR + SSD)?

A: Multi‑tier memory for inference (e.g., GPU HBM/DRAM as a hot tier, CPU DRAM as warm, SSD/NVMe as cold) is increasingly practical because KV cache growth pushes serving bottlenecks toward capacity and bandwidth. Frameworks are starting to treat KV cache like an OS-managed resource: allocate in blocks, keep hot blocks close to the GPU, and spill colder blocks to cheaper tiers.

Concrete examples include NVIDIA Dynamo’s KVBM, which manages KV blocks across GPU → pinned CPU → disk tiers, and production-oriented stacks like TensorRT‑LLM, which supports KV reuse with eviction/offload policies. In open ecosystems, vLLM’s PagedAttention reduces fragmentation by paging KV blocks, and LMCache extends KV cache beyond HBM into CPU RAM and local SSD (with published GKE benchmarks). Research systems like FlexGen and KVSwap go further by explicitly using CPU+disk as part of the inference memory hierarchy.

Post-Training Explorer Demo 

To explore these concepts visually and interactively, you can try the Post‑Training Explorer we showed during the live demo.  It's designed to help build intuition around how different post-training stages transform model behavior and includes a guided tour. Download it here:  https://provandal.github.io/post-training-explorer/

Closing Thoughts

Post‑training is where foundation models become products: aligned, specialized, and deployable under real‑world constraints. From SFT and LoRA to KV cache and memory tiering, the space sits at the intersection of machine learning and systems engineering.

We hope these clarifications help you navigate that space more confidently—and we look forward to continuing the discussion in upcoming posts and sessions.

Stay Connected!

This webinar is part of our “AI Stack” webinar series. We encourage you to register for our upcoming sessions and view our on-demand webinars in this series.

Up Next:

On-Demand:

We hope you will continue to participate in our educational webinars. Follow us for upcoming dates and topics on LinkedIn and @SNIA.