Can smaller models outperform their LLM teachers with enough data?
Explores whether student models trained on expanded teacher-generated labels can exceed teacher performance in production ranking tasks, and what data scale makes this possible.
LLMs have superior ranking quality but unaffordable latency for retail search. The standard distillation move is to train a smaller student model on the teacher's labels — but Walmart's setup adds a twist: the teacher LLM is first trained as a classification model with soft targets, and then the student is trained on a much larger dataset where the teacher labels generated unlabeled queries.
The empirical surprise: with enough augmented data, the student model outperforms the teacher. This violates the conventional distillation framing where the student approximates the teacher and accepts a quality gap as the cost of speed. Why it happens: the teacher's labels are an oracle for the student, and the augmented dataset contains query-product pairs the teacher never explicitly trained on. The student gets to see more of the input distribution than the teacher did, smoothed by the teacher's predictions, which lets it generalize better than the teacher to the actual evaluation distribution.
The architecture decision matters too. Bi-encoder retrieval allows precomputed item embeddings and approximate nearest-neighbor lookup — fast but less effective because query and item are encoded independently. Cross-encoder rerankers concatenate query and item, allowing attention across all tokens, capturing interactions a bi-encoder can't. The two-stage retrieval-then-rerank funnel uses bi-encoders to handle latency at the top of the funnel and cross-encoders (now LLM-distilled) where latency is more relaxed. The student-exceeds-teacher result was deployed in production with significantly positive metrics.
Inquiring lines that use this note as a source 30
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Where do LLMs succeed at generation but struggle with evaluation?
- Why do proprietary models improve with training while open-source models decline?
- Why does self-generated training data outperform externally sourced data?
- Why do pretrained LLM representations fail at task-specific relevance ranking?
- When does knowledge distillation produce student models superior to teachers?
- How do pseudo-relevance labels enable training without ground truth relevance judgments?
- Why does self-generated training data outperform externally curated domain examples?
- How can smaller models help select useful data for larger models?
- How do label constraints improve synthetic data without ground truth validation?
- Can smaller specialist models outperform large generalist models on domain tasks?
- What happens when LLMs grade other LLMs in closed evaluation loops?
- Does exposure to more domain-specific examples reduce LLM overconfidence?
- How does training data distribution determine what models can learn?
- Why do vector embeddings fail to measure task relevance in production RAG?
- Why do weaker models generate better training data than stronger models?
- Does training data format matter more than who generates it?
- Why do older datasets show higher LLM performance than newer ones?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- What filtering criteria best identify student-compatible refinements from teacher models?
- Can self-training drift be prevented by applying student compatibility filtering?
- How does the ratio of synthetic to real training data affect model collapse?
- How does information asymmetry between teacher and student create the learning signal?
- Can smaller judge models better capture human preferences than larger prompted models?
- How should training data be constructed to preserve teacher-student information gaps?
- What makes policy self-distillation more effective than external teacher distillation?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- Does pseudo-labeling from LLMs degrade classifier performance?
- Can intentional data-mixture design replace model scaling for rare task learning?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we distill LLM knowledge into graphs for real-time recommendations?
E-commerce needs sub-millisecond recommendations, but LLMs are too slow. Can we extract LLM insights offline into a knowledge graph that serves requests in production without sacrificing quality or explainability?
extends: same architectural pattern (distill LLM offline, serve smaller model online) applied to KG construction rather than ranking
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
exemplifies: e-commerce ranking is the scoped repetitive task where SLM-first economics applies — student-exceeds-teacher reinforces the case
-
Can reinforcement learning align summarization with ranking goals?
Generic LLM summaries optimize for readability, not ranking performance. Can training summarizers with downstream relevance scores as rewards fix this misalignment and produce summaries that actually help rankers match queries?
complements: both align LLM output to a specific downstream task — distillation aligns scoring; RL aligns summarization
-
Can semantic knowledge shift model behavior like reinforcement learning does?
Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.
complements: both compress LLM behavior into a cheaper substrate — ranking weights vs token-prior — preserving capability at lower cost
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Long-context LLMs Struggle with Long In-context Learning
- Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
Original note title
distilling LLM ranking into BERT cross-encoders enables production e-commerce search — augmented unlabeled data lets the student exceed the teacher