INQUIRING LINE

Can likelihood choice matter more than architectural depth for CF?

This explores whether, in collaborative filtering (CF) — the recommender technique that predicts what you'll like from patterns in what similar users liked — the choice of likelihood (the probabilistic loss you train against) can outweigh making the network deeper or bigger.


This explores whether the likelihood — the objective you fit your model to — can matter more than architectural depth for collaborative filtering. Up front, an honesty flag: this corpus doesn't contain papers on collaborative filtering or recommender systems specifically, so it can't settle the CF question on its own terms. What it does have, scattered across reasoning and fine-tuning work, is a recurring pattern that speaks directly to the spirit of your question — that *what you optimize* often dominates *how big you build*.

The sharpest analog is the finding that small models trained with DPO can match large ones on function calling, because DPO's explicit negative examples target the exact failure mode that plain supervised fine-tuning misses Can small models match large models on function calling?. Strip away the domain and that's your question almost verbatim: changing the training objective — moving from a likelihood that only rewards correct outputs to one that also penalizes wrong ones — closed a gap that scaling the model was supposed to close. For CF, this maps cleanly onto the long-running debate between pointwise likelihoods (predict the rating) and pairwise/ranking likelihoods (predict that you prefer A over B): the corpus's evidence suggests the loss formulation is doing heavy lifting that depth alone can't replicate.

There's a second, subtler reason objective choice can beat depth: identical performance metrics can hide completely different internal representations Can models be smart without organized internal structure?. A deeper architecture might post the same offline accuracy while being internally fractured and brittle to distribution shift — which in recommenders is the everyday reality of new users and shifting catalogs. The likelihood you choose shapes what the model actually organizes its representation around, and that organization is invisible to the headline metric. Relatedly, the effect of an objective isn't even fixed across settings: preference tuning *reduces* diversity in code but *increases* it in creative writing, depending on what each domain rewards Does preference tuning always reduce diversity the same way?. The lesson for CF is that a likelihood interacts with the data's incentive structure — so the right loss is a modeling decision, not a default.

The honest counterweight is that architecture is not inert. Conditional scaling laws that fold in architectural variables — hidden size, MLP-to-attention ratio — delivered real, measurable efficiency and accuracy gains Can architecture choices improve inference efficiency without sacrificing accuracy?, and across reasoning tasks the training regime, not raw compute or size, is what makes a model's capacity productive Can non-reasoning models catch up with more compute?. Read together, these point to the same conclusion: depth and likelihood aren't rivals so much as differently-leveraged knobs, and the cheaper, higher-leverage knob is frequently the objective. If you want to pursue the CF question rigorously, the corpus gives you the *principle* — objective often beats scale — but not the recommender-specific experiments; that's a gap worth filling with dedicated CF literature.


Sources 5 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, evaluate this claim: objective/likelihood choice can dominate architectural depth in collaborative filtering systems. A curated library on reasoning and fine-tuning (2024–2026) found patterns suggesting objective often outperforms scale, but contained no CF-specific work.

What a curated library found — and when (dated claims, not current truth):
• Small models with DPO (explicit negatives in loss) matched large supervised-fine-tuned models on function calling, suggesting loss formulation closed a gap scaling alone couldn't (2024–2025)
• Identical offline accuracy metrics masked fundamentally different internal representations and robustness to distribution shift, implying objective choice shapes representation structure invisibly (2025)
• Preference tuning effects on diversity are domain-dependent (reduces code lexical diversity, increases creative writing diversity), hinting objective interacts with data incentive structure (2025)
• Conditional scaling laws incorporating architectural variables (hidden size, MLP-to-attention ratio) showed real gains, and training regime, not raw size, unlocks capacity in reasoning tasks (2025–2026)

Anchor papers (verify; mind their dates):
• 2410.18890 (Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks) — DPO on small models
• 2501.17161 (SFT Memorizes, RL Generalizes) — objective regime shifts
• 2510.18245 (Scaling Laws Meet Model Architecture) — architectural + training interaction
• 2605.19376 (Generative Recursive Reasoning) — regime boundaries

Your task:
(1) RE-TEST EACH CONSTRAINT: For DPO vs. supervised loss, has scaling or orchestration (e.g., in-batch negatives, hard-negative mining, memory-augmented ranking) since made supervised losses competitive? For representation brittleness under distribution shift, has newer evaluation of CF systems (e.g., cold-start, catalog churn) confirmed that objective choice mitigates it better than depth? Where does the constraint still appear to hold?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months: does any recent CF or ranking paper show architectural depth overwhelming objective choice?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can hybrid losses (e.g., pointwise + ranking + contrastive) with shallow width now match pure-objective depth in cold-start robustness? (b) Does externalization (memory, caching, multi-agent orchestration per 2026-04) reshape the objective-vs-depth tradeoff by moving ranking outside the model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines