INQUIRING LINE

Do scaling laws change when weight precision becomes a design variable?

This explores what happens to scaling laws — the predictable curves relating model size, data, and compute to performance — once the *precision* of each weight (how many bits it uses) is something you get to choose rather than a fixed assumption (usually 16-bit).


This explores what happens to the familiar scaling curves once you stop treating 16-bit weights as a given and let precision itself become a knob. The sharpest answer in the corpus is that yes, the curve moves: BitNet shows that LLMs trained natively with *ternary* weights (roughly 1.58 bits each) match full-precision FP16/BF16 models on perplexity and end-task benchmarks at the same parameter count, while slashing latency, memory, and energy Can ternary weights match full precision model performance?. The striking part isn't just the compression — it's that the authors frame the result as defining a *new* scaling law, one with a different cost axis, and an invitation to design hardware around 1-bit models. So precision isn't a lossy afterthought applied to a trained model; made a first-class design variable, it redraws the relationship between size and capability.

The deeper point is that scaling laws were never one fixed law — they're a template you can re-parameterize by whatever you let vary. The corpus has a clear example: when you fold *architectural* choices (hidden size, the MLP-to-attention ratio, grouped-query attention) into the scaling law, you can optimize for inference efficiency and squeeze out 42% more throughput *and* 2.1% higher accuracy under the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. Precision is the same kind of move — adding a dimension the original Chinchilla-style law held constant. Each new design variable doesn't break scaling laws; it gives you a richer surface to find better trade-offs on.

What's worth knowing is that the most interesting scaling shifts are happening *off* the parameter-count axis entirely. Inference-time compute can substitute for model size: smaller models given more thinking time match larger ones on hard prompts, which means pretraining and inference compute aren't independent resources Can inference compute replace scaling up model size?. The same pattern recurs for research agents, where the number of *search steps* follows the same diminishing-returns curve as reasoning tokens — a genuinely new inference-compute axis Do search steps follow the same scaling rules as reasoning tokens?. Precision joins this family: it's one more dimension along which you can trade resources, and the lesson across all of them is that capability is governed by a multi-axis budget, not a single number.

There's also a hardware-shaped reason precision matters as a design variable, visible in the mobile work. On memory-bound devices, the bottleneck isn't computing weights — it's *moving* them, so recomputing a transformer block twice can be cheaper than fetching separate weights from memory Does recomputing weights cost less than moving them on mobile?. Low-bit weights attack the same bottleneck from the other side: fewer bits per weight means less to move. This is exactly why BitNet's authors point toward custom 1-bit hardware — once precision is a design variable, it co-evolves with the chip, and the 'cost' term in the scaling law stops being abstract FLOPs and becomes bytes-moved on real silicon.

If you want to keep pulling this thread, the adjacent territory is everything that decouples performance from naive weight-counting: representation finetuning that intervenes on frozen activations instead of updating weights, hitting 10–50x better parameter efficiency than LoRA Can editing hidden representations beat weight updates for finetuning?; finetuning's own multiplicative scaling law, where a larger base model helps more than more data How should finetuning scale with model and data size?; and weight *sparsity*, a different bit-budget move that trades dense capacity for interpretable, modular circuits Can sparse weight training make neural networks interpretable by design?. The throughline: 'how many parameters' was always a proxy. As precision, sparsity, architecture, and inference compute each become design variables, scaling laws don't dissolve — they multiply into a family, each describing a different face of the same cost-versus-capability surface.


Sources 8 notes

Can ternary weights match full precision model performance?

BitNet b1.58 trains natively with ternary weights and matches FP16/BF16 performance on perplexity and end-task benchmarks at equal model size, while cutting latency, memory, and energy costs. The result enables a new scaling law and opens the path to hardware designed specifically for 1-bit LLMs.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether scaling-law constraints around weight precision have shifted since early 2024. The question remains: Do scaling laws change when weight precision becomes a design variable?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2025 and include:
  • BitNet (1.58-bit ternary weights) matches FP16/BF16 perplexity and task performance at identical parameter count while cutting latency, memory, energy (2024-02).
  • Precision is a first-class design variable; reframing it redraws the cost axis in scaling laws, inviting hardware co-design around 1-bit models (2024-02).
  • Inference-time compute (test-time reasoning) substitutes for model size on hard prompts; pretraining and inference compute are independent resources with their own scaling (2025-*).
  • Architectural variables (hidden size, MLP ratio, grouped-query attention) folded into scaling laws yield 42% more throughput + 2.1% accuracy gain under same budget (2025-10).
  • Representation finetuning on frozen activations achieves 10–50× better parameter efficiency than LoRA; weight sparsity trades dense capacity for interpretable circuits (2024-04, 2025-11).

Anchor papers (verify; mind their dates):
  • arXiv:2402.17764 (BitNet, Feb 2024)
  • arXiv:2510.18245 (Scaling Laws Meet Model Architecture, Oct 2025)
  • arXiv:2404.03592 (ReFT, Apr 2024)
  • arXiv:2511.13653 (Weight-sparse circuits, Nov 2025)

Your task:
  (1) RE-TEST EACH CONSTRAINT. For BitNet's ternary-weight claim, has post-2024 work confirmed this holds on larger models or harder domains? Do inference-compute and architectural-variable trade-offs hold in production systems? Have newer optimizers, quantization techniques, or hardware (TPU v6e, custom 1-bit silicon) relaxed or overturned any precision-budget bottleneck? Flag which constraints appear durable and which resolved.
  (2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing precision matters *less* than claimed, or that a different axis (activation sparsity, dynamic routing) dominates the cost surface.
  (3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Can precision co-vary with LoRA rank in finetuning without synergistic loss?" or "Does weight sparsity + ternary precision yield better circuits than either alone?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines