Logical Reasoning and Internal Rules

Can argument schemes be organized by formal principles instead of lists?

Argumentation theory has accumulated 60+ overlapping scheme lists with no principled boundaries. Can a structured classification based on formal ordering principles replace this ad-hoc approach and provide a coherent target space for analysis?

Can retrieval-augmented language models forecast like human experts?

Can language models augmented with search and reasoning match or exceed the forecasting accuracy of competitive human crowd forecasters on events beyond their training data? This tests whether AI can handle genuine predictive judgment.

How do transformers perform analogical reasoning across domains?

Exploring whether transformers solve analogy problems through a distinct mechanism separate from composition, and whether this involves abstract relational structure rather than memorized computation patterns.

Can three axes organize all possible argument schemes?

Can a small set of orthogonal distinctions—subject vs. predicate, order level, and proposition types—capture the full space of valid argument structures? This matters because it could replace ad-hoc scheme lists with a systematic framework.

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

How do transformers learn to reason across multiple steps?

Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

Do language models make rational strategic decisions in games?

Explores whether LLMs consistently apply game-theoretic reasoning to reach optimal strategies, and whether their performance holds as games become more complex. Understanding this matters for deploying LLMs in negotiation and competitive settings.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Why does partial formalization outperform full symbolic logic?

Explores whether injecting some symbolic structure into natural language reasoning works better than completely formalizing problems. Matters because it could reveal the optimal balance between structure and semantics for LLM reasoning.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Do first-order and second-order arguments unify classical and modern divisions?

Does the formal distinction between first-order and second-order arguments map onto both the classical internal-external topoi divide and the modern reasonable-fallacious distinction? If so, it would reveal a single structural axis underlying two separate critical traditions.