Foundation AI Papers (II) - a Temus Collection

Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30 • 47

Better & Faster Large Language Models via Multi-token Prediction

Paper • 2404.19737 • Published Apr 30 • 73

Note well ...

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12 • 62

KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30 • 108

Note "Less scalable version" of AGI backend model

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Paper • 2303.02536 • Published Mar 5, 2023 • 1

Note LoRA fine-tune on judge LM, using dataset from Prometheus's 10K feedback dataset. Turn LLM into a classifier to increase 'overfitting' and get a slightly better performing model based on Phi-3 (which arguably already have a stronger performance than Mistral) Not that surprising, and using large dataset to fine-tune on human preference is boring. They did release code for the experiment which is nice to have. The real gem is efficient alignment.

ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models

Paper • 2405.09220 • Published May 15 • 24

Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14 • 14

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Paper • 2405.05904 • Published May 9 • 6

Note A good way to avoid penalty while being lazy is just to be generic, or provide fake information

Robust agents learn causal world models

Paper • 2402.10877 • Published Feb 16 • 2

How Far Are We From AGI

Paper • 2405.10313 • Published May 16 • 3

MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20 • 45

Note What is the difference again?

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Paper • 2405.12939 • Published May 21 • 1

Note Majority vote is unreliable when the distribution is skewed. Variety in the prompt is used to elicit diversity in the evaluation distribution, where majority vote is obtained, this is the key for AoR.

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15 • 87

Note Duh

The Platonic Representation Hypothesis

Paper • 2405.07987 • Published May 13 • 2

Note Intelligence has at least 2 levels: Level 1 associative intelligence, key to achieve it is representation of concept such that 'distance' between representation vectors accurately depict the closeness of these concepts, such intelligence can be achieved with Supervised Learning. Level 2 is deductive intelligence, key to achieve that is searching for the right connection and reach the correct conclusion robustless to noisy input. This should be achieved with Reinforcement Learning.

AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct

Paper • 2405.14906 • Published May 23 • 23

Trans-LoRA: towards data-free Transferable Parameter Efficient Finetuning

Paper • 2405.17258 • Published May 27 • 14

Executable Code Actions Elicit Better LLM Agents

Paper • 2402.01030 • Published Feb 1 • 27

Contextual Position Encoding: Learning to Count What's Important

Paper • 2405.18719 • Published May 29 • 5

Note HUGE

Understanding Transformer Reasoning Capabilities via Graph Algorithms

Paper • 2405.18512 • Published May 28 • 1

What's the Magic Word? A Control Theory of LLM Prompting

Paper • 2310.04444 • Published Oct 2, 2023 • 1

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2 • 30

Calibrated Language Models Must Hallucinate

Paper • 2311.14648 • Published Nov 24, 2023 • 1

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17 • 30

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Paper • 2405.20541 • Published May 30 • 20

Note SImilar to LLM2LLM, reduces the selection cost by using a smaller LLM. But, it goes back to the model agnostic training design which is suboptimal.

To Believe or Not to Believe Your LLM

Paper • 2406.02543 • Published Jun 4 • 31

Note LLM suffers from confirmation bias. Given a single-label query Q, the confidence of LLM in answering A can be tested by adding "Another answer to Q is B" into the prompt and check the change in LLM's confidence in providing A as answer. More specifically P(A) / P(A) + P(B) is adopted as a score. Such changes serve as a query-specific hallucination indicator.

TextGrad: Automatic "Differentiation" via Text

Paper • 2406.07496 • Published Jun 11 • 26

Note Requires drastic simplification as the current mehcanism basically doesn't work, nonthless, identifaction of the need for "semantic gradient" is a correct insight

In-Context Editing: Learning Knowledge from Self-Induced Distributions

Paper • 2406.11194 • Published Jun 17 • 15

Note Self Distillation Loss with Context to supervise distribution learning with distribution target improves efficient and also "battles knowlegde collapsing"

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Paper • 2406.06592 • Published Jun 5 • 24

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Paper • 2406.12034 • Published Jun 17 • 13

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Paper • 2406.06469 • Published Jun 10 • 23

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Paper • 2406.14283 • Published Jun 20 • 2

Note Overpromise quite a lot. This one applies Aligner (residual connected extra network on top of LLM) to learn a reward model and generate under MCTS structure.

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Paper • 2406.11896 • Published Jun 14 • 18

Note This is the future. I am trying to build OS version of this one too.

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Paper • 2406.13542 • Published Jun 19 • 16

Note Arguably a spin-off from Voyager

HARE: HumAn pRiors, a key to small language model Efficiency

Paper • 2406.11410 • Published Jun 17 • 38

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20 • 85

Note Trained a instruction synthethizer to generate QA pairs from raw text. (Efficient way of getting around GPT-4 rate limit I supposed) (10k SFT data for synthethizer training) Use the extra QA dataset to conduct collective pre-training (raw corpus + QA pairs) and found better performance.

Teaching Arithmetic to Small Transformers

Paper • 2307.03381 • Published Jul 7, 2023 • 17

Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

Paper • 2406.15109 • Published Jun 21 • 1

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 94

Unlocking Continual Learning Abilities in Language Models

Paper • 2406.17245 • Published Jun 25 • 28

ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27 • 41

Flextron: Many-in-One Flexible Large Language Model

Paper • 2406.10260 • Published Jun 11 • 2