☆ Yσɠƚԋσʂ ☆@lemmy.ml to

Technology@lemmy.mlEnglish · 7 hours ago

German researchers achieved 71.6% on ARC-AGI using a regular GPU for 2 cents per task. OpenAI's o3 gets 87% but costs $17 per task making it 850x more expensive.

1

23

German researchers achieved 71.6% on ARC-AGI using a regular GPU for 2 cents per task. OpenAI's o3 gets 87% but costs $17 per task making it 850x more expensive.

☆ Yσɠƚԋσʂ ☆@lemmy.ml to

Technology@lemmy.mlEnglish · 7 hours ago

1

Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

That score is seriously impressive because it actually beats the average human performance of 60.2% and completely changes the narrative that you need massive proprietary models to do abstract reasoning. They used a fine-tuned version of Mistral-NeMo-Minitron-8B and brought the inference cost down to an absurdly cheap level compared to OpenAI’s o3 model.

The methodology is really clever because they started by nuking the standard tokenizer and stripping it down to just 64 tokens to stop the model from accidentally merging digits and confusing itself. They also leaned heavily on test-time training where the model fine-tunes itself on the few example pairs of a specific puzzle for a few seconds before trying to solve the test input. For the actual generation they ditched standard sampling for a depth-first search that prunes low-probability paths early so they do not waste compute on obvious dead ends.

The most innovative part of the paper is their Product of Experts selection strategy. Once the model generates a candidate solution they do not just trust it blindly. They take that solution and re-evaluate its probability across different augmentations of the input like rotating the grid or swapping colors. If the solution is actually correct it should look plausible from every perspective so they calculate the geometric mean of those probabilities to filter out hallucinations. It is basically like the model peer reviewing its own work by looking at the problem from different angles to make sure the logic holds up.

What’s remarkable is that all of this was done with smart engineering rather than raw compute. You can literally run this tonight on your own machine.

The code is fully open-source: https://github.com/da-fr/Product-of-Experts-ARC-Paper

Chat

neon_nova@lemmy.dbzer0.com
link
fedilink
English
arrow-up
1·
3 hours ago
I don’t know much about running this on my own computer other than using ollama. Is that what you mean about running it on my own?

Technology@lemmy.ml

technology@lemmy.ml

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@lemmy.ml

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.

Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.

Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

178 users / day
887 users / week
2.34K users / month
7.04K users / 6 months
1 local subscriber
40.3K subscribers
4.24K Posts
50.2K Comments
Modlog

mods:
MinutePhrase@lemmy.ml