That score is seriously impressive because it actually beats the average human performance of 60.2% and completely changes the narrative that you need massive proprietary models to do abstract reasoning. They used a fine-tuned version of Mistral-NeMo-Minitron-8B and brought the inference cost down to an absurdly cheap level compared to OpenAI’s o3 model.

The methodology is really clever because they started by nuking the standard tokenizer and stripping it down to just 64 tokens to stop the model from accidentally merging digits and confusing itself. They also leaned heavily on test-time training where the model fine-tunes itself on the few example pairs of a specific puzzle for a few seconds before trying to solve the test input. For the actual generation they ditched standard sampling for a depth-first search that prunes low-probability paths early so they do not waste compute on obvious dead ends.

The most innovative part of the paper is their Product of Experts selection strategy. Once the model generates a candidate solution they do not just trust it blindly. They take that solution and re-evaluate its probability across different augmentations of the input like rotating the grid or swapping colors. If the solution is actually correct it should look plausible from every perspective so they calculate the geometric mean of those probabilities to filter out hallucinations. It is basically like the model peer reviewing its own work by looking at the problem from different angles to make sure the logic holds up.

What’s remarkable is that all of this was done with smart engineering rather than raw compute. You can literally run this tonight on your own machine.

The code is fully open-source: https://github.com/da-fr/Product-of-Experts-ARC-Paper

  • neon_nova@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 hours ago

    I don’t know much about running this on my own computer other than using ollama. Is that what you mean about running it on my own?