If so are these programs that claim to ‘poison’ the training datasets effective ?

  • fiat_lux 🆕 🏠@lemmy.zip
    link
    fedilink
    arrow-up
    2
    ·
    22 days ago

    We can see that it’s solved by the fact that AI models continue to get better despite an increasing amount of AI-generated data being present in the world that training data is being drawn from.

    Even if it logically followed that model improvement means model collapse is a solved problem, which it absolutely doesn’t, even the premise that models are improving to a significant degree is up for debate.

    MMLU pro benchmark over time line graph showing plateauing values Massive Multitask Language Understanding (MMLU) benchmark vs time 07-2023 to 01-2026

    A lot of people really want to believe that AI is going to just “go away” somehow, and this notion of model collapse is a convenient way to support that belief

    Model collapse may for some people be an argument used to support a hope that AI will go away, but the reality of that hope does not alter the validity of the model collapse problem.

    You can tell it’s not a solved problem because researchers are still trying to quantify the risk and severity of collapse - as you can see even just from the abstracts in the links I provided.

    Some choice excerpts from the abstracts, for those who don’t want to click the links:

    Our results show that even the smallest fraction of synthetic data (e.g., as little as 1% of the total training dataset) can still lead to model collapse

    …we establish … that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions … are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set.

    • XLE@piefed.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      22 days ago

      It’s really interesting reading a conversion between somebody who knows what they’re talking about, providing sources, and a known troll (FaceDeer) who can only go “nuh-uh” and complain about ghosts.