FaceDeer

FaceDeer@fedia.io · 3 days ago

Alright, so instead of simply saying “include external data in your training run”, extend that to “and also filter the data to exclude erroneous stuff.” That’s a routine part of curating training data in real-world AI training as well, I was already writing a lot so I didn’t feel like adding more detail there would have enhanced it.

The basic point remains the same, that real world training accounts for the things that were necessary to force model collapse to happen in that old paper I linked. It’s a solved problem. We can see that it’s solved by the fact that AI models continue to get better, despite an increasing amount of AI-generated data being present in the world that training data is being drawn from. Indeed, most models these days use synthetic training data that is intentionally AI-generated.

A lot of people really want to believe that AI is going to just “go away” somehow, and this notion of model collapse is a convenient way to support that belief. So it’s very persistent and makes for great clickbait. But it’s just not so. If nothing else, the exact same training data that was used to create those earlier models is still around. AI models are never going to get worse than they are now because if they did get worse we’d just throw them out and go back to the earlier ones that worked better, perhaps re-training with the same data but better training techniques or model architectures.

FaceDeer@fedia.io · 3 days ago

Model collapse comes from using only training data generated by previous generations.

All that’s needed to avoid it is to add training data that isn’t directly from the previous “generation” of the LLM in question. The thing that causes model collapse is the loss of data from generation to generation, so you just need to keep the training data “fresh” with stuff that wasn’t directly generated by the earlier generation of your model.

You could do that with archived material you used for previous training runs. For more recent events you could do that with social media feeds. The Fediverse, for example, would probably be a perfectly fine source of new stuff. Sure, there’s some AI-generated stuff mixed in, but that’s not “poison.”

As I mentioned, the article that demonstrated model collapse did it using a very artificial set of circumstances. It’s not how real AI training is done.

FaceDeer@fedia.io · 3 days ago

Indeed! Plus, building that giant corruption-riddled eyesore will surely be cheaper than just having adequate security at the next White House Correspondents Dinner. They surely saved enough to pay for it by not having metal detectors at the hotel this time.

FaceDeer@fedia.io · 4 days ago

The main mechanism leading to model collapse in that paper, as I understand it, are the loss of “rare” elements in the training data as each generation of model omits things that just don’t happen to be asked of it. Like, if the original training data has just one single line somewhere that says “birds are nice”, but the first generation of model never happens to be asked what it thinks of birds then this bit of information won’t be present in the second generation. Over time the training data becomes homogenized. It probably also picks up an increasing load of false or idiosyncratic bits of information that were hallucinated and get reinforced due to random happenstance, it’s been a long time since I read the article and details slip my mind.

I’m really not seeing how human filtering would mimic this process, so I think it’s safe. The filtering is being done with intent in that case, not due to random drift as is done with a purely automated generation like was done in the paper.

FaceDeer@fedia.io · 4 days ago

Semantic quibbling is one of the least interesting kinds of internet debate, so replace the word “understanding” with whatever word makes you happy. I continued with “and talking about” right afterwards so you can just delete the word entirely and the sentence still works fine. You could have just kept reading.

Since you didn’t read the rest of my comment, I should note that the rest of it after that sentence is about the other issue that OP raised and not even about model collapse at all.

Anyway. The article about model collapse that I see still crop up every once in a while is this one. It’s not that it has “methodological errors”, though, it’s just that it uses a very artificial training protocol to illustrate model collapse that doesn’t align with how LLMs are actually trained in real life. It’s like demonstrating the effects of inbreeding in animals by crossing brothers and sisters for twenty generations straight - you’ll almost certainly see some strong evidence, but it’s not a pattern of breeding that you are actually going to see in the wild.

FaceDeer@fedia.io · 4 days ago

The AI will exist either way, and people who use that AI will discover these exploits with it. I’d rather it be Mozilla.

FaceDeer@fedia.io · 4 days ago

Only in trivial cases where the training data isn’t being curated properly. There was a paper done on the subject a few years back where “model collapse” was demonstrated by repeatedly training generation after generation of models on the output of previous generations, and sure enough, the results were bad. This result gets paraded around every once in a while to “prove” that AI is doomed. However, in the real world this is not remotely close to how AI is actually trained. You can prevent model collapse simply by enriching the training data with good data - stuff that is already archived, that can’t be “contaminated.”

Indeed, the best models these days are trained largely on synthetic data - data that’s been pre-processed by other AIs to turn it into stuff that makes for better training material. For example a textbook could be processed by an LLM to turn it into a conversation about the information in the textbook, with questions and answers, and the result is training data an AI that’s better at understanding and talking about the content than if it was just fed the raw text.

If so are these programs that claim to ‘poison’ the training datasets effective?

This is a separate issue from the usual “model collapse” argument. I assume you’re talking about stuff like Nightshade, which claim to put false patterns into images that cause AIs to miscategorize them. These techniques are also something that only works in a “toy” environment, these adversarial patterns are tailored to affect specific AIs and won’t work on other AIs they weren’t specifically designed for. So for example you might “poison” an image so that a classifier based on Dall-E would become confused by it, but a GPT-Image classifier wouldn’t care. The most obvious illustration of this is the fact that humans are a separate lineage of image classifier and these “poisonings” have no effect on us.

There’s also the added problem that these adversarial patterns tend to be fragile, they break if you resample the image to resize or crop it. Since that’s usually a routine part of preparing training data for an image AI it may end up making the poison ineffective even for image AIs that it was designed for.

Essentially, all these things are just added background noise of the sort that AI training operations already have mechanisms for dealing with. But they make people feel better, I suppose.

FaceDeer@fedia.io · 4 days ago

The US is already trying to throw its economic weight around bullying Canada, and we’ve already settled in to an effective economic defensive posture. Those trade deals with China are actually part of it, previously we were supporting various American initiatives to tariff China but the Americans tore up a bunch of agreements with us so we responded in kind. It’s unfortunate but they started it and we’re prepared to hold our own.

FaceDeer@fedia.io · 4 days ago

It’ll be good for consumers worldwide. America is not the whole world.

I, for example, am in Canada. We’ve established a bunch of very nice trade deals with China recently, we’re going to end up with access to a bunch of Chinese products that Americans can’t get due to their self-imposed trade war with China.

FaceDeer@fedia.io · 5 days ago

That “thought” thing that you applied to the situation, Donald doesn’t do that part.

FaceDeer@fedia.io · 6 days ago

Not without hanging herself.

FaceDeer@fedia.io · 6 days ago

He might, confabulation is not uncommon in advanced dementia. In the end it doesn’t really matter though. Just imagine Trump is some sort of alien robot spewing nonsense and focus on what America does.

FaceDeer@fedia.io · 6 days ago

Immediately after the big announcements about Mythos there were followups by other teams that were able to find most of the same vulnerabilities with other existing models. I think the main takeaway there was that it’s just a matter of actually looking. Anthropic’s advantage may have been in the framework that let them do so in industrial-scale quantity rather than the cleverness of the particular model they used.

This sort of security scan is still new and important to pay attention to, but it’s not something that’s unique to Anthropic or that can be kept “contained.” Shades of how GPT-2 was considered “too dangerous to release” back when it first appeared. Comical in hindsight, and impossible to prevent anyway.

FaceDeer@fedia.io · 8 days ago

Kimi K2.6 is close to Opus. It beats Opus 4.6 on the benchmarks, so if Opus 4.6 was sufficient for your needs then Kimi K2.6 should be on par.

If you literally can’t access Opus because Anthropic cut you off I suspect that matters more than a slight difference in benchmarks.

FaceDeer@fedia.io · 8 days ago

The one currently making the headlines is Kimi K2.6, on the benchmarks it’s just short of Opus 4.7. It’s a trillion-parameter model so it won’t run on desktop computers, but it’s something a company could run on reasonably buildable servers for their own use.

For local use, I’ve been finding Qwen3.6’s 35B parameter model to be uncannily good. Gemma4 is also good, that’s one of the Western ones. These models won’t do the sort of heavy lifting that Opus can do but you don’t need that heavy lifting for all tasks.

FaceDeer@fedia.io · 8 days ago

Ironically, this is a great case study to illustrate the value of Chinese models. They’ve released a number that are on par with Claude’s latest models under “open weight” licenses that would allow you to run them yourselves if you wanted to, or to hire some other third party to provide API access. It wouldn’t matter what the original company’s “usage policy” is in that case.

There are a couple of Western open models that aren’t bad either, but they tend to be aimed at a smaller and simpler use case than Claude.

FaceDeer@fedia.io · 10 days ago

A while back a friend of mine told me that Trump had “crossed a line” with his blasphemous Jesus image, and I quietly privately gritted my teeth. Seriously, that’s a line? He was never a Trump supporter to begin with but this was worse than all the child rape and the ruination he’s bringing to the world?

Same thing here. If US conservatives decide to turn on the Israelis over this, sure, I’ll be happy they turned on them. But that this is the thing that does it, and not all the genocide and apartheid and whatnot, that doesn’t exactly put them in my good books.

FaceDeer@fedia.io · 10 days ago

Careful, it might advance to the part where he actually does it.

FaceDeer@fedia.io · 10 days ago

Reminder to the crews of those ships: matresses are flammable.

FaceDeer@fedia.io · 11 days ago

I feel like glue wouldn’t last as long. I have no idea how long the nail was, but if it’s a couple of inches deep I have no idea how you’d be able to get it off without an angle grinder. With glue you can chip it off, or use solvent, or a blowtorch.