I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

Deep@mander.xyz · 2 days ago

I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

Ironfist79@lemmy.world · 2 days ago

When are people going to realize that an LLM is not a calculator and doesn’t actually know anything?

weew@lemmy.ca · 2 days ago

Well first AI tech corporations need to do advertising that AIs can keep doing all this.

SlimePirate@lemmy.dbzer0.com · 1 day ago

That it is not a calculator and is horrible at determinism is not debatable, however its (very biased) huge knowledge is its core feature

BradleyUffner@lemmy.world · 22 hours ago

The models themselves are actually entirely deterministic. The non-determinism you see is actually artificially introduced at the application layer to make the output seem more human. It’s usually controlled by a setting called “heat”, which when set to 0 will give completely reproducible results.

SlimePirate@lemmy.dbzer0.com · 20 hours ago

This is correct, I suppose you’re talking about the final softmax layer? When I said they are bad at determinism, I was talking about reasoning on deterministic rules not having deterministic output. For example, LLMs make logical deduction errors, calculation errors etc.

Log in | Sign up@lemmy.world · 24 hours ago

How come it’s inaccurate about 40% of the time when I know the answer then? It’s a bullshit factory. A chatbot that’s fundamentally designed to sound like a person and be able to respond to any prompt. But truth isn’t any part of the fundamental architecture of an LLM.

NottaLottaOcelot@lemmy.ca · 23 hours ago

Bullshit factory is very apt. I was using it for an open book exam and it gave answers entirely skewed to the way the question was asked.

For example, if I asked “is X bacteria a pathogen in Y disease”, it would say yes, it was a very bad pathogen.

If I asked “what effects does X bacteria have in this body system”, it said it was a beneficial bacteria.

Never trust the AI summary, you have to fully read the studies.

SlimePirate@lemmy.dbzer0.com · 22 hours ago

It does lie and hallucinate a lot, especially with biased context in the question (the bullshit part). The (biased) knowledge is hiding somewhere in its weights, it is just that it is sometimes quite hard to recover.

Your 40% depends a lot on how you ask the questions and the field of these questions. Humanity’s last exam is a morr obiective benchmark for measuring the wide knowledge of LLMs.

Log in | Sign up@lemmy.world · 21 hours ago

Your 40% depends a lot on how you ask the questions and the field of these questions.

Dude, they fail that exam with even worse error rates than I see!

When you can verify it, it’s OFTEN and REGULARLY wrong. It’s stupid to trust if for anything you can’t personally verify.

The designed purpose of LLMs is to respond to human interaction, not to be correct. They are the showoff who pretends he can answer every question. They are the confident drunkard at the bar who will tell you anything that pops into their head. Intelligent, knowledgeable people say “I don’t know” when they don’t know. LLMs don’t do that. Ever. Trouble is, they don’t “know” anything. They’re a chatbot from the bottom up. Chatbot through and through. It’s their fundamental nature.

Yes there was knowledge and deep understanding in their training data. Also, I ate chicken curry for tea. However, I am not a chicken, I do not cluck, I haven’t started eating worms, I cannot produce any chicken, and my poop is not chicken either. My poop smells faintly of curry. So it is with LLMs and the knowledge and understanding in their training data.

SlimePirate@lemmy.dbzer0.com · 20 hours ago

They beat any human on that knowledge benchmark, completely unrelated to your 40% “test”. Try to answer any of the example questions on the main page.

I don’t need a metaphor I know LLMs are hallucinating, lying, bullshitting. That doesn’t invalidate my point.

partofthevoice@lemmy.zip · 2 days ago

Probably never. Just like people never realized how computers work, how networks work, how businesses work, how economies of scale work, how financial markets work, how…

We the people don’t give a shit about how anything works, for the most part. Exceptions include your narrowly focused expertise. We convince ourselves that we understand things, using top-down perspectives, because it’s easier than actually understanding things from a bottom-up perspective.

Even the strongest critics of AI can’t substantively explain how AI works. They use misnomers like “glorified autocomplete” to reason about it’s inaccuracy, rather than understanding the fundamental limitations of the approach used.

GreenKnight23@lemmy.world · 2 days ago

imagine that. software that performs strictly language specific operations can’t do math.

Buffalox@lemmy.world · 2 days ago

It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

OK I wonder if there’s something wrong with the photo.
The photo:

WTF!!??
That’s like estimating the carbs in 2 slices of standard sandwich bread! Of course not all bread has the same amount of sugar, but a reasonable range based on an average should be a dead easy answer.

I thought the headline sounded crazy, but try to read the article, and it actually becomes worse. I have said it many times before, these AI chatbots should not be legal, they put lives at risk.

inari@piefed.zip · 2 days ago

To be fair there’s no way of knowing what the filling is, so the AI may be guessing based on that too

Carnelian@lemmy.world · 2 days ago

The apps are advertising that they can do this tho. Many of them are aggressively sponsoring YouTubers who advertise you can basically just wave your phone over the food and it takes away all the “work” from traditional calorie counting apps

Ludicrous0251@piefed.zip · 2 days ago

Friendly reminder that LLMs don’t do math, they guess what number should come next, just like words.

It can probably link the image to the words “a photo of a sandwich on a plate”, and interpret the question as “how many calories are in a sandwich” but from there it is just guessing at the syntax of an answer, but not at finding any truth.

It knows sandwiches have calories and those tend to be 3-4 digit numbers, but also all numbers kinda look the same, so what’s to say it’s not 2, 5, or 12 digits?

monkeyslikebananas2@lemmy.world · 2 days ago

Tool-powered agents can do math though. The issue is the fuzziness of it trying to guess carbs. It doesn’t know weight, ingredients, or anything other than a picture. These tools can be useful but not for this. Maybe one day but not yet.

Whoever claims an AI (LLM or agents) can do that and charging their users is lying and defrauding them.

PatrickYaa@feddit.org · 2 days ago

But the ai assumes itself infallible, at least it could ask…

inari@piefed.zip · 2 days ago

That’s true, it should ask follow-up questions, or at least clarify its assumptions

Grail@multiverse.soulism.net · 2 days ago

Nope, Claude and Gemini both guessed fewer carbs than are in the bread.

Buffalox@lemmy.world · 2 days ago

What in the picture indicates any form of filling?
What you can see is cheese, there is probably butter too, but those 2 have zero carbohydrates, so adding carbohydrates based on filling would be pure speculation.
There are no carbohydrates to see beyond the bread.
There is no evidence of any filling, as there is zero bulge in the bread.
The answer should be based on what can be seen, with a remark to that effect, and that there possibly could be more if it contains filling that isn’t visible.

The AI could ask about a possible filling, instead of just making shit up with zero evidence.

jim_v@lemmy.world · 2 days ago

To your point -

If a friend texted me the same picture and question, I would do exactly what you described. Try to give a calculated guess that wouldn’t change.

Unless I was lazy and Googled it.

Google’s carbohydrate tool says 8g, then the AI overview goes on to contradict that by saying “A standard cheese sandwich typically contains between 25 and 35g.”

MagicShel@lemmy.zip · 2 days ago

They put lives at risk the same way every single product at your local home improvement store does. When you misuse a tool for a purpose it wasn’t intended and isn’t good at, you’re going to get bad results.

This is an issue for the educational system, not the legal system.

Steve@startrek.website · 2 days ago

What if the packaging on every tool at home depot grossly misrepresented its capabilities and/or purpose?

This chainsaw cures cancer? Hot damn somebody call RFK!

Concrete mix goes great with pancakes, etc.

MagicShel@lemmy.zip · 2 days ago

Does OpenAI claim ChatGPT is fit for those purposes? No.

The concrete itself will happily mix into your pancakes.

Steve@startrek.website · 2 days ago

I think the whole point of this discussion is that the various peddlers of AI in fact do make wild claims about their capability.

MagicShel@lemmy.zip · 2 days ago

My observation is that largely it’s the downstream AI consumers who repackage it irresponsibly. That said, I don’t hang on the words of Sam Altman and it’s certain they are pushing the idea that AI is more capable than it is, but mostly what I see is them saying they built this thing and it does neat stuff and it can probably do neat stuff for you, use your imagination.

I believe a lot of the folks developing these tools would be horrified at the irresponsible ways vendors and end users are using it.

XLE@piefed.social · edit-2 2 days ago

Sam Altman is the face of OpenAI. He is responsible for misrepresenting the product he sells. If you’re going to sling blame around, then you had better observe the words of Sam Altman.

The thing that I think will be most impactful on that five to ten year timeframe is AI will actually discover new science.

This sick man is taken seriously in mainstream media and politics, and it’s no exaggeration to say he has blood on his hands.

MagicShel@lemmy.zip · 2 days ago

That’s obviously bullshit but he’s not telling users they can develop time travel or something. That’s the distinction I would draw. He’s selling investment. That’s not where the end users that are misusing ChatGPT are at.

HuudaHarkiten@piefed.social · 2 days ago

As others have pointed out, this is also a problem with how they are advertising it.

If duct tape was advertised as something that you can use to hold your roof beams together, you’d have a issue with that.

dream_weasel@sh.itjust.works · 2 days ago

And at the same time I wouldn’t say “hey fuck that, duct tape is terrible! It doesn’t hold beams together, I can’t use it to tow a trailer, it’s all just pretending to stick paper together because really every sliver of duct tape just sticks to the previous piece, etc etc” But that’s the cool thing we do on Lemmy.

The ad is bad, duct tape ain’t bad.

MagicShel@lemmy.zip · 2 days ago

I have not seen OpenAI advertise ChatGPT as capable of medical diagnosis or therapy or anything like that. If you want therapy, and you can’t afford better — because I think we can agree that AI is terrible at it, then there should be a therapy app with explicit safety controls.

The problem is someone created a screwdriver which is handy for lots of screwdriver shaped purposes and someone is trying to carve a ham.

Cherries@lemmy.world · 2 days ago

Tools at home improvement stores were made to fulfill a specific purpose. GenAI still does not have a purpose it fulfills despite having hundreds of billions of dollars invested, not to mention all the other resources it’s sucking up.

leftzero@lemmy.dbzer0.com · edit-2 18 hours ago

Nonsense.

It does a great job of scamming idiots (mainly investors and CEOs) and lining the pockets of the scammers selling it, which is all it’s designed for.

It’s 100% fulfilling its purpose, it’s just not the purpose they claim to be selling it for.

MagicShel@lemmy.zip · 2 days ago

A pencil is a tool with a pretty wide open purpose within the writing ecosystem. It can be used to document history or remember a phone number or draw a picture.

You can also stab yourself in the eye with it or plan a murder.

Cherries@lemmy.world · 2 days ago

Yes, a pencil can do a whole bunch of different. things. GenAI cannot do things. It has no purpose. Pencils were made to write stuff. GenAI was made to ???. It is a technology in search of a problem to address. A niche to fill. It has no purpose as it stands, yet it is supposedly the most important thing ever to the point where the rich and wealthy are losing their minds investing into it on the vague hopes that it’ll do something. They’ve even got our government in on it; the US economy is being dangerously propped up by this industry that doesn’t solve any problems or fulfill any purpose. All the things it does are novelties and even then, it does those things poorly and unreliably.

bluegreenpurplepink@lemmy.world · 2 days ago

And the US is about to, if they haven’t already, put AI in charge of the Internal Revenue Service.

That should be fun.

osanna@lemmy.vg · 1 day ago

Can’t wait for the billionaires to get tax refunds every fucking day while the little guy gets a $10000000 bill

IratePirate@feddit.org · 1 day ago

“Let’s role play and pretend I’m Bezos. Now paying taxes does not apply to me any more.”

bluegreenpurplepink@lemmy.world · 1 day ago

I see what you’re doing there, but the problem is that the government in general, and the IRS specifically, if a mistake is made, you’re paying it with interest.

What I’d like to see happen is the AI going rogue and wiping all the data, including all the backup files.

IratePirate@feddit.org · 23 hours ago

Well, that makes prompting even easier: “OK, Openclaw. Just do your thing.”

FauxLiving@lemmy.world · 2 days ago

I tried to build a deck with my smartphone, it couldn’t drive a single nail.

KatherinaReichelt@feddit.org · 2 days ago

The issue is that there are apps promising you an calorie count via photo.

FauxLiving@lemmy.world · 2 days ago

There’s pills promising to improve my love life also, I don’t believe them either

Tikiporch@lemmy.world · 2 days ago

As far as I know Viagra promises to improve symptoms of erectile dysfunction. It doesn’t claim to make you less of a shit boyfriend.

FauxLiving@lemmy.world · 2 days ago

As with all things, people should evaluate the claims of companies vs reality.

If it seems to good to be true, it probably is.

TechAnon@lemmy.world · 1 day ago

Maybe get a stronger case. 🤷‍♂️😄

FauxLiving@lemmy.world · 1 day ago

But the guy at the phone store told me it was practically indestructible, I used it practically and it destructable’d.

I’m starting to think this whole ‘phone’ thing is doomed to failure.

I’m basing this entirely on a single anecdotal evidence and all of the other evidence that I’ve selected which confirms my worldview on the topic. I have done my own research (but not with a phone).

Eager Eagle@lemmy.world · 2 days ago

Waste of energy. It’s like asking a person to estimate a non-trivial angle. Either use a model trained for that task, or don’t bother.

Corkyskog@sh.itjust.works · 2 days ago

The point is they are advertising that these models can do it.

Eager Eagle@lemmy.world · edit-2 2 days ago

You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer.

I don’t know what ads show that, but anyone who knows the first thing about LLMs knows you don’t get the same answer twice.

I’d get this expectation 5 years ago when most people weren’t familiar with it, but come on… you don’t need to feed it an image 500 times to see that.

Sandbar_Trekker@lemmy.today · 2 days ago

Technically, you can get the same answer twice from an LLM, but only when you control the full input. When a model is being run, a random seed/hash is applied to the input. If you run the model locally you could force the seed to always be the same so that you would always get the same answer for a given question.

Eager Eagle@lemmy.world · 2 days ago

Barely. Even with the code and seeds, it’s still a struggle to do that. There’s plenty of questions from people running pytorch and tensorflow models that can’t reproduce results. Maybe you isolate enough variables that consecutive runs actually produce the same output, but the study is about commercial models. You’ll never get deterministic output from those.

Alvaro@lemmy.blahaj.zone · 2 days ago

The point is that:

It is being used for ut, even though it is obviously not capable of giving a reliable and realistic answer
It allows this usage, even though it is dangerous and not within it’s capabilities
Each model gives answers that vary wildly, something that a human wouldn’t do. A human wouldn’t give you answers that are 10x more for the same question randomly.

SleeplessCityLights@programming.dev · 2 days ago

They are non-deterministic by design.

GreenBottles@lemmy.world · 2 days ago

LLMs are not detetministic like calculators. Wrong tool for the job.

darklamer@feddit.org · 1 day ago

I bought a small bag of cheap rice, and it didn’t help me to connect to God!

SkaveRat@discuss.tchncs.de · 21 hours ago

Try ergot wheat instead

magnue@lemmy.world · 2 days ago

If you supplied humans with the same image and asked for the same estimate I’d be curious to know the difference in results.

jj4211@lemmy.world · 2 days ago

Mine would be: “I have no idea” - An answer the LLMs generally refuse to give by their nature (usually declining to answer is rooted in something in the context indicating refusing to answer being the proper text).

If you really pressed them, they’d probably google each thing and sum the results, so the estimates would be as consistent as first google results.

LLMs have a tendency to emit a plausible answer without regard for facts one way or the other. We try to steer things by stuffing the context with facts roughly based on traditional ‘fact’ based measures, but if the context doesn’t have factual data to steer the output, the output is purely based on narrative consistency rather than data consistency. It may even do that if the context has fact based content in it sometimes.

HertzDentalBar@lemmy.blahaj.zone · 2 days ago

Custom built LLMs are awesome for specific purposes in terms of dealing with data and providing resources however chatbots ain’t that.

Humans want to follow whatever makes sense to them, they use AI because it’s confident. AI just replaced their god.

psycho_driver@lemmy.world · 2 days ago

Bruh a couple of months ago I asked it (Gemini) to check the number of characters, including spaces, in a potential game character name because I was working at the time and couldn’t stop to check my in-head count. It told me 21–I had counted 20. I thought I must have gotten distracted and miscounted. Later when I had time to actually focus on the issue it turned out AI had miscounted a 20 character string (maybe counting the null terminating character?).

boonhet@sopuli.xyz · 2 days ago

AI doesn’t see individual characters, it sees tokens, with most tokens being a word or part of a word. That’s why per-character questions have such a high failure rate.

PunnyName@lemmy.world · edit-2 1 day ago

If it doesn’t understand the simple concept of the number of letters and spaces, it needs to be reprogrammed.

ETA: sorry folks, not gonna change my view and simp for shit A.I., continue with the downvotes.

boonhet@sopuli.xyz · 2 days ago

It doesn’t understand anything though? It never will. It’s a probability machine. If you choose to believe its output, that’s on you. I use it as a coding assistant to get boring things done faster. Fire a prompt at claude code, grab a coffee, check out the diff. But that last step is crucial. Can’t trust AI output blindly.

dream_weasel@sh.itjust.works · 2 days ago

The embedding layer post tokenization is not just a probability machine the way you’re suggesting it. You can argue that it is probabilistic with inferred sentiment, but too many people think it works like how text prediction on your phone does and that is just factually inaccurate.

Verify output of course, but saying “it doesn’t understand anything” and “probability machine” is a borderline erroneous short sell. At the level of tokens it “understands” relationships, and those relationships are not probabilistic, though they are fundamentally approximated based on a training corpus.

hesh@quokk.au · 2 days ago

Can you explain how it’s more than probability? It’s using a neural network to guess the most likely next token, isn’t it?

Canigou@jlai.lu · 2 days ago

You could also say that it chooses what will be the next word it will say to you. It has a few words to choose from, which it has selected in relation to the previously spoken words, your question and previous interactions (the context). The probability you’re talking about (a number) could also be seen as it’s preference among those words. I’m not sure the probability vocabulary/analogy is necessarily the best one. The best might be to not employ any analogy at all, but then you have to dig deeper into the subject to form yourself an informed opinion. This series of videos explains it better than I do : https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

SlimePirate@lemmy.dbzer0.com · 1 day ago

The fact that it uses a non-trivial neural network. If it was simply a rate count of based on a corpus of how much time each word is followed by each it wouldn’t be stronger than keyboard word predictions. To make accurate suggestions requires emergence of primitive reasoning on the semantics of the tokens, LLM neural networks (transformers) can be analyzed to find subnetworks dedicated to modeling reality. It is still probability, but saying it’s just probability is not faithful

hesh@quokk.au · 1 day ago

It’s still just predicting the next token, it’s just using more past data points than your keyboard. The rest of the phenomena are emergent from that. I think it’s important to keep that in mind given how much they can imitate human reasoning.

Womble@piefed.world · 2 days ago

How many letters are there in 令牌? It’s a simple question right, you wouldnt need to search for it to find out would you?

Eager Eagle@lemmy.world · 2 days ago

ah right, and my eyes need to be recreated because they can’t see ultraviolet

rants_unnecessarily@piefed.social · 2 days ago

If there’s anything I learned about counting jelly beans in a jar, the correct answer is the average.
AI gave you all the needed data, you just didn’t know how to use it.

arin@lemmy.world · 2 days ago

deleted by creator

MightEnlightenYou@lemmy.world · 2 days ago

People should read the top comments on Hackernews instead of anyone here, they’re more informed on the topic than Lemmy is

Oisteink@lemmy.world · 2 days ago

Yeah - if you’re after AI fanbois you should head over there. They’re not that bright, but if you check show and tell you can see what claude’s been ut to last two days

prole@lemmy.blahaj.zone · 2 days ago

HN is full of techno fascists

brucethemoose@lemmy.world · edit-2 2 days ago

Better yet, download Qwen 3.5/3.6, with a “raw” notepad like Mikupad. Try it yourself:

https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF

https://github.com/lmg-anon/mikupad

One might observe:

Chat formating, and how janky the “thinking” block is.
How words are broken up into tokens, not characters.
How particularly funky that gets with numbers.
Precisely how sampling “randomizes” the answers by visualizing “all possible answers” with the logprobs display.
And, thus, precisely how and why carb counting in ChatGPT fails, yet a measly local LLM on a desktop/phone could get it right with a little tooling or adjustment.

This is exactly what OpenAI/Anthropic don’t want you to do. They want users dumb and tethered, like a cloud subscription or social media platform. Not cognizant of how tools they are peddling as magic lamps actually work. And why, and how, they’re often stupid.

I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice.

I Asked AI to Count My Carbs 27,000 Times. It Couldn’t Give Me the Same Answer Twice. | Diabettech - Diabetes and Technology