There’s a rumor going around that computers can now read your thoughts. That “AI” can pick off images you’re thinking about.
Here’s one of many similar breathless headlines: “The AI that can read your mind: Chilling Black Mirror-style machine recreates the image you’re thinking about by decoding your brain signals“.
Not a chance, thought I, and not thought AI, either. There is no way a computer will be able to construct an image I am thinking of, at the least because in our minds images are fleeting, quickly morphing from one thing to another, and always somewhat vague. We none of us think like photographs which display static images, even though we can fix brief fuzzy moments.
So something other than the headline must be true. And is. Let’s see what the people behind the headline did, which is interesting, and related to the excitement over “deep fakes” and the like.
The paper is “High-resolution image reconstruction with latent diffusion models from humanbrain activity” by Yu Takagi and Shinji Nishimoto. As is typical, it is full of jargon, strange grammar and obscure ideas. So let me translate.
They use something called “diffusion modeling”. Here is a great simplification of the steps for creating and using a diffusion model for images and “deep fakes”:
1. Take a digitized picture of a house, which is labeled house, so that we can separate house models from cat models. Add some “noise” to that house picture. “Noise” is defined as coloring the pixels other than that which they should be colored.
2. Take the known noisy image and the original image and apply a model that estimates what the error in each pixel is. This is quite easy, because pixels only have a small set of possible values, and because we know the true image, so we know exactly how much error is at each pixel.
3. Repeat step 2 many times for different noise sets, because although the possible noise sets are not infinite, they are still large. In the end you have a model which can take an image and remove the noise from it, to a certain degree, which is to say, not perfectly, by subtracting from the noisy image the estimate of the noise.
4. Now invert things. Feed just noise to the model and have the model try to remove it. Which it will! The end result will be a house, or something resembling the original picture. The image will not be a cat because the estimates are from the house model.
5. Add other images, also called house, and begin the process anew. Add images of cats and do the same thing. Add cats on houses. And so on. Collect the models and their labels.
The thing to recognize is that, if one wants to make a “deep fake”, one must specify which model is being called on from this tremendous suite of models. If house is desired, and there are many house models, any can be picked. You can’t, however, just plug noise into some computer and get a picture of clock tower back, because a model of something must be used, and if you want clock tower you must say clock tower. There must be a mechanism to pick one from the right “target” (like house).
That means if I hook an electronic phrenology device to your head, a.k.a. an fMRI, I could take its output, treat it as noise, and give it to, say, the house model. That model will produce a picture, to some degree of accuracy (or inaccuracy), of a house. It will not produce a model of cat, a clock tower, or of anything else.
Okay so far?
All right, here’s what our authors did.
They took some images from the Natural Scenes Dataset (NSD), which “provides data acquired from a 7-Tesla fMRI scanner over 30–40 sessions during which [many subjects] viewed three repetitions of 10,000 images”. These were small pictures, around 425×425 pixels each. (Also, 7 Teslas is a very strong magnet!) Our authors took data from four people in this set.
Incidentally, the fMRI data itself is not “raw” but the result from other models over “region of interests” in the brain. We can ignore these details.
Our authors split the viewing sessions for each person, one set for training the authors’ model and another for testing.
They checked accuracy by somebody saying, in effect, “Yes, the image from the model was similar enough to the image the person looked at.”
The model was in parts, just using the fMRI data (“z”), just the name of the image (“c”, like house) and then both put together (“z_c”). Here’s the result they showed, which they admit was picked as the best looking result from just one person:
They would have done us a great service by also showing us the worst. Well, space in papers is limited.
The “ground truths” are the images the people stared at, the “z” was the reconstruction from the fMRI, and the “c” the reconstruction from the text using fMRI “noise”, and “z_c” the both of them.
Look at the top row, the second image to the right. A brownish blob. If you didn’t know it was supposed to be a train, you’d never guess. Looks more like a blue-eyed mamma bear with arms around a baby bear, both covered by a bear rug.
If you didn’t possess the known names of the images, and their models (as formed like in the steps above), which the authors had and used, you’d get nowhere. Computers are not reading your minds, and they are not “seeing” images you are thinking of.
There’s more to it, of course, and it is clever work and interesting. But you’re in no danger of having your minds read by a computer.
Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email: firstname.lastname@example.org, and please include yours so I know who to thank.