NPR has an amusing
report on what happened when ChatGPT was asked to explain or design rockets. "ChatGPT – the recently released chatbot from the company OpenAI – failed to accurately reproduce even the most basic equations of rocketry. It wasn't the only AI program to flunk the assignment. Others that generate images could turn out designs for rocket engines that looked impressive, but would fail catastrophically if anyone actually attempted to build them."
The bottom line is that the AI engines are trained to look at correlations between words/ phrases/ shapes/ vectors. (A vector is a set of numbers. It could include a count of bushels of potatoes, Chase holdings of rubles, the number of shark attacks this year, phase of the moon and number of estimated voters in Sierra Leone. It isn't just a direction in 3-space.) The ChapGPT is designed to use words to express those correlations in turn. Others produce images for output.
I think we've been trying to duck philosophical questions about the nature of understanding or learning, and the nature of man. Instead we've used a purely mechanical model, to try to fit in with a purely mechanistic universe. It seems easier than trying to figure out what learning is (e.g. Socrates' Meno)--it just doesn't seem to work very well. Some things are fairly easy to deal with--similar-sounding things aren't, as XKCD notes.
Perhaps it would help if we tried to categorize problems as "tractable using only correlations and statistics" and "not trivially tractable."
You could counter that a dozen years ago we would have classified chess games as an example of the latter, but that using randomly generated games scored by chess's rule for victory, a computer was able to learn to play skillful chess.
Look at the rocket assignment. It was given to an AI which was not trained on physics texts but on more general writing. If you picked a different training set you'd be able to get accurate results. The chess program was trained using a selected set of rules--for chess, and did not include the rules and scoring for Go. To get accurate results from either project, you need humans to select the criteria for the training. Without a standard for accuracy, the rocket chat was taking what people on the net had written (IIRC; could be wrong about the training), which was a mix of sloppiness and precision.
Tests showed that the ChatGPT system as initially offered had some dramatic biases on political topics--whether this was inadvertent (sloppy training set) or deliberate I don't know, though years in academia leave me with some suspicions.
With humans involved you can get invidious biases as well as accurate ones. Without humans involved in the selection you can get rubbish.
If such a chatgpt is going to be accurate and useful on some subject, the training set needs to be appropriate, and the data set either needs to be accurate (on the whole) for that subject, or with clear enough clustering that the system can crank out "there are disagreements, and here they are." It seems you can also create surprisingly good boiler-plate text to go with the text. If you design it to not return certain classes of answers you'll have side effects.
Fresh insight? I haven't seen any. Possibly there are connections that it could find: A relates to B, but also sometimes to C; or A relates to B which relates to C. But I'd have to evaluate the claims myself to be sure. And how does my evaluation differ from that of the correlator?
That's the question, isn't it?
Under what conditions can we use an uninspected data set?
- If I want to collect information so I can evaluate it ourselves, I could ask the program to give me the associated extra information so I can evaluate the data set too. For example, trying to predict election results based on Facebook "likes" is futile unless you know how people not on Facebook differ from those who are--and how the people who bother with "likes" differ from those who don't (guilty).
- If my question does not matter, because it is for amusement, or because I forgot a detail and don't care about the rest of the answer.
- When there are multiple answers all equally valid. (What's the best fruit?)
- When there is a single universally known answer (Where does the Sun appear to rise?).
- When the correlations demand no new (to me) insight. I'm not sure how to quantify this one, because some insights are already out there, along with statements of some form of my problem. The AI will scoop them up or not depending on how its internal numbers shake out--not whether the insights are coherent. I suspect this isn't quantifiable at all--it needs real understanding, and not mere correlations, to have insight.
- I understand my own question. Questions that sound very similar can have wildly different answers. Does the AI have the training to distinguish similar questions, despite their strong correlations? And know when it doesn't matter?
- I'll be satisfied with an average of what a bunch of people might say.
Going back to my test: write a poem in the style of The Barrel-Organ by Alfred Noyes. The original used long and strongly rhythmic lines--the AI didn't notice. The original flips scenes and meters between the city and the park--the AI picked a single different meter. The original speaks of the individual pains of passers-by--the AI was generic. It did pick up on one of the themes (people react to the music) and added a detail (the player packs up to go home). You could modify the AI to look for these kinds of details and register them--but would it then be suitable for writing a summary of cotton imports in Surinam? (Yes, with epicycles: Keyword "summary" → "discard report format details")
I can see uses for such a system in shaping searches, and apparently the big tech boys see it too ("New Bing", and Google's "Bard"). I assume their tools would be at the disposal of the Chinese government, and any others that care to try to shape local opinion.