I’ll be teaching next week at the University of Minnesota’s Machine Learning Camp for high school students, which I founded back in 2018 with Dr. Melissa Lynn, Dr. Kaitlin Hill, and the MCFAM team. It’s a bit of a homecoming and a bookmark to my three years at CH Robinson. I looooooooove talking about the edge cases and problems and complications of data science, machine learning, and artificial intelligence, and with the explosion of ChatGPT and other Large Language Models (LLMs) into the common consciousness, as well as Midjourney, Stable Diffusion, and Dall-e for images, we’ve got a whole new set of examples to talk about! Examples I want to hit on: the lawyer who uses ChatGPT with disastrous results, the “who is pregnant” problem in Large Language Models like ChatGPT, the marked downgrade in diversity in images from the image generators, the pr0n-y problem with training on, um, well, certain kinds of fiction (not to mention similar problems on the image side). I don’t think I’ll manage to get to intellectual property problems per se (the “stolen data” problem).
Time is getting away from me so this will be a multi-part set of posts. In this post I’ll tackle the misconceptions that ChatGPT is a search engine, ChatGPT is evaluated on truth, and ChatGPT represents the real world.
Misconception 1: ChatGPT is a search engine. Related misconception: ChatGPT and other LLMs are trained to be true and evaluated on the truth of their answers.
Fact 1: Nope, ChatGPT is not a search engine and neither is Bard. The older version of ChatGPT was only trained on info from the internet up to about September 2021. ChatGPT now has plugins that allow “internet access” in some cases, and Bard can access the internet as well, but LLMs in general are trained and evaluated on plausibility.
A large language model looks for what set of words would naturally follow another set of words with highest probability given the training data. Of course this is a bit oversimplified, as one can adjust weights, add stochasticity, etc. But the basic core is true: the value of LLMs is plausibility/probability, not truth. The lawyer who used ChatGPT to assist in his research learned this the hard way. ChatGPT entirely made up a number of cases that don’t exist. They don’t even make sense — there’s a wrongful death suit about a guy whose name changed in the first few pages who missed his flight. He’s a fictional character made up by ChatGPT, but unlike a fiction made up by a human writer, he’s not even got the consistency to have the grace to die! (ChatGPT says it is a wrongful death suit, remember.)
Misconception 2: ChatGPT is trained on real-life writing so it reflects the world we live in, right? And since Stable Diffusion and other image-generation models are trained on images from the internet they too represent the world, right?
Fact 2: Nope. When you skim “highest probability” paths off of life, you simply lose a lot of reality. Yes, ChatGPT overrepresents American experiences and writing — but *even given that* it loses a lot. Check out these examples from Kathryn Tewson (she’s got her own interesting AI story):
Yes, I am going to keep going until I find The Most Masculine Job
— Kathryn Tewson (@KathrynTewson) April 24, 2023
ChatGPT has specific pathways trained in regarding gender associated to professions. Stable Diffusion, an image-generation model, shows the same phenomenon, *amplifying* disparities beyond reality: “Stable Diffusion depicts a different scenario, where hardly any women have lucrative jobs or occupy positions of power. Women made up a tiny fraction of the images generated for the keyword “judge” — about 3% — when in reality 34% of US judges are women, according to the National Association of Women Judges and the Federal Judicial Center. In the Stable Diffusion results, women were not only underrepresented in high-paying occupations, they were also overrepresented in low-paying ones.” Similar phenomenon with skin tone: Stable Diffusion “specifically overrepresented people with darker skin tones in low-paying fields. For example, the model generated images of people with darker skin tones 70% of the time for the keyword “fast-food worker,” even though 70% of fast-food workers in the US are White. Similarly, 68% of the images generated of social workers had darker skin tones, while 65% of US social workers are White.”