From Frustration to Freedom

How AI Can Help Scientists Let Go of False Theories

Normal Frustration

It happened to me last week. I worked on some complex data manipulation, pre-processing and merge problems. My analytical goal was simple. But before I could work on it, I needed my data cleaned and arranged in a certain way to get to where I wanted to.

When we scientists work endless hours towards some computer program to make calculations, a visualization to show some relation, or a statistical model to disentangle distributions, we can become very frustrated. We spend a lot of time — and I mean days, weeks, months — figuring out how to measure and calculate and model stuff. And that is the more interesting stuff. Before we get into modeling, we usually need to clean, arrange, merge, and whatnot.

Sometimes we approach these problems with a specific goal in mind. Oftentimes, this goal is testing a theory. We approach our data with certain expectations (some may say prejudice). In order to find out if the theory is right or wrong, we go through enormous efforts.

And then we find … nothing. This happened to me today. And it happened to thousands of scientists today. I have spent many hours on a complex process of data cleaning, wrangling, reshaping, and merging to arrive at a measure of association between two variables. I had a grandiose theoretical idea. My expectation was a colossal (positive) correlation between two time-series. What I found was an association close to zero. I found nothing. The Frustration of Finding Nothing After Thinking There Really Should Be Something

Finding nothing can be frustrating. Finding the opposite to what you expected can be just as well, but at least there is something. I personally find it worse to expect a pattern and find randomness. And I bet all of us working in science know these issues.

But I was not frustrated this time. Finding nothing when you expect something because of theoretical concerns does not feel great, but the emotional toll was very small this time. I had experienced massive frustration before. I will tell you about my student days shortly, and those were the days of data analysis related heartbreak. But not this time.

Enter ChatGPT: A Game-Changing Assistant for Data Preparation

Now comes the twist. ChatGPT helped me with the complex task of data preparation! It was not easy for them, because my requests and prompts were either extremely specific or pretty vague. However, together, we managed to arrive at a data set ready for analysis. I do not know how long it would have taken without ChatGPT.

When I think back to my early student days, the same task may have taken me twice or triple the time, easily. I started my undergraduate studies in 2014. When I did not know how to perform a certain task in R, Python or Stata, I googled my problem. Sometimes, I would find something on the usual forums (thank God for Stackoverflow) or maybe someone had a solution on their Github. Sometimes I did not find anything. There were textbooks on programming languages, data processing, and statistical modeling thick like bricks (and equally interesting), but these usually offer only standard solutions for standard problems.

Some tasks took me multiple weeks to solve. Just some problems that you work on are unique, so no one on the internet has a solution for you. And then, when I took more than a week to create some calculations, often I found “nothing” (as in: not what I was hoping for). And then I went crazy. How can that be? I have this theory! It is inspired by Marx! It must be right! There was this study which indicated there should be a statistical association. And so on.

When Scientists Can’t Let Go

This happens often, in all the scientific disciplines I presume. And when this happens to you, you go look for a “fix”. You use a different measure. You use a different model. You aggregate the data. You look into subgroups. You exclude outliers. You add new variables to existing models. You consider finally finding out what “Bayesian” statistics is, because you sincerely believe becoming a “Bayesian” will solve your problem.

But your problem is this: you suffer from sunk cost fallacy. It is our fallacy to believe that just because we have invested a lot of energy in something, there needs to be a payout. Often there is not, and we need to let stuff go. But that is hard.

You have invested a tremendous amount of your time to produce a measurement of something. And the measurement tells you that this something does not exist. But you want it to exist, you need it to exist, because you went through the effort of producing the measurement only because you believed that something existed. But it doesn’t. And you become furious.

Does this sound familiar? We do not need to go into the whole debate about p-hacking and HARking. It is pretty emotional already. It is the emotional roller coaster of figuring out how to do the analysis and then the analysis tells you there is nothing going on.

Freedom to Focus on Analysis

My impression is this: with ChatGPT, many tedious tasks of data preparation and wrangling go much faster. I am ready to analyze the data much faster. And that is what I want. Analyse. I am not a data producer. And I am not a data engineer who enjoys cleaning and preparing.

In my research, I use “found data” or “administrative data”. Records of human behavior and interactions which naturally occur as products of human and organizational actions. I use data generated by employees who record their interactions with clients for the internal reporting system of their employer. I use government data on work which the national employment agencies collect as part of their governance efforts. I know these data are often not ready for analysis, because they were not produced for the eager eyes of sociologists. They need to be cleaned, adapted, reshaped, merged, saturated with external information to become really interesting.

But I am not interested in cleaning. Some of my colleagues may value the skills acquired via these tasks, because they want to feel like “real” data scientists or something like that. I do not. I want to write. Full stop. I need something to write about and thus I read sociological literature, generate some ideas about the world and look at some data to see if the ideas correspond to reality or not.

The more time I invest before I can take that look the more attached I become to my expectations. And everyone who has ever engaged with sociological theory knows these expectations are usually not very clear even. Sociological theories lack clarity and precision. However, sometimes I have looked into my tediously prepared data with (vague) expectations and was horrified. Nothing was as I hoped it would be. And then panic. Humans Making Measurements

Breaking Free from Theory: How AI Enables A Different Research Style

This is not even so much about having strong theories (prejudices). There are calls in my discipline to move away from the deductive approach and towards abduction. Instead of approaching the world and the data we have about it with the goal to “kill an idea” (as Popper framed it), we are invited to generate new and surprising results in the light of existing theories. Psychologically, even minor expectations combined with a huge investment may lead to the same search for “fixes” as I described above.

With ChatGPT, I can reduce the time spent on data preparation. I have only a rough estimate for how large this reduction is, but my feeling is it decreases time spent by about 50%. That is enormous. It also frees a lot of time to actually engage with what I see. To analyze deeper. Not to try to find what I was looking for because my sunk cost fallacy syndrome tells me to do it. Emotionally, I will be more able to discard false theory, even if I have subscribed to the theory. Even if it is “my favorite” for some reason. The emotional cost will be lower.

AI Assistance Doesn’t Replace Data Expertise

Obviously, data preparation teaches you a lot about the data you have. And you need to know your data very well. In the sense of: Where does it come from? Who made them? How does it feel? Where is the variation? Which data are missing? Is there a pattern to missingness? Which variables are there and what do they represent? Are observations nested?

So I am a huge proponent of exploratory data analysis. And sometimes exploratory data analysis involves creating new data structures, merging external data sources, and so on. In the same sense that mindless modeling only gets you so far, mindless cleaning and preparing with the aid of ChatGPT only gets you so far. You need to know the data. Be the expert of your data. But do not be afraid to hire an assistant. ChatGPT is not much different than a research assistant in that sense (I wonder what ChatGPT might do to the undergrad research assistant labor market …)

The Future of Discovery

And maybe, maybe, by spending more time actually analyzing the data and playing around with it, we may find something new, some real discovery. Instead of trying to save our theories like madmen. Because we have invested so much time and energy in testing them that they become just too important to fail. Using ChatGPT to avoid sunk cost fallacy will enable us to be more ready to discard false theories about the world. Because that is what science is about.

Written on February 11, 2025