Why We Seek Validation

Itamar Zigel
The scientific method, humanity’s most successful approach to uncovering how the world works, doesn’t rely on “truth” as much as it relies on validity. In fact, “valid” often serves as a sharper, more useful stand-in for the vague concept of “true”.

Why? Because philosophy of science reminds us that there is no access to absolute truth outside of the paradigm we’re operating in. Paradigms themselves are built on human assumptions and shared beliefs - and those change. Once we believed in Ether, or in the four natural elements. Today, those are curiosities of history, not “truth”.

And yet we can’t live in epistemological despair. We build satellites, run computers on quantum mechanics, and develop new therapies. We clearly know something about the world. What we do in practice is not prove absolute truths, but assess how valid a claim is. That’s how science moves forward, and why assessing validity is foundational. 

Why We Need Proxies

Take Karl Popper’s classic example: “all swans are white”. You can never prove it universally true as the next swan might be black. But every new white swan you observe increases the validity of the claim, while spotting just one black swan refutes it.

So validity is not binary. It grows stronger or weaker as claims are tested against reality. Reproducibility in science is a great example of a validity proxy. Many landmark studies have shown results that can’t be replicated - take Caspi et al. (2003) on the serotonin transporter gene (5-HTTLPR) and stress. While its validity is in question, the study still shaped how we think about depression and drove decades of discovery. Validity, not perfection, is what science runs on.

qed’s Proxies of Validity

At qed, we’ve worked with philosophers of science to make validity practical for AI. Our system evaluates claims along a set of proxies, including:

  • Consistency & Cohesion - do the claims support or contradict other claims? 
  • Simplicity - does the explanation rely on fewer assumptions than alternatives?
  • Predictive Power - does the claim generate testable outcomes, even in theory?
  • Refutability - can the claim be clearly proven wrong if false?

These proxies are not sufficient, and don’t capture “truth”, yet they let us measure how strong a claim stands within its scientific context.

Normal vs. Weird Science

Most science operates within paradigms (what Thomas Kuhn, the renowned historian and philosopher who popularized the term “paradigm shift” called “Normal Science”). Here, qed excels at providing grounded, critical review.

But scientific revolutions rarely emerge from within the paradigm. They begin with “weird” thinking: ideas that are unconventional, uncomfortable, even absurd. Traditional AI models struggle with this, because they optimize for patterns inside the existing framework.

At qed, we believe such breakthroughs will always rely on human ingenuity. We build qed to empower scientists - not to replace them. Our goal is to surface the scientific elements that don’t quite fit within humanity’s current understanding of the world - the outliers that may one day redefine it - thus taking scientists closer to the borders of human knowledge.  

The Way We Assess Validity at qed Science

This is the foundation of how we approach review: not chasing absolute truth, but systematically evaluating validity. Even in the new era of highly capable LLMs, teaching a machine to reliably estimate the validity of a claim remains conceptually hard. 

We took a stab at it right from the start. Early on, we realized that the challenge of building the algorithms is as hard as measuring their quality. To do that, we needed a dataset of high- and low-validity claims, a benchmark that would enable us to test whether our models could tell them apart.

That turned out to be surprisingly difficult. Even in high-quality outlets, reviewer comments can’t easily serve as ground truth. Many critiques are never fully addressed in the final paper, which might mean that the reviewed claims were valid all along - or that the weakness simply went unresolved. 

Our approach was to look deeper: analyzing anything from how authors respond to reviewer feedback to how those responses correlate with the paper’s final outcome. This way, by considering the dynamic between critique, revision and influence, we began to see a fuller picture of what validity really means in practice.

We then set out to build the algorithms based on these insights. For example, one useful framing is what we call “internal” and “external” validity, which mean validation of claims against supporting evidence presented in the paper, and validation of claims against the relevant scientific corpus, respectively. Both form conceptual challenges: Internal validity requires identifying and ranking logical gaps between claims and their supporting evidence, and external validity requires making assumptions on the validity of claims in other papers. 

Understanding and ranking gaps prove to be particularly hard. To train our models we built on opt-in feedback from thousands of expert scientists who helped identify gaps and rank their severity, and translated it into a claim validity index that serves as a training and evaluation signal. As more scientists adopt qed to strengthen their work, we become better at this. By teaching machines to understand philosophical “validity”, we aspire to build the foundation for how scientists work in the AI era.

One thing is clear to us: in a fittingly scientific manner, this effort will never be complete.

Photo by DAVIDCOHEN on Unsplash

Please fill in your details below and you will get immediate access.

We've sent you an access link.
Please check your inbox.

Didn't get your email? Check your spam folder or reach out to info@qedscience.com

Oops! Something went wrong while submitting the form.