From Coverage to Precision

From Coverage to Precision: A Dual Approach to AI Review Evaluation

Enav W.
|
Meir M.
June 2025

GenAI evaluation presents persistent challenges across applications. Drawing from human reviews, we found that one metric could simultaneously measure a system coverage and at the same time be used to improve it’s percision.

The task of generating expert level critical review for life science papers is at the heart of q.e.d’s mission. With the extensive progress made recently in AI, generating high quality reviews that can boost the quality of research made by scientists becomes a realistic goal. However, as in many other AI-driven tasks, the process of taking an ok-ish demo into a high quality production system requires an efficient process of improvement iterations, which demand access to very high quality labeled data. While q.e.d generates its gold data through intensive collaboration with life science researchers, relying solely on manually labeled data is not enough and alternative, scalable methods are required.

q.e.d agentic AI review platform is designed to critically assess life-science manuscripts. When a manuscript is submitted, the system analyzes its structure, claims, and supporting evidence to identify what it terms as “gaps.” These are potential logical flaws, instances where the observed results do not fully support the stated conclusions. In this blog we want to focus on the two challenges: evaluating the quality of system-generated gaps and  generating high quality training data.  

We evaluate our system using two key metrics: precision (whether the comments we make are genuinely useful and actionable) and recall (whether we capture all significant aspects that deserve attention in our test dataset). Aside from human-labeled data, our evaluation data set also consists of open peer reviews from published papers and reviewed preprints. This represents a valuable knowledge repository for both assessing our system's performance and enhancing its capabilities. However, human reviews have inherent limitations - for example, Schroter, S. et al. (2008) showed that expert reviewers identify fewer than one-third of major issues deliberately inserted into manuscripts. Following this logic, employing 2-3 reviewers per manuscript typically ensures comprehensive coverage of critical issues through their collective assessment.

For simplicity, we separate the universe of review comments into 4 categories: rigor, impact, novelty, and style/presentation. Here, we want to focus on measuring the quality of rigor-related comments. Early iterations showed that asking researchers to label the quality of a review comment is tricky. For example, sometimes the positive sentiment in the comment on the importance of an experiment led researchers to label comments as high quality, even when they did not agree the gap is truly a gap to start with. To avoid such label noise, we partition every comment into three components: (i) background, which gives some context for the gap that is identified. (ii) the research gap itself, and (iii) a plan to address the gap (optional). This framework also allows normalizing manual reviews to ease their comparison to the system’s output.

Collecting high-quality labeled data poses significant challenges, but even when diverse, reliable ground truth data is available, measuring accuracy between two sets of comments (ground truth versus system output) remains extremely difficult. Traditional LLM evaluation systems face a core issue: there are often multiple valid answers that differ not just in how semantically equivalent statements are worded, but also in their organizational structure, how points are divided, and how ideas are grouped together. Comments that are scientifically identical may use completely different language. To solve this problem, we developed a new evaluation approach called coverage. This metric assesses how well one collection of review comments covers the content of another collection, capturing the fundamental meaning behind each critique. The metric acknowledges that a single identified issue might not correspond directly to any single comment in the comparison set, so it evaluates coverage across multiple comments to determine whether the essential scientific concern has been addressed. This approach proves valuable for measuring system accuracy. We discovered an unexpected benefit: when applied correctly, this metric can also automatically generate labeled training data on a large scale.

To test the output of the generated gaps, we ask domain experts to label each gap as one of “major”, “minor”, and “no gap”. To improve the performance of our system, we use the LLM-as-a-judge pattern where we train a model to predict the user feedback for a given gap. Relying solely on domain experts is challenging both due to scale reasons and because system generated gaps constantly improve and the distribution of labeled gaps from a few months back is sometimes no longer valid to predict the quality of the newly generated gaps. Hence, we needed to constantly get feedback on the quality of the present gaps generated by the system, automatically.

It turns out that such training data can be achieved by reversing the order of the gap coverage test mechanism. Instead of testing how well human comments are covered by the system’s review, we measure to what extent each system gap is covered by the human review. Our assumption, validated externally, is that system gaps that are well covered by human review tend to be of better quality than those that are not matched by humans. While this is not always the case (as review quality varies), at scale it provides valuable information. Important to note that we believe this will not hold once AI systems become significantly better than human experts, but until that time we find the generated labeling very useful to train a model that predicts the quality of the generated gaps.

To summarize, surprisingly, it turns out that we can use the same machinery to measure the recall of the system, in order to improve its precision, using standard coverage measurements for the former and reverse coverage for the latter.

To meet up the demand we’re gradually opening up access. Please fill in your details below and we’ll get in touch soon!

You’re signed up for early access.
 We’ll be in touch soon.
Oops! Something went wrong while submitting the form.