Closed Source Models
Investigation into the claims made by closed source model providers about their capabilities in vulnerability research.
title: Closed Source Models status: Proposed start_date: TBD excerpt: > Independent evaluation of the claims made by commercial AI providers about their models’ capabilities in vulnerability research. —
Why independent evaluation matters
Commercial AI providers — including Anthropic, OpenAI, Google, and others — are making significant claims about their models’ ability to find vulnerabilities in software. Some of these claims are supported by published benchmarks. Most benchmarks, however, are designed and run by the providers themselves.
Project Lacewing is in a unique position to independently verify these claims. DIVD has the domain expertise, the track record, and the independence to run rigorous, unbiased evaluations — and to publish the results openly, regardless of what they show.
What we will do
This sub-project conducts structured, independent evaluations of commercially available AI models on real vulnerability research tasks. The output is not a marketing comparison — it is actionable intelligence for the security community.
Define a reproducible evaluation framework
Before testing any model, we will define a clear evaluation framework: what tasks, what ground truth, what scoring criteria. The framework will be published so others can replicate or extend our findings.
Test against real vulnerability research tasks
Evaluations will be grounded in the kind of work DIVD actually does: analysing source code and binaries, reasoning about attack surfaces, identifying vulnerability patterns, and producing findings in a structured format. Where possible we will use historical DIVD cases — with appropriate anonymisation — as test material, since we know the ground truth.
Evaluate cost and operational fit
Raw capability is only part of the picture. For Project Lacewing to use a commercial model sustainably, the cost per task must be viable at scale on a non-profit budget. We will evaluate each model’s performance relative to its cost, and assess whether usage policies allow the kind of security research DIVD conducts.
Publish openly
All findings will be published under a Creative Commons licence. Providers whose models perform well have nothing to fear. The security community deserves to know which tools actually work.
What we need
- Security researchers with experience in vulnerability research who can help design and run evaluations.
- Funding — commercial API access at research scale is not free. Token costs for thorough evaluation across multiple models are significant.
- Legal review — commercial model terms of service vary in how they treat security research use cases. We will need to navigate this carefully.