Improving OpenAI Codex with Repo-Specific Context

Three weeks ago we published results showing Codeset improved Claude models by 7–10 percentage points on real coding tasks. That eval covered three tiers of the Anthropic model family. What it didn't answer was whether the gain was specific to how Claude processes context — or whether it was something more fundamental.

So we ran the same evaluation on OpenAI Codex running GPT-5.4.

The setup

We used the same benchmarks and, crucially, the same task subset as the Claude eval.

codeset-gym-python is our public dataset of software engineering tasks. We used the same 150-task subset from the original evaluation — same tasks, same verifiers, same evaluation harness. Each task comes from a real GitHub issue and is verified by a test suite that confirms the solution.

SWE-Bench Pro is the widely-used benchmark of real GitHub issues across JS, TS, Go, and Python repositories. We randomly sampled 400 tasks to keep evaluation costs tractable.

Codeset context was extracted once per sample, targeting the base commit before any work is performed. The agent sees it before writing the first line of code.

Results

GPT-5.4 · resolution rate

codeset-gym-python · 150 tasks+5.3pp

baseline

60.7%

codeset

66%

SWE-Bench Pro · 400 tasks+2.0pp

baseline

56.5%

codeset

58.5%

Baseline

with Codeset

What the numbers mean

The improvement is real and consistent across both benchmarks — which matters because the two benchmarks test different things. codeset-gym-python tasks come from our own curated dataset of software engineering tasks. SWE-Bench Pro is an independent benchmark across multiple languages and repository types. Seeing improvement on both, using the same Codeset context pipeline, rules out dataset-specific effects.

The context gap is model-agnostic. Claude models improved by 7–10pp. GPT-5.4 improved by 5.3pp on the same task set. The mechanism is the same in both cases: the agent enters the task with structured knowledge it would otherwise have to reconstruct from scratch — or miss entirely. Historical bug patterns, co-change relationships, known pitfalls, the exact tests to run. That knowledge exists in every codebase's git history. Codeset surfaces it. The model doesn't matter.

The SWE-Bench Pro delta is in line with prior results. When we published the Claude eval, we noted that a +2.7pp improvement on SWE-Bench Pro was comparable to what you'd get from an incremental GPT-5 model upgrade (5.2→5.4: +2.1pp on the same benchmark). The GPT-5.4 result here — +2.0pp — is consistent with that range. Codeset provides a context boost that competes with a model-tier increment, at a fraction of the cost.

The gain is additive, not compensatory. Codeset doesn't paper over a weak model. GPT-5.4 is OpenAI's current flagship on coding tasks. Adding structured context still improves it by 5.3pp. The same was true for Claude Opus 4.5 — the best model in that family — which improved by 7.3pp.

The full evaluation artifacts — task results, per-sample outputs, and raw scores — are available at github.com/codeset-ai/codeset-release-evals.

Codeset is available today. Point it at your GitHub repository, and every session your agent runs — whether it's Claude Code, OpenAI Codex, or any other tool — starts with context your team built over years. $5 per repo, one-time payment.

Use code CODESETLAUNCH for a free repo analysis.

Analyze your repo →