codeset - A platform for training and evaluating agentic models

Introducing SWE-Bench Verified on the Codeset Platform

We're excited to announce that SWE-Bench Verified is now available on the Codeset platform!

SWE-Bench Verified is the most widely adopted benchmark for evaluating code agents on real-world software engineering tasks and it is now easier than ever to run it. With just a few lines of code, you can programatically setup a session with one of the 500 samples in SWE-Bench Verified, interact with it, and verify if the instance has been solved.

import os
from codeset import Codeset

# Initialize client with your API key
client = Codeset(api_key=os.getenv("CODESET_API_KEY"))

# Create a session for a task
session = client.sessions.create(
    dataset="swe-bench-verified",
    sample_id="astropy__astropy-12907",
)

# Interact with the environment
response = client.sessions.execute_command(
    session_id=session.session_id,
    command="ls -lah"
)
print(response.stdout)

result = client.sessions.str_replace(
    session_id=session.session_id,
    file_path="astropy/modeling/separable.py",
    str_to_replace="cright[-right.shape[0]:, -right.shape[1]:] = 1",
    str_to_insert="cright[-right.shape[0]:, -right.shape[1]:] = right"

)
print(result)

# Run the tests and verify the solution
verify = client.sessions.verify.start(session_id=session.session_id)
assert verify.result.is_success
print("Verification successful!")

client.sessions.close(session_id=session.session_id)

You can start running evaluations immediately and integrate them directly into your benchmarking pipeline. Whether you are already using SWE-Bench Verified or just getting started, Codeset provides the easiest way to run and scale your evaluations.

Fixing issues with the SWE-Bench Dataset

While integrating SWE-Bench Verified into our platform, we encountered several problematic instances that have also been identified by the community:

psf__requests-1724: #324, #484
django__django-10097: #487
astropy__astropy-8707: #484
astropy__astropy-8872: #484

Additionally, we have fixed issues related to the PASS_TO_PASS and FAIL_TO_PASS test lists in our own version of the SWE-bench Verified dataset on HuggingFace for the following instances:

astropy__astropy-7606
matplotlib__matplotlib-22719
django__django-7530: #502

We believe maintaining a high-quality benchmark is essential for meaningful evaluations, and we're committed to helping improve SWE-Bench for the entire community.