codeset - A platform for training and evaluating agentic models

Welcome to Building in Public, a series of articles where I (Nuno), or my co-founder (André), will keep you updated about our journey with Codeset. In this first article, we will share the story of how Codeset was born from a research project and the mission that drives us forward.

Turning a Research Project Into a Startup

In 2023, I was trying to understand what I wanted to do professionally. I had finished my Master's at the end of 2022 and I did a 4 months internship that finished in November 2022 in a startup that used AI to make informed decisions on investing in the energy market. I was offered a full-time position there, but I decided not to take it, as I did not feel fulfilled. The startup was great with a good work-life balance, good pay, and an interesting job, but I was not able to feel ownership over the work I was doing.

In early 2023, I felt lost. I tried to create some concepts of products with my brother but we were not able to find a market for them. Meanwhile, I kept having conversations with João, our scientific advisor at Codeset, about starting a PhD. This was a conversation I had since I finished my Master's, but at the time the low pay and the 4-year commitment did not sit well with me. However I did really enjoy doing research in software engineering, supervising students, and writing papers, and I felt that research gave me the sense of ownership that I yearned for. So after a while, I decided that I still had time later in life to get a big paycheck and I finally decided to give it a try with the mindset that if I did not like it, I could just leave at any point.

I applied for a grant to cover my pay and tuition in May 2023, but the results of my acceptance were only published in July 2023 and I could only start in September. So I had four months where I had nothing to do. I could have vibed the whole summer as any normal young person, but that does not fit my style. So it was with great excitement that I listened about the project that André proposed to me around June 2023. We could say that this was the embryonic stages of Codeset.

There was a challenge at the International Conference on Automated Software Engineering (ASE) to find methods to collect labeled datasets of software engineering tasks. The proposed solutions and respective datasets would then be evaluated and the winners would win a prize money. André was working on Automated Program Repair and these datasets would be useful for his research. I, on the other hand, was going to study topics related to Infrastructure-as-Code starting in September, but I also had a real interest in the topic of Automated Program Repair and I really enjoyed doing data mining, so I decided to accept his challenge. We have known each other since the second week of our Bachelor's and we have a very strong friendship. Moreover, we both trust each others' technical skills, so that is why André decided to invite me.

We did not end up submitting for that challenge as we were not able to have our work ready for the deadline, but we ended up submitting two short papers: one for the International Conference on Software Engineering (ICSE), the most prestigious software engineering conference, and another for a conference co-located with ICSE called Mining Software Repositories (MSR). The one submitted to ICSE was a tool called GitBug-Actions, which allowed users to collect datasets of software engineering tasks by leveraging GitHub-Actions workflows. The other paper was GitBug-Java, a benchmark of Java software engineering tasks. These two works are the pillar of what Codeset is today. Both works were presented at these conferences in 2024 at Lisbon.

A year passed without us touching the GitBug-Actions project. By the end of 2024, DeepSeek was finally able to use reinforcement learning to significantly improve the performance of Large Language Models (LLMs) while drastically reducing the costs of training. This was a big step for AI as it not only allowed more efficient training, but it also allowed for a new type of AI to increase significantly in popularity: agentic AI. Agents learn best when they can interact with an environment that provides feedback, receiving rewards for correct actions and penalties for mistakes. This is a concept that has been used in "traditional" AI for a long time, but this was the first time that it was successfully applied to LLMs.

However, this brought a new challenge. Until now, LLMs had been trained solely on static textual data, but reinforcement learning requires an environment where actions can be taken and evaluated with reward signals. These environments are much harder to not only create but also to use than textual data. In the particular case of code agents, one must be able to collect snapshots of software repositories with issues, reproduce their development environment with all its dependencies, and more importantly be able to execute their tests (or any other relevant oracle), which are used as the reward signals. You must also have available an infrastructure capable of running these environments easily at scale.

On February 12th 2025, André sent me a message asking if I would be interested in creating a startup focused on generating datasets of software engineering tasks and training models as we had the expertise and infrastructure to do it due to the GitBug-Java project. After some iterations, that idea transformed into what Codeset is today. Our mission is to accelerate the evolution of code agents, empowering developers by making them better at what they do, every day, in every task.

Both André and I use code agents daily and find them incredibly useful. Personally, it feels like being a software engineer cyborg. I can exclusively focus on the hard problems. These are the kinds of challenges that demand creativity: architectural decisions, algorithmical choices, or deeply intricate implementation details. Meanwhile, code agents deal with repetitive tasks. If you are writing a function and you already defined what arguments the function should receive, what it should return, and the high-level logic for the intermediate steps, actually writing it down is pretty much the same as any other function you ever wrote. The same goes for refactorings, logging, boilerplate code and so many other software engineering tasks. For the foreseeable future, this is where code agents truly shine.

For me, the main productivity boost is not about writing code faster or smarter, it's about how code agents help me fight procrastination. I can stay engaged longer as I can focus only on interesting and challenging tasks. I do not believe that code agents will replace software engineers anytime soon, at least for the next 10 years, but I do believe that, in the short term, agents will evolve to handle all those not so interesting tasks accurately and effortlessly.

However, we are not there yet. Code agents still hallucinate, ignore coding standards, and often produce a lot (and I mean A LOT) of duplicated or nonsensical code. Sometimes their solutions are just plain wrong, even when the prompts are detailed and clear. And when it comes to other areas such as security or DevOps, the situation is even worse.

One of the main reasons for these limitations is data. Training code agents requires vast amounts of high-quality, executable, and reproducible data, but collecting it is difficult, time-consuming, and expensive. Datasets are scattered across platforms, often incomplete, and rarely easy to integrate. Setting up the infrastructure to use them adds yet another layer of complexity and cost.

Codeset exists to solve this. To build better code agents, we need to make large quantities of high-quality data easily accessible and ready to use. By bringing everything together, data, environments, and infrastructure, and making them available through just a few API calls, Codeset makes it faster, simpler, and far more affordable for companies and academic labs to train and improve their models. In the future, our goal is for anyone to be able to create their own personalized code agent, tailored to their code base, with just a few clicks.