> All of that happens in persistent context windows that last for the entire engagement. Teammates accumulate knowledge and remember what they have attempted and accomplished, all guided by the operator and team lead, with the state database functioning as the ultimate source-of-truth.
Oh wow! How long do these engagements usually last, and furthermore how big are these context windows becoming?
Was thinking the orchestrator + second in command only have a small part of their context windows consumed checking for updates and reprioritizing tasks / spawning agents, as opposed to actually doing the work each agent is getting from the task.
How are you managing the orchestrator to agent comminication (inflight context about a task spawning an agent)?
Testing, thus far, has focused on single-target CTFs, since they are relatively quick to execute end-to-end. The current goal is to stabilize the underlying pieces (MCP servers, teammates, skills) that will eventually support larger engagements. Total token usage per box ranges from a few hundred thousand to a couple million (all mostly Sonnet). Mostly medium-hard HTB boxes.
Each discrete task assigned to a teammate should be accomplished in under 200k tokens, as a general rule of thumb. Those are tasks like:
* Enumerate this vhost for exploitable vulns
* Exploit LFI on this target
* Research version xyz of this software
* Use these creds to find AD attack paths
* Exploit this ACL misconfiguration
If I find them hitting auto-compactions or otherwise losing context, that means the task is too large for one single teammate.
Agent teams handles the teammate-to-teammate messaging, although it has its flaws, in its current (experimental) form. For example, when a teammate is executing its turn, it does not stop to check for new messages until it decides it is time to go "idle". This can take a long while, depending on the task. As a result, they often miss important updates from other teammates (research results, related findings, etc.). The solution (for now) is human intervention - pressing the escape key in the teammate's pane to force it to stop and check its inbox. That's pretty similar to how humans operate, honestly, now that I'm thinking about it... But it will get better.
That's what I am getting to as well, around 167k tokens per agent spawn for a dedicated task. sonnet with /effort medium (i don't see improvement from medium -> high -> max).
overall didn't you run into issues with the models preventing you from conducting these engagements with it being in the "grey-area" in terms of ethical hacking / pentesting vs actually being a malicious user? How did you get around those constraints?
> All of that happens in persistent context windows that last for the entire engagement. Teammates accumulate knowledge and remember what they have attempted and accomplished, all guided by the operator and team lead, with the state database functioning as the ultimate source-of-truth.
Oh wow! How long do these engagements usually last, and furthermore how big are these context windows becoming?
Was thinking the orchestrator + second in command only have a small part of their context windows consumed checking for updates and reprioritizing tasks / spawning agents, as opposed to actually doing the work each agent is getting from the task.
How are you managing the orchestrator to agent comminication (inflight context about a task spawning an agent)?
What a great read!
Testing, thus far, has focused on single-target CTFs, since they are relatively quick to execute end-to-end. The current goal is to stabilize the underlying pieces (MCP servers, teammates, skills) that will eventually support larger engagements. Total token usage per box ranges from a few hundred thousand to a couple million (all mostly Sonnet). Mostly medium-hard HTB boxes.
Each discrete task assigned to a teammate should be accomplished in under 200k tokens, as a general rule of thumb. Those are tasks like:
* Enumerate this vhost for exploitable vulns
* Exploit LFI on this target
* Research version xyz of this software
* Use these creds to find AD attack paths
* Exploit this ACL misconfiguration
If I find them hitting auto-compactions or otherwise losing context, that means the task is too large for one single teammate.
Agent teams handles the teammate-to-teammate messaging, although it has its flaws, in its current (experimental) form. For example, when a teammate is executing its turn, it does not stop to check for new messages until it decides it is time to go "idle". This can take a long while, depending on the task. As a result, they often miss important updates from other teammates (research results, related findings, etc.). The solution (for now) is human intervention - pressing the escape key in the teammate's pane to force it to stop and check its inbox. That's pretty similar to how humans operate, honestly, now that I'm thinking about it... But it will get better.
Forgive me if this is all in the github already! Loved the article!
That's what I am getting to as well, around 167k tokens per agent spawn for a dedicated task. sonnet with /effort medium (i don't see improvement from medium -> high -> max).
overall didn't you run into issues with the models preventing you from conducting these engagements with it being in the "grey-area" in terms of ethical hacking / pentesting vs actually being a malicious user? How did you get around those constraints?