Multi-Agent AI Candidate Screening with Panel Discussion & Continuous Learning

4-layer

Screening platform

4wk

Working demonstration

Engineer team

Team reviewing candidate data and AI screening workflow

Context

A screening platform that challenges its own conclusions before they reach a recruiter

The market is full of AI-powered ATS systems with rigid prompts and no learning feedback. You write the prompt once. The system applies it forever. The model never gets smarter about your role, company, or definition of a good candidate.

Worse, many of these tools are keyword matchers with an LLM glued on top. A single LLM talking to itself can only argue for its own conclusion. Weak assumptions and missed signals stay hidden.

A global workforce platform in the EOR and contractor-management space wanted to evaluate what a different approach could look like. We built TalentIQ as a working system to show responsible multi-agent screening: auditable reasoning, measured bias, structured criticism, and continuous learning from recruiter feedback.

IndustryHR Tech / Recruitment

Engagement TypeCapability demonstration partnership

Timeline4 weeks from kickoff to working demonstration

Team Size2 engineers: 1 AI/ML, 1 full-stack

The Problem

Four failure modes static AI screening cannot handle

Candidate screening at scale has four failure modes that compound. Static prompts do not learn. Single-perspective reasoning misses counterarguments. Opaque outputs are hard to defend. Bias risk needs measurement, not reassurance.

Static prompts that do not learn

A prompt written today does not know what your senior recruiter approved yesterday. The model never improves at scoring against your actual hiring bar.

Single-perspective reasoning

A system that only argues for its own conclusion hides its own weaknesses. Errors a human review panel would catch slip through.

Opaque outputs

Most AI screening tools output a number with no usable explanation. Recruiters do not trust it, and auditors cannot review it.

Bias risk

Pattern-matching on names, schools, employment gaps, or geography can leak demographic proxies if you are not measuring carefully.

What We Built

Multi-agent architecture on Microsoft AutoGen, organised as four screening layers

TalentIQ is built on Microsoft AutoGen and structured as four sequential layers of agent activity. Each layer has a distinct purpose, and each layer's output is challenged by the next.

This is a deliberate departure from one-agent, one-decision patterns. Structuring screening as a panel with built-in adversarial review lets the system catch reasoning errors that single-pass systems miss.

Layer 1

Signal Extraction

Parser Agent extracts and normalises candidate signals from the CV: skills, experience, education, and structured metadata. It produces a canonical candidate profile that downstream agents work from, so every later agent reasons over the same shared representation.

Parser Agent

Layer 2

Multi-Dimensional Evaluation

A panel of independent evaluators scores the candidate across distinct dimensions. Each evaluator works in isolation so its reasoning is not contaminated by the others.

Role-Match AgentStage-Fit AgentBias Auditor Agent

Layer 3

Panel Discussion & Adversarial Review

Specialised agents take adversarial stances against the Layer 2 outputs. The system stress-tests conclusions through AutoGen group chat. Every disagreement references specific upstream reasoning.

Critique AgentCounter-Challenge AgentDevil's Advocate AgentSteelman Agent

Layer 4

Synthesis & Learning

Synthesiser Agent composes the final score and reasoning trace, integrating evaluator outputs and panel critiques. Learning Agent observes recruiter overrides and updates company-specific rule memory for future evaluations.

Synthesiser AgentLearning Agent

Panel Discussion & Adversarial Review

Specialised agents argue different positions on the same evidence

Layer 3 is where the system stress-tests its own conclusions. The panel runs as a structured conversation through AutoGen group-chat patterns. It is not a free-for-all.

Each agent's output references specific upstream reasoning. Every disagreement is logged, and dissent is preserved for recruiter and auditor review.

TalentIQ AI panel discussion screen showing parser, skill-gap, interview, decision, and QA review stages. — Shared evidence review
A panel only works when every agent is reasoning over the same candidate evidence.
This screen supports the adversarial review story directly: staged agent analysis, structured scoring, and a visible reasoning surface before a recruiter sees the result.

Role-Match Agent

Scores skills and experience against role requirements, with explicit reasoning per requirement.

Stage-Fit Agent

Evaluates whether the candidate career stage and trajectory fit the role seniority and growth profile.

Bias Auditor Agent

Runs counterfactual rewrites of the CV by swapping name, school, and geography signals. Flags score deltas above threshold.

Critique Agent

Challenges the reasoning of evaluation agents. Looks for weak reasoning, unjustified weighting, and conclusions that do not follow from the evidence.

Counter-Challenge Agent

Argues the opposite conclusion to whatever Layer 2 produced. The goal is to force the strongest possible case for the alternative.

Devil's Advocate Agent

Surfaces failure modes and red flags the evaluators may have missed or under-weighted: thin claims and role-fit assumptions that do not hold up.

Steelman Agent

Argues the strongest possible case for the candidate, ensuring weaker or non-obvious signals are not dismissed prematurely.

Synthesiser + Learning Agents

The Synthesiser integrates evaluations and critique into the final recommendation. The Learning Agent observes feedback and updates company-specific rule memory.

Final score with dissent preserved

Recruiters see not just the final score, but the strongest arguments against it. Dissent from the panel remains in the audit trail.

Company-specific adaptation

The Learning Agent distils recruiter overrides into memory that Layer 2 and Layer 3 agents consult on future evaluations.

Representative Case

The under-titled backend engineer the panel did not miss

One mid-senior backend engineering candidate from the test batches showed why the panel layer mattered.

1Initial under-score

The Role-Match and Stage-Fit agents initially under-scored the candidate. The CV listed modest titles (Senior Engineer, no formal lead designation), and Stage-Fit flagged the trajectory as a stretch for the target role.

2Steelman pushback

The Steelman Agent surfaced three concrete signals the evaluators had under-weighted: ownership of a payments service running production traffic across two regions, mentoring responsibility for two junior ICs, and recurring ownership of incident postmortems.

3Counter-challenge and synthesis

The Counter-Challenge Agent pressure-tested whether postmortem ownership was true leadership or just on-call rotation. The Synthesiser weighed both arguments, raised the final score, and preserved both reasoning lines in the audit trail.

4Recruiter-aligned outcome

The recruiter decision matched the upgraded recommendation. This is the exact failure mode static-prompt ATS systems create at scale: strong candidates filtered out on title-string matching.

Key factors considered

Modest title: Senior Engineer
No formal lead designation
Stage-Fit flagged trajectory as stretch

Under-weighted signals

Ownership of a payments service across two regions
Mentoring responsibility for two junior ICs
Recurring ownership of incident postmortems

What happened

Counter-Challenge Agent pressure-tested the claims
Synthesiser weighed both sides fairly
Raised the final score
Preserved both reasoning lines in audit trail

Outcome

Recruiter decision matched the upgraded recommendation
Avoided false negative on strong candidate
Static-prompt ATS systems would have filtered this candidate out

Final outcome: the panel rescued a strong candidate

The panel discussion, adversarial review, and synthesis kept the candidate's signal visible past title-string matching.

Reasoning preserved in audit trailRecruiter aligned with final decisionStrong candidate not missed

What We Got Wrong First

The system had to be tuned like a real review panel

The first version was not perfect. The system had to be tuned like a real review panel, not a static prompt chain.

The panel skewed too critical

Three of the four Layer 3 agents were oriented toward challenge, with only the Steelman defending the candidate. The Synthesiser consistently pulled scores down, and false negatives climbed in the early cohort.

Rebalanced synthesis weighting so Steelman arguments were not structurally outvoted
Required Counter-Challenge and Devil's Advocate objections to anchor in specific candidate evidence, not general suspicion
Recruiter override rates on under-scored candidates fell sharply after the change

Panel latency was too high

Unconstrained AutoGen group chat ran long because agents kept finding new things to argue about, and per-candidate cost climbed with it.

Added structured turn limits per agent
Introduced a concise-disagreement output format
Roughly halved panel runtime without measurably changing decision quality

Stateful Memory & Continuous Learning

Every recruiter override can compound into future decisions

Unlike stateless prompt-and-response systems, TalentIQ agents maintain memory across evaluations. Memory includes the company's hiring history, recruiter feedback patterns, role-specific rules accumulated over time, and each agent's own past reasoning on similar candidates.

Memory is scoped per company. What one client's agents learn never leaks to another.

Override learning

Recruiters can override scores with reasons. The Learning Agent observes overrides and updates company-specific rule memory.

Examples of learned rules

The system can learn that a client values open-source contributions more than the default rubric suggests, or penalises employment gaps less than the default.

Batch-over-batch convergence

Across test batches, the gap between agent scores and recruiter overrides narrowed as override patterns accumulated in memory.

Configurable Signal Exclusion

Recruiters can see what the system was allowed to consider

Configuration controls what each agent is allowed to use as a signal. These choices are visible in the audit trail per decision, so recruiters and auditors can see exactly what the system was permitted to consider.

Educational institution names can be disabled
Employment-gap penalties can be disabled
Location-based scoring can be disabled
Every scored candidate is bias-audited, not just a sample

TalentIQ past-runs review screen with score breakdown, matched skills, missing skills, and shortlist status for a candidate. — Visible controls
Recruiters need inspectable scoring, not a hidden number.
This review screen shows the level of transparency the system was designed for: scored dimensions, skill-match evidence, outcome state, and agent-attributed review history.

Key Design Decisions

Architecture choices that made the system auditable and defensible

Multi-agent over single-prompt

Single-prompt systems cannot separate concerns cleanly. Multi-agent architecture gives each agent a clear responsibility, a testable boundary, and an interpretable contribution to the final decision.

Adversarial review built into the architecture

A system that only argues for its own conclusion hides its weaknesses. The panel discussion layer forces every decision to survive structured criticism before it reaches the recruiter.

AutoGen over a custom orchestration layer

AutoGen handles agent communication, group-chat patterns, and role definitions out of the box. That let the team focus on the screening domain rather than reinventing orchestration.

Stateful memory over stateless inference

A screening system that does not learn from recruiter feedback cannot get better at scoring against a specific hiring bar. Per-company memory means every override compounds.

Inline bias auditing, not sampled

Most AI fairness tooling is bolted on after the fact. TalentIQ made the auditor a peer agent in Layer 2, so every scored candidate is audited.

Outcome

A concrete reference point for responsible, learning-based AI screening

TalentIQ showed that layered multi-agent screening can produce recommendations that have already gone through adversarial review before a recruiter sees them.

The system adapts to a company's hiring standards through recruiter feedback while remaining auditable and bias-aware.

Reasoning gaps surfaced

The panel discussion layer surfaced gaps in Layer 2 evaluator reasoning that single-pass systems would have missed entirely.

Scores moved on strong counter-cases

The Counter-Challenge Agent moved the Synthesiser's score in enough cases to show that adversarial review was affecting outcomes.

Good candidates were rescued

The Steelman Agent rescued candidates Layer 2 had under-weighted on surface signals, including the under-titled backend engineer.

Bias drift became visible

The bias auditor caught counterfactual score drift on early prototypes that would not have been visible without inline auditing.

Recruiter feedback converged

The gap between agent scores and recruiter overrides narrowed across batches as the Learning Agent absorbed override patterns.

A defensible reference architecture

The evaluating platform team received a baseline for what responsible, learning-based, adversarial AI screening would need to show compliance teams or external auditors.

Tech Stack

Architecture behind TalentIQ

AI Layer

Microsoft AutoGen orchestrationGroup-chat panel discussionCustom agent role definitionsAnthropic ClaudeOpenAI modelsConstrained reasoning schemasCounterfactual generation pipeline

Backend

PythonFastAPIRedis inter-agent stateBatch processingStructured audit APIsOverride pattern analysis

Frontend

AngularTypeScriptRecruiter review UIOverride workflowFeedback captureDecision trace views

Infrastructure

Docker containersConfiguration-as-codeAgent role configsSignal exclusion configsLearning threshold configsDrift monitoring

Memory Systems

PostgreSQL audit trailStructured company memoryQdrant semantic retrievalPer-company memory isolationRecruiter feedback memory

Engagement ModelCapability demonstration partnership

Timeline4 weeks from kickoff to working demonstration

Team Size2 engineers: 1 AI/ML, 1 full-stack

OutcomeReference architecture for responsible, learning-based, adversarial AI screening

Next Case Study

Multi-Agent AI Candidate Screening with Panel Discussion & Continuous Learning

A screening platform that challenges its own conclusions before they reach a recruiter

Four failure modes static AI screening cannot handle

Static prompts that do not learn

Single-perspective reasoning

Opaque outputs

Bias risk

Multi-agent architecture on Microsoft AutoGen, organised as four screening layers

Signal Extraction

Multi-Dimensional Evaluation

Panel Discussion & Adversarial Review

Synthesis & Learning

Specialised agents argue different positions on the same evidence

A panel only works when every agent is reasoning over the same candidate evidence.

Role-Match Agent

Stage-Fit Agent

Bias Auditor Agent

Critique Agent

Counter-Challenge Agent

Devil's Advocate Agent

Steelman Agent

Synthesiser + Learning Agents

Final score with dissent preserved

Company-specific adaptation

The under-titled backend engineer the panel did not miss

Key factors considered

Under-weighted signals

What happened

Outcome

Final outcome: the panel rescued a strong candidate

The system had to be tuned like a real review panel

The panel skewed too critical

Panel latency was too high

Every recruiter override can compound into future decisions

Override learning

Examples of learned rules

Batch-over-batch convergence

Recruiters can see what the system was allowed to consider

Recruiters need inspectable scoring, not a hidden number.

Architecture choices that made the system auditable and defensible

Multi-agent over single-prompt

Adversarial review built into the architecture

AutoGen over a custom orchestration layer

Stateful memory over stateless inference

Inline bias auditing, not sampled

A concrete reference point for responsible, learning-based AI screening

Reasoning gaps surfaced

Scores moved on strong counter-cases

Good candidates were rescued

Bias drift became visible

Recruiter feedback converged

A defensible reference architecture

Architecture behind TalentIQ

AI Layer

Backend

Frontend

Infrastructure

Memory Systems

AI-Driven Global Compliance Onboarding Engine

Ready to start your project?