How We Built an AI System to Evaluate Competencies from Conversations
At Voicit, we generate interview reports for selection processes. One of our most complex features is competency evaluation through critical incidents - a system that analyzes conversations and determines a candidate’s competency level based on behavioral evidence.
This is the technical story of how we built it, the methodology behind it, and what we learned along the way.
Context: What We’re Solving
In Voicit, users can generate interview reports using report templates. These templates allow adding sections that extract specific information: work experience, salary range, technical skills, etc.
Among these sections, some are simple and others are complex. Competency evaluation through critical incidents is one of the complex ones.
Users can select competencies from our competency dictionary, which includes both Voicit-defined competencies and custom ones that teams can create and share. Each competency has:
- Name
- Definition or description
- Evaluation levels (e.g., Basic, Intermediate, Advanced, Expert)
The output for each competency includes:
- Detected level with its definition
- Level justification analyzing patterns and limitations detected in critical incidents
- Critical incidents used to determine the level
- Recommendations on what to probe further
The challenge was: how do you reliably extract behavioral evidence from a conversation and map it to competency levels in a way that's useful for combining with formal test results?
The Three-Phase Architecture
We broke down the problem into three distinct phases, each with a specific responsibility.
Phase 1: Extraction and Classification of Critical Incidents
Goal: Identify conversation fragments that demonstrate the presence (or absence) of a specific competency.
The AI system analyzes:
- The interview transcription
- The competency name
- The competency definition (from the user’s dictionary)
Key insight: Not all evidence is a critical incident, but every critical incident is evidence.
This distinction is crucial when working with AI systems, which tend to interpret any evidence as a valid critical incident. We needed solid behavioral evidence - complete behavioral narratives.
Each critical incident follows the SAR model (Situation-Action-Result) and is classified into a JSON object:
| Field | Role | Notes |
|---|---|---|
| impact | Overall episode assessment | “positive” / “negative” - reflects behavioral effectiveness |
| intensity | Strength level of the incident | “weak” / “moderate” / “strong” |
| intensity_reason | Justification for the intensity | Allows auditing and automated weighting |
| context.situation | Context or environment | Essential (defines the scenario) |
| context.task | Responsibility or objective | Essential (defines the person’s role) |
| behavior | Specific action(s) taken | Optional if candidate doesn’t detail clear behaviors |
| result | Observable consequence or impact | Optional, but valuable for calibrating effectiveness |
| learning | Reflection or derived learning | Optional, shows maturity or self-awareness |
| timeKeys | Temporal location | Very useful for auditing or reviewing audio/video excerpts |
This classification ensures critical incidents have sufficient quality parameters to serve for subsequent level evaluation. That’s why there’s currently no intermediate phase to evaluate the quality of extracted incidents - the structure itself enforces quality.
Phase 2: Competency Level Evaluation
Goal: Determine the level achieved in a specific competency based on the critical incidents extracted in Phase 1, integrating both positive and negative evidence.
Methodological foundation: Based on BEI (Behavioral Event Interview) models and Critical Incident Technique (Flanagan, 1954).
The competency level is deduced from:
- Consistency of observed behaviors across different situations
- Complexity of contexts where behaviors manifest
- Degree of autonomy and impact demonstrated
- Learning capacity or transfer to new scenarios
The system analyzes all critical incidents (positive and negative) and contrasts them with the competency dictionary’s level definitions.
Output structure:
| Field | Role | Notes |
|---|---|---|
| level_label | Assigned level identifier | Name of existing levels in the competency within the dictionary. |
| level_definition | Description of assigned level | Definition of existing level in the competency within the dictionary. |
| confidence_score | Confidence degree (1–10) | 1 = very low confidence (little/weak evidence), 10 = maximum confidence (multiple solid, consistent incidents). |
| critical_gaps | List of critical deficiencies | Identifies areas without evidence or with insufficient evidence (e.g., “Lack of measurable results”, “No leadership behaviors observed”). |
| critical_incidents_justification | Link between incidents and level | List describing how each critical incident contributes (or limits) the assigned level. |
| critical_incidents_justification[].incident_id | Unique incident identifier | ID from Phase 1, maintains traceability. |
| critical_incidents_justification[].content | Incident description and relevance | Interpretive summary describing what behavior or fact was relevant. |
| critical_incidents_justification[].relevance_to_level | Impact interpretation on assigned level | Explains how the incident reinforces or limits competency relative to selected level. |
Key criteria:
- Strong positive incidents reinforce high levels if they show observable behaviors with impact or autonomy
- Strong negative incidents can limit the maximum possible level if they affect essential aspects (ethics, leadership, results)
- If incidents are insufficient, ambiguous, or routine, a lower level is assigned and the evidence gap is documented
- The
confidence_scorereflects the model’s certainty (1–10) based on quantity, coherence, and intensity of available incidents
The result of this phase is not narrative, but structured and explanatory. It defines the achieved level, the reasons, and areas without sufficient evidence. This becomes the foundation for Phase 3.
Phase 3: Competency Summary Generation
Goal: Transform the structured evaluation from Phase 2 into an interpretive narrative summary that:
- Clearly presents facts supporting the level evaluation
- Synthesizes behavioral patterns, consistency, and transferability of the competency
- Highlights critical gaps identified
This summary is designed for the selection consultant or recruiter to support their professional judgment alongside formal competency tests, to reach a clear conclusion.
Summary structure:
Assigned level and definition
- Indicates final level along with its description
Level justification: patterns and limitations
- Explains recurring behaviors, how they relate, and what level of complexity or autonomy they imply
- Limitations found, relating them to the assigned competency level
Supporting evidence
- Summarizes behaviors, contexts, and results observed in the most representative critical incidents
- What they did (behavior)
- In what context and task
- What result they obtained
- What learning or development they showed
- Provides time references to find it in the conversation
Aspects to probe further
- Analysis of aspects needing deeper exploration to improve competency evaluation
How Selection Teams Actually Use This
An important part of this critical incident competency analysis is how selection teams use it.
Voicit offers them guidance on competency level and evidence from their professional experience that they can use to:
- Contrast with their own conclusions
- Compare with competency test results
- Complement test results with detected critical incidents
This enables a more complete and objective final competency evaluation.
It's not about replacing human judgment - it's about giving consultants structured, traceable evidence to make better decisions.
The Surprising Learning: Reasoning Models Aren’t Always Better
One of the most interesting findings during development: reasoning LLMs are not necessary for this type of analysis.
They don’t improve results and add a very high time delay.
For structured behavioral analysis with clear frameworks (like SAR model and competency dictionaries), traditional LLMs with good prompting outperform reasoning models in both quality and speed.
This was counterintuitive but consistent across our testing.
Implementation Summary
The critical incident competency evaluation section extracts critical incidents that include: impact, intensity, intensity reason, situation, task, behavior, result, learning, and time references.
With this data, competency level is evaluated based on the competency dictionary. The evaluation generates:
- Competency level and definition
- Evaluation confidence
- Critical limitations
- Level justification according to critical incidents
Finally, a summary for the consultant is created, showing the assigned level and its definition, justification, analyzed critical incidents, and recommendations to probe further based on detected limitations.
Why This Matters
Building this system taught us that AI doesn’t replace expertise - it structures it.
The methodology (BEI, Critical Incident Technique) existed long before LLMs. What AI enables is:
- Scale - Analyze hours of conversation in minutes
- Consistency - Apply the same framework uniformly
- Traceability - Link every conclusion to specific evidence
- Augmentation - Give consultants tools to make better decisions faster
The magic isn’t in the AI. It’s in combining solid methodology with AI’s ability to process and structure information at scale.
FAQs
No. The level definition remains fixed, according to the competency dictionary.
What varies is the justification: it adapts to the critical incidents and evidence observed in each interview, which are unique for each candidate.
Competencies are analyzed based on critical incidents mentioned in the conversation. From these critical incidents we extract:
- Detected level and definition
- Justification of detected level based on critical incidents
- List of evidence
- Recommendations on what points to probe further to improve competency evaluation
This is another insight on how I’ve built product at Voicit. If you’re working on similar challenges with AI and structured analysis, I’d love to hear your approach.