If you interviewed for an AI role two years ago, you got grilled on transformers, attention heads, and fine-tuning tradeoffs. If you interview for one today, you will still get some of that, but the center of gravity has moved. Companies are not just hiring model builders anymore. They are hiring system architects who can design agents that reason, plan, call tools, recover from failures, and do it all without setting money on fire.

This shift has changed what gets asked in interviews. The questions are more about judgment than recall, more about production tradeoffs than theory, and more about system design than math. Below are 40 questions we have seen come up repeatedly in agentic AI interviews at both frontier labs and enterprise AI teams in 2026, organized from fundamentals through production.

Part 1: Agent Fundamentals

1. What actually makes something an "agent" versus a regular LLM app?

The minimum bar most interviewers use: an agent has a loop, tools, and some form of state. The model decides what to do next, invokes a tool, observes the result, and decides again. A single-turn "ask LLM, return response" pipeline is not an agent. A RAG pipeline with retrieval followed by generation is not an agent either - it is a fixed workflow. The distinguishing feature is that control flow is determined by the model, not by your code.

2. When should you NOT use an agent?

This is a trap question to weed out candidates who have only read marketing blog posts. Agents are expensive, slow, and harder to debug than deterministic code. If your task has a clear, stable sequence of steps, build a pipeline. Reach for an agent when the task genuinely requires branching based on information you cannot know at design time, or when the set of possible actions is too large to hardcode.

3. Explain the ReAct pattern.

Reason, Act, Observe. The model produces a thought about what to do, emits an action (typically a tool call), observes the output, and loops. The value is that reasoning is made explicit in the context, which both improves tool selection and gives you something to debug when things go wrong. Most modern agent frameworks are variations on this idea.

4. What is the difference between a tool-using agent and a function-calling model?

Function calling is a model capability - the model emits structured output describing a function to invoke. A tool-using agent is a system built around that capability, including the runtime loop, tool execution, error handling, and state management. In practice, candidates conflate the two, and interviewers notice.

5. What are the typical failure modes of an agent in production?

The ones that come up most: infinite loops when the model cannot make progress, tool hallucination (calling tools that do not exist), parameter hallucination (calling real tools with invalid arguments), premature termination, and context window blowup. Being able to name these and describe mitigations is table stakes.

Part 2: Architecture and Design

6. Walk me through the architecture of an agent you have built.

Interviewers want to hear about specific choices and why you made them. Cover the control loop (single agent vs. multi-agent, framework used), memory strategy (short-term scratchpad, long-term store), tool interface, error handling, and observability. Generic answers ("I used LangChain to build a chatbot") fail this question.

7. When would you choose a single agent over a multi-agent system?

Start single. Multi-agent systems add coordination overhead, more failure modes, and much harder debugging. Move to multi-agent when you have genuinely distinct specializations (different tool sets, different system prompts that would conflict in one agent) or when parallelism meaningfully improves latency.

8. How do you design the tool interface an agent sees?

Tools should be small, composable, and have clear, unambiguous descriptions. A common mistake is exposing your internal API surface directly to the agent. Instead, design tools for the agent: one tool per task, descriptive names, strongly typed parameters, and error messages that help the model self-correct.

9. What is the tradeoff between giving an agent many fine-grained tools versus a few high-level ones?

Many fine-grained tools give the model flexibility but increase the chance it picks the wrong one or chains them incorrectly. Few high-level tools are easier to reason about but less expressive. In practice, teams often start with high-level tools and add granularity only when they see the agent failing to accomplish tasks.

10. How do you handle long-running agent tasks?

Checkpointing is the core idea. Persist agent state after each step so you can resume, retry, or inspect. LangGraph, Temporal, and custom solutions on top of a database all work. The critical point in interviews is recognizing that an in-memory agent loop is not production-viable for tasks longer than a few seconds.

Part 3: LangGraph and Framework Questions

11. Why LangGraph over plain LangChain agents?

LangChain's original agent abstractions were opaque and hard to customize. LangGraph makes the state machine explicit: nodes, edges, and shared state. That explicitness matters in production because you can reason about exactly what happens on each step, add conditional routing, and checkpoint cleanly. If you are building anything non-trivial, the ceiling is higher.

12. What goes in LangGraph state and what does not?

State should include everything that matters across nodes: the conversation, intermediate results, counters, flags. Keep it serializable - no database connections, no open file handles. A good heuristic: if you could not pickle it and replay it tomorrow, it does not belong in state.

13. How do you implement human-in-the-loop with LangGraph?

Use interrupt nodes. The graph pauses, state is persisted, and a human reviews or edits before the graph resumes. The key insight is that the persistence layer (checkpointer) is what makes this work - without durable state, you cannot meaningfully pause and resume across a human review window.

14. Compare LangGraph, CrewAI, and AutoGen.

LangGraph is a state machine for agents with the most explicit control. CrewAI is higher-level, organized around roles and tasks, and shines when you want a team-of-agents metaphor. AutoGen emphasizes conversational multi-agent patterns with a Microsoft research pedigree. The honest answer is that the right choice depends on how much control you need and how much opinionation you want baked in.

15. When would you skip a framework entirely?

When your agent is simple enough that a framework adds more complexity than it removes. A single-agent loop with three tools and no memory can be 80 lines of Python. Frameworks earn their keep when you need checkpointing, streaming, multi-agent orchestration, or observability out of the box.

Part 4: Memory and Context

16. Explain the difference between short-term and long-term memory in an agent.

Short-term memory is the working context within a single task or session - typically just the conversation history plus scratchpad state. Long-term memory persists across sessions and is usually backed by a vector store or database. The two are accessed differently: short-term lives in the prompt, long-term is retrieved on demand.

17. How do you prevent context window blowup in a long-running agent?

Several strategies stack: summarization of older turns, selective retrieval instead of dumping everything, tool outputs stored externally with only pointers in context, and aggressive pruning of intermediate steps that are no longer relevant. Interviewers want to see that you think about context as a finite, expensive resource.

18. How would you design long-term memory for a user-facing agent?

A typical approach: vector embeddings of past interactions, metadata filtering by user and time, and a retrieval step at the start of each session that pulls the most relevant facts. The hard part is not storage - it is deciding what to remember and when to forget. Many teams implement explicit memory write tools the agent can call.

19. What is episodic vs. semantic memory in an agent context?

Episodic memory stores specific past events ("user asked about X on date Y"). Semantic memory stores distilled facts ("user prefers Python"). Production systems usually need both, and the tricky engineering work is converting episodic memory into semantic memory over time without losing important specifics.

20. How do you handle stale or contradictory information in memory?

Timestamping, confidence scores, and explicit invalidation. For user facts that can change (preferences, employment, location), you need a write path that replaces rather than appends. For facts that accumulate (project history), you keep them. This is where generic vector-store-as-memory breaks down and you end up needing a real schema.

Part 5: Multi-Agent Systems

21. What coordination patterns do you know for multi-agent systems?

The big ones: supervisor/worker (one agent routes tasks to specialists), hierarchical (trees of supervisors), peer-to-peer (agents negotiate directly), and pipeline (linear handoff). Most production systems end up looking like supervisor/worker because it is the easiest to reason about and debug.

22. How do agents communicate in a multi-agent system?

Through shared state, direct messages, or a message bus. In LangGraph, it is typically shared state. In CrewAI, it is task handoffs. The interesting design decision is whether agents see each other's full reasoning or just final outputs. More information helps coordination but costs tokens and increases the blast radius of one agent's errors.

23. How do you prevent agents from talking past each other or getting stuck in loops?

Explicit termination conditions, turn limits, and a supervisor that can break ties. Shared scratchpad helps agents see what has already been tried. The single biggest fix in practice is giving the supervisor the authority to say "stop, we are done" rather than letting agents negotiate their own termination.

24. Describe a case where multi-agent was worth the complexity.

Document processing pipelines with genuinely different specializations (extraction, classification, summarization, verification) benefit from multi-agent because each sub-agent can have a focused system prompt and toolset. Research agents with parallel exploration also benefit. The anti-pattern is splitting a task into two agents just because you can.

25. What is an "agent-to-agent" protocol and why does it matter?

Emerging standards (Google's A2A being the prominent one) let agents from different vendors interoperate. Think of it as a shared language for capability discovery, task delegation, and result exchange. In interviews for forward-looking teams, knowing the direction this is moving signals you are reading the field.

Part 6: Tools and Function Calling

26. How do you write a good tool description?

Clear purpose, explicit parameter types and constraints, and an example of correct usage. The model reads the description as part of deciding whether to call the tool, so ambiguity here directly causes bad tool selection. Test by reading it yourself and asking if you could use the tool correctly from the description alone.

27. How do you handle tool errors?

Return structured error messages that the model can read and react to. "Failed: user_id not found" is useful. A stack trace is not. The goal is for the agent to self-correct on the next iteration rather than crash. Also worth mentioning: some errors should abort the whole agent, and it is your job to distinguish.

28. How do you prevent an agent from calling a destructive tool accidentally?

Require confirmation for destructive operations, either through a separate confirmation tool, a human-in-the-loop interrupt, or role-based tool access. "The model said yes" is not an acceptable answer for anything that deletes data, spends money, or sends messages to users.

29. What is tool poisoning and how do you mitigate it?

An attacker embeds instructions in tool output (e.g., a web page the agent scrapes) that manipulate the agent's behavior. Mitigations: never trust tool output as instructions, separate system prompts from user content, sandbox untrusted content, and monitor for anomalous tool call patterns.

30. How do you decide what to expose as a tool versus hardcode in a workflow?

If it is the same every time, hardcode it. Expose as a tool when the call is conditional, parameterized in ways that benefit from model reasoning, or part of a space of possible actions. The bias should be toward hardcoding - agents are expensive, and every tool call is a chance for things to go wrong.

Part 7: Evaluation and Observability

31. How do you evaluate an agent?

Three layers: unit evals on individual tool calls and reasoning steps, trajectory evals on full agent runs, and outcome evals on whether the agent actually accomplished the task. Most teams underinvest in the middle layer and end up with agents that pass unit tests but fail in realistic scenarios.

32. What is a "golden dataset" for agents?

A curated set of realistic tasks with known-good trajectories or outcomes. You run the agent against it after every prompt or tool change, and compare results. The dataset needs to grow as you find new failure modes. This is the single highest-leverage practice for keeping an agent reliable as you iterate on it.

33. How do you debug an agent that is behaving unexpectedly in production?

Traces. You need full visibility into every tool call, every model response, every state transition. LangSmith, Arize, and similar tools exist for this reason. "Check the logs" is not enough when a single request can involve 20 model calls and 15 tool invocations.

34. How do you know if your agent regression is from a prompt change or a model change?

Version everything: prompts, tool definitions, model version, framework version. Run your golden dataset on every change. If you cannot answer this question cleanly, you do not have a reproducible agent - you have a prototype.

35. What metrics do you track for agents in production?

Task success rate, turns-to-completion, token usage per task, tool call success rate, latency distribution, and cost per task. Also track model-specific metrics like refusal rate and format violation rate. The cost metric is the one candidates often forget, and it matters - agents can quietly 10x your bill.

Part 8: Production and Scale

36. How do you control cost in an agentic system?

Cap turns, use smaller models for routing and larger models only for complex reasoning, cache aggressively where outputs are deterministic, and prune context. Structural fixes beat prompt engineering here - picking the right model for each node saves more than any prompt optimization.

37. How would you handle a spike in traffic to an agentic system?

Queueing and backpressure. Agents have unpredictable latency, so synchronous request/response at scale is a losing game. Most production systems look like: user submits task, agent runs async, result is delivered via callback or polling. This is a different architecture than a standard web API, and candidates who have actually shipped agents know this.

38. How do you deploy a new agent version safely?

Shadow traffic first - run the new version in parallel without showing results to users, compare against the production version on the golden dataset and real traffic. Then canary with a small percentage of real users. Agents can fail in subtle ways that only show up at scale.

39. What are the main security risks of agents?

Prompt injection, especially via tool outputs. Unauthorized tool use (the agent is tricked into calling a tool it should not). Data exfiltration through tool calls. Confused deputy (the agent acts on behalf of a user but with higher privileges). The mitigation playbook: least-privilege tools, input sanitization, output validation, and audit logs.

40. If you had to make one agent production-ready tomorrow, what would you prioritize?

Observability, a golden eval set, and a kill switch. Everything else can be iterated on. You cannot improve what you cannot measure, you cannot measure without an eval set, and you cannot sleep at night without the ability to turn the agent off if it goes sideways. Candidates who answer "better prompts" miss the point.

Closing Thoughts

The common thread across every one of these questions is that agents are systems, not prompts. The candidates getting offers in 2026 are the ones who treat agent development like any other production engineering problem: with eval harnesses, observability, versioning, and a clear-eyed view of failure modes.

If you are preparing for agent engineering interviews, do not just read about LangGraph. Build something end to end, break it, fix it, and pay attention to where the wheels actually come off. That experience is what interviewers are trying to surface with these questions.

Want to practice agent engineering interviews with realistic scenarios? gitGood.dev offers AI mock interviews and 1,000+ practice questions across ML, systems design, and coding. Get feedback on your answers and sharpen your judgment before the real thing.