Fix for kagent controller locking up with bad mcp server config #1232

tzhukov · 2026-01-22T15:55:41Z

Description

Motivation:
kagent-controller locks up when an MCPServer with bad configuration(ie MCP server is up but cannot communicate with the control plane) is applied. This blocks new agents passing from Not Accepted state into Ready/Not Ready

Error seen in kagent-controller:

"msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes requeuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler"

What changed:
- Resolving reconcile function bug in MCPServerToolController that wasn`t allowing controller-runtime to hangle exponential backoff properly.
- Corrected error handling where error was logged but not returned, this was making MCPServer not registered in the tool server db, but reconcile appeared successful. '
- Added explicit validation in validateAgent to check that MCP server reference exists and has valid port configurations before accepting agent.
- Added testing for this behavior.

Testing

To replicate the issue I applied this MCPServer + RemoteMCPServer manifest:

---
apiVersion: kagent.dev/v1alpha1
kind: MCPServer
metadata:
  name: example-mcp-server
  namespace: kagent
  labels:
    app: example-mcp-server
spec:
  deployment:
    image: alpine:latest
    port: 3001
    cmd: sh
    args:
      - "-c"
      - "while true; do sleep 3600; done"
  transportType: stdio
  stdioTransport: {}

---
apiVersion: kagent.dev/v1alpha2
kind: RemoteMCPServer
metadata:
  name: example-remote-mcp-server
  namespace: kagent
  labels:
    app: example-remote-mcp-server
spec:
  description: "Example remote MCP server"
  protocol: STREAMABLE_HTTP
  url: "http://example-mcp-server.kagent.svc.cluster.local:8081/mcp"
  timeout: 30s
  sseReadTimeout: 5m0s
  terminateOnClose: true

After this I tried creating a new agent and observed it being stuck in Not Accepted state and the error message mentioned in motivation. The same workflow after the PR change result in agents being accepted.

Instead we are appropriately see error logs about the new RemoteMCPServer:

"failed to upsert tool server for remote mcp server","remoteMCPServer":"kagent/example-remote-mcp-server","error":"failed to fetch tools for toolServer kagent /example-remote-mcp-server: failed to connect client for toolServer kagent/example-remote-mcp-server: calling \"initialize\": rejected by transport: sending \"initialize\":  Post \"http://example-mcp-server.kagent.svc.cluster.local: 8081/mcp\": dial tcp 10.96.47.128:8081: i/o timeout"

…`t allowing controller-runtime to hangle exponential backoff properly. Corrected error handling where error was logged but not returned, this was making MCPServer not registered in the tool server db, but reconcile appeared successful. Added explicit validation in validateAgent to check that MCP server reference exists and has valid port configurations before accepting agent. Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

…d removing a test we don`t need Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

Copilot

Pull request overview

This PR fixes a critical bug in the MCPServerToolController reconciliation logic that was causing the kagent-controller to lock up when MCPServers with invalid configurations were applied. The fix ensures proper error handling and backoff behavior in the controller-runtime framework.

Changes:

Fixed the Reconcile function to return zero Result when errors occur, allowing controller-runtime to handle exponential backoff properly
Corrected error handling in ReconcileKagentMCPServer to return errors instead of just logging them
Added explicit validation for MCPServer port configurations before accepting agents

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
go/internal/controller/mcp_server_tool_controller.go	Fixed reconcile function to properly return errors with zero Result for controller-runtime backoff handling
go/internal/controller/reconciler/reconciler.go	Changed error handling to return errors instead of just logging them, ensuring failed conversions trigger requeue
go/internal/controller/translator/agent/adk_api_translator.go	Added validation logic to check MCPServer references exist and have valid port configurations before accepting agents
go/internal/controller/translator/agent/mcp_validation_test.go	Added comprehensive test coverage for MCPServer validation including invalid ports, missing servers, and multiple tools
go/internal/controller/reconciler/mcp_server_reconciler_test.go	Added reconciler tests to verify error propagation and proper handling of invalid MCPServer configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T19:29:32Z

go/internal/controller/translator/agent/adk_api_translator.go

+				_, err = ConvertMCPServerToRemoteMCPServer(mcpServer)
+				if err != nil {
+					return fmt.Errorf("MCPServer %s/%s has invalid configuration: %w", agent.Namespace, tool.McpServer.Name, err)
+				}


The validation switch statement only handles MCPServer types and silently ignores other valid types like RemoteMCPServer, Service, etc. This means that if an agent references a non-existent RemoteMCPServer or Service, the validation passes here but will fail later during translation. Consider adding validation for RemoteMCPServer and Service types to fail early with a clear error message, or add a default case that validates the existence of the referenced resource regardless of type.

Suggested change

}

}

case schema.GroupKind{

Group: "",

Kind: "Service",

}:

// Validate that the referenced Service exists

svc := &corev1.Service{}

err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, svc)

if err != nil {

return fmt.Errorf("failed to get Service %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)

}

case schema.GroupKind{

Group: "kagent.dev",

Kind: "RemoteMCPServer",

}:

// Validate that the referenced RemoteMCPServer exists

remote := &v1alpha1.RemoteMCPServer{}

err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, remote)

if err != nil {

return fmt.Errorf("failed to get RemoteMCPServer %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)

}

Copilot · 2026-01-22T19:29:32Z

go/internal/controller/translator/agent/mcp_validation_test.go

+// TestMCPServerValidation_RemoteMCPServer tests that validation also works for RemoteMCPServer.
+func TestMCPServerValidation_RemoteMCPServer(t *testing.T) {
+	ctx := context.Background()
+	scheme := schemev1.Scheme


In TestMCPServerValidation_RemoteMCPServer, the RemoteMCPServer is added to the fake client but v1alpha1.AddToScheme is not called for the kmcp API types. However, RemoteMCPServer is part of v1alpha2, so this is actually fine. Consider adding a comment to clarify that RemoteMCPServer is in v1alpha2, not v1alpha1, to avoid confusion.

Suggested change

scheme := schemev1.Scheme

scheme := schemev1.Scheme

// RemoteMCPServer is defined in the kagent v1alpha2 API, so only v1alpha2.AddToScheme

// is required here; v1alpha1.AddToScheme for kmcp types is not needed in this test.

EItanya

Just the one comment, otherwise this looks good :)

EItanya · 2026-01-23T14:13:23Z

go/internal/controller/translator/agent/adk_api_translator.go

-		err = a.validateAgent(ctx, toolAgent, state.with(agent))
-		if err != nil {
-			return err
+		case v1alpha2.ToolProviderType_McpServer:


Is this part actually necessary? later on in translation we return an error for this exact scenario. iirc the point of this function is really just to pre-check recursion depth so that we don't have to do the entire translation. and fail towards the end, but honestly it's a bit confusing and maybe we should just check the recursion depth during the actual translation. Either way I don't think this actually changes anything

Yeah, it's redundant here, we do it later again in translateMCPServerTarget. It does fail fast this way, but that doesn't really add any extra benefit, I'm going to remove it.

…s accordingly Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

… on bad object config Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

EItanya · 2026-01-23T20:01:47Z

go/internal/controller/reconciler/reconciler.go

+	if _, err := status.ReconcileMCPServerStatus(ctx, a.kube, mcpServer, nil); err != nil {
+		return err


We can't do this here, we don't own this resource, it's owned by kmcp

EItanya · 2026-01-23T20:04:41Z

go/internal/controller/reconciler/reconciler.go

+	remoteService, err := agent_translator.ConvertServiceToRemoteMCPServer(service)
+	if err != nil {
+		// Check if this is a validation error
+		var validationErr *agent_translator.ValidationError


As I mentioned in that other comment, we can't update the status of this resource in kagent at all because this is owned by kmcp. Also, I think this error checking/logic should happen in the controller, not the reconciler.

…ests to test out this logic Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

Signed-off-by: Tim Zhukov <51675972+tzhukov@users.noreply.github.com>

tzhukov · 2026-01-23T20:35:18Z

go/internal/controller/service_controller.go

 func (r *ServiceController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
 	_ = log.FromContext(ctx)
-	return ctrl.Result{}, r.Reconciler.ReconcileKagentMCPService(ctx, req)
+


Moved validation/error handling logic into controller

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

tzhukov added 2 commits January 22, 2026 11:14

refactoring mcp server reconciler tests to use slices.ContainsFunc an…

6700aec

…d removing a test we don`t need Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

tzhukov force-pushed the tzhukov/1097-kagent-controller-locks branch from 4dae1a6 to 6700aec Compare January 22, 2026 16:15

tzhukov added 2 commits January 22, 2026 14:11

comment cleanup

695c1ed

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

cleaned up tests in mcp_validation_test.go to be more specific

22b8096

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

tzhukov marked this pull request as ready for review January 22, 2026 19:24

tzhukov requested review from EItanya, ilackarms and yuval-k as code owners January 22, 2026 19:24

Copilot AI review requested due to automatic review settings January 22, 2026 19:24

Copilot started reviewing on behalf of tzhukov January 22, 2026 19:24 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

tzhukov changed the title ~~Resolving reconcile function bug in MCPServerToolController that wasn…~~ Fix for kagent controller locking up with bad mcp server config Jan 23, 2026

EItanya reviewed Jan 23, 2026

View reviewed changes

removing redundant validation in adk_api_translator and updating test…

e4b68ee

…s accordingly Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

tzhukov requested a review from EItanya January 23, 2026 15:46

adding extra error handling states to the reconciler to avoid backoff…

12e6b7f

… on bad object config Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

EItanya reviewed Jan 23, 2026

View reviewed changes

tzhukov and others added 2 commits January 23, 2026 15:25

moving error checking/validation logic into controller, adding unit t…

bb6746a

…ests to test out this logic Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

Merge branch 'main' into tzhukov/1097-kagent-controller-locks

b9b4235

Signed-off-by: Tim Zhukov <51675972+tzhukov@users.noreply.github.com>

tzhukov commented Jan 23, 2026

View reviewed changes

refactoring tests to remove redundancy and make them more compact

2f7acf7

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for kagent controller locking up with bad mcp server config #1232

Fix for kagent controller locking up with bad mcp server config #1232

Uh oh!

tzhukov commented Jan 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

EItanya left a comment

Uh oh!

EItanya Jan 23, 2026

Uh oh!

tzhukov Jan 23, 2026

Uh oh!

EItanya Jan 23, 2026

Uh oh!

EItanya Jan 23, 2026

Uh oh!

tzhukov Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-				}
+				}
+			case schema.GroupKind{
+				Group: "",
+				Kind:  "Service",
+			}:
+				// Validate that the referenced Service exists
+				svc := &corev1.Service{}
+				err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, svc)
+				if err != nil {
+					return fmt.Errorf("failed to get Service %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)
+				}
+			case schema.GroupKind{
+				Group: "kagent.dev",
+				Kind:  "RemoteMCPServer",
+			}:
+				// Validate that the referenced RemoteMCPServer exists
+				remote := &v1alpha1.RemoteMCPServer{}
+				err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, remote)
+				if err != nil {
+					return fmt.Errorf("failed to get RemoteMCPServer %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)
+				}

		if _, err := status.ReconcileMCPServerStatus(ctx, a.kube, mcpServer, nil); err != nil {
		return err

Fix for kagent controller locking up with bad mcp server config #1232

Are you sure you want to change the base?

Fix for kagent controller locking up with bad mcp server config #1232

Uh oh!

Conversation

tzhukov commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

EItanya left a comment

Choose a reason for hiding this comment

Uh oh!

EItanya Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

tzhukov Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

EItanya Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

EItanya Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

tzhukov Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tzhukov commented Jan 22, 2026 •

edited

Loading