Skip to content

Conversation

@tzhukov
Copy link

@tzhukov tzhukov commented Jan 22, 2026

Description

  • Motivation:
    kagent-controller locks up when an MCPServer with bad configuration(ie MCP server is up but cannot communicate with the control plane) is applied. This blocks new agents passing from Not Accepted state into Ready/Not Ready

Error seen in kagent-controller:

"msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes requeuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler"
  • What changed:
    • Resolving reconcile function bug in MCPServerToolController that wasn`t allowing controller-runtime to hangle exponential backoff properly.
    • Corrected error handling where error was logged but not returned, this was making MCPServer not registered in the tool server db, but reconcile appeared successful. '
    • Added explicit validation in validateAgent to check that MCP server reference exists and has valid port configurations before accepting agent.
    • Added testing for this behavior.

Testing

To replicate the issue I applied this MCPServer + RemoteMCPServer manifest:

---
apiVersion: kagent.dev/v1alpha1
kind: MCPServer
metadata:
  name: example-mcp-server
  namespace: kagent
  labels:
    app: example-mcp-server
spec:
  deployment:
    image: alpine:latest
    port: 3001
    cmd: sh
    args:
      - "-c"
      - "while true; do sleep 3600; done"
  transportType: stdio
  stdioTransport: {}

---
apiVersion: kagent.dev/v1alpha2
kind: RemoteMCPServer
metadata:
  name: example-remote-mcp-server
  namespace: kagent
  labels:
    app: example-remote-mcp-server
spec:
  description: "Example remote MCP server"
  protocol: STREAMABLE_HTTP
  url: "http://example-mcp-server.kagent.svc.cluster.local:8081/mcp"
  timeout: 30s
  sseReadTimeout: 5m0s
  terminateOnClose: true

After this I tried creating a new agent and observed it being stuck in Not Accepted state and the error message mentioned in motivation. The same workflow after the PR change result in agents being accepted.

Instead we are appropriately see error logs about the new RemoteMCPServer:

"failed to upsert tool server for remote mcp server","remoteMCPServer":"kagent/example-remote-mcp-server","error":"failed to fetch tools for toolServer kagent /example-remote-mcp-server: failed to connect client for toolServer kagent/example-remote-mcp-server: calling \"initialize\": rejected by transport: sending \"initialize\":  Post \"http://example-mcp-server.kagent.svc.cluster.local: 8081/mcp\": dial tcp 10.96.47.128:8081: i/o timeout"

…`t allowing controller-runtime to hangle exponential backoff properly. Corrected error handling where error was logged but not returned, this was making MCPServer not registered in the tool server db, but reconcile appeared successful. Added explicit validation in validateAgent to check that MCP server reference exists and has valid port configurations before accepting agent.

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
…d removing a test we don`t need

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
@tzhukov tzhukov force-pushed the tzhukov/1097-kagent-controller-locks branch from 4dae1a6 to 6700aec Compare January 22, 2026 16:15
Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
@tzhukov tzhukov marked this pull request as ready for review January 22, 2026 19:24
Copilot AI review requested due to automatic review settings January 22, 2026 19:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the MCPServerToolController reconciliation logic that was causing the kagent-controller to lock up when MCPServers with invalid configurations were applied. The fix ensures proper error handling and backoff behavior in the controller-runtime framework.

Changes:

  • Fixed the Reconcile function to return zero Result when errors occur, allowing controller-runtime to handle exponential backoff properly
  • Corrected error handling in ReconcileKagentMCPServer to return errors instead of just logging them
  • Added explicit validation for MCPServer port configurations before accepting agents

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
go/internal/controller/mcp_server_tool_controller.go Fixed reconcile function to properly return errors with zero Result for controller-runtime backoff handling
go/internal/controller/reconciler/reconciler.go Changed error handling to return errors instead of just logging them, ensuring failed conversions trigger requeue
go/internal/controller/translator/agent/adk_api_translator.go Added validation logic to check MCPServer references exist and have valid port configurations before accepting agents
go/internal/controller/translator/agent/mcp_validation_test.go Added comprehensive test coverage for MCPServer validation including invalid ports, missing servers, and multiple tools
go/internal/controller/reconciler/mcp_server_reconciler_test.go Added reconciler tests to verify error propagation and proper handling of invalid MCPServer configurations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

_, err = ConvertMCPServerToRemoteMCPServer(mcpServer)
if err != nil {
return fmt.Errorf("MCPServer %s/%s has invalid configuration: %w", agent.Namespace, tool.McpServer.Name, err)
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation switch statement only handles MCPServer types and silently ignores other valid types like RemoteMCPServer, Service, etc. This means that if an agent references a non-existent RemoteMCPServer or Service, the validation passes here but will fail later during translation. Consider adding validation for RemoteMCPServer and Service types to fail early with a clear error message, or add a default case that validates the existence of the referenced resource regardless of type.

Suggested change
}
}
case schema.GroupKind{
Group: "",
Kind: "Service",
}:
// Validate that the referenced Service exists
svc := &corev1.Service{}
err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, svc)
if err != nil {
return fmt.Errorf("failed to get Service %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)
}
case schema.GroupKind{
Group: "kagent.dev",
Kind: "RemoteMCPServer",
}:
// Validate that the referenced RemoteMCPServer exists
remote := &v1alpha1.RemoteMCPServer{}
err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, remote)
if err != nil {
return fmt.Errorf("failed to get RemoteMCPServer %s/%s: %w", agent.Namespace, tool.McpServer.Name, err)
}

Copilot uses AI. Check for mistakes.
// TestMCPServerValidation_RemoteMCPServer tests that validation also works for RemoteMCPServer.
func TestMCPServerValidation_RemoteMCPServer(t *testing.T) {
ctx := context.Background()
scheme := schemev1.Scheme
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TestMCPServerValidation_RemoteMCPServer, the RemoteMCPServer is added to the fake client but v1alpha1.AddToScheme is not called for the kmcp API types. However, RemoteMCPServer is part of v1alpha2, so this is actually fine. Consider adding a comment to clarify that RemoteMCPServer is in v1alpha2, not v1alpha1, to avoid confusion.

Suggested change
scheme := schemev1.Scheme
scheme := schemev1.Scheme
// RemoteMCPServer is defined in the kagent v1alpha2 API, so only v1alpha2.AddToScheme
// is required here; v1alpha1.AddToScheme for kmcp types is not needed in this test.

Copilot uses AI. Check for mistakes.
@tzhukov tzhukov changed the title Resolving reconcile function bug in MCPServerToolController that wasn… Fix for kagent controller locking up with bad mcp server config Jan 23, 2026
Copy link
Contributor

@EItanya EItanya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment, otherwise this looks good :)

err = a.validateAgent(ctx, toolAgent, state.with(agent))
if err != nil {
return err
case v1alpha2.ToolProviderType_McpServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part actually necessary? later on in translation we return an error for this exact scenario. iirc the point of this function is really just to pre-check recursion depth so that we don't have to do the entire translation. and fail towards the end, but honestly it's a bit confusing and maybe we should just check the recursion depth during the actual translation. Either way I don't think this actually changes anything

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's redundant here, we do it later again in translateMCPServerTarget. It does fail fast this way, but that doesn't really add any extra benefit, I'm going to remove it.

…s accordingly

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
@tzhukov tzhukov requested a review from EItanya January 23, 2026 15:46
… on bad object config

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Comment on lines 404 to 405
if _, err := status.ReconcileMCPServerStatus(ctx, a.kube, mcpServer, nil); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this here, we don't own this resource, it's owned by kmcp

remoteService, err := agent_translator.ConvertServiceToRemoteMCPServer(service)
if err != nil {
// Check if this is a validation error
var validationErr *agent_translator.ValidationError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in that other comment, we can't update the status of this resource in kagent at all because this is owned by kmcp. Also, I think this error checking/logic should happen in the controller, not the reconciler.

tzhukov and others added 2 commits January 23, 2026 15:25
…ests to test out this logic

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Signed-off-by: Tim Zhukov <51675972+tzhukov@users.noreply.github.com>
func (r *ServiceController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
_ = log.FromContext(ctx)
return ctrl.Result{}, r.Reconciler.ReconcileKagentMCPService(ctx, req)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved validation/error handling logic into controller

Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants