-
Notifications
You must be signed in to change notification settings - Fork 389
Fix for kagent controller locking up with bad mcp server config #1232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix for kagent controller locking up with bad mcp server config #1232
Conversation
…`t allowing controller-runtime to hangle exponential backoff properly. Corrected error handling where error was logged but not returned, this was making MCPServer not registered in the tool server db, but reconcile appeared successful. Added explicit validation in validateAgent to check that MCP server reference exists and has valid port configurations before accepting agent. Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
…d removing a test we don`t need Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
4dae1a6 to
6700aec
Compare
Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug in the MCPServerToolController reconciliation logic that was causing the kagent-controller to lock up when MCPServers with invalid configurations were applied. The fix ensures proper error handling and backoff behavior in the controller-runtime framework.
Changes:
- Fixed the Reconcile function to return zero Result when errors occur, allowing controller-runtime to handle exponential backoff properly
- Corrected error handling in ReconcileKagentMCPServer to return errors instead of just logging them
- Added explicit validation for MCPServer port configurations before accepting agents
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| go/internal/controller/mcp_server_tool_controller.go | Fixed reconcile function to properly return errors with zero Result for controller-runtime backoff handling |
| go/internal/controller/reconciler/reconciler.go | Changed error handling to return errors instead of just logging them, ensuring failed conversions trigger requeue |
| go/internal/controller/translator/agent/adk_api_translator.go | Added validation logic to check MCPServer references exist and have valid port configurations before accepting agents |
| go/internal/controller/translator/agent/mcp_validation_test.go | Added comprehensive test coverage for MCPServer validation including invalid ports, missing servers, and multiple tools |
| go/internal/controller/reconciler/mcp_server_reconciler_test.go | Added reconciler tests to verify error propagation and proper handling of invalid MCPServer configurations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _, err = ConvertMCPServerToRemoteMCPServer(mcpServer) | ||
| if err != nil { | ||
| return fmt.Errorf("MCPServer %s/%s has invalid configuration: %w", agent.Namespace, tool.McpServer.Name, err) | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation switch statement only handles MCPServer types and silently ignores other valid types like RemoteMCPServer, Service, etc. This means that if an agent references a non-existent RemoteMCPServer or Service, the validation passes here but will fail later during translation. Consider adding validation for RemoteMCPServer and Service types to fail early with a clear error message, or add a default case that validates the existence of the referenced resource regardless of type.
| } | |
| } | |
| case schema.GroupKind{ | |
| Group: "", | |
| Kind: "Service", | |
| }: | |
| // Validate that the referenced Service exists | |
| svc := &corev1.Service{} | |
| err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, svc) | |
| if err != nil { | |
| return fmt.Errorf("failed to get Service %s/%s: %w", agent.Namespace, tool.McpServer.Name, err) | |
| } | |
| case schema.GroupKind{ | |
| Group: "kagent.dev", | |
| Kind: "RemoteMCPServer", | |
| }: | |
| // Validate that the referenced RemoteMCPServer exists | |
| remote := &v1alpha1.RemoteMCPServer{} | |
| err := a.kube.Get(ctx, types.NamespacedName{Namespace: agent.Namespace, Name: tool.McpServer.Name}, remote) | |
| if err != nil { | |
| return fmt.Errorf("failed to get RemoteMCPServer %s/%s: %w", agent.Namespace, tool.McpServer.Name, err) | |
| } |
| // TestMCPServerValidation_RemoteMCPServer tests that validation also works for RemoteMCPServer. | ||
| func TestMCPServerValidation_RemoteMCPServer(t *testing.T) { | ||
| ctx := context.Background() | ||
| scheme := schemev1.Scheme |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TestMCPServerValidation_RemoteMCPServer, the RemoteMCPServer is added to the fake client but v1alpha1.AddToScheme is not called for the kmcp API types. However, RemoteMCPServer is part of v1alpha2, so this is actually fine. Consider adding a comment to clarify that RemoteMCPServer is in v1alpha2, not v1alpha1, to avoid confusion.
| scheme := schemev1.Scheme | |
| scheme := schemev1.Scheme | |
| // RemoteMCPServer is defined in the kagent v1alpha2 API, so only v1alpha2.AddToScheme | |
| // is required here; v1alpha1.AddToScheme for kmcp types is not needed in this test. |
EItanya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one comment, otherwise this looks good :)
| err = a.validateAgent(ctx, toolAgent, state.with(agent)) | ||
| if err != nil { | ||
| return err | ||
| case v1alpha2.ToolProviderType_McpServer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part actually necessary? later on in translation we return an error for this exact scenario. iirc the point of this function is really just to pre-check recursion depth so that we don't have to do the entire translation. and fail towards the end, but honestly it's a bit confusing and maybe we should just check the recursion depth during the actual translation. Either way I don't think this actually changes anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's redundant here, we do it later again in translateMCPServerTarget. It does fail fast this way, but that doesn't really add any extra benefit, I'm going to remove it.
…s accordingly Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
… on bad object config Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
| if _, err := status.ReconcileMCPServerStatus(ctx, a.kube, mcpServer, nil); err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do this here, we don't own this resource, it's owned by kmcp
| remoteService, err := agent_translator.ConvertServiceToRemoteMCPServer(service) | ||
| if err != nil { | ||
| // Check if this is a validation error | ||
| var validationErr *agent_translator.ValidationError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in that other comment, we can't update the status of this resource in kagent at all because this is owned by kmcp. Also, I think this error checking/logic should happen in the controller, not the reconciler.
…ests to test out this logic Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Signed-off-by: Tim Zhukov <51675972+tzhukov@users.noreply.github.com>
| func (r *ServiceController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { | ||
| _ = log.FromContext(ctx) | ||
| return ctrl.Result{}, r.Reconciler.ReconcileKagentMCPService(ctx, req) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved validation/error handling logic into controller
Signed-off-by: Timofey Zhukov-Khovanskiy <tzhukov87@gmail.com>
Description
kagent-controller locks up when an MCPServer with bad configuration(ie MCP server is up but cannot communicate with the control plane) is applied. This blocks new agents passing from
Not Acceptedstate intoReady/Not ReadyError seen in kagent-controller:
Testing
To replicate the issue I applied this MCPServer + RemoteMCPServer manifest:
After this I tried creating a new agent and observed it being stuck in
Not Acceptedstate and the error message mentioned in motivation. The same workflow after the PR change result in agents being accepted.Instead we are appropriately see error logs about the new RemoteMCPServer: