Skip to content

Conversation

@challenger71498
Copy link

@challenger71498 challenger71498 commented Dec 26, 2025

This PR fixes Session cleanup is being failed, caused by task cancellation via __aexit__ not handled properly.

Motivation and Context

Session cleanups the receive streams via finally clause of BaseSession::_receive_loop.

  • When the session is being closed, or leaving the async context(which calls __aexit__), it cancels the TaskGroup.
  • When a TaskGroup is being cancelled, it cancels every task children.
  • The parent of task we are using in finally clause, such as MemoryObjectStream::send is its base coroutine _receive_loop, IS the TaskGroup which is being closed.
  • Therefore the cleanup process is being cancelled while closing, leaving all receive streams not being closed, causing stream leak.

I fixed this problem by adding CancelScope with shield=True on finally clause, ensures the cleanup logic to be executed properly even TaskGroup cancel is requested.

Also, while cleaning up receive streams, we loop streams, therefore stream dictionary must not change.

  • If you change dictionary while looping, it causes runtime error, which should not happen in production.
  • BaseSession AS-IS does not protect dictionary change on cleanup.
  • To handle this minor issue, I added a simple flag to signal that the session is cleaning up the resources.

How Has This Been Tested?

I found this issue while debugging the problem, using ADK AgentTool with MCP causes an abtruse error:

RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

After a week of debugging, I realised the ultimate fix to this problem is quite huge and need fixes both MCP SDK and Google ADK, and this fix is one of them.

I added a simple test which triggers this issue.

  • In this test..
    • Server never responses
    • Creates the ClientSession and request 2 ping requests to server asynchronously
    • While 2 tasks are waiting (since the server never responses), leaves the async with block.
    • ClientSession is being closed
    • Assert that total count of response stream is equal to 0, not 2
    • Assert that the closed ClientSession fails to send ping
  • BaseSession AS-IS fails this test, response stream count is still 2, since the cleanup logic is being cancelled.

Breaking Changes

  • No breaking changes at all since this is just a minor bugfix.
  • There might be a side-effect, since we are now execute finally block until every stream is being closed, the __aexit__ will be blocked until finally block finishes its execution.
    • If the finally block is being executed too long, user might think it is hanging.
    • But I don't think this really be a problem, stream is running on memory not an actual network, also the amount of the stream won't be that huge, expecting ~1000, then it will be fine.
    • If it IS being a problem, we could just run closing logics in parallel. (I don't want that much change in this PR though.)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed No docs needed to be updated

Additional context

This PR MAY resolve issues below:

I think this problem is related to task leak (or indefinite-waiting) at Google ADK, since in ADK the tool call await forever until the receive stream responds.

Session must close _response_streams properly when exiting async with scope. Session AS-IS fails the test since the cleanup logic is being cancelled forcefully.
A cleanup logic should not be cancelled via CancelScope. Therefore 'shield=True' parameter is required.
Since we are using for-loop while cleaning up _response_streams dict, the dict must not be changed on cleanup. We could handle this issue easily by adding a simple flag.
copybara-service bot pushed a commit to google/adk-python that referenced this pull request Jan 15, 2026
…ager

Merge #4025

**Please ensure you have read the [contribution guide](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) before creating a pull request.**

### Link to Issue or Description of Change

**1. Link to an existing issue (if applicable):**

- Closes:
  - #3950
  - #3731
  - #3708

**2. Or, if no issue exists, describe the change:**

**Problem:**
- `ClientSession` of https://github.com/modelcontextprotocol/python-sdk uses AnyIO for async task management.
- AnyIO TaskGroup requires its start and close must happen in a same task.
- Since `McpSessionManager` does not create task per client, the client might be closed by different task, cause the error: `Attempted to exit cancel scope in a different task than it was entered in`.

**Solution:**

I Suggest 2 changes:

Handling the `ClientSession` in a single task
- To start and close `ClientSession` by the same task, we need to wrap the whole lifecycle of `ClientSession` to a single task.
- `SessionContext` wraps the initialization and disposal of `ClientSession` to a single task, ensures that the `ClientSession` will be handled only in a dedicated task.

Add timeout for `ClientSession`
- Since now we are using task per `ClientSession`, task should never be leaked.
- But `McpSessionManager` does not deliver timeout directly to `ClientSession` when the type is not STDIO.
  - There is only timeout for `httpx` client when MCP type is SSE or StreamableHTTP.
  - But the timeout applys only to `httpx` client, so if there is an issue in MCP client itself(e.g. modelcontextprotocol/python-sdk#262), a tool call waits the result **FOREVER**!
- To overcome this issue, I propagated the `sse_read_timeout` to `ClientSession`.
  - `timeout` is too short for timeout for tool call, since its default value is only 5s.
  - `sse_read_timeout` is originally made for read timeout of SSE(default value of 5m or 300s), but actually most of SSE implementations from server (e.g. FastAPI, etc.) sends ping periodically(about 15s I assume), so in a normal circumstances this timeout is quite useless.
  - If the server does not send ping, the timeout is equal to tool call timeout. Therefore, it would be appropriate to use `sse_read_timeout` as tool call timeout.
  - Most of tool calls should finish within 5 minutes, and sse timeout is adjustable if not.
- If this change is not acceptable, we could make a dedicate parameter for tool call timeout(e.g. `tool_call_timeout`).

### Testing Plan
- Although this does not change the interface itself, it changes its own session management logics, some existing tests are no longer valid.
  - I made changes to those tests, especially those of which validate session states(e.g. checking whether `initialize()` called).
  - Since now session is encapsulated with `SessionContext`, we cannot validate the initialized state of the session in `TestMcpSessionManager`, should validate it at `TestSessionContext`.
- Added a simple test for reproducing the issue(`test_create_and_close_session_in_different_tasks`).
- Also made a test for the new component: `SessionContext`.

**Unit Tests:**

- [x] I have added or updated unit tests for my change.
- [x] All unit tests pass locally.

```plaintext
=================================================================================== 3689 passed, 1 skipped, 2205 warnings in 63.39s (0:01:03) ===================================================================================
```

**Manual End-to-End (E2E) Tests:**

_Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix._

### Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document.
- [x] I have performed a self-review of my own code.
- [x] I have commented my code, particularly in hard-to-understand areas.
- [x] I have added tests that prove my fix is effective or that my feature works.
- [x] New and existing unit tests pass locally with my changes.
- [x] I have manually tested my changes end-to-end.
- [ ] ~~Any dependent changes have been merged and published in downstream modules.~~ `no deps has been changed`

### Additional context
This PR is related to modelcontextprotocol/python-sdk#1817 since it also fixes endless tool call awaiting.

Co-authored-by: Kathy Wu <wukathy@google.com>
COPYBARA_INTEGRATE_REVIEW=#4025 from challenger71498:feat/task-based-mcp-session-manager f7f7cd0
PiperOrigin-RevId: 856438147
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working P1 Significant bug affecting many users, highly requested feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants