Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

Include the followings:

  1. When create CHK IV namespace, make the secondary group to be same as the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble when communicate via IV.

  2. Integrate CHK IV namespace create and destroy API, cleanup related logic, redefine the version.

  3. Get ranks list and IV namespace version from CHK leader when rejoin. Adjust CHK_REJOIN RPC for related changes.

  4. Remove unsupported functionality for checking the specified 'phase'.

  5. Add new test for case of lost some engine(s) before start checker.

  6. Dedicated ULT to handle dead rank event, that will not be affected by checker start or stop. Then even if check scheduler exited, the subsequent check query still can work against the latest rank list.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'DAOS checker cannot completed on Aurora after some engines excluded'
Status is 'In Review'
Labels: 'scrubbed_2.6.5'
https://daosio.atlassian.net/browse/DAOS-17535

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/1/execution/node/1302/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/1/execution/node/1317/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7_t branch 2 times, most recently from 562010a to ed02f46 Compare January 24, 2026 03:57
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/3/execution/node/1283/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7_t branch 3 times, most recently from efa47b4 to a944ece Compare January 25, 2026 03:47
@daosbuild3
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17427/5/display/redirect

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7_t branch from a944ece to 3e235f3 Compare January 25, 2026 05:12
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/7/execution/node/1318/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/7/execution/node/1308/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7_t branch 2 times, most recently from e741d4e to f684f3f Compare January 25, 2026 17:14
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/9/execution/node/1324/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17427/9/execution/node/1314/log

Include the followings:

1. When create CHK IV namespace, make the secondary group to be same as
   the primary group. Otherwise, CHK logic may hit DER_NONEXIST trouble
   when communicate via IV.

2. Integrate CHK IV namespace create and destroy API, cleanup related
   logic, redefine the version.

3. Get ranks list and IV namespace version from CHK leader when rejoin.
   Adjust CHK_REJOIN RPC for related changes.

4. Remove unsupported functionality for checking the specified 'phase'.

5. Add new test for case of lost some engine(s) before start checker.

6. Dedicated ULT to handle dead rank event, that will not be affected
   by checker start or stop. Then even if check scheduler exited, the
   subsequent check query still can work against the latest rank list.

Test-tag: recovery

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-17535_7_t branch from f684f3f to ae6132f Compare January 26, 2026 05:45
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 26, 2026 13:58
@Nasf-Fan Nasf-Fan requested review from a team as code owners January 26, 2026 13:58
Comment on lines +43 to +45
pool:
scm_size: 6G
nvme_size: 80G
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify these individually or could we just specify size: and let DAOS use the default ratio?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants