Skip to content

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jan 8, 2026

Set NVMe power management values for SSDs by setting the new
engine DAOS_NVME_POWER_MGMT environment variable to an integer
(sets register bits 0-4). Value will be applied by SPDK on devices
attached to an engine process. The value will not be reset on engine
exit.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr self-assigned this Jan 8, 2026
@tanabarr tanabarr requested review from a team as code owners January 8, 2026 12:25
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Ticket title is 'NVMe power control feature'
Status is 'Awaiting Verification'
https://daosio.atlassian.net/browse/DAOS-18431

@Michael-Hennecke Michael-Hennecke self-requested a review January 8, 2026 12:51
@daosbuild3
Copy link
Collaborator

@tanabarr tanabarr force-pushed the tanabarr/bio-powmanage-patch branch from 5e0062d to 0be43f7 Compare January 8, 2026 17:26
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from NiuYawei and kjacque January 8, 2026 17:35
@tanabarr tanabarr force-pushed the tanabarr/bio-powmanage-patch branch from 0be43f7 to 8b31c49 Compare January 8, 2026 17:38
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable. Just trying to understand the right way to interact with the spdk stuff.

If we're letting users configure this generally, it might be good to add an explicit parameter to the config file with more intuitively named values, based on whatever the 0-4 stand for. Not necessary in this PR, but something to think about.

memset(&cmd, 0, sizeof(cmd));
cmd.opc = SPDK_NVME_OPC_SET_FEATURES;
cmd.cdw10 = SPDK_NVME_FEAT_POWER_MANAGEMENT;
cmd.cdw11 = bio_spdk_power_mgmt_val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

union spdk_nvme_cmd_cdw11 has a member union spdk_nvme_feat_power_management feat_power_management that has multiple fields, not just the power state. So I think we should be using that to make sure we're setting the value correctly.

IMO we should also check the value taken from the env variable is in an acceptable range, preferably when we first ingest it. I had some trouble finding definitions for the power state values. Does SPDK define those somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wait for @Michael-Hennecke to verify the PR. this is more of an enablement change that will unblock some testing. Once verified we can look at tidying things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we at least need to check the value. I was wondering about what would happen if we set the env variable to some value larger than 5 bits, potentially overflowing into the other bit field ("workload hint"). Here's the struct definition I found: https://spdk.io/doc/unionspdk__nvme__feat__power__management.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michael-Hennecke thoughts on acceptable values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricting the value to <= 0x1F to avoid overflowing the workload hint bits may prevent some experimentation capability. As an environment variable capability I would like to leave this unfiltered for the moment with the understanding that it should be used by someone who understands the values to be set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there someone ready to perform that experiment and see what happens? I would prefer to know what the worst case scenario consequences are before signing off, but if someone can verify it's harmless to toy with these fields with strange/invalid input ASAP, I'm fine with landing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michael-Hennecke can you please comment with regard to @kjacque 's concerns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@tanabarr tanabarr requested a review from kjacque January 12, 2026 12:50
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
bb->bb_dev_health.bdh_io_channel = channel;

/* Set NVMe power management */
rc = bio_set_power_mgmt(bb->bb_dev, channel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't want this being skipped when health monitor is disabled? (see bypass_health_collect() at the beginning of this function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay to move this function and the channel fetch to before the bypass check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this function uses the io_channel created by bio_init_health_monitoring().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

cmd.cdw10 = SPDK_NVME_FEAT_POWER_MANAGEMENT;
cmd.cdw11 = bio_spdk_power_mgmt_val;

rc = spdk_bdev_nvme_admin_passthru(d_bdev->bb_desc, channel, &cmd, NULL, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses bdh_io_channel but doesn't hold reference count, that could cause issue when bio_dev_health being finalized before the completion. See bio_fini_health_monitoring().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decoupled implementation from health monitoring, is it better now?

@tanabarr tanabarr requested a review from NiuYawei January 16, 2026 15:44
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/bio-powmanage-patch branch from cde98e1 to 2b38adf Compare January 16, 2026 15:47
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17355/7/execution/node/1334/log

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

struct spdk_nvme_cmd cmd;
int rc = 0;

D_INFO("Power management 1\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these "Power management x" info messages in this function should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

int
bio_set_power_mgmt(const char *bdev_name, struct spdk_bdev *bdev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks redundant to pass in both "bdev_name" & "bdev".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

if (!spdk_bdev_io_type_supported(bdev, SPDK_BDEV_IO_TYPE_NVME_ADMIN)) {
D_ERROR("Bdev NVMe admin passthru not supported for %s\n", bdev_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to debug

}

/* Writable descriptor required for applying power management settings */
rc = spdk_bdev_open_ext(bdev_name, true, bdev_event_cb, NULL, &pm_ctx.bdev_desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bio_bdev_event_cb() should be used as the event callback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


D_INFO("Power management 3\n");
rc = spdk_bdev_nvme_admin_passthru(pm_ctx.bdev_desc, pm_ctx.bdev_io_channel, &cmd, NULL, 0,
set_power_mgmt_completion, &pm_ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bio_xsctxt_alloc() isn't completed at this moment, and NVMe polling ULT isn't created yet, you need to explicitly poll completion by your self (see xs_poll_completion())

So you don't need ABT_eventual for such self polling. You could refer to common_fini_cb() & xs_poll_completion().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


spdk_bdev_free_io(bdev_io);
spdk_put_io_channel(pm_ctx->bdev_io_channel);
spdk_bdev_close(pm_ctx->bdev_desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not callback driven model anymore. The io_channel & dev_desc could be put/closed by caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 24, 2026

@NiuYawei I tried to address your comments, I'm now getting SPDK_NVME_SC_INVALID_FIELD if I try to set value to 0x01, fine if setting to 0x00 but I think this maybe to do with the SSD not supporting changes. @Michael-Hennecke please could you verify on your side?
Logging example:

2026/01/25 23:05:46.136270 edaos-15 DAOS[241271/0/3] bio DBUG src/bio/bio_device.c:1166 bio_set_power_mgmt() Setting power state 0 on device Nvme_edaos-15.daos.hpc.amslabs.hpecorp.net_0_1_7n1

2026/01/25 23:05:46.136306 edaos-15 DAOS[241271/0/3] bio INFO src/bio/bio_device.c:1095 set_power_mgmt_completion() Power management value set on device Nvme_edaos-15.daos.hpc.amslabs.hpecorp.net_0_1_7n1

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17355/9/testReport/

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from NiuYawei January 25, 2026 23:09
…daos into tanabarr/bio-powmanage-patch

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
}

struct power_mgmt_context_t {
ABT_eventual eventual;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't useful, can be removed.

D_DEBUG(DB_MGMT, "Bdev NVMe admin passthru not supported for %s\n", bdev_name);
rc = -DER_NOTSUPPORTED;
goto out_desc;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'bdev' can be acquired by spdk_bdev_get_by_name(), these checking should be done before spdk_bdev_open_ext()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants