Skip to content

Conversation

@jshook
Copy link
Contributor

@jshook jshook commented Dec 11, 2025

This is the next step-wise change to streamline dataset loading and usage. It does the following:

  • Virtualizes DataSetLoader, adds a central DataSets access point to simplify loading.
    • use Optional instead of checked IOExceptions
  • Separates and modularizes HDF5, MFD loaders and consolidate helper methods.
  • Organize dataset types into a package together.
  • Update callers, with minor cleanups.
  • Removes bench2d and its loader.

- virtualize DataSetLoader
- separate and modularize HDF5, MFD loaders and consolidate helper methods
- organize dataset types into a package together
- update callers, minor cleanups
- use Optional instead of checked IOExceptions
- add javadoc
- organize dataset classes together
- remove bench2d and loader
@github-actions
Copy link
Contributor

github-actions bot commented Dec 11, 2025

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

try {
DataSet ds = DataSetLoader.loadDataSet(datasetName);
DataSet ds = DataSets.loadDataSet(datasetName).orElseThrow(
() -> new IllegalStateException("Dataset " + datasetName + " not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why IllegalStateException here? I believe all the other checks just use RuntimeException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixing

private static final String infraBucketName = "jvector-datasets-infratest";
private static final String fvecDir = "fvec";
private static final String bucketName = "astra-vector";
private static final List<String> bucketNames = List.of(bucketName, infraBucketName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just move everything to the infratest bucket to avoid having to use multiple buckets here, although that's not a comment on this PR. This is fine, just follows the existing logic.

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but they are not necessarily things that need to be addressed in this PR, overall looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants