-
Notifications
You must be signed in to change notification settings - Fork 143
Virtualize and Modularize DataSetLoader logic #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- virtualize DataSetLoader - separate and modularize HDF5, MFD loaders and consolidate helper methods - organize dataset types into a package together - update callers, minor cleanups - use Optional instead of checked IOExceptions - add javadoc - organize dataset classes together - remove bench2d and loader
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
| try { | ||
| DataSet ds = DataSetLoader.loadDataSet(datasetName); | ||
| DataSet ds = DataSets.loadDataSet(datasetName).orElseThrow( | ||
| () -> new IllegalStateException("Dataset " + datasetName + " not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why IllegalStateException here? I believe all the other checks just use RuntimeException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixing
| private static final String infraBucketName = "jvector-datasets-infratest"; | ||
| private static final String fvecDir = "fvec"; | ||
| private static final String bucketName = "astra-vector"; | ||
| private static final List<String> bucketNames = List.of(bucketName, infraBucketName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just move everything to the infratest bucket to avoid having to use multiple buckets here, although that's not a comment on this PR. This is fine, just follows the existing logic.
...es/src/main/java/io/github/jbellis/jvector/example/benchmarks/datasets/DataSetLoaderMFD.java
Show resolved
Hide resolved
MarkWolters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but they are not necessarily things that need to be addressed in this PR, overall looks good
This is the next step-wise change to streamline dataset loading and usage. It does the following: