NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is proposed as a fix for NUTCH-2455 and also to supersede #254
In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.
Problem
The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:
Solution
Use MapReduce secondary sorting to stream HostDb entries through the pipeline:
FloatTextPair): Combines score and hostname to enable sortingScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entriesKey Components
FloatTextPair
ScoreHostKeyComparator
Sorting order:
HostDbReaderMapper
Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:
Configuration
generate.hostdbgenerate.max.count.exprgenerate.fetch.delay.exprExample JEXL Expressions
Performance
Backward Compatibility
generate.hostdbis not set, behavior is unchangedTesting