Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Jan 13, 2026

This PR is proposed as a fix for NUTCH-2455 and also to supersede #254

In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.

Problem

The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:

  • High memory consumption (O(HostDb size) per reducer)
  • OutOfMemoryError for large HostDbs
  • Startup latency while loading data

Solution

Use MapReduce secondary sorting to stream HostDb entries through the pipeline:

  1. Composite Key (FloatTextPair): Combines score and hostname to enable sorting
  2. Custom Comparator (ScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entries
  3. MultipleInputs: Reads both HostDb and CrawlDb in a single MapReduce job
  4. Streaming Reducer: Processes HostDb entries as they arrive, no preloading required

Key Components

FloatTextPair

public static class FloatTextPair implements WritableComparable<FloatTextPair> {
    public FloatWritable first;  // score (negative for HostDb)
    public Text second;          // hostname (empty for CrawlDb)
}

ScoreHostKeyComparator

Sorting order:

  1. HostDb entries first (non-empty hostname), sorted by hostname
  2. CrawlDb entries second (empty hostname), sorted by score descending

HostDbReaderMapper

Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:

context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);

Configuration

Property Description
generate.hostdb Path to HostDb (enables feature)
generate.max.count.expr JEXL expression for per-host URL limit
generate.fetch.delay.expr JEXL expression for per-host fetch delay

Example JEXL Expressions

<!-- Limit hosts with many failures to 10 URLs -->
<property>
  <name>generate.max.count.expr</name>
  <value>connectionFailures > 100 ? 10 : 1000</value>
</property>

<!-- Increase delay for unreliable hosts -->
<property>
  <name>generate.fetch.delay.expr</name>
  <value>connectionFailures > 50 ? 5000 : 1000</value>
</property>

Performance

Aspect Before After
Memory per reducer O(H) where H = total hosts O(P) where P = hosts in partition
Startup time Load entire HostDb None (streaming)
Scalability Limited by JVM heap Scales with cluster size

Backward Compatibility

  • When generate.hostdb is not set, behavior is unchanged
  • Existing configurations continue to work
  • JEXL expressions only evaluated when HostDb is provided

Testing

  • Unit tests (9): FloatTextPair serialization, equality, comparison; ScoreHostKeyComparator ordering
  • Integration tests (3): Variable max count, variable fetch delay, backward compatibility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants