NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888

lewismc · 2026-01-13T05:12:34Z

This PR is proposed as a fix for NUTCH-2455 and also to supersede #254

In essence this PR implements scalable HostDb integration in the Generator using MapReduce secondary sorting, eliminating the need to load the entire HostDb into memory.

Problem

The previous implementation loaded the entire HostDb into memory at reducer startup. For crawls with millions of hosts, this caused:

High memory consumption (O(HostDb size) per reducer)
OutOfMemoryError for large HostDbs
Startup latency while loading data

Solution

Use MapReduce secondary sorting to stream HostDb entries through the pipeline:

Composite Key (FloatTextPair): Combines score and hostname to enable sorting
Custom Comparator (ScoreHostKeyComparator): Ensures HostDb entries arrive before CrawlDb entries
MultipleInputs: Reads both HostDb and CrawlDb in a single MapReduce job
Streaming Reducer: Processes HostDb entries as they arrive, no preloading required

Key Components

FloatTextPair

public static class FloatTextPair implements WritableComparable<FloatTextPair> {
    public FloatWritable first;  // score (negative for HostDb)
    public Text second;          // hostname (empty for CrawlDb)
}

ScoreHostKeyComparator

Sorting order:

HostDb entries first (non-empty hostname), sorted by hostname
CrawlDb entries second (empty hostname), sorted by score descending

HostDbReaderMapper

Reads HostDb and emits with special key to ensure sorting before CrawlDb entries:

context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);

Configuration

Property	Description
`generate.hostdb`	Path to HostDb (enables feature)
`generate.max.count.expr`	JEXL expression for per-host URL limit
`generate.fetch.delay.expr`	JEXL expression for per-host fetch delay

Example JEXL Expressions

<!-- Limit hosts with many failures to 10 URLs -->
<property>
  <name>generate.max.count.expr</name>
  <value>connectionFailures > 100 ? 10 : 1000</value>
</property>

<!-- Increase delay for unreliable hosts -->
<property>
  <name>generate.fetch.delay.expr</name>
  <value>connectionFailures > 50 ? 5000 : 1000</value>
</property>

Performance

Aspect	Before	After
Memory per reducer	O(H) where H = total hosts	O(P) where P = hosts in partition
Startup time	Load entire HostDb	None (streaming)
Scalability	Limited by JVM heap	Scales with cluster size

Backward Compatibility

When generate.hostdb is not set, behavior is unchanged
Existing configurations continue to work
JEXL expressions only evaluated when HostDb is provided

Testing

Unit tests (9): FloatTextPair serialization, equality, comparison; ScoreHostKeyComparator ordering
Integration tests (3): Variable max count, variable fetch delay, backward compatibility

This reverts commit c3e1a6e.

… by reference, fixed with clone

…ng process

…ion in Generator

okedoki and others added 10 commits December 8, 2017 16:54

fix for NUTCH-2455 more efficient usage of hostdb in generate

c1ce018

added id to output files

c3e1a6e

Revert "added id to output files"

e20973c

This reverts commit c3e1a6e.

fix of the partitioner bug for NUTCH-2455

16f26f1

Merge branch 'master' into NUTCH-2455

d2451af

formating change #3

709aa0e

master conflict solved for NUTCH-2455

d608868

bug fix for NUTCH-2455 hostdatum in generate wasnot coppied correctly…

767e2e7

… by reference, fixed with clone

fix for NUTCH-2455 lost line hostDatum = entry.hostdatum in the mergi…

6fe1afd

…ng process

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integrat…

a50c958

…ion in Generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888

Uh oh!

lewismc commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888

Are you sure you want to change the base?

NUTCH-2455 Use secondary sorting for memory-efficient HostDb integration in Generator #888

Uh oh!

Conversation

lewismc commented Jan 13, 2026

Problem

Solution

Key Components

FloatTextPair

ScoreHostKeyComparator

HostDbReaderMapper

Configuration

Example JEXL Expressions

Performance

Backward Compatibility

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants