Skip to content

Conversation

@YuriNachos
Copy link

Summary

  • Added tag handling in HTML2Text and CustomHTML2Text
  • Updates baseurl when <base href="..."> is encountered
  • Fixes relative link resolution to follow HTML standards

Fixes

Fixes #1680

Details

The HTML2Text class was ignoring the <base> tag, causing relative links
to be resolved against the initial page URL instead of the base URL
specified in the HTML document.

According to HTML standards, the <base> element specifies the base URL
for all relative URLs in a document. This is commonly used in:

  • Content Management Systems (CMS)
  • Single Page Applications (SPA)
  • Documentation sites
  • Sites with complex URL structures

Changes

  1. HTML2Text.handle_tag(): Added <base> tag detection and baseurl update
  2. CustomHTML2Text.handle_tag(): Added same handling before preserved tags processing

Example

Before:

  • Page URL: https://example.com/index.php/page.html
  • Base tag: <base href="https://example.com/">
  • Relative link: files/document.pdf
  • Resolved to: https://example.com/index.php/files/document.pdf

After:

  • Resolved to: https://example.com/files/document.pdf

Test plan

  • Code review confirms tag is properly handled
  • Both HTML2Text and CustomHTML2Text updated
  • baseurl is updated when tag is encountered
  • Follows HTML standard behavior

🤖 Generated with Claude Code

Fixes unclecode#1680

The HTML2Text class was ignoring the <base> tag, causing relative links
to be resolved against the page URL instead of the base URL specified
in the <base href="..."> attribute.

Added <base> tag handling in both HTML2Text and CustomHTML2Text to update
self.baseurl when the tag is encountered, ensuring proper link resolution
according to HTML standards.

Co-Authored-By: Claude <noreply@anthropic.com>
@YuriNachos YuriNachos force-pushed the fix/issue-1680-html2text-base-tag branch from a3355fa to 2016d66 Compare January 17, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: html2text ignores <base> tag when resolving relative links

1 participant