Extract Clean Web Content with Defuddle.js

Category: Javascript , Recommended | June 2, 2025
Authorkepano
Last UpdateJune 2, 2025
LicenseMIT
Tags
Views55 views
Extract Clean Web Content with Defuddle.js

Defuddle is a web content extraction library that extracts the main content from web pages by removing clutter and standardizing HTML. Works both for Browser and Node.js.

This library parses a given HTML document (or string), identifies & preserves the core article content, and strips away elements like sidebars, headers, footers, and other non-essential content.

It serves as an enhanced alternative to Mozilla Readability with more forgiving extraction algorithms and consistent output formatting.

Features:

  • Content Extraction: Removes sidebars, headers, footers, comments, and other non-essential elements
  • Mobile-Aware Detection: Uses page mobile styles to identify unnecessary elements
  • Metadata Extraction: Pulls author, title, description, publication date, and schema.org data
  • HTML Standardization: Normalizes headings, code blocks, footnotes, and math elements
  • Multiple Output Formats: Supports HTML and Markdown conversion
  • Performance Tracking: Includes parse time metrics and word count statistics
  • Bundle Options: Core, full (with math parsing), and Node.js-optimized versions
  • Debug Mode: Preserves attributes and structure for development analysis

See It In Action:

Use Cases:

  • Building “Read It Later” Apps: If you’re creating a service where users can save articles to read later, Defuddle can provide that clean, distraction-free reading view.
  • Content Ingestion for AI/RAG Applications: When you need to feed webpage content into a Retrieval Augmented Generation (RAG) system or any LLM, you want the core text, not all the surrounding noise. Defuddle helps get you that cleaner text.
  • Web Clipping Browser Extensions: This is its origin story with Obsidian Web Clipper. If you’re building an extension that needs to grab the main content of a page, Defuddle is a solid choice.
  • Automated Content Processing: For tasks like scraping articles for analysis or converting web content into different formats (like Markdown for a knowledge base), Defuddle handles the initial cleanup.

How to use it:

1. Install Defuddle with NPM. Available bundles:

  • Core (defuddle): For browser use, no extra dependencies. Handles math content but without fallbacks for MathML/LaTeX conversion.
  • Full (defuddle/full): Includes more robust math parsing with mathml-to-latex and temml.
  • Node.js (defuddle/node): Optimized for Node.js with JSDOM, includes full math and Markdown conversion capabilities.
# NPM
$ npm install defuddle

2. If you’re planning to use it in a Node.js environment, you’ll also need JSDOM:

# NPM
$ npm install jsdom

3. In a browser environment, you can import Defuddle and pass it a DOM document object. The parse method returns an object with these properties:

  • content: Cleaned HTML or Markdown string
  • title: Article title
  • author: Author name
  • description: Article summary/description
  • domain: Source website domain
  • favicon: Website favicon URL
  • image: Main article image URL
  • metaTags: Raw meta tag data
  • parseTime: Processing time in milliseconds
  • published: Publication date string
  • site: Website name
  • schemaOrgData: Structured data from schema.org markup
  • wordCount: Total word count of extracted content
import Defuddle from 'defuddle';
// Assuming 'document' is the current page's document
const defuddleInstance = new Defuddle(document);
const article = defuddleInstance.parse();
console.log(article.content);
console.log(article.title);

4. For server-side processing, the setup is slightly different. You’ll use defuddle/node.

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
// To parse HTML from a string
const htmlString = '<html><body><article><h1>My Title</h1><p>Some content.</p></article></body></html>';
const articleFromString = await Defuddle(htmlString);
console.log(articleFromString.title);
// To parse HTML from a URL
const dom = await JSDOM.fromURL('https://example.com/some-article');
const articleFromUrl = await Defuddle(dom); // You can also pass the URL string directly
console.log(articleFromUrl.content);
// With options
const articleWithOptions = await Defuddle(dom, 'https://example.com/some-article', {
  debug: true,
  markdown: true
});
// This will be Markdown
console.log(articleWithOptions.content);

5. Available configuration options. You can pass an options object during instantiation or when calling Defuddle in Node.js:

  • debug (boolean): Enables more verbose logging and preserves more attributes in the HTML (useful for, well, debugging).
  • url (string): The original URL of the page, which can help with resolving relative links and metadata.
  • markdown (boolean): If true, the content property in the result will be Markdown.
  • separateMarkdown (boolean): If true, content remains HTML, and an additional contentMarkdown property is returned.
  • removeExactSelectors (boolean, default: true): Controls removal of elements matching precise selectors (e.g., specific ad classes).
  • removePartialSelectors (boolean, default: true): Controls removal of elements matching broader, partial selectors.

FAQs

Q: Is Defuddle reliable for all websites?
A: No extraction library is 100% reliable for all websites due to the vast differences in web page structures.

Q: How does Defuddle handle paywalled content?
A: Defuddle can only parse the HTML content it’s given. If the main content is hidden behind a paywall and not present in the initial HTML (or the DOM passed to it), Defuddle won’t be able to extract it. It doesn’t bypass paywalls.

Q: How does the Markdown conversion work?
A: Defuddle uses a library like Turndown for HTML-to-Markdown conversion. The quality of the Markdown will depend on the cleanliness and structure of the HTML that Defuddle extracts. The HTML standardization step helps a lot here.

Q: What if the extracted metadata (author, date) is incorrect?
A: Metadata extraction relies on common HTML patterns, <meta> tags, and schema.org data. If a site doesn’t follow these conventions or has poorly structured metadata, Defuddle might struggle to get it right, similar to other libraries. The schemaOrgData field can be useful for debugging this, as it shows you the raw structured data it found.

Can Defuddle handle single-page applications with dynamically loaded content?
A: Defuddle processes the DOM state at parse time, so it works with SPAs after content has loaded. For dynamically loaded content, you’ll need to trigger parsing after the relevant content appears in the DOM.

How does the mobile-aware detection actually work?
A: The library analyzes CSS media queries and mobile-specific styling to identify elements that are hidden or repositioned on smaller screens. Elements that disappear on mobile are often navigational or promotional content rather than primary article content. This heuristic significantly improves extraction accuracy on modern responsive sites.

What happens to embedded media like videos and images?
A: Images within the main content area are preserved with their attributes intact. Embedded videos, iframes, and other rich media elements are retained if they appear to be part of the article content. Social media embeds and advertisement iframes are typically removed as part of the clutter detection.

You Might Be Interested In:


Leave a Reply