Skip to content

A fast, lightweight HTML to Markdown converter optimized for LLM consumption. Uses proven parsing libraries to deliver clean, well-structured markdown with intelligent content extraction and noise filtering.

License

Notifications You must be signed in to change notification settings

Nano-Collective/get-md

Repository files navigation

get-md

A fast, lightweight HTML to Markdown converter optimized for LLM consumption. Uses proven parsing libraries to deliver clean, well-structured markdown with intelligent content extraction and noise filtering.

Features

  • Lightning-fast: Converts HTML to Markdown in <100ms
  • Intelligent extraction: Uses Mozilla Readability to extract main content
  • LLM-optimized: Consistent formatting perfect for AI consumption
  • CLI included: Use from the command line or as a library
  • TypeScript: Full type definitions included
  • Zero downloads: No models to download, works instantly
  • Lightweight: Small package size (~10MB)
  • React Native compatible: Full support including content extraction!

Installation

npm install @nanocollective/get-md
# or
pnpm add @nanocollective/get-md
# or
yarn add @nanocollective/get-md

Quick Start

As a Library

import { convertToMarkdown } from "@nanocollective/get-md";

// From HTML string
const result = await convertToMarkdown("<h1>Hello</h1><p>World</p>");
console.log(result.markdown);
// # Hello
//
// World

// From URL (auto-detected)
const result = await convertToMarkdown("https://example.com");
console.log(result.metadata.title);

// From URL with custom timeout and headers
const result = await convertToMarkdown("https://example.com", {
  timeout: 10000,
  headers: { Authorization: "Bearer token" },
});

// Force URL mode if auto-detection fails
const result = await convertToMarkdown("example.com", { isUrl: true });

As a CLI

# From stdin
echo '<h1>Hello</h1>' | getmd

# From file
getmd input.html

# From URL
getmd https://example.com

# Save to file
getmd input.html -o output.md

API

convertToMarkdown(html, options?)

Convert HTML to clean, LLM-optimized Markdown.

Parameters:

  • html (string): Raw HTML string or URL to fetch
  • options (MarkdownOptions): Conversion options

Returns: Promise<MarkdownResult>

Options:

{
  // Content options
  extractContent?: boolean;       // Use Readability extraction (default: true)
  includeMeta?: boolean;          // Include YAML frontmatter (default: true)
  includeImages?: boolean;        // Include images (default: true)
  includeLinks?: boolean;         // Include links (default: true)
  includeTables?: boolean;        // Include tables (default: true)
  aggressiveCleanup?: boolean;    // Remove ads, nav, etc. (default: true)
  maxLength?: number;             // Max output length (default: 1000000)
  baseUrl?: string;               // Base URL for resolving relative links

  // URL fetch options (only used when input is a URL)
  isUrl?: boolean;                // Force treat input as URL (default: auto-detect)
  timeout?: number;               // Request timeout in ms (default: 15000)
  followRedirects?: boolean;      // Follow redirects (default: true)
  maxRedirects?: number;          // Max redirects to follow (default: 5)
  headers?: Record<string, string>; // Custom HTTP headers
  userAgent?: string;             // Custom user agent
}
```

## CLI Usage

```bash
getmd [input] [options]

Options:
  -o, --output <file>       Output file (default: stdout)
  --no-extract              Disable Readability content extraction
  --no-frontmatter          Exclude metadata from YAML frontmatter
  --no-images               Remove images from output
  --no-links                Remove links from output
  --no-tables               Remove tables from output
  --max-length <n>          Maximum output length (default: 1000000)
  --base-url <url>          Base URL for resolving relative links
  -v, --verbose             Verbose output
  -h, --help                Display help

Examples

Basic Conversion

import { convertToMarkdown } from "@nanocollective/getmd";

const html = `
  <article>
    <h1>My Article</h1>
    <p>This is a <strong>test</strong>.</p>
  </article>
`;

const result = await convertToMarkdown(html);
console.log(result.markdown);

With Metadata

// Metadata is included by default
const result = await convertToMarkdown(html);
console.log(result.markdown);
// ---
// title: "My Article"
// author: "John Doe"
// readingTime: 5
// ---
//
// # My Article
// ...

// To exclude metadata:
const resultNoMeta = await convertToMarkdown(html, { includeMeta: false });

CLI Examples

# Convert HTML file (frontmatter included by default)
getmd article.html -o article.md

# Fetch from URL
getmd https://blog.example.com/post -o post.md

# Remove images and links
getmd article.html --no-images --no-links -o clean.md

# Exclude frontmatter metadata
getmd article.html --no-frontmatter -o clean.md

React Native Support

get-md fully supports React Native including content extraction! We use happy-dom-without-node instead of JSDOM, which works across Node.js, React Native, and browser environments.

import { convertToMarkdown } from "@nanocollective/get-md";

// Works in React Native with full features!
const result = await convertToMarkdown(html, {
  extractContent: true, // Readability extraction works!
  includeMeta: true,
  // ... all other options work
});

All features work in React Native:

  • ✅ HTML to Markdown conversion
  • ✅ Mozilla Readability content extraction
  • ✅ Metadata extraction
  • ✅ Content cleaning and optimization
  • ✅ All formatting options

No special configuration needed!

Why get-md?

For LLMs

  • Consistent output: Deterministic markdown formatting helps LLMs learn patterns
  • Clean structure: Proper heading hierarchy, list formatting, and spacing
  • Noise removal: Automatically removes ads, navigation, footers, etc.
  • Fast processing: <100ms per document enables real-time workflows

vs Other Tools

  • Faster than LLM-based extractors: No model inference overhead
  • More reliable: Deterministic output, no hallucinations
  • Cheaper: No API costs
  • Privacy-friendly: Runs locally, no data sent to third parties

Community

We're a small community-led team building local and privacy-first AI solutions under the Nano Collective and would love your help! Whether you're interested in contributing code, documentation, or just being part of our community, there are several ways to get involved.

If you want to contribute to the code:

  • Read our detailed CONTRIBUTING.md guide for information on development setup, coding standards, and how to submit your changes.

If you want to be part of our community or help with other aspects like design or marketing:

  • Join our Discord server to connect with other users, ask questions, share ideas, and get help: Join our Discord server

  • Head to our GitHub issues or discussions to open and join current conversations with others in the community.

About

A fast, lightweight HTML to Markdown converter optimized for LLM consumption. Uses proven parsing libraries to deliver clean, well-structured markdown with intelligent content extraction and noise filtering.

Topics

Resources

License

Contributing

Stars

Watchers

Forks