If you’re working with Claude, Cursor, or other Model Control Protocol (MCP) clients, you’ve probably faced the challenge of connecting external knowledge sources. SiteMCP solves this problem by letting you fetch entire websites and use them as MCP servers – no complex setup required.
Sitemcp is an open-source TypeScript library – actually a fork of sitefetch by egoist, which is worth noting – that does one thing really well: it fetches an entire website (or parts of it) and spins up a local MCP server based on that content. This means you can directly query the site’s information from your MCP-compatible client.
Features
- Website crawling and caching: Automatically explores and stores website content for faster future access
- Selectable crawling patterns: Target specific sections of websites using micromatch patterns
- Configurable concurrency: Adjust the number of simultaneous connections for faster scraping
- Custom content selection: Use CSS selectors to precisely target the content you need
- Tool name customization: Configure how the tool appears in your MCP client
- Local caching: Stores fetched content locally to avoid redundant downloads
- Multiple package manager support: Works with npm, pnpm, and bun
Live Demo
Use Cases
- Querying Documentation: This is the big one for me. Point
sitemcpat a framework’s docs (likehttps://xxx.com/guide/) and ask questions about specific APIs or configurations directly within Cursor or Claude Desktop. No more endless tab switching and searching. - Referencing Component Libraries: Similar to docs, if you’re working with something like DaisyUI (
https://daisyui.com/components/), you can quickly ask for examples or usage details for specific components. - Integrating Internal Wikis: If your team has an internal knowledge base accessible via HTTP, you could use
sitemcpto make it queryable within your AI tools (assuming no complex authentication is required). - Targeted Research: Need info only from a specific blog section or product category on a website? Use the
--matchflag to grab just that, creating a focused knowledge source for your AI.
How To Use It
1. Installation. You can run it directly without a global install using npx, bunx, or pnpx:
# Pick one:
npx sitemcp https://example.com
bunx sitemcp https://example.com
pnpx sitemcp https://example.comOr, install it globally if you plan to use it often:
# Pick one:
npm install -g sitemcp
bun install -g sitemcp
pnpm install -g sitemcp2. Basic Usage. Point it at the URL you want to fetch:
sitemcp https://scriptbyai.comThis will start fetching the site and make it available as an MCP server.
3. Speeding Things Up (Concurrency). For larger sites, fetching can take a while. You can try increasing the concurrency (default is often low, like 2 or 5):
sitemcp https://scriptbyai.com --concurrency 10Be mindful not to overload the target server, though.
4. Controlling the Tool Name. By default, sitemcp uses the domain name for the MCP server name (e.g., indexOfScriptByAi, getDocumentOfScriptByAi). You can change this strategy:
# Use domain (default)
sitemcp https://scriptbyai.com -t domain
# Use subdomain: indexOfSubDomain / getDocumentOfSubDomain
sitemcp https://sub-domain.scriptbyai.com/ -t subdomain
# Use first pathname segment: indexOfFreeAi / getDocumentOfFreeAi
sitemcp https://scriptbyai.com/free-ai/ -t pathnameYou can also limit the length if needed (some clients might truncate long names):
sitemcp https://scriptbyai.com -l 100 # Max 100 chars for tool name5. Fetching Specific Pages. This is useful for large sites. Use the --match flag (you can use it multiple times) with micromatch patterns:
# Fetch only blog and tools sections from scriptbyai.com
sitemcp https://scriptbyai.com -m "/blog/**" -m "/tools/**"Check the micromatch docs for pattern details. This tests against the URL’s pathname.
6. Refining Content Extraction. Sometimes, the automatic content extraction grabs sidebars or footers. If you inspect the page and find the main content lives inside an element with a specific class or ID (e.g., <main class="content">), you can tell sitemcp to look there:
sitemcp https://scriptbyai.com --content-selector ".main"7. Configuring Your MCP Client. The key is to tell your MCP client how to run sitemcp. Here’s how you might configure Claude Desktop (the specific JSON structure might vary slightly for other clients like Cursor or Windsurf):
{
"mcpServers": {
"daisy-ui-components": {
"command": "npx", // Or bunx, pnpx, or the direct path if globally installed
"args": [
"-y", // Tells npx to execute without confirmation prompt
"sitemcp",
"https://daisyui.com",
"-m", // Only fetch component pages
"/components/**",
"--concurrency", // Use higher concurrency
"8"
]
}
}
}Important Tip: Fetching a large site can take time. It’s often better to run sitemcp once manually in your terminal with the desired flags (sitemcp https://yoursite.com -m "/docs/**"). Let it run and build the cache (~/.cache/sitemcp). Then, configure your MCP client to run the same command. The client will start the server much faster because the content is already cached.
Pros
- No API key requirements – Works with any public website
- Simple integration – Easy to set up with popular MCP clients
- Selective crawling – Fetch only the content you need
- Smart caching – Speeds up repeated access to the same sites
- Customizable output – Control how content is presented to AI tools
- Open-source foundation – Built on proven web crawling technology
Cons
- Performance limitations – Large sites can take time to fetch initially
- Content extraction quality varies – Some websites may not render correctly
- No automatic updates – You must manually refresh the cache for new content
- Limited by website structure – Heavily JavaScript-dependent sites may not work well
- Client compatibility – Works only with MCP-compatible AI tools
Related Resources
- Mozilla Readability – The content extraction library used by SiteMCP
- Micromatch – Pattern matching library used for page selection
- sitefetch – The original project that SiteMCP forked from
- Claude Desktop – A compatible MCP client
- What is Model Control Protocol (MCP) – Learn more about the underlying protocol
- Curated MCP servers – A directory of curated & open-source Model Context Protocol servers.
FAQs
Q: How is sitemcp different from just scraping a site with Python?
A: While both fetch web content, sitemcp is specifically designed to serve that content via the MCP protocol, making it directly usable by compatible AI clients. It also includes features tailored for this, like tool naming and integration commands, plus the use of Readability for cleaner content extraction focused on the main text. It’s less about raw data extraction and more about creating a queryable knowledge source for AI.
Q: Does it work with sites that heavily rely on JavaScript to render content?
A: It might struggle. sitemcp primarily fetches the HTML source and then uses Readability. If critical content is only rendered after complex JavaScript execution in the browser, sitemcp might not capture it accurately. It works best with sites where the main content is present in the initial HTML or rendered fairly simply.
Q: Where exactly is the cache stored, and can I clear it?
A: By default, it’s in ~/.cache/sitemcp on Linux/macOS (likely C:\Users\YourUser\AppData\Local\sitemcp\Cache on Windows, though check the tool’s output). You can simply delete this directory to clear the cache. You can also run sitemcp with the --no-cache flag to bypass caching for a specific run.
Q: Can I use this for sites that require login/authentication?
A: Generally, no. sitemcp fetches publicly accessible URLs. It doesn’t include mechanisms for handling login forms, cookies, or authentication headers out of the box. It’s best suited for public documentation, blogs, or websites.










