content-extraction · GitHub Topics

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai llm-tools mcp-server model-context-protocol search-api web-crawler web-scraping javascript-rendering mcp

JavaScript 4.5 k

2 天前

graphlit / graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

claude content-extraction data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping

TypeScript 357

14 天前

currentslab / extractnet

#计算机科学#A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

content-extraction webscraping web-scraping text-mining news 机器学习 Python

HTML 293

4 个月前

mvasilkov / readability2

Readability2 converts HTML to plain text.

JavaScript readability HTML plaintext content-extraction

TypeScript 108

7 年前

tuffstuff9 / nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

content-extraction filepond Next pdf-parser pdf-parsing

TypeScript 63

2 年前

gregors / boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

content-extraction webscraping news

Ruby 43

5 年前

oiwn / dom-content-extraction

#网络爬虫#DOM Based Content Extraction via Text Density

scraping content-extraction

Rust 35

4 个月前

nikitautiu / learnhtml

#计算机科学#Web content extraction using machine learning

深度学习 HTML content-extraction

HTML 34

5 年前

spences10 / mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool llm-tools mcp model-context-protocol text-extraction web-scraping

JavaScript 30

5 个月前

pdfix / pdfix_sdk_example_cpp

Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

pdfua digital-signature pdf-converter pdf-manipulation extract-data watermark HTML metadata conversion converter tagging wcag sign pdf content-extraction Web Accessibility (a11y)

C++ 20

6 个月前

gdamdam / sumo

#自然语言处理#Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

自然语言处理 content-extraction nltk entity-recognition semantic-analysis

Python 20

7 年前

timoteostewart / benson

Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

content-extraction web-scraping productivity

Python 14

10 个月前

bencmc / youtube_video_summarizer

#自然语言处理#This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

content-extraction gpt-35-turbo natural 自然语言处理 openai Python text-processing video-processing youtube-api langchain-python Streamlit

Python 14

2 年前

LandWhale2 / TD-Spider

#网络爬虫#Via Text Density Simple Web Crawler With Go

Go web-crawler content-extraction data-mining Document Object Model (DOM)Open Source scraping

Go 13

2 年前

peremenov / seize

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

content-extraction Document Object Model (DOM)readability extract reader

HTML 12

8 年前

kamjin3086 / Crawell

📸 Crawell – 网页图片/正文一键提取、Markdown 转换与批量下载的浏览器扩展，本地化，免费 Crawell browser extension for one-click image & article extraction, Markdown conversion and bulk download – 100 % local processing.

browser-extension Chrome 插件 content-extraction edge-extension Firefox 插件 Markdown privacy-first React Tailwind CSS TypeScript web-scraping

TypeScript 10

1 个月前

vakharwalad23 / mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

TypeScript ai-powered cloudflare-worker content-extraction document-processing Puppeteer web-scraping

TypeScript 10

13 天前

amirthfultehrani / Youtube-Transcript-Copier

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

Web Accessibility (a11y)自动化 browser-extension clipboard content-extraction data-extraction Userscripts helper JavaScript productivity text-extraction 工具 transcript utilities Video Web YouTube

JavaScript 9

7 个月前

zeoagency / mobile-first-indexing-tool

Mobile First Indexing Tool

搜索引擎优化 (SEO)content-extraction aws-lambda lighthouse

Python 9

3 年前

pinkpixel-dev / web-scout-mcp

#网络爬虫#A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI a...

ai-assistant ai-tools cheerio content-extraction 爬虫 DuckDuckGo google-search mcp mcp-server web-crawler web-scraper web-scraping web-search

JavaScript 8

3 个月前