web-crawler · GitHub Topics

firecrawl / firecrawl

#网络爬虫#Firecrawl 是一种 API 服务，它爬取URL并将其转换为清洗过的 markdown 或结构化数据

人工智能爬虫 data Markdown scraper html-to-markdown 大语言模型 rag scraping web-crawler ai-scraping webscraping

TypeScript 57.18 k

1 小时前

ScrapeGraphAI / Scrapegraph-ai

#网络爬虫#Python scraper based on AI

scraping scraping-python automated-scraper 大语言模型人工智能 web-crawler web-scraping ai-scraping 爬虫 html-to-markdown Markdown rag

Python 21.29 k

1 个月前

apify / crawlee

#网络爬虫#Crawlee - 一个用于Node.js 开发的网页爬虫和浏览器自动化库

web-scraping web-crawling npm headless-chrome Puppeteer 自动化 apify scraping crawling 爬虫 headless scraper web-crawler JavaScript Node.js Playwright TypeScript

TypeScript 19.46 k

11 小时前

crawlab-team / crawlab

#网络爬虫#Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架

webcrawler scrapy crawlab spiders-management Go scrapyd-ui spider 爬虫 webspider web-crawler Docker platform crawling-tasks

Go 11.91 k

2 天前

ssssssss-team / spider-flow

#网络爬虫#新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。

spider 爬虫 jsoup xpath web-spider webspider webcrawler web-crawler spider-flow

Java 10.9 k

2 年前

BruceDone / awesome-crawler

#网络爬虫#A collection of awesome web crawler,spider in different languages

web-crawler 爬虫 web-scraper spider scraper Awesome Lists

6.95 k

1 年前

adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

OCR omniparser parse-server parser-library vision-transformer web-crawler

Python 6.69 k

3 个月前

apify / crawlee-python

#网络爬虫#Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works...

apify 自动化 beautifulsoup 爬虫 crawling headless headless-chrome pip Playwright Python scraper scraping web-crawler web-crawling web-scraping Hacktoberfest

Python 6.3 k

12 小时前

firecrawl / firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai llm-tools mcp-server model-context-protocol search-api web-crawler web-scraping javascript-rendering mcp

JavaScript 4.5 k

2 天前

apache / nutch

#网络爬虫#Apache Nutch is an extensible and scalable web crawler

Java nutch web-crawler crawling hadoop apache

Java 3.07 k

2 天前

sjdirect / abot

#网络爬虫#Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

C#爬虫 web-crawler Parsing spider spiders pluggable Unit testing netcore netcore2 netcore3 netstandard20 cross-platform

C# 2.29 k

1 年前

jasonxtn / Argus

The Ultimate Information Gathering Toolkit

dns-lookup information-gathering OSINT recon-tools reconnaissance virustotal web-crawler whois-lookup

Python 2.27 k

1 年前

xianhu / PSpider

#网络爬虫#简单易用的Python爬虫框架，QQ交流群：597510560

爬虫 spider Python proxies web-spider multi-threading web-crawler python-spider multiprocessing

Python 1.84 k

3 年前

MarginaliaSearch / MarginaliaSearch

#搜索#Internet search engine for text-oriented websites. Indexing the small, old and weird web.

搜索引擎 no-cloud small-web internet-search indexer language-processing web-crawler alt-search 自托管 Java

HTML 1.48 k

2 天前

gildas-lormeau / single-file-cli

#网络爬虫#CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

命令行界面 Node.js single-file web-archiving web-scraper web-scraping archiving scraping-websites 爬虫 web-crawler Deno Dockerfile

JavaScript 974

3 个月前

Algebra-FUN / WeReadScan

扫描“微信读书”已购图书并下载本地PDF的爬虫

Selenium weread web-crawler book-downloader

Python 972

2 年前

apache / stormcrawler

#网络爬虫#A scalable, mature and versatile web crawler based on Apache Storm

web-crawler distributed Java 爬虫

Java 931

3 天前

webrecorder / browsertrix-crawler

#网络爬虫#Run a high-fidelity browser-based web archiving crawler in a single Docker container

爬虫 crawling warc web-archiving web-crawler

TypeScript 872

2 天前

postmodern / spidr

#网络爬虫#A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

spider Ruby 爬虫 Web scraper web-scraping web-spider web-crawler web-scraper

Ruby 825

2 个月前

cxcscmu / Craw4LLM

#网络爬虫#Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

爬虫 crawling large-language-models 大语言模型 pre-training pretraining web-crawler web-crawling

Python 638

7 个月前