GitHub 中文社区
回车: Github搜索    Shift+回车: Google搜索
论坛
排行榜
趋势
登录

©2025 GitHub中文社区论坛GitHub官网网站地图GitHub官方翻译

  • X iconGitHub on X
  • Facebook iconGitHub on Facebook
  • Linkedin iconGitHub on LinkedIn
  • YouTube iconGitHub on YouTube
  • Twitch iconGitHub on Twitch
  • TikTok iconGitHub on TikTok
  • GitHub markGitHub’s organization on GitHub
集合主题趋势排行榜
#

warc

Website
Wikipedia
ArchiveBox/ArchiveBox
https://static.github-zh.com/github_avatars/ArchiveBox?size=40
ArchiveBox / ArchiveBox

"Your own personal internet archive" (网站存档 / 爬虫),一个自托管的网站时光机

pocketwgetbrowser-bookmarkspinboardChromiumFirefoxbackupsRSSweb-archivingPythonwayback-machineyoutube-dl自托管headless-browserdigipreswarc
Python 24.05 k
1 个月前
https://static.github-zh.com/github_avatars/internetarchive?size=40
internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Javawarc
Java 2.98 k
3 天前
Rhizome-Conifer/conifer
https://static.github-zh.com/github_avatars/Rhizome-Conifer?size=40
Rhizome-Conifer / conifer

Collect and revisit web pages.

web-archivingarchivesPythonDockerwarc
Python 1.5 k
5 个月前
https://static.github-zh.com/github_avatars/ArchiveTeam?size=40
ArchiveTeam / grab-site

#网络爬虫#The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archivingcrawlspider爬虫warc
Python 1.49 k
23 天前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

Chromium插件web-archivingarchivingbrowser-extensionwarc
TypeScript 1.02 k
15 天前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / browsertrix-crawler

#网络爬虫#Run a high-fidelity browser-based web archiving crawler in a single Docker container

爬虫crawlingwarcweb-archivingweb-crawler
TypeScript 802
4 天前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / replayweb.page

Serverless replay of web archives directly in the browser

web-archivingweb-archivewayback-machinewarcservice-worker
TypeScript 801
12 天前
https://static.github-zh.com/github_avatars/oduwsdl?size=40
oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

IPFSwarcweb-archivingPythonservice-workerDocker
Python 638
1 个月前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

warcElectronweb-archiving
JavaScript 446
5 年前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO

web-archivingwarcPython
Python 416
6 个月前
https://static.github-zh.com/github_avatars/Florents-Tselai?size=40
Florents-Tselai / WarcDB

#网络爬虫#WarcDB: Web crawl data as SQLite databases.

crawlingSQLitewarc命令行界面数据库web-archiving
Python 398
1 年前
https://static.github-zh.com/github_avatars/machawk1?size=40
machawk1 / wail

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

web-archivingPythonGUIwarcpyinstaller
Roff 374
3 个月前
https://static.github-zh.com/github_avatars/commoncrawl?size=40
commoncrawl / news-crawl

#网络爬虫#News crawling with StormCrawler - stores content as WARC

爬虫newswarcweb-crawler
Java 346
4 个月前
https://static.github-zh.com/github_avatars/bitextor?size=40
bitextor / bitextor

#网络爬虫#Bitextor generates translation memories from multilingual websites

dictionaries爬虫wgetParsingwarccorpus-toolscorpus-processingmachine-translationneural-machine-translationstatistical-machine-translation
Python 293
7 个月前
https://static.github-zh.com/github_avatars/webrecorder?size=40
webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archivingcloudwarcweb-archiveweb-archivingKubernetes
TypeScript 280
4 天前
https://static.github-zh.com/github_avatars/harvard-lil?size=40
harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

人工智能ragwarc
Python 250
4 个月前
https://static.github-zh.com/github_avatars/machawk1?size=40
machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"

Chrome 插件warcweb-archiving
JavaScript 221
2 年前
https://static.github-zh.com/github_avatars/cocrawler?size=40
cocrawler / cocrawler

#网络爬虫#CoCrawler is a versatile web crawler built using modern tools and concurrency.

爬虫Pythonasync-pythonwarcscreenshotconcurrencyaiohttp
Python 191
3 年前
https://static.github-zh.com/github_avatars/cocrawler?size=40
cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

web-archivingwarcPython
Python 175
5 个月前
https://static.github-zh.com/github_avatars/helgeho?size=40
helgeho / ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Apache Sparkweb-archivinginternet-archivewarc
Scala 150
16 天前
loading...