warc · GitHub Topics

"Your own personal internet archive" (网站存档 / 爬虫)，一个自托管的网站时光机

pocket wget browser-bookmarks pinboard Chromium Firefox backups RSS web-archiving Python wayback-machine youtube-dl 自托管 headless-browser digipres warc

Python 24.95 k

4 个月前

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java warc

Java 3.06 k

2 天前

Rhizome-Conifer / conifer

Collect and revisit web pages.

web-archiving archives Python Docker warc

Python 1.52 k

8 个月前

ArchiveTeam / grab-site

#网络爬虫#The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

archiving crawl spider 爬虫 warc

Python 1.52 k

4 个月前

webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

Chromium 插件 web-archiving archiving browser-extension warc

TypeScript 1.06 k

14 小时前

webrecorder / browsertrix-crawler

#网络爬虫#Run a high-fidelity browser-based web archiving crawler in a single Docker container

爬虫 crawling warc web-archiving web-crawler

TypeScript 872

2 天前

webrecorder / replayweb.page

Serverless replay of web archives directly in the browser

web-archiving web-archive wayback-machine warc service-worker

TypeScript 840

2 天前

oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

IPFS warc web-archiving Python service-worker Docker

Python 643

16 天前

webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

warc Electron web-archiving

JavaScript 448

5 年前

webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO

web-archiving warc Python

Python 430

9 个月前

Florents-Tselai / WarcDB

#网络爬虫#WarcDB: Web crawl data as SQLite databases.

crawling SQLite warc 命令行界面数据库 web-archiving

Python 406

1 年前

machawk1 / wail

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

web-archiving Python GUI warc pyinstaller

Roff 377

6 个月前

commoncrawl / news-crawl

#网络爬虫#News crawling with StormCrawler - stores content as WARC

爬虫 news warc web-crawler

Java 355

7 个月前

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

archiving cloud warc web-archive web-archiving Kubernetes

TypeScript 328

2 天前

bitextor / bitextor

#网络爬虫#Bitextor generates translation memories from multilingual websites

dictionaries 爬虫 wget Parsing warc corpus-tools corpus-processing machine-translation neural-machine-translation statistical-machine-translation

Python 295

10 个月前

harvard-lil / warc-gpt

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

人工智能 rag warc

Python 258

7 个月前

machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"

Chrome 插件 warc web-archiving

JavaScript 223

2 年前

cocrawler / cocrawler

#网络爬虫#CoCrawler is a versatile web crawler built using modern tools and concurrency.

爬虫 Python async-python warc screenshot concurrency aiohttp

Python 190

3 年前

cocrawler / cdx_toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

web-archiving warc Python

Python 183

8 个月前

helgeho / ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

Apache Spark web-archiving internet-archive warc

Scala 152

1 个月前