#自然语言处理#Module for automatic summarization of text documents and HTML pages.
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
#网络爬虫#Automatically extract the main text content (and more) from an HTML document
PHP library which determines which css is used from html snippets.
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.
HTML‐to‐Anki Enhanced Human Explanation & Reasoning Tool (HEART). A Python CLI that leverages the OpenAI API to transform full UWorld vignettes into AI-enhanced Anki cards.