GitHub 中文社区
回车: Github搜索    Shift+回车: Google搜索
论坛
排行榜
趋势
登录

©2025 GitHub中文社区论坛GitHub官网网站地图GitHub官方翻译

  • X iconGitHub on X
  • Facebook iconGitHub on Facebook
  • Linkedin iconGitHub on LinkedIn
  • YouTube iconGitHub on YouTube
  • Twitch iconGitHub on Twitch
  • TikTok iconGitHub on TikTok
  • GitHub markGitHub’s organization on GitHub
集合主题趋势排行榜
#

text-extraction

Website
Wikipedia
https://static.github-zh.com/github_avatars/adbar?size=40
adbar / trafilatura

#网络爬虫#Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scrapingtext-extraction自然语言处理text-mining爬虫text-preprocessingarticle-extractorreadabilityscrapinghtml-to-markdowncorpus-toolsrss-feednews-aggregatorrag大语言模型
Python 4.36 k
16 天前
https://static.github-zh.com/github_avatars/miso-belica?size=40
miso-belica / sumy

#自然语言处理#Module for automatic summarization of text documents and HTML pages.

Pythonlsatextteaserhtml-pagesummarizerpagerank-algorithmreductiontext-extractionhtml-extractionhtml-extractorsummarizationsummary自然语言处理
Python 3.6 k
1 年前
https://static.github-zh.com/github_avatars/unidoc?size=40
unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

Gopdfpdf-librarypdf-generationpdf-document-processortext-extractionpdf-manipulationsigningpdf-signpdf-generator
Go 2.82 k
1 个月前
https://static.github-zh.com/github_avatars/Goldziher?size=40
Goldziher / kreuzberg

A text extraction library supporting PDFs, images, office documents and more

asynciodocxOCRpdftext-extraction
Python 1.85 k
6 天前
https://static.github-zh.com/github_avatars/chrismattmann?size=40
chrismattmann / tika-python

#自然语言处理#Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

PythonParsingtext-extractionmimebuffermemextext-recognitiondetectionrecognition自然语言处理nlp-libraryCOVID-19extraction
Python 1.59 k
2 个月前
https://static.github-zh.com/github_avatars/whitelok?size=40
whitelok / image-text-localization-recognition

#计算机科学#A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

text-recognitiontext-detectionconvolutional-neural-networks深度学习OCRtext-extraction机器学习Awesome Lists
952
2 年前
https://static.github-zh.com/github_avatars/miso-belica?size=40
miso-belica / jusText

Heuristic based boilerplate removal tool

Pythontext-extractionhtml-parserhtml-parsing
Python 782
4 个月前
https://static.github-zh.com/github_avatars/unidoc?size=40
unidoc / unidoc

This repository has moved! https://github.com/unidoc/unipdf

Gopdfpdf-librarypdf-filestext-extractionpdf-invoice
Go 708
6 年前
https://static.github-zh.com/github_avatars/ICIJ?size=40
ICIJ / datashare

A self-hosted search engine for documents.

named-entity-recognitiontext-extractionextractinvestigative-journalismelasticsearchDockerweb-gui
Java 635
4 天前
https://static.github-zh.com/github_avatars/ropensci?size=40
ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

text-extractionRrstatspdf-filesr-package
C++ 535
3 个月前
https://static.github-zh.com/github_avatars/cdown?size=40
cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

srtsubtitlesubtitlestext-extractionPythonmit-license工具命令行界面command-line-toolLibrary
Python 512
1 年前
https://static.github-zh.com/github_avatars/shixzie?size=40
shixzie / nlp

#自然语言处理#[UNMANTEINED] Extract values from strings and fill your structs with nlp.

自然语言处理ParsingGotext-extractiontext
Go 389
8 年前
https://static.github-zh.com/github_avatars/flairNLP?size=40
flairNLP / fundus

#网络爬虫#A very simple news crawler with a funny name

corpus爬虫自然语言处理PythonRSSscrapersitemaptext-extractionweb-scrapingcorpus-tools数据集image-classification
Python 388
4 天前
https://static.github-zh.com/github_avatars/iamarunbrahma?size=40
iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parserpdf-parserpdf-to-markdowntext-extraction
Python 386
4 个月前
https://static.github-zh.com/github_avatars/pd3f?size=40
pd3f / pd3f

#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdftext-extractionpdf-to-textpipeline机器学习OCRlanguage-modelextract-textparsrPython
HTML 320
2 年前
https://static.github-zh.com/github_avatars/py-pdf?size=40
py-pdf / benchmarks

Benchmarking PDF libraries

benchmarkdata-extractionmupdfpdfpypdf2text-extraction
Python 286
2 年前
https://static.github-zh.com/github_avatars/bookieio?size=40
bookieio / breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Pythontext-miningtext-extractionhtml-extractionhtml-extractorhtml-parsing
HTML 204
1 年前
https://static.github-zh.com/github_avatars/weareprestatech?size=40
weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdfPythontext-extractiontext-search
Python 193
6 个月前
https://static.github-zh.com/github_avatars/SapienzaNLP?size=40
SapienzaNLP / extend

#自然语言处理#Entity Disambiguation as text extraction (ACL 2022)

自然语言处理Entity resolutiontext-extractionPyTorchacl
Python 182
3 年前
https://static.github-zh.com/github_avatars/skylander86?size=40
skylander86 / lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

text-extractionaws-lambdaOCRlambda-functionspdftesseract
Python 177
7 年前
loading...