text-extraction · GitHub Topics

#网络爬虫#Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scraping text-extraction 自然语言处理 text-mining 爬虫 text-preprocessing article-extractor readability scraping html-to-markdown corpus-tools rss-feed news-aggregator rag 大语言模型

Python 4.67 k

2 天前

miso-belica / sumy

#自然语言处理#Module for automatic summarization of text documents and HTML pages.

Python lsa textteaser html-page summarizer pagerank-algorithm reduction text-extraction html-extraction html-extractor summarization summary 自然语言处理

Python 3.62 k

6 天前

unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

Go pdf pdf-library pdf-generation pdf-document-processor text-extraction pdf-manipulation signing pdf-sign pdf-generator

Go 2.91 k

12 天前

Goldziher / kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

OCR text-extraction async document-intelligence mcp pandoc Python rag table-extraction tesseract

Python 2.35 k

5 小时前

chrismattmann / tika-python

#自然语言处理#Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python Parsing text-extraction mime buffer memex text-recognition detection recognition 自然语言处理 nlp-library COVID-19 extraction

Python 1.62 k

5 个月前

whitelok / image-text-localization-recognition

#计算机科学#A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

text-recognition text-detection convolutional-neural-networks 深度学习 OCR text-extraction 机器学习 Awesome Lists

955

2 年前

miso-belica / jusText

Heuristic based boilerplate removal tool

Python text-extraction html-parser html-parsing

Python 794

7 个月前

unidoc / unidoc

This repository has moved! https://github.com/unidoc/unipdf

Go pdf pdf-library pdf-files text-extraction pdf-invoice

Go 709

6 年前

ICIJ / datashare

A self‑hosted search engine for documents. Help us improve Datashare by answering a survey on structured content: https://forms.gle/PYgusFsoBaMyzUec9

named-entity-recognition text-extraction extract investigative-journalism elasticsearch Docker web-gui

Java 657

4 天前

ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

text-extraction R rstats pdf-files r-package

C++ 538

6 天前

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

srt subtitle subtitles text-extraction Python mit-license 工具命令行界面 command-line-tool Library

Python 522

1 年前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python 428

8 天前

flairNLP / fundus

#网络爬虫#A very simple news crawler with a funny name

corpus 爬虫自然语言处理 Python RSS scraper sitemap text-extraction web-scraping corpus-tools 数据集 image-classification

Python 401

5 小时前

shixzie / nlp

#自然语言处理#[UNMANTEINED] Extract values from strings and fill your structs with nlp.

自然语言处理 Parsing Go text-extraction text

Go 389

8 年前

pd3f / pd3f

#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML 327

2 年前

py-pdf / benchmarks

Benchmarking PDF libraries

benchmark data-extraction mupdf pdf pypdf2 text-extraction

Python 310

2 个月前

Goldziher / html-to-markdown

HTML to markdown converter

html-converter markdown-converter rag text-extraction text-processing

Python 232

2 天前

bookieio / breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Python text-mining text-extraction html-extraction html-extractor html-parsing

HTML 205

1 年前

weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf Python text-extraction text-search

Python 196

9 个月前

SapienzaNLP / extend

#自然语言处理#Entity Disambiguation as text extraction (ACL 2022)

自然语言处理 Entity resolution text-extraction PyTorch acl

Python 182

3 年前