pdf-to-text · GitHub Topics

Get your documents ready for gen AI

人工智能 convert documents pdf tables document-parser document-parsing docx HTML Markdown pdf-converter pdf-to-json pdf-to-text pptx xlsx

Python 38.56 k

3 天前

Unstructured-IO / unstructured

#自然语言处理#Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to...

深度学习 document-parsing 机器学习自然语言处理 OCR information-retrieval data-pipelines preprocessing pdf-to-text pdf pdf-to-json document-image-analysis donut document-image-processing document-parser docx langchain 大语言模型

HTML 12.65 k

4 天前

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

document Parsing pdf pdf-document-processor pptx structured-data document-parser document-parsing docx-to-markdown pdf-to-excel pdf-to-json pdf-to-text ppt-to-json tables ppt-to-markdown pdf-to-markdown

TypeScript 4.14 k

2 天前

enoch3712 / ExtractThinker

#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python 1.4 k

18 天前

Academic-Hammer / SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

table-structure-recognition pdf-to-text

Python 375

5 年前

pd3f / pd3f

#计算机科学#🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML 327

2 年前

shoryasethia / markdrop

#大语言模型#A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functio...

Open Source pypi-package image-to-text 大语言模型 pdf-to-markdown pdf-to-text table-to-text agents

Python 151

2 个月前

GiftMungmeeprued / document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts...

data-pipeline document-image-processing document-parser document-parsing langchain OCR pdf pdf-to-text preprocessing

149

2 个月前

NanoNets / ocr-python

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

OCR tesseract pdf Python pdf-to-json pdf-to-text image-to-text

Jupyter Notebook 113

3 年前

nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

pdf-to-text Streamlit streamlit-webapp text-extraction Python OCR ocr-python pdf

Python 88

1 年前

datalogics / adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

OCR pdf pdf-conversion pdf-converter pdf-document pdf-generation pdf-lib pdf-manipulation pdf-merger pdf-parser pdf-to-text pdf-tools pdfa

C# 82

2 年前

BitMiracle / Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

pdf-library pdf-to-text pdf-signature pdf-generation extract-text net-core pdf-manipulation pdf-parser html-to-pdf

Visual Basic .NET 78

1 个月前

galkahana / pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

pdf pdf-to-text

C++ 74

3 个月前

papercast-dev / papercast

#自然语言处理#A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...

arxiv Python dag 自然语言处理 pdf-converter pdf-document-processor pipeline document-parser document-parsing pdf-to-text podcast tts

Python 52

6 个月前

mbzuai-oryx / KITAB-Bench

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

arabic benchmark layout-detection OCR pdf-to-text table-detection vlms vqa

Python 49

4 个月前

iditectweb / converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

pdf-to-text html-to-pdf

C# 40

7 年前