pdf-parser · GitHub Topics

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

extract-data layout-analysis OCR Parser pdf pdf-converter Python document-analysis pdf-parser pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag ai4science

Python 43.79 k

4 天前

py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

pypdf2 pdf Python pdf-parser pdf-parsing pdf-manipulation pdf-documents help-wanted

Python 9.4 k

2 天前

bytedance / Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

document-analysis layout-analysis OCR Parser pdf pdf-converter pdf-parser Python vlm-ocr

Python 5.8 k

16 天前

dromara / yft-design

yft-design is a powerful, visually stunning online design tool built with Vue3, fabric.js, and Element Plus. 基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具，具备海报设计和图片编辑功能。适用于多种场景，如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编...

element-plus fabricjs canvas-editor clipper pdf-parser online-editor pdf-editor

TypeScript 1.42 k

1 个月前

yobix-ai / extractous

#自然语言处理#Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

extraction pdf tika unstructured unstructured-data data-pipelines docx etl etl-pipelines 大语言模型机器学习自然语言处理 OCR pdf-parser rag Rust

Rust 1.24 k

9 个月前

adithya-s-k / marker-api

Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.

FastAPI pdf-converter pdf-files pdf-parser pdf-parsing API REST API

Python 894

1 年前

drmingler / docling-api

Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is...

API FastAPI markdown-parser pdf-conversion pdf-converter pdf-parser pdf-parsing pdf-to-markdown

Python 688

6 个月前

ispras / dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...

doc docx odt documents excel pdf txt OCR scanned-documents table-recognition HTML html-parser pdf-parser document-analysis

Python 595

2 天前

NanoNets / docstrange

#大语言模型#Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

大语言模型 Markdown OCR pdf-to-markdown structured-data 人工智能 document-parser document-parsing pdf-parser pdf-to-json tables

Python 546

3 天前

titipata / scipdf_parser

Python PDF parser for scientific publications: content and figures

pdf Parser pdf-parser

Python 431

1 年前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python 427

8 天前

michelcrypt4d4mus / pdfalyzer

Analyze PDFs. With colors. And Yara.

malware-analysis pdf pdf-documents pdf-parser

YARA 316

6 天前

lazyFrogLOL / llmdocparser

#自然语言处理#A package for parsing PDFs and analyzing their content using LLMs.

大语言模型自然语言处理 OCR rag chunking document-analysis pdf-parser

Python 270

1 年前

sylphxltd / pdf-reader-mcp

An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.

ai-agent mcp Node.js pdf pdf-parser stdio TypeScript

TypeScript 241

6 天前