document-analysis · GitHub Topics

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

extract-data layout-analysis OCR Parser pdf pdf-converter Python document-analysis pdf-parser pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag ai4science

Python 43.79 k

4 天前

bytedance / Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

document-analysis layout-analysis OCR Parser pdf pdf-converter pdf-parser Python vlm-ocr

Python 5.8 k

16 天前

ucbepic / docetl

#大语言模型#A system for agentic LLM-powered data processing and ETL

data etl 大语言模型 Python data-pipelines elt workflow agents semantic-data document-processing unstructured-data unstructured-data-analysis document-analysis

Python 2.83 k

2 天前

UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdfbox pdf pdf-document C#netstandard pdf-extractor pdf-document-processor pdf-files alto-xml hocr layout-analysis document-analysis page-xml pdf-generation

C# 2.19 k

7 小时前

AlibabaResearch / AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

C++ 1.77 k

5 个月前

NanoNets / docext

#自然语言处理#An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python 1.72 k

20 天前

tstanislawek / awesome-document-understanding

#自然语言处理#A curated list of resources for Document Understanding (DU) topic

Awesome Lists 机器学习 information-extraction key-information-extraction document-understanding robotic-process-automation document-analysis document-layout-analysis OCR 自然语言处理深度学习 pdf rpa pdf-documents document-intelligence unstructured-data document-ai

1.46 k

2 年前

DocumindHQ / documind

Open-source platform for extracting structured data from documents using AI.

人工智能大语言模型 Open Source pdf-extractor developer-tools OCR document-analysis extract-data Parser pdf pdf-converter pdf-extractor-llm

JavaScript 1.41 k

4 个月前

Yuliang-Liu / Curve-Text-Detector

#计算机科学#This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

深度学习 document-analysis object-detection scene-text

Jupyter Notebook 648

5 年前

ispras / dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic...

doc docx odt documents excel pdf txt OCR scanned-documents table-recognition HTML html-parser pdf-parser document-analysis

Python 595

2 天前

wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

key-information-extraction document-analysis graph-neural-networks graph-learning document-understanding

Python 568

1 年前

CybercentreCanada / assemblyline

AssemblyLine 4: File triage and malware analysis

malware-analysis malware-research malware-detection Cybersecurity incident-response Malware automation-framework cert cyber-security document-analysis 框架 Python security-automation 安全

Python 362

2 天前

jpWang / LiLT

#自然语言处理#Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

自然语言处理 document-ai document-analysis document-understanding information-extraction multimodal-pre-trained-model

Python 354

3 年前

lazyFrogLOL / llmdocparser

#自然语言处理#A package for parsing PDFs and analyzing their content using LLMs.

大语言模型自然语言处理 OCR rag chunking document-analysis pdf-parser

Python 270

1 年前

pandora-analysis / pandora

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

Cybersecurity document-analysis malware-detection

Python 269

10 天前

masyagin1998 / robin

#计算机科学#RObust document image BINarization

Python OpenCV Keras neural-networks 深度学习 OCR 机器视觉 document-analysis

Python 182

1 年前

ppaanngggg / yolo-doclaynet

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

document-analysis layout-analysis ultralytics yolo yolov8

Python 133

1 个月前

chriswolfvision / local_adaptive_binarization

Local adaptive image binarization

机器视觉 document-analysis

C++ 126

3 年前

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

#自然语言处理#Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it i...

人工智能 chat-application document-analysis generative-ai langchain large-language-models 自然语言处理 openai-chatgpt question-answering retrieval-augmented-generation Streamlit gpt-3

Python 125

1 年前

anisha2102 / docvqa

#计算机科学#Document Visual Question Answering

visual-question-answering 机器视觉深度学习 document-analysis

Python 124

5 年前