document-parsing · GitHub Topics

PaddlePaddle / PaddleOCR

PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库，助力使用者训练出更好的模型，并应用落地。

OCR crnn ocrlite 数据库 chineseocr pdf2markdown pp-ocr pp-structure document-parsing chatocr document-translation kie

Python 53.75 k

4 天前

docling-project / docling

Get your documents ready for gen AI

人工智能 convert documents pdf tables document-parser document-parsing docx HTML Markdown pdf-converter pdf-to-json pdf-to-text pptx xlsx

Python 38.56 k

3 天前

Unstructured-IO / unstructured

#自然语言处理#Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to...

深度学习 document-parsing 机器学习自然语言处理 OCR information-retrieval data-pipelines preprocessing pdf-to-text pdf pdf-to-json document-image-analysis donut document-image-processing document-parser docx langchain 大语言模型

HTML 12.65 k

4 天前

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

document Parsing pdf pdf-document-processor pptx structured-data document-parser document-parsing docx-to-markdown pdf-to-excel pdf-to-json pdf-to-text ppt-to-json tables ppt-to-markdown pdf-to-markdown

TypeScript 4.14 k

2 天前

enoch3712 / ExtractThinker

#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python 1.4 k

18 天前

NanoNets / docstrange

#大语言模型#Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

大语言模型 Markdown OCR pdf-to-markdown structured-data 人工智能 document-parser document-parsing pdf-parser pdf-to-json tables

Python 546

3 天前

edenai / edenai-apis

#自然语言处理#Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

aggregator 人工智能 API 机器视觉 document-parsing 图像处理 machine-translation 自然语言处理 OCR optical-character-recognition pre-trained-model Python speech-recognition speech-to-text text-to-speech video-recognition

Python 456

4 天前

harishdeivanayagam / rowfill

#大语言模型#Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

document document-parsing langgraph llama 大语言模型 Next OCR ollama openai vision pdf pdfs unstructured unstructured-data

TypeScript 363

6 个月前

GiftMungmeeprued / document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts...

data-pipeline document-image-processing document-parser document-parsing langchain OCR pdf pdf-to-text preprocessing

149

2 个月前

AdemBoukhris457 / Docs_Parsing_Techniques

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

人工智能 genai OCR document-parsing

Jupyter Notebook 63

13 天前

papercast-dev / papercast

#自然语言处理#A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines...

arxiv Python dag 自然语言处理 pdf-converter pdf-document-processor pipeline document-parser document-parsing pdf-to-text podcast tts

Python 52

6 个月前

CycloneBoy / pdf_table

A Unified Toolkit for Deep Learning-Based Table Extraction

人工智能 document-parsing pdf layout-analysis OCR table table-recognition

Python 49

10 个月前

Unstructured-IO / community

#计算机科学#Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

community data-pipeline 深度学习 document-ai document-parsing 机器学习 nlp-parsing ocr-python Open Source

2 年前

docling-project / docling4j

Docling4j brings the functionalities of Docling in document understanding to Java® projects

人工智能 document-parser document-parsing document-understanding documents Java pdf pdf-converter pdf-to-json

Java 16

5 个月前

aimagelab / mugat

Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"

document-parsing OCR transformer

Python 11

1 年前

acenji / ats

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring...

applicant-tracking-system ats document-parsing generative-ai keyword-extraction 自然语言处理 Node.js React resume-analysis sorting-algorithms

JavaScript 8

5 个月前