document-processing

#大语言模型#A system for agentic LLM-powered data processing and ETL

data etl 大语言模型 Python data-pipelines elt workflow agents semantic-data document-processing unstructured-data unstructured-data-analysis document-analysis

Python 2.83 k

1 天前

enoch3712 / ExtractThinker

#自然语言处理#ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python 1.4 k

19 天前

dhlab-epfl / dhSegment

Generic framework for historical document processing

Tensorflow segmentation historical-data Python document-processing

Python 379

4 年前

ucbepic / TWIX

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-data-extraction document-processing

Python 204

4 个月前

awslabs / project-lakechain

#自然语言处理#⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

Amazon Web Services 机器视觉 document-processing generative-ai 机器学习自然语言处理 retrieval-augmented-generation Serverless Hacktoberfest aws-cdk

TypeScript 184

6 个月前

formkiq / formkiq-core

A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Pleas...

amazon-web-services Amazon Web Services cloud-storage dms document-database document-management document-management-system document-processing headless Serverless OCR optical-character-recognition

Java 141

2 天前

Tele-AI / doc-ops-mcp

MCP server for seamless document format conversion and processing

document-conversion document-processing docx-to-pdf file-converter markdown-converter pdf-conversion watermark pdf-processing

TypeScript 108

21 小时前

iamarunbrahma / pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced info...

document-conversion document-processing information-retrieval pdf-parsing pdf-to-markdown Python rag retrieval-augmented-generation text-extraction pdf-converter

Python 94

10 个月前

awslabs / rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

amazon-bedrock document-processing generative-ai multi-modal

Python 94

14 天前

parsee-ai / parsee-core

#大语言模型#Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

document-processing 大语言模型 structured-data multimodal

Python 73

18 天前

steindani / pandoc-include

An include filter for Pandoc

pandoc pandoc-filter Markdown document-processing

Haskell 62

5 年前

PSPDFKit / nutrient-document-engine-mcp-server

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

agentic-ai document-processing mcp-server

TypeScript 56

2 个月前

jmanhype / DSPy-Multi-Document-Agents

#自然语言处理#An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

人工智能 distributed-systems document-processing knowledge-management 自然语言处理 query-optimization vector-search

Python 45

1 年前

aws-solutions / enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates...

document-analysis document-processing

JavaScript 40

4 天前

cburschka / lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

mirror document-processing LaTeX

C++ 39

2 年前

abdullahshafiq-20 / ResumeTex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaT...

自动化 developer-tools document-processing Express LaTeX Node.js Open Source pdf-parsing React resume Tailwind CSS TeX

JavaScript 37

13 天前

kili-technology / awesome-datasets

#自然语言处理#A comprehensive list of annotated training datasets classified by use case.

awesome-public-datasets 数据集 Open Data dataset data open-datasets annotation 自然语言处理 entity-extraction ner entity-recognition document-processing OCR

3 年前

afrozas / proceedings

Semantic extraction from conference proceedings.

conferences semantic spaCy document-processing

Python 31

5 年前

autollama / autollama

#大语言模型#Anthropic's Contextual Retrieval implementation with visual chunk comparison. Preview context enrichment before/after embedding.

人工智能自动化聊天机器人 Docker document-processing embeddings knowledge-base 大语言模型 Node.js openai pdf-processing rag React semantic-search vector-database

HTML 24

14 天前

MBAigner / PDFSegmenter

This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

pdf document-processing Python layout-analysis annotations CSV table

Python 23

5 年前

Website
Wikipedia