extract-data · GitHub Topics

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

extract-data layout-analysis OCR Parser pdf pdf-converter Python document-analysis pdf-parser pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag ai4science

Python 41.04 k

3 小时前

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

mupdf xps pdf-documents epub OCR pdf 字体 Python 数据科学 extract-data table-extraction tesseract text-processing text-shaping

Python 7.68 k

3 天前

bda-research / node-crawler

#网络爬虫#Web Crawler/Spider for NodeJS + server-side jQuery ;-)

爬虫 JavaScript spider extract-data cheerio jQuery Node.js

TypeScript 6.77 k

2 个月前

meltano / meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

DataOps elt Open Source data pipelines extract-data connectors integration tap loaders data-pipelines data-engineering

Python 2.15 k

6 天前

DocumindHQ / documind

Open-source platform for extracting structured data from documents using AI.

人工智能大语言模型 Open Source pdf-extractor developer-tools OCR document-analysis extract-data Parser pdf pdf-converter pdf-extractor-llm

JavaScript 1.36 k

3 个月前

markummitchell / engauge-digitizer

Extracts data points from images of graphs

image-analysis extract-data Utility Software

C++ 1.27 k

4 年前

elixir-crawly / crawly

#网络爬虫#Crawly, a high-level web crawling & scraping framework for Elixir.

Elixir Erlang scraper scraping scraping-websites extract-data spider 爬虫 crawling

Elixir 1.04 k

15 天前

slotix / dataflowkit

#网络爬虫#Extract structured data from web sites. Web sites scraping.

Go golang-library extract-data scraping-websites crawling scraper scraping cdp headless

Go 688

2 年前

OmkarPathak / ResumeParser

A simple resume parser used for extracting information from resumes

resume-parser extract-data GUI Python Parser

Python 304

1 年前

danschultzer / receipt-scanner

Receipt scanner extracts information from your PDF or image receipts - built in NodeJS

OCR optical-character-recognition extract-data extract-information

JavaScript 299

7 年前

Qusic / TraceUtility

Extract data from .trace documents generated by Instruments

instruments extract-data 逆向工程 profiling Xcode

Objective-C 225

5 年前

m92vyas / llm-reader

#网络爬虫#Turn Webpage to LLM friendly input text. Similar to Firecrawl and Jina Reader API. Makes RAG, AI web scraping, image & webpage links extraction easy.

extract-data 大语言模型 llm-agent scraper scraping scraping-websites webscraping ai-agent-tools ai-agents firecrawl rag

Python 205

23 天前