etl-pipeline · GitHub Topics

risingwavelabs / risingwave

下一代云原生流数据库

数据库 stream-processing Rust PostgreSQL kafka materialized-view data-engineering apache-iceberg elt-pipeline etl-pipeline

Rust 8.34 k

18 小时前

Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

etl-pipeline llm-platform unstructured-data

Python 5.77 k

3 天前

apache / streampark

StreamX 的初衷是为了让流处理更简单. 打造一个一站式大数据平台,流批一体,湖仓一体的解决方案

streaming streampark apache development-framework easy-to-use etl-pipeline operation-platform

Java 4.2 k

5 天前

orchest / orchest

#编辑器#Build data pipelines, the easy way 🛠️

数据科学机器学习 pipelines ide Jupyter Notebook cloud 自托管 jupyterlab notebooks Docker Python data-pipelines 部署 Kubernetes airflow dag etl etl-pipeline

TypeScript 4.14 k

2 年前

apache / hamilton

#计算机科学#Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

数据科学 Python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering 机器学习 pandas 软件工程数据分析 lineage llmops mlops orchestration Hacktoberfest rag

Jupyter Notebook 2.26 k

1 天前

AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

pyspark etl-job Python data-engineering Apache Spark 数据科学 etl etl-pipeline

Python 1.99 k

3 年前

san089 / Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow cluster Apache Cassandra infrastructure PostgreSQL Amazon Web Services aws-ec2 aws-sdk aws-s3 cloudformation

Python 1.73 k

3 年前

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

etl-pipeline etl-framework Apache Spark apache-airflow airflow redshift emr-cluster livy s3 data-lake scheduler data-migration data-engineering data-engineering-pipeline Python etl-job

Python 1.42 k

6 年前

Open-Source-Legal / OpenContracts

#大语言模型#Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent agentic-ai etl etl-pipeline 大语言模型 unstructured-data vector-database prompt-engineering

TypeScript 924

3 天前

stitchfix / hamilton

#计算机科学#A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

Python pandas dag 数据科学 data-engineering NumPy 软件工程 etl-framework etl-pipeline etl feature-engineering dataframe data-platform 机器学习

Python 860

2 年前

techascent / tech.ml.dataset

#计算机科学#A Clojure high performance data processing system

Clojure dataframe CSV xlsx datascience 机器学习 dataset etl-pipeline Java

Clojure 719

3 天前

SorellaLabs / brontes

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

以太坊 evm mev etl-pipeline Rust

Rust 636

2 个月前

Pravko-Solutions / FlashLearn

#大语言模型#Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

人工智能 ai-agents concurrency 大语言模型 llm-agent Python agentic-ai-development ai-agents-framework etl-pipeline

Python 606

6 个月前

YotpoLtd / metorikku

A simplified, lightweight ETL Framework based on Apache Spark

big-data Apache Spark Scala etl-framework distributed-computing SQL etl etl-pipeline

Scala 589

2 年前

trustgraph-ai / trustgraph

The agentic AI platform for enterprise. Built by data engineers for data engineers. Complete context engineering and LLM orchestration infrastructure. Run anywhere - local, cloud, or bare metal.

graphrag context context-engineering model-serving agentic-ai agentic-ai-development agentic-rag ai-native data data-engineering data-extraction etl-pipeline

Python 571

4 天前

unbody-io / unbody

#大语言模型#The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

agentic-ai ai-native 后端聊天机器人 data-ingestion developer-tools etl-pipeline generative-ai knowledge-base 大语言模型 rag vector-database

TypeScript 350

3 个月前

DataWithBaraa / sql-data-warehouse-project

A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

数据分析 data-analytics data-cleaning data-engineering 数据科学 data-warehouse data-warehousing datalake datascience datawarehouse etl etl-job etl-pipeline SQL sql-query sql-server

TSQL 318

5 个月前

ebonnal / streamable

concurrent & fluent interface for (async) iterables

data-engineering etl-pipeline etl reverse-etl collections streams fluent-interface immutability lazy-evaluation method-chaining visitor-pattern data Python asyncio concurrent-data-structure multiprocessing multithreading

Python 279

2 天前

airscholar / e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...

apache-airflow apache-kafka Apache Spark big-data Apache Cassandra containerization data-engineering data-pipeline data-processing Docker etl-pipeline PostgreSQL real-time-analytics

Python 273

7 个月前

jitsucom / bulker

Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)

data-engineering datawarehouse etl etl-pipeline ingestion pipeline

Go 191

5 天前