lakeFS - Data version control for your data lake | Git for data
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Personal Data Engineering Projects
Data API Framework for AI Agents and Data Apps
Generic Data Ingestion & Dispersal Library for Hadoop
Enterprise-grade, production-hardened, serverless data lake on AWS
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
#大语言模型#🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥
Use SQL to build ELT pipelines on a data lakehouse.
GigAPI is an infinite timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Resources for video demonstrations and blog posts related to DataOps on AWS