#大语言模型#Machine Learning Engineering Open Book
Slurm: A Highly Scalable Workload Manager
A DSL for data-driven computational pipelines
#计算机科学#dstack is an open-source container orchestrator that simplifies workload orchestration and drives GPU utilization for ML teams. It works with any GPU cloud, on-prem cluster, or accelerated hardware.
A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
Best practices & guides on how to write distributed pytorch training code
Lightweight fast function pipeline (DAG) creation in pure Python for scientific workflows 🕸️🧪
#计算机科学#TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
A scheduler for GPU/CPU tasks
Create clusters of VMs on the cloud and configure them with Ansible.
Simplify HPC and Batch workloads on Azure
A Cross-Platform, Multi-Cloud High-Performance Computing Platform
An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
#计算机科学#Run Slurm in Kubernetes
Prometheus exporter for performance metrics from Slurm.