#大语言模型#Machine Learning Engineering Open Book
Slurm: A Highly Scalable Workload Manager
A DSL for data-driven computational pipelines
#计算机科学#dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.
A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
Best practices & guides on how to write distributed pytorch training code
Lightweight fast function pipeline (DAG) creation in pure Python for scientific workflows 🕸️🧪
#计算机科学#TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
A scheduler for GPU/CPU tasks
Create clusters of VMs on the cloud and configure them with Ansible.
#计算机科学#Run Slurm in Kubernetes
Simplify HPC and Batch workloads on Azure
A Cross-Platform, Multi-Cloud High-Performance Computing Platform
An open-source toolkit for deploying and managing high performance clusters for HPC, AI, and data analytics workloads.
Prometheus exporter for performance metrics from Slurm.