slurm · GitHub Topics

#大语言模型#Machine Learning Engineering Open Book

PyTorch slurm large-language-models 大语言模型机器学习 scalability transformers machine-learning-engineering mlops 人工智能 inference training

Python 15.07 k

2 天前

SchedMD / slurm

Slurm: A Highly Scalable Workload Manager

slurm slurm-job-scheduler slurm-workload-manager

C 3.28 k

2 天前

nextflow-io / nextflow

A DSL for data-driven computational pipelines

Bioinformatics workflow-engine pipeline pipeline-framework nextflow cloud Groovy slurm Amazon Web Services Docker singularity hpc reproducible-science reproducible-research dataflow

Groovy 3.14 k

1 天前

dstackai / dstack

#计算机科学#dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or on-prem.

机器学习 Python gpu 大语言模型 cloud orchestration fine-tuning training Kubernetes amd Docker inference Nvidia slurm containers

Python 1.89 k

15 小时前

facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm

slurm Python clusters

Python 1.5 k

4 个月前

DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.

Common Workflow Language Python mesos slurm workflow pipeline Amazon Web Services Kubernetes

Python 915

2 天前

PySlurm / pyslurm

Python Interface to Slurm

slurm cython Python hpc cluster

Cython 537

2 个月前

rackslab / Slurm-web

Open source web interface for Slurm HPC & AI clusters

dashboard hpc slurm webui 人工智能

Python 480

5 天前

LambdaLabsML / distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

CUDA deepspeed distributed-training gpu gpu-cluster kuberentes nccl PyTorch slurm cluster mpi sharding

Python 475

7 个月前

pipefunc / pipefunc

Lightweight fast function pipeline (DAG) creation in pure Python for scientific workflows 🕸️🧪

pipeline-framework pipelines reproducible-research dag hpc parallel-computing slurm workflow-engine

Python 398

4 天前

giovtorres / slurm-docker-cluster

A Slurm cluster using docker-compose

hpc slurm Docker Compose

Dockerfile 393

2 个月前

pytorch / torchx

#计算机科学#TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

PyTorch 机器学习 Kubernetes slurm distributed-training pipelines components 深度学习 Python ray airflow

Python 388

2 天前