CUDA testing protocol¶
The hosted GitHub Actions runners do not provide a GPU, so the test
suite in tests.yml runs CPU-only. TorchKM's CUDA code paths are
exercised in two complementary ways:
- Self-hosted CUDA workflow —
.github/workflows/cuda-tests.yml. Dispatches manually or on a weekly schedule onto a runner labelled[self-hosted, linux, cuda]. The job stays unscheduled until such a runner is registered, so it costs nothing in the meantime. - Documented periodic protocol — this page. Run the protocol on a
CUDA workstation, then commit the resulting log bundle under
benchmarks/cuda-runs/. The protocol is the source of record while no self-hosted runner is online.
Either path produces the same artefact bundle, so reviewers can audit either one without having to know which was used.
When to run¶
- Before tagging a release. Every
vX.Y.Ztag should be accompanied by a fresh CUDA log bundle on the release commit. - On dependency bumps. PyTorch / CUDA bumps in
setup.cfg. - After non-trivial solver changes. Anything touching
torchkm/cvk*.py,torchkm/cvknys*.py, ortorchkm/functions.py.
Required environment¶
| Component | Minimum tested | Notes |
|---|---|---|
| OS | Ubuntu 22.04 LTS | Other Linux distros are fine if CUDA drivers are properly installed. |
| Python | 3.10 | Match one of the versions in tests.yml. |
| NVIDIA driver | 535+ | Driver must support the CUDA build PyTorch was compiled against. |
| GPU | Any with ≥ 8 GB VRAM | The full suite peaks around 4 GB. |
| PyTorch | The version pinned by setup.cfg's install_requires |
pip install -e ".[dev,examples,viz]" will resolve a CUDA wheel automatically. |
nvidia-smi must return a working device. python -c "import torch;
print(torch.cuda.is_available())" must print True.
Manual protocol¶
Run from a fresh clone of the commit you want to verify:
# 1) Set up an isolated environment
python -m venv .venv-cuda
source .venv-cuda/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,examples,viz]"
# 2) Sanity check
python -c "import torch; assert torch.cuda.is_available(); \
print('cuda:', torch.version.cuda, 'device:', torch.cuda.get_device_name(0))"
# 3) Capture run metadata
mkdir -p benchmarks/cuda-runs/$(date -u +%Y%m%dT%H%M%SZ)
RUN_DIR=$(ls -dt benchmarks/cuda-runs/*/ | head -n1)
{
echo "Commit: $(git rev-parse HEAD)"
echo "Ref: $(git rev-parse --abbrev-ref HEAD)"
echo "Date (UTC): $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo
echo "## nvidia-smi"
nvidia-smi
echo
echo "## pip freeze"
python -m pip freeze
} > "${RUN_DIR}runner-snapshot.txt"
# 4) Full suite with coverage (writes XML + log)
python -m pytest -q \
--cov=torchkm \
--cov-report=term-missing:skip-covered \
--cov-report=xml:"${RUN_DIR}coverage-cuda.xml" \
--junitxml="${RUN_DIR}junit-cuda.xml" \
2>&1 | tee "${RUN_DIR}cuda-test-output.log"
# 5) Verbose CUDA-marked tests
python -m pytest -m cuda -v 2>&1 | tee "${RUN_DIR}cuda-marked-tests.log"
The protocol should leave the following files under ${RUN_DIR}:
runner-snapshot.txt # nvidia-smi + pip freeze + commit + ref + date
cuda-test-output.log # stdout from the full pytest run
cuda-marked-tests.log # stdout from `pytest -m cuda -v`
coverage-cuda.xml # cobertura-format coverage report
junit-cuda.xml # JUnit-format test report
What "passing" means¶
A CUDA run is considered passing when all of the following hold:
cuda-test-output.logends withN passed, 0 failed(skips are allowed — environment-specific skips are fine, but neithertest_torchkmsvc_cuda_smokenortest_device_cudashould be among the skipped tests).cuda-marked-tests.logshows each@pytest.mark.cudatest executing on the GPU (i.e. the skipif-on-cpu marker did not trigger).coverage-cuda.xmlreports overall coverage≥ 90 %— the same threshold the CPU CI enforces. CUDA-only branches typically push this slightly higher than the CPU-only run.
If a run fails, file a GitHub issue referencing the run directory and
link the artefact bundle. Do not delete failing runs from
benchmarks/cuda-runs/ — they are part of the historical record.
Committing the artefacts¶
benchmarks/cuda-runs/<UTC-timestamp>/ is checked in to the repo so
that reviewers can audit which commits have been validated on GPU and
when. Each directory should also contain a one-line STATUS file with
the verdict (PASS or FAIL) so the catalogue can be scanned
quickly:
echo "PASS" > "${RUN_DIR}STATUS"
git add benchmarks/cuda-runs/$(basename "${RUN_DIR}")
git commit -m "CUDA run $(basename ${RUN_DIR}): PASS at $(git rev-parse --short HEAD)"
Logs are typically a few tens of kilobytes each, so the repository size impact is negligible. If the bundle ever grows beyond a couple of MB, move it to git LFS or to a release attachment and replace the in-tree directory with a stub linking to the artefact.
Running the self-hosted workflow¶
Once a runner with labels self-hosted, linux, cuda is registered:
- Trigger via the Run workflow button on the Actions tab and pick a ref, or wait for the scheduled Monday 07:00 UTC run.
- The job uploads the same five files described above as a build
artefact (
cuda-test-artifacts-<run-id>, retained 90 days). - After the run completes, download the artefact and commit it under
benchmarks/cuda-runs/so the in-repo audit trail remains the canonical record.
The workflow file itself (.github/workflows/cuda-tests.yml) is the
source of truth for which commands are run — keep this document and
the workflow file in sync if either is updated.