Statistical Computing: A tutorial for UCSC Graduate Students

16 Feb 2025

Hi, grads! Whether you’re running Bayesian models, tuning machine learning algorithms, or analyzing massive datasets, this guide aims to help you take advantage of the full potential of UCSC’s Hummingbird (HB) clusters.

What Are the Hummingbird Clusters?

The HB clusters are UCSC’s shared high-performance computing (HPC) systems, built to handle tasks that would overwhelm personal computers. Designed for researchers across disciplines, they offer:

Parallel computing: Run thousands of simulations or MCMC chains simultaneously
High-speed storage: Process TB-scale datasets (e.g., genomics, climate models)
Specialized hardware: Access GPU nodes for deep learning
Scalability: Transition seamlessly from small interactive tests to large batch jobs

Who Should Use HB?

These clusters are ideal for statistical work requiring:

Computational intensity: Hierarchical models, bootstrapping, or optimization
Reproducibility: Version-controlled environments for collaborative projects
Speed: Accelerate workflows that take days on a laptop

What’s in This Guide?

You’ll learn to:

Test code interactively (e.g., debug an RStan model)
Submit batch jobs (e.g., parallelize 100 MCMC chains)
Optimize resources (avoid memory crashes, leverage GPUs)
Manage workflows (from data storage to result analysis)

Introduction to Cluster Computing for Stats Grads

Brief Glossary:

Node: Dedicated server (128 cores/256GB RAM typical) - Think of it as a powerful workstation
Core: Individual processing unit (Like a CPU thread) - Your basic computation unit
GPU Node: Specialized nodes with 4 NVIDIA A100 GPUs (80GB VRAM each) for deep learning
Scratch Space: 1PB high-speed temporary storage (Auto-cleaned every 14 days) - Perfect for intermediate results

I. Interactive Development Sessions

When to Use Interactive:

Debugging code
Exploratory analysis
Small simulations
Model prototyping
Visualization

Example 1: Debugging Bayesian Models in R

# Request interactive resources: 4 cores, 8GB RAM for 2 hours
srun --pty --mem=8G --cpus-per-task=4 --time=02:00:00 bash

# Load R environment with Bayesian stack
module load R/4.3.0

# Start R session with debugging capabilities
R
> library(rstan)          # Load STAN interface
> debug(fit_model)        # Set breakpoint in function
> source("hierarchical_bayes.R")  # Run script until breakpoint
> where                  # Show call stack when breakpoint hits

Example 2: Interactive ML Development with Jupyter

# Request heavier resources for data e ploration: 8 cores, 16GB RAM
srun --pty --mem=16G --cpus-per-task=8 --time=04:00:00 bash

# Load Python environment
module load python/3.11

# Start Jupyter Lab on cluster (no local browser)
python -m jupyter lab --no-browser --port=8889

# On your local machine, create SSH tunnel:
ssh -L 8889:localhost:8889 cruzid@hb.ucsc.edu
# Now access via http://localhost:8889 in local browser

II. Batch Processing for Production Workloads

When to Use Batch Processing:

Long-running computations (>4 hours)
Parameter sweeps/optimization runs
Production model training/inference
Final analyses requiring full resources
Reproducible pipeline executions

Example 1: Large-Scale Bayesian Inference

#!/bin/bash
#SBATCH --job-name=stan_meta          # Job identifier
#SBATCH --output=mcmc_%A_%a.log       # Log file template (JobID_ArrayID)
#SBATCH --array=1-100                 # Parallelize 100 independent chains
#SBATCH --cpus-per-task=4             # 4 cores per chain (for within-chain parallel)
#SBATCH --mem=16G                     # 16GB RAM per chain
#SBATCH --time=24:00:00               # 24hr ma  runtime

# Load environment
module load R/4.3.0

# Run STAN model with chain-specific data
Rscript run_stan.R --model hierarchical \
                  --data ${SLURM_ARRAY_TASK_ID} \  
                  --iter 5000

Example 2: Distributed ML Training

#!/bin/bash
#SBATCH --job-name= gb_ensemble       # Job name
#SBATCH --nodes=2                     # Use 2 physical servers
#SBATCH --ntasks-per-node=16          # 32 total tasks (16 per node)
#SBATCH --mem=128G                    # 128GB total RAM (64GB/node)
#SBATCH --time=48:00:00               # 2-day ma  runtime
#SBATCH --gres=gpu:2                  # Request 2 GPUs per node

# Load ML environment
module load python/3.11

# Train  GBoost ensemble with cross-validation
python train_ensemble.py --n-estimators 1000 \  # 1000 trees
                        --depth-range 3-10 \   # Search depth 3-10
                        --gpu                  # Enable GPU acceleration

Reproducible Environment Setup

Statistical Computing Environments

# R: Create project-specific environment with renv
module load R/4.3.0
R -e "renv::init()"            # Initialize project
R -e "renv::install('brms')"   # Install Bayesian regression models

# Python: Lock dependencies with conda-lock
module load miniconda3
conda create -n stats_proj python=3.11  # New environment
conda install -n stats_proj numpy pandas scikit-learn  # Core stack
conda-lock lock --file environment.yml --platform linu -64  # Create reproducible lockfile

Big Data Best Practices

Chunked Processing: Use dask.dataframe.read_csv(chunksize=1e6) for memory-efficient ETL
Memory Mapping: numpy.memmap('large_array.npy') for out-of-core 100GB+ arrays
Columnar Storage: pd.read_parquet('data.parquet') for fast I/O of structured data

Pro Tip: Always test workflows interactively before submitting batch jobs!

Validate data loading in small sessions
Profile memory usage with RStudio/python -m memory_profiler
Test single array job element before full submission

Performance Optimization Guide

Understanding Cluster Resources

All compute happens on remote servers - your laptop just submits jobs
Storage paths like /hb/home are network-mounted - accessible from all nodes
Always test scripts with small resources first!

Memory Management Essentials

# For R: Profile memory usage with Valgrind
# This creates detailed memory usage reports
module load R/4.3.0
R -d "valgrind --tool=massif" -f bayesian_analysis.R
# After running, analyze with:
ms_print massif.out.* > memory_report.txt

# For Python: Track memory allocation
# First install memory profiler in your environment
pip install memray
# Run profiling and generate report
python -m memray run -o profile.bin ml_pipeline.py
python -m memray stats --json profile.bin > memory_stats.json

Why Memory Matters:

Jobs exceeding requested memory get automatically killed
Use 10-20% less than node maximums for safety
Monitor memory during runs with seff JOBID

Parallel Computing Patterns Explained

Method	SLURM Directives	When to Use	Example Command
Embarrassing Parallel	--array=1-100	Independent tasks (Bootstrap/permutation tests)	`sbatch --array=1-100 job.sh`
MPI (Message Passing)	--nodes=4 --ntasks-per-node=16	Inter-process communication (Gibbs sampling)	`mpirun -np 64 ./model`
Multithreading	--cpus-per-task=32	Shared-memory tasks (XGBoost/CV tuning)	`export OMP_NUM_THREADS=32`

Key Concept: Always match parallel method to your algorithm:

Embarrassing Parallel: No data sharing between tasks
MPI: Needs data exchange between processes
Multithreading: Single process with multiple threads

Getting Help with HPC

Cluster Support Channels

Email Support: hummmingbird@ucsc.edu
Slack Channel: Link
Documentation at https://hummingbird.ucsc.edu/docs

# Check job efficiency - run this while job is active
seff $(squeue -u $USER -h -o %i)
# Look for:
# - CPU Utilization: Should be >90% for good efficiency
# - Memory Usage: Should be < requested amount

Avoid These Common Mistakes

Memory Overallocation:
Bad: --mem=256G (max is 256GB/node)
Good: --mem=230G (leave 10% margin)
Ignoring Error Logs:
Always check slurm-JOBID.err after failures
Local Installs:
Never use pip install --user - it can break cluster environments

Antonio Aguirre Data Modelling·Bayesian Statistics·Machine Learning