Antonio Aguirre Data Modelling·Bayesian Statistics·Machine Learning

Statistical Computing: A tutorial for UCSC Graduate Students

Hi, grads! Whether you’re running Bayesian models, tuning machine learning algorithms, or analyzing massive datasets, this guide aims to help you take advantage of the full potential of UCSC’s Hummingbird (HB) clusters.

What Are the Hummingbird Clusters?

The HB clusters are UCSC’s shared high-performance computing (HPC) systems, built to handle tasks that would overwhelm personal computers. Designed for researchers across disciplines, they offer:

  • Parallel computing: Run thousands of simulations or MCMC chains simultaneously
  • High-speed storage: Process TB-scale datasets (e.g., genomics, climate models)
  • Specialized hardware: Access GPU nodes for deep learning
  • Scalability: Transition seamlessly from small interactive tests to large batch jobs

Who Should Use HB?

These clusters are ideal for statistical work requiring:

  • Computational intensity: Hierarchical models, bootstrapping, or optimization
  • Reproducibility: Version-controlled environments for collaborative projects
  • Speed: Accelerate workflows that take days on a laptop

What’s in This Guide?

You’ll learn to:

  1. Test code interactively (e.g., debug an RStan model)
  2. Submit batch jobs (e.g., parallelize 100 MCMC chains)
  3. Optimize resources (avoid memory crashes, leverage GPUs)
  4. Manage workflows (from data storage to result analysis)

Introduction to Cluster Computing for Stats Grads

Brief Glossary:

  • Node: Dedicated server (128 cores/256GB RAM typical) - Think of it as a powerful workstation
  • Core: Individual processing unit (Like a CPU thread) - Your basic computation unit
  • GPU Node: Specialized nodes with 4 NVIDIA A100 GPUs (80GB VRAM each) for deep learning
  • Scratch Space: 1PB high-speed temporary storage (Auto-cleaned every 14 days) - Perfect for intermediate results

I. Interactive Development Sessions

When to Use Interactive:
  • Debugging code
  • Exploratory analysis
  • Small simulations
  • Model prototyping
  • Visualization

Example 1: Debugging Bayesian Models in R

# Request interactive resources: 4 cores, 8GB RAM for 2 hours
srun --pty --mem=8G --cpus-per-task=4 --time=02:00:00 bash

# Load R environment with Bayesian stack
module load R/4.3.0

# Start R session with debugging capabilities
R
> library(rstan)          # Load STAN interface
> debug(fit_model)        # Set breakpoint in function
> source("hierarchical_bayes.R")  # Run script until breakpoint
> where                  # Show call stack when breakpoint hits

Example 2: Interactive ML Development with Jupyter

# Request heavier resources for data e ploration: 8 cores, 16GB RAM
srun --pty --mem=16G --cpus-per-task=8 --time=04:00:00 bash

# Load Python environment
module load python/3.11

# Start Jupyter Lab on cluster (no local browser)
python -m jupyter lab --no-browser --port=8889

# On your local machine, create SSH tunnel:
ssh -L 8889:localhost:8889 cruzid@hb.ucsc.edu
# Now access via http://localhost:8889 in local browser

II. Batch Processing for Production Workloads

When to Use Batch Processing:
  • Long-running computations (>4 hours)
  • Parameter sweeps/optimization runs
  • Production model training/inference
  • Final analyses requiring full resources
  • Reproducible pipeline executions

Example 1: Large-Scale Bayesian Inference

#!/bin/bash
#SBATCH --job-name=stan_meta          # Job identifier
#SBATCH --output=mcmc_%A_%a.log       # Log file template (JobID_ArrayID)
#SBATCH --array=1-100                 # Parallelize 100 independent chains
#SBATCH --cpus-per-task=4             # 4 cores per chain (for within-chain parallel)
#SBATCH --mem=16G                     # 16GB RAM per chain
#SBATCH --time=24:00:00               # 24hr ma  runtime

# Load environment
module load R/4.3.0

# Run STAN model with chain-specific data
Rscript run_stan.R --model hierarchical \
                  --data ${SLURM_ARRAY_TASK_ID} \  
                  --iter 5000

Example 2: Distributed ML Training

#!/bin/bash
#SBATCH --job-name= gb_ensemble       # Job name
#SBATCH --nodes=2                     # Use 2 physical servers
#SBATCH --ntasks-per-node=16          # 32 total tasks (16 per node)
#SBATCH --mem=128G                    # 128GB total RAM (64GB/node)
#SBATCH --time=48:00:00               # 2-day ma  runtime
#SBATCH --gres=gpu:2                  # Request 2 GPUs per node

# Load ML environment
module load python/3.11

# Train  GBoost ensemble with cross-validation
python train_ensemble.py --n-estimators 1000 \  # 1000 trees
                        --depth-range 3-10 \   # Search depth 3-10
                        --gpu                  # Enable GPU acceleration

Reproducible Environment Setup

Statistical Computing Environments

# R: Create project-specific environment with renv
module load R/4.3.0
R -e "renv::init()"            # Initialize project
R -e "renv::install('brms')"   # Install Bayesian regression models

# Python: Lock dependencies with conda-lock
module load miniconda3
conda create -n stats_proj python=3.11  # New environment
conda install -n stats_proj numpy pandas scikit-learn  # Core stack
conda-lock lock --file environment.yml --platform linu -64  # Create reproducible lockfile
Big Data Best Practices
  • Chunked Processing: Use dask.dataframe.read_csv(chunksize=1e6) for memory-efficient ETL
  • Memory Mapping: numpy.memmap('large_array.npy') for out-of-core 100GB+ arrays
  • Columnar Storage: pd.read_parquet('data.parquet') for fast I/O of structured data
Pro Tip: Always test workflows interactively before submitting batch jobs!
  • Validate data loading in small sessions
  • Profile memory usage with RStudio/python -m memory_profiler
  • Test single array job element before full submission

Performance Optimization Guide

Understanding Cluster Resources
  • All compute happens on remote servers - your laptop just submits jobs
  • Storage paths like /hb/home are network-mounted - accessible from all nodes
  • Always test scripts with small resources first!

Memory Management Essentials

# For R: Profile memory usage with Valgrind
# This creates detailed memory usage reports
module load R/4.3.0
R -d "valgrind --tool=massif" -f bayesian_analysis.R
# After running, analyze with:
ms_print massif.out.* > memory_report.txt

# For Python: Track memory allocation
# First install memory profiler in your environment
pip install memray
# Run profiling and generate report
python -m memray run -o profile.bin ml_pipeline.py
python -m memray stats --json profile.bin > memory_stats.json
Why Memory Matters:
  • Jobs exceeding requested memory get automatically killed
  • Use 10-20% less than node maximums for safety
  • Monitor memory during runs with seff JOBID

Parallel Computing Patterns Explained

Method SLURM Directives When to Use Example Command
Embarrassing Parallel --array=1-100 Independent tasks (Bootstrap/permutation tests) sbatch --array=1-100 job.sh
MPI (Message Passing) --nodes=4 --ntasks-per-node=16 Inter-process communication (Gibbs sampling) mpirun -np 64 ./model
Multithreading --cpus-per-task=32 Shared-memory tasks (XGBoost/CV tuning) export OMP_NUM_THREADS=32
Key Concept: Always match parallel method to your algorithm:
  • Embarrassing Parallel: No data sharing between tasks
  • MPI: Needs data exchange between processes
  • Multithreading: Single process with multiple threads

Getting Help with HPC

Cluster Support Channels
# Check job efficiency - run this while job is active
seff $(squeue -u $USER -h -o %i)
# Look for:
# - CPU Utilization: Should be >90% for good efficiency
# - Memory Usage: Should be < requested amount
Avoid These Common Mistakes
  • Memory Overallocation:
    Bad: --mem=256G (max is 256GB/node)
    Good: --mem=230G (leave 10% margin)
  • Ignoring Error Logs:
    Always check slurm-JOBID.err after failures
  • Local Installs:
    Never use pip install --user - it can break cluster environments