Statistical Computing: A tutorial for UCSC Graduate Students
16 Feb 2025Hi, grads! Whether you’re running Bayesian models, tuning machine learning algorithms, or analyzing massive datasets, this guide aims to help you take advantage of the full potential of UCSC’s Hummingbird (HB) clusters.
What Are the Hummingbird Clusters?
The HB clusters are UCSC’s shared high-performance computing (HPC) systems, built to handle tasks that would overwhelm personal computers. Designed for researchers across disciplines, they offer:
- Parallel computing: Run thousands of simulations or MCMC chains simultaneously
- High-speed storage: Process TB-scale datasets (e.g., genomics, climate models)
- Specialized hardware: Access GPU nodes for deep learning
- Scalability: Transition seamlessly from small interactive tests to large batch jobs
Who Should Use HB?
These clusters are ideal for statistical work requiring:
- Computational intensity: Hierarchical models, bootstrapping, or optimization
- Reproducibility: Version-controlled environments for collaborative projects
- Speed: Accelerate workflows that take days on a laptop
What’s in This Guide?
You’ll learn to:
- Test code interactively (e.g., debug an RStan model)
- Submit batch jobs (e.g., parallelize 100 MCMC chains)
- Optimize resources (avoid memory crashes, leverage GPUs)
- Manage workflows (from data storage to result analysis)
Introduction to Cluster Computing for Stats Grads
Brief Glossary:
- Node: Dedicated server (128 cores/256GB RAM typical) - Think of it as a powerful workstation
- Core: Individual processing unit (Like a CPU thread) - Your basic computation unit
- GPU Node: Specialized nodes with 4 NVIDIA A100 GPUs (80GB VRAM each) for deep learning
- Scratch Space: 1PB high-speed temporary storage (Auto-cleaned every 14 days) - Perfect for intermediate results
I. Interactive Development Sessions
When to Use Interactive:
- Debugging code
- Exploratory analysis
- Small simulations
- Model prototyping
- Visualization
Example 1: Debugging Bayesian Models in R
# Request interactive resources: 4 cores, 8GB RAM for 2 hours
srun --pty --mem=8G --cpus-per-task=4 --time=02:00:00 bash
# Load R environment with Bayesian stack
module load R/4.3.0
# Start R session with debugging capabilities
R
> library(rstan) # Load STAN interface
> debug(fit_model) # Set breakpoint in function
> source("hierarchical_bayes.R") # Run script until breakpoint
> where # Show call stack when breakpoint hits
Example 2: Interactive ML Development with Jupyter
# Request heavier resources for data e ploration: 8 cores, 16GB RAM
srun --pty --mem=16G --cpus-per-task=8 --time=04:00:00 bash
# Load Python environment
module load python/3.11
# Start Jupyter Lab on cluster (no local browser)
python -m jupyter lab --no-browser --port=8889
# On your local machine, create SSH tunnel:
ssh -L 8889:localhost:8889 cruzid@hb.ucsc.edu
# Now access via http://localhost:8889 in local browser
II. Batch Processing for Production Workloads
When to Use Batch Processing:
- Long-running computations (>4 hours)
- Parameter sweeps/optimization runs
- Production model training/inference
- Final analyses requiring full resources
- Reproducible pipeline executions
Example 1: Large-Scale Bayesian Inference
#!/bin/bash
#SBATCH --job-name=stan_meta # Job identifier
#SBATCH --output=mcmc_%A_%a.log # Log file template (JobID_ArrayID)
#SBATCH --array=1-100 # Parallelize 100 independent chains
#SBATCH --cpus-per-task=4 # 4 cores per chain (for within-chain parallel)
#SBATCH --mem=16G # 16GB RAM per chain
#SBATCH --time=24:00:00 # 24hr ma runtime
# Load environment
module load R/4.3.0
# Run STAN model with chain-specific data
Rscript run_stan.R --model hierarchical \
--data ${SLURM_ARRAY_TASK_ID} \
--iter 5000
Example 2: Distributed ML Training
#!/bin/bash
#SBATCH --job-name= gb_ensemble # Job name
#SBATCH --nodes=2 # Use 2 physical servers
#SBATCH --ntasks-per-node=16 # 32 total tasks (16 per node)
#SBATCH --mem=128G # 128GB total RAM (64GB/node)
#SBATCH --time=48:00:00 # 2-day ma runtime
#SBATCH --gres=gpu:2 # Request 2 GPUs per node
# Load ML environment
module load python/3.11
# Train GBoost ensemble with cross-validation
python train_ensemble.py --n-estimators 1000 \ # 1000 trees
--depth-range 3-10 \ # Search depth 3-10
--gpu # Enable GPU acceleration
Reproducible Environment Setup
Statistical Computing Environments
# R: Create project-specific environment with renv
module load R/4.3.0
R -e "renv::init()" # Initialize project
R -e "renv::install('brms')" # Install Bayesian regression models
# Python: Lock dependencies with conda-lock
module load miniconda3
conda create -n stats_proj python=3.11 # New environment
conda install -n stats_proj numpy pandas scikit-learn # Core stack
conda-lock lock --file environment.yml --platform linu -64 # Create reproducible lockfile
Big Data Best Practices
- Chunked Processing: Use dask.dataframe.read_csv(chunksize=1e6) for memory-efficient ETL
- Memory Mapping: numpy.memmap('large_array.npy') for out-of-core 100GB+ arrays
- Columnar Storage: pd.read_parquet('data.parquet') for fast I/O of structured data
Pro Tip: Always test workflows interactively before submitting batch jobs!
- Validate data loading in small sessions
- Profile memory usage with RStudio/python -m memory_profiler
- Test single array job element before full submission
Performance Optimization Guide
Understanding Cluster Resources
- All compute happens on remote servers - your laptop just submits jobs
- Storage paths like /hb/home are network-mounted - accessible from all nodes
- Always test scripts with small resources first!
Memory Management Essentials
# For R: Profile memory usage with Valgrind
# This creates detailed memory usage reports
module load R/4.3.0
R -d "valgrind --tool=massif" -f bayesian_analysis.R
# After running, analyze with:
ms_print massif.out.* > memory_report.txt
# For Python: Track memory allocation
# First install memory profiler in your environment
pip install memray
# Run profiling and generate report
python -m memray run -o profile.bin ml_pipeline.py
python -m memray stats --json profile.bin > memory_stats.json
Why Memory Matters:
- Jobs exceeding requested memory get automatically killed
- Use 10-20% less than node maximums for safety
- Monitor memory during runs with
seff JOBID
Parallel Computing Patterns Explained
Method | SLURM Directives | When to Use | Example Command |
---|---|---|---|
Embarrassing Parallel | --array=1-100 | Independent tasks (Bootstrap/permutation tests) | sbatch --array=1-100 job.sh |
MPI (Message Passing) | --nodes=4 --ntasks-per-node=16 | Inter-process communication (Gibbs sampling) | mpirun -np 64 ./model |
Multithreading | --cpus-per-task=32 | Shared-memory tasks (XGBoost/CV tuning) | export OMP_NUM_THREADS=32 |
Key Concept: Always match parallel method to your algorithm:
- Embarrassing Parallel: No data sharing between tasks
- MPI: Needs data exchange between processes
- Multithreading: Single process with multiple threads
Getting Help with HPC
Cluster Support Channels
- Email Support: hummmingbird@ucsc.edu
- Slack Channel: Link
- Documentation at https://hummingbird.ucsc.edu/docs
# Check job efficiency - run this while job is active
seff $(squeue -u $USER -h -o %i)
# Look for:
# - CPU Utilization: Should be >90% for good efficiency
# - Memory Usage: Should be < requested amount
Avoid These Common Mistakes
- Memory Overallocation:
Bad:--mem=256G
(max is 256GB/node)
Good:--mem=230G
(leave 10% margin) - Ignoring Error Logs:
Always checkslurm-JOBID.err
after failures - Local Installs:
Never usepip install --user
- it can break cluster environments