This is a single header file inspired by stb.h by Sean Barrett with a bunch of useful statistical functions
============================================================================
You MUST
#define STB_STATS_DEFINE
in EXACTLY _one_ C or C++ file that includes this header, BEFORE the
include, like this:
#define STB_STATS_DEFINE
#include "stb_stats.h"
All other files should just #include "stb_stats.h" without the #define.
============================================================================
stb_stats.h- Main header file with statistical functionsexamples/- Example programs demonstrating library usage:dim_reduce.c- Dimensionality reduction (PCA, t-SNE, UMAP)deseq2_example.c- DESeq2-style differential expression analysisspearman.c- Spearman's rank correlation calculator
test_stb_stats.c- Comprehensive test suitetest_isolated.c- Isolated tests for specific functions
Functions included are:
- stb_tsne (t-SNE: t-Distributed Stochastic Neighbor Embedding with Barnes-Hut approximation)
- stb_umap (UMAP: Uniform Manifold Approximation and Projection)
- stb_kdtree KD-tree data structure for efficient nearest neighbor search (used by t-SNE and UMAP)
- stb_adjust_pvalues_bh (apply Benjamini-Hochberg FDR correction to array of p-values), stb_log2_fold_change
- stb_moderated_ttest, stb_cosine_similarity, RSE Normalization (stb_calc_geometric_scaling_factors and stb_meanvar_counts_to_common_scale)
- stb_shannon (Shannon's diversity index, Pilou evenness, stb_simpson (Simpson's Diversity Index), stb_jaccard (Jaccard similarity index), stb_bray_curtis (Bray–Curtis dissimilarity) and stb_create_htable a simple basic hash table
- stb_pdf_hypgeo hypergeometric distribution probability density function, speedup stb_log_factorial using lookup table
- stb_fisher2x2 simple fisher exact test for 2x2 contigency tables
- stb_pdf_binom and stb_pdf_pois, the binomial and poison probability density functions
- stb_polygamma, stb_trigamma_inverse gamme functions and stb_fit_f_dist for moment estimation of the scaled F-distribution
- stb_qnorm and stb_qnorm_with_reference (also matrix variants) quantile normalization between columns with and without a reference
- stb_neugas Neural gas clustering algorithm
- stb_pca Principal Component Analysis
- stb_csm (confident sequence method) for monte-carlo simulations
- stb_kmeans k-means++ classical data clustering
- stb_qsort (Quicksort), could be used to replace current sorting method
- stb_cdf_gumbel, stb_pdf_gumbel, stb_icdf_gumbel and stb_est_gumbel, the (inverse) cumulative/probability density functions for the gumbel distribution and the ML estimator of the gumbel parameters
- stb_kendall (Kendall's Rank correlation)
- stb_jenks Initial port of O(k×n×log(n)) Jenks-Fisher algorithm originally created by Maarten Hilferink
- stb_logistic_regression_L2 simple L2-regularized logistic regression
- stb_spearman (Spearman's Rank correlation)
- stb_invert_matrix, stb_transpose_matrix, stb_matrix_multiply, ..., stb_multi_linear_regression and stb_multi_logistic_regression
- stb_ksample_anderson_darling, stb_2sample_anderson_darling, (one sample) stb_anderson_darling
- stb_expfit (Exponential fitting), stb_polyfit (Polynomial fitting), stb_powfit (Power curve fitting), stb_linfit (Liniear fitting)
- stb_trap, stb_trapezoidal (returns the integral (area under the cruve) of a given function and interval)
- stb_lagrange (polynomial interpolation), stb_sum (Neumaier summation algorithm)
- stb_mann_whitney, stb_kruskal_wallis (Unfinished, needs a better way to handle Dunn's post-hoc test)
- stb_combinations
- stb_allocmat (simple allocation of 2d array, but might not work on all systems?!)
- stb_fgetln, stb_fgetlns
- stb_pcg32 (PCG-XSH-RR) and stb_xoshiro512 (xoshiro512**) Pseudo Random Number Generators
- stb_anova (One-Way Anova with Tukey HSD test and Scheffe T-statistics method (post-hoc) (Unfinished))
- stb_quartiles
- stb_histogram (very simple histogram), stb_print_histogram, ...
- stb_factorial
- stb_meanvar
- stb_ttest, stb_uttest
- stb_ftest,
- stb_benjamini_hochberg
- stb_chisqr, stb_chisqr_matrix, stb_gtest, stb_gtest_matrix,
A sample C program demonstrating DESeq2-style differential expression analysis using stb_stats.h for normalization and statistical testing.
Features:
- RSE (Relative Log Expression) normalization using geometric means
- Dispersion estimation using
stb_fit_f_dist - Moderated t-test for differential expression
- Multiple testing correction using Benjamini-Hochberg FDR
- Log2 fold change calculation
Input format:
- TAB-delimited count matrix (rows x columns format)
- First row: number of rows and columns
- Subsequent rows: count data (genes as rows, samples as columns)
Quick start:
make
# Example with first 3 samples as group 1 (columns 0-2) and next 3 as group 2 (columns 3-5)
./deseq2_example sample_counts.txt --g1-start 0 --g1-count 3 --g2-start 3 --g2-count 3 -o results.txtOptions:
--g1-start N: Starting column index for group 1 (default: 1)--g1-count N: Number of samples in group 1 (default: 3)--g2-start N: Starting column index for group 2 (default: 4)--g2-count N: Number of samples in group 2 (default: 3)--fdr FLOAT: False discovery rate threshold (default: 0.05)-o FILE: Output file (default: stdout)
Output columns:
- Gene: Gene identifier
- baseMean: Average expression across all samples
- log2FoldChange: Log2 fold change between groups
- stat: Test statistic (moderated t-statistic)
- pvalue: P-value from statistical test
- padj: Adjusted p-value (Benjamini-Hochberg FDR)
A flexible C program demonstrating the usage of PCA, t-SNE, and UMAP for dimensionality reduction on tabular data.
Features:
- Support for PCA, t-SNE, and UMAP algorithms
- TAB-delimited and gzipped file support
- Row-major and column-major data orientations
- Automatic Z-score normalization
- Pre-PCA option for high-dimensional data
See DIM_REDUCE_README.md for detailed usage instructions.
Quick start:
make
./dim_reduce data.txt -a pca -o output.txt
./dim_reduce data.txt -a tsne --perplexity 30 -o output.txt
./dim_reduce data.txt -a umap --neighbors 15 -o output.txtA sample C program for calculating Spearman's rank correlation coefficient between two data files.
Features:
- Reads tab-separated data files (e.g., HTSeq count files)
- Supports standard HTSeq format: gene_ID, gene_name, counts
- Optional header skipping
- Optional filtering by minimum value threshold
- Uses
stb_spearmanfor efficient rank correlation calculation
Input format:
- TAB-delimited files with three columns: gene_ID, gene_name, and count values
- Commonly used for RNAseq data (HTSeq count files)
- Example format:
ENSG00000000003 TSPAN6 1234 ENSG00000000005 TNMD 567
Quick start:
make
./spearman sample1.txt sample2.txt
./spearman sample1.txt sample2.txt -s # Skip header
./spearman sample1.txt sample2.txt -m 10 # Filter with min value 10
./spearman sample1.txt sample2.txt -s -m 10 # Both optionsOptions:
-s, --skip-header: Skip the first line (header) in both files-m, --min-val VALUE: Minimum value threshold for filtering (keeps pairs where at least one value meets the threshold)-h, --help: Show help message
Output:
- TAB-delimited format:
file1\tfile2\tcorrelation_coefficient
CITATION
If you use this Tool-Box in a publication, please reference:
Voshol, G.P. (2024). STB: A simple Statistics Tool Box (Version 1.26) [Software]. Available from https://github.com/gerbenvoshol/Statistics-Tool-Box