Skip to content

gerbenvoshol/Statistics-Tool-Box

Repository files navigation

Statistics-Tool-Box

This is a single header file inspired by stb.h by Sean Barrett with a bunch of useful statistical functions

============================================================================

 You MUST

		#define STB_STATS_DEFINE

 in EXACTLY _one_ C or C++ file that includes this header, BEFORE the
 include, like this:

		#define STB_STATS_DEFINE
		#include "stb_stats.h"

 All other files should just #include "stb_stats.h" without the #define.

============================================================================

Repository Structure

  • stb_stats.h - Main header file with statistical functions
  • examples/ - Example programs demonstrating library usage:
    • dim_reduce.c - Dimensionality reduction (PCA, t-SNE, UMAP)
    • deseq2_example.c - DESeq2-style differential expression analysis
    • spearman.c - Spearman's rank correlation calculator
  • test_stb_stats.c - Comprehensive test suite
  • test_isolated.c - Isolated tests for specific functions

Functions included are:

  • stb_tsne (t-SNE: t-Distributed Stochastic Neighbor Embedding with Barnes-Hut approximation)
  • stb_umap (UMAP: Uniform Manifold Approximation and Projection)
  • stb_kdtree KD-tree data structure for efficient nearest neighbor search (used by t-SNE and UMAP)
  • stb_adjust_pvalues_bh (apply Benjamini-Hochberg FDR correction to array of p-values), stb_log2_fold_change
  • stb_moderated_ttest, stb_cosine_similarity, RSE Normalization (stb_calc_geometric_scaling_factors and stb_meanvar_counts_to_common_scale)
  • stb_shannon (Shannon's diversity index, Pilou evenness, stb_simpson (Simpson's Diversity Index), stb_jaccard (Jaccard similarity index), stb_bray_curtis (Bray–Curtis dissimilarity) and stb_create_htable a simple basic hash table
  • stb_pdf_hypgeo hypergeometric distribution probability density function, speedup stb_log_factorial using lookup table
  • stb_fisher2x2 simple fisher exact test for 2x2 contigency tables
  • stb_pdf_binom and stb_pdf_pois, the binomial and poison probability density functions
  • stb_polygamma, stb_trigamma_inverse gamme functions and stb_fit_f_dist for moment estimation of the scaled F-distribution
  • stb_qnorm and stb_qnorm_with_reference (also matrix variants) quantile normalization between columns with and without a reference
  • stb_neugas Neural gas clustering algorithm
  • stb_pca Principal Component Analysis
  • stb_csm (confident sequence method) for monte-carlo simulations
  • stb_kmeans k-means++ classical data clustering
  • stb_qsort (Quicksort), could be used to replace current sorting method
  • stb_cdf_gumbel, stb_pdf_gumbel, stb_icdf_gumbel and stb_est_gumbel, the (inverse) cumulative/probability density functions for the gumbel distribution and the ML estimator of the gumbel parameters
  • stb_kendall (Kendall's Rank correlation)
  • stb_jenks Initial port of O(k×n×log(n)) Jenks-Fisher algorithm originally created by Maarten Hilferink
  • stb_logistic_regression_L2 simple L2-regularized logistic regression
  • stb_spearman (Spearman's Rank correlation)
  • stb_invert_matrix, stb_transpose_matrix, stb_matrix_multiply, ..., stb_multi_linear_regression and stb_multi_logistic_regression
  • stb_ksample_anderson_darling, stb_2sample_anderson_darling, (one sample) stb_anderson_darling
  • stb_expfit (Exponential fitting), stb_polyfit (Polynomial fitting), stb_powfit (Power curve fitting), stb_linfit (Liniear fitting)
  • stb_trap, stb_trapezoidal (returns the integral (area under the cruve) of a given function and interval)
  • stb_lagrange (polynomial interpolation), stb_sum (Neumaier summation algorithm)
  • stb_mann_whitney, stb_kruskal_wallis (Unfinished, needs a better way to handle Dunn's post-hoc test)
  • stb_combinations
  • stb_allocmat (simple allocation of 2d array, but might not work on all systems?!)
  • stb_fgetln, stb_fgetlns
  • stb_pcg32 (PCG-XSH-RR) and stb_xoshiro512 (xoshiro512**) Pseudo Random Number Generators
  • stb_anova (One-Way Anova with Tukey HSD test and Scheffe T-statistics method (post-hoc) (Unfinished))
  • stb_quartiles
  • stb_histogram (very simple histogram), stb_print_histogram, ...
  • stb_factorial
  • stb_meanvar
  • stb_ttest, stb_uttest
  • stb_ftest,
  • stb_benjamini_hochberg
  • stb_chisqr, stb_chisqr_matrix, stb_gtest, stb_gtest_matrix,

Example Programs

deseq2_example - DESeq2-style Differential Expression Analysis

A sample C program demonstrating DESeq2-style differential expression analysis using stb_stats.h for normalization and statistical testing.

Features:

  • RSE (Relative Log Expression) normalization using geometric means
  • Dispersion estimation using stb_fit_f_dist
  • Moderated t-test for differential expression
  • Multiple testing correction using Benjamini-Hochberg FDR
  • Log2 fold change calculation

Input format:

  • TAB-delimited count matrix (rows x columns format)
  • First row: number of rows and columns
  • Subsequent rows: count data (genes as rows, samples as columns)

Quick start:

make
# Example with first 3 samples as group 1 (columns 0-2) and next 3 as group 2 (columns 3-5)
./deseq2_example sample_counts.txt --g1-start 0 --g1-count 3 --g2-start 3 --g2-count 3 -o results.txt

Options:

  • --g1-start N: Starting column index for group 1 (default: 1)
  • --g1-count N: Number of samples in group 1 (default: 3)
  • --g2-start N: Starting column index for group 2 (default: 4)
  • --g2-count N: Number of samples in group 2 (default: 3)
  • --fdr FLOAT: False discovery rate threshold (default: 0.05)
  • -o FILE: Output file (default: stdout)

Output columns:

  • Gene: Gene identifier
  • baseMean: Average expression across all samples
  • log2FoldChange: Log2 fold change between groups
  • stat: Test statistic (moderated t-statistic)
  • pvalue: P-value from statistical test
  • padj: Adjusted p-value (Benjamini-Hochberg FDR)

dim_reduce - Generic Dimension Reduction Tool

A flexible C program demonstrating the usage of PCA, t-SNE, and UMAP for dimensionality reduction on tabular data.

Features:

  • Support for PCA, t-SNE, and UMAP algorithms
  • TAB-delimited and gzipped file support
  • Row-major and column-major data orientations
  • Automatic Z-score normalization
  • Pre-PCA option for high-dimensional data

See DIM_REDUCE_README.md for detailed usage instructions.

Quick start:

make
./dim_reduce data.txt -a pca -o output.txt
./dim_reduce data.txt -a tsne --perplexity 30 -o output.txt
./dim_reduce data.txt -a umap --neighbors 15 -o output.txt

spearman - Spearman's Rank Correlation Calculator

A sample C program for calculating Spearman's rank correlation coefficient between two data files.

Features:

  • Reads tab-separated data files (e.g., HTSeq count files)
  • Supports standard HTSeq format: gene_ID, gene_name, counts
  • Optional header skipping
  • Optional filtering by minimum value threshold
  • Uses stb_spearman for efficient rank correlation calculation

Input format:

  • TAB-delimited files with three columns: gene_ID, gene_name, and count values
  • Commonly used for RNAseq data (HTSeq count files)
  • Example format:
    ENSG00000000003	TSPAN6	1234
    ENSG00000000005	TNMD	567
    

Quick start:

make
./spearman sample1.txt sample2.txt
./spearman sample1.txt sample2.txt -s           # Skip header
./spearman sample1.txt sample2.txt -m 10        # Filter with min value 10
./spearman sample1.txt sample2.txt -s -m 10     # Both options

Options:

  • -s, --skip-header: Skip the first line (header) in both files
  • -m, --min-val VALUE: Minimum value threshold for filtering (keeps pairs where at least one value meets the threshold)
  • -h, --help: Show help message

Output:

  • TAB-delimited format: file1\tfile2\tcorrelation_coefficient

CITATION

If you use this Tool-Box in a publication, please reference:

Voshol, G.P. (2024). STB: A simple Statistics Tool Box (Version 1.26) [Software]. Available from https://github.com/gerbenvoshol/Statistics-Tool-Box

About

Single header file with a bunch of useful statistical functions such as ANOVA, Kruskal-Wallis, T-test, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors