Archive for importance sampling

estimating evidence redux

Posted in Books, Statistics, University life with tags , , , , , , , , on November 21, 2025 by xi'an

Following our arXival on the new version of our HPD based Gelfand & Dey estimator of evidence, I got pointed at Wang et al. (2018), which I had forgotten I had read at the time (as testified by an ‘Og entry). Reading my own comments, I concur (with myself¹⁸!) that the method is not massively compelling since it requires a partition set that is strongly related with the targeted integral. The above illustration for a mixture, that is for a pseudo posterior that is a mixture with two Gaussian components with known variance, also shows (in reverse) the curse of dimension and the need for finely tuned partitions. Said partition corresponding to the myriad of sets on the rhs. With such a degree of partitioning, Riemann integration should also produce perfect estimate, as shown by the zero error in the resulting estimator (Table 4).

finite variance goals

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on November 8, 2025 by xi'an

During Johan Seger’s seminar in Warwick, on the control variate improvements he developed with Rémi Leluc (which PhD thesis committee I joined), Aymeric Dieuleveut, François Portier, and Aigerim Zhuman, I started wondering at whether or not a control variate could turn an infinite variance Monte Carlo estimate into a finite variance one. And asked… ChatGPT about it, with the above reply that is correct if not practical in the least since the example provided therein was reverse-engineering an infinite variance rv into a sum of an infinite variance rv considered as the control variate and a finite variance rv. As summarised below. In practice, this would mean replacing the integrand of interest with a much simpler integrand that shares the same asymptotic behaviour, not an easy task! (As an aside, I found out that enabling MathJax on this ‘Og would cost me $40 a month!)

5. Summary

Theoretical possibility:
Yes — control variates can make an infinite-variance estimator finite, but only if the control’s sample path shares the same tail driver and its expectation is known.infinite variance rv
In real-world Monte Carlo, when X is heavy-tailed, you usually:

  1. Split X = Y + (X-Y), where Y has known expectation and similar tails,

  2. Use Y as control variate, and

  3. Possibly combine with truncation, conditional expectation, or importance sampling for stability.

coupling-based approach to f-divergences diagnostics for MCMC

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on October 27, 2025 by xi'an

Adrien Corenflos (University of Warwick) and Hai-Dang Dau (NUS) just arXived their paper on MCMC diagnostics that Adrien told me about last month, while in Warwick.

“This [f-divergence] bound is clearly suboptimal since it does not vary in t and does not take into account the mixing of the Markov chain. We present a scheme where the weights are ‘harmonized’ as the Markov chain progresses, reflecting its mixing through the notion of coupling.”

They start by opposing the classical ergodic average and embarrassingly parallel estimates obtained by N parallel chains culled of their B initial values, to couplings used in standard diagnoses. Opting for the parallel perspective, maybe rekindling the diagnostic war of the early 1990s! The evaluation tool in the paper is based on f-divergences, like the χ² divergence which naturally relates to the effective sample size when considering weighted atomic measures. When consistent, these weighted approximations produce upper bounds on the f-divergence, with exact convergence in case of independence.

In my opinion the most exciting part of the paper stands with the ability to modify these weights along MCMC iterations, since the naïve sequential importance sampling argument I also use in class keeps them constant! The trick is to (be able to) couple randomly chosen parallel chains, with the weights being averaged at each coupling event. The resulting algorithm preserves expectation (in the importance sampling sense) and consistency (in the particle sense). Furthermore, the f-divergence bound based on the weights can only decrease between iterations, which reminds me of interleaving. And exponential convergence of the weights to uniform ones (under the strong assumption of a uniformly lower bounded probability of coupling). The paper concludes with interesting remarks on perfect sampling, Rao-Blackwellisation, control variates, and backward sampling.

A long-standing gap exists between the theoretical analysis of Markov chain Monte Carlo convergence, which is often based on statistical divergences, and the diagnostics used in practice. We introduce the first general convergence diagnostics for Markov chain Monte Carlo based on any f χ² -divergence, allowing users to directly monitor, among others, the Kullback–Leibler and the divergences as well as the Hellinger and the total variation distances. Our first key contribution is a coupling-based ‘weight harmonization’ scheme that produces a direct, computable, and consistent weighting of interacting Markov chains with respect to their target distribution. The second key contribution is to show how such consistent weightings of empirical measures can be used to provide upper bounds to f -divergences in general. We prove that these bounds are guaranteed to tighten over time and converge to zero as the chains approach stationarity, providing a concrete diagnostic.

mostly [14] M[ar]C[h] seminar

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 8, 2025 by xi'an

gentle importance sampling

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , , on February 24, 2025 by xi'an

A new (and gentle!) survey by Luca Martino! And by Fernando Llorente. On importance sampling, with coverage of normalised and self-normalised versions. And their usage in different configurations (one vs several integrals, one vs several families of distributions). Some points relating to earlier remarks or musing of mine’s:

  • the fact that the optimal importance function does not lead to a zero variance importance estimator when the integrand f is not of constant sign (p.7) can be cancelled by first decomposing f as f⁺-f⁻, since both allow for a zero variance importance estimator, if formally requiring two different samples (of size zero!), a trick considered later on p.18 and repeated for the ratio in self-normalised importance (p.19)
  • the special case when the integrand f is constant is not of practical interest but relevant for checking properties of different estimators. For instance, this case allowed George and myself to spot a mistake in an early importance paper. In the same volume of the Comptes Rendus as an early paper of Lions and Villani.
  • the remark that self-normalised (SNIS) importance sampling can prove more efficient than (properly normalised) importance sampling, although the property that SNIS is always bounded should not be seen as a major point given that it is simply due to using a finite sample and hence a finite set of images of f
  • the case of integrals involving several target pdfs or several integrands is not necessarily of major interest if simulating different samples for each unidimensional integral can be implemented (again formally leading to zero variance at no cost)
  • the issue of merging several estimators in an optimal way is briefly mentioned in §5.4, a challenge Victor Elvira and I have been approaching over the past years, if not yet concluding satisfactorily (mea culpa)
  • when replacing the target with a noisy estimate (p.22), the fact that this estimate must be normalised is correct, but pales against the impact of using this estimate, which may prove catastrophic. And unbiasedness is not particularly crucially important in this setup for the same reason
  • the section on evidence approximation (§7) is more standard, with the harmonic mean estimator being called reverse importance sampling, which brings us to the “elephant in the room”, namely that
  • the issue of infinite variance of some importance sampling estimators is not directly covered (except once in §8, p.34), thus perceiving importance sampling as a variance reduction method being somewhat misleading (unless the authors consider solely the optimal importance function, which is rarely of practical use)

The paper concludes with an interesting notion that

“we suggest the analysis of the relevant connection between importance sampling and contrastive learning Gutmann and Hyvärinen (2012)”

that I also have been pointing out for a while. All in all, a useful summing-up that I will likely suggest to my students.