How to choose summary statistics

How to choose summary statistics#

Many simulators produce outputs that are high-dimesional. For example, a simulator might generate a time series or an image. In the tutorial 04_embedding_networks, we discussed how a neural networks can be used to learn summary statistics from such data. Another option is to hand-craft summary statistics. But how should one choose “good” summary statistics?

While there is no clear guide, we recommend the following workflow:

  • start with relatively few summary statistics and see if you can make inference work. Add more summary statistics after initial successes.

  • make sure that the summary statistic of the observation is covered by the summary statistics of the simulated data. To assess this, visualize the histogram of each summary statistic and plot the value of the observation (e.g., using the pairplot method). If, for some summary statistics, the observation is not covered (or is at the very border, e.g. the MSE above), the trained neural network will struggle.

  • do not use “error” as a summary statistic (typically followed by inferring the posterior given error=0). This will almost never give good results.