Running Master's Thesis
Description
Scientists want to discover governing equations from data in the presence of uncertainty a task referred to as Symbolic Regression in the Machine Learning community. However, the space of mathematical expression is too complex to be exhaustively searched and a recent branch of research explores the to learn representations that capture the inherent structure of mathematical expressions and thereby ease the search problem. In this thesis, we want to leverage the representation learning capabilities of deep learning approaches [1, 2] while accounting for the inherent aleatoric and epistemic uncertainty with a truly Bayesian approach [4]. This combination (see Figure 1) is novel and a promising direction because Bayesian approaches naturally lend themselves to encode prior knowledge through probabilistic priors. E.g. environmental scientists want to characterize the retention of chemicals in soils through an equation- known as sorption isotherms. The structure of existing sorption isotherms in the literature can be used as a prior for symbolic regression.
Structured Latent Spaces: In particular, this thesis builds on the idea of a Grammar Variational Autoencoder (GVAE)[2], which learns a latent space for the inherent structure present in mathematical equations, as depicted in fig 1 part 1. Here the encoder and decoder networks are informed by (context-free) grammars to only generate syntactically valid expressions. To find an explaining equation given a set of data points, their approach places a Gaussian Process (GP) estimation of the root mean squared error (RMSE) for different equation candidates fitting the data over the latent space, iterating to find its minimum.
Bayesian Reasoning: From a Bayesian perspective, scoring the RMSE can be related to a maximum likelihood solution. In contrast, our previous work on Regular Tree Priors [4] yields a way to pose a prior distribution over the space of equations. Encoding scientific knowledge with it (compare fig. 1 part 2) allows to reason about the full posterior of equations that fit a given set of data-points. To actually perform this inference in the latent space (compare fig.1 part 3) we plan to apply a continuous sampler, Hamiltonian Monte Carlo (HMC), on the GP estimates of (unnormalized) posterior (instead of RMSE), which is known in literature as the GP-HMC algorithm [3]. This algorithm naturally combines with the existing GP-estimates over the structured latent space from the previous section.
Summary of Work Packages
1. Get familiar with the related work from symbolic regression and Bayesian reasoning (see references below as the starting point)
2. Understand our existing implementation of [1]
• https://github.com/TimPhillip/ac_grammar_vae.git
3. Replace the inference part with the GP-HMC algorithm [3], which requires a prior about equations:
(a) Express the prior about equations with a probabilistic context-free grammar
(b) Express the prior about equations with a regular tree prior [4] (optional step; implementation also available)
4. Use the scientific symbolic reasoning examples (+ prior knowledge) from [4] to test and benchmark the resulting method
References
- [1] D. P. Kingma and M. Welling, Auto-encoding variational bayes, in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2014.
- [2] M. J. Kusner, B. Paige, and J. M. Hern´ andez-Lobato, Grammar variational autoencoder, in International conference on machine learning, PMLR, 2017, pp. 1945–1954.
- [3] C. E. Rasmussen, Gaussian processes to speed up hybrid monte carlo for expensive bayesian integrals, in Seventh Valencia international meeting, dedicated to Dennis V. Lindley, Oxford University Press, 2003, pp. 651–659.
- [4] T. Schneider, A. Totounferoush, W. Nowak, and S. Staab, Probabilistic regular tree priors for scientific symbolic reasoning, arXiv preprint arXiv:2306.08506, (2023)