Probabilistic programming overview

ericlippert · March 16, 2020, 5:35pm

User @osdotsystem asked me to say a little about probabilistic programming, which is my current area of interest in language design. I’ll do a very brief overview. For a very long (40 part!) blog series that gently explains the basics with lots of code samples in C#, see https://ericlippert.com/2019/01/31/fixing-random-part-1/.

I see probabilistic programming as having two main ideas, one more trivial and one more profound. The more trivial idea is simply the observation that the way we represent randomness in mainstream line-of-business programming languages lags behind the way we represent other important concepts such as function abstraction, sequences, nullability, futures, continuations and so on.

All the things I mentioned are monads and all have support in the type systems and language features of mainstream languages, but “probability distribution of T” is also a monad, and moreover, the “bind” operation of this monad is the application of a likelihood function to a probability distribution! That is, if we have a type representing a probability distribution of rainy days, and a likelihood function that takes a weather state and returns the probability of ice cream sales given that weather, then the bind operation on the monad gives us a probability distribution of ice cream sales.

Making these trivial improvements to type systems would be straightforward; we came up with abstractions for “sequence of T” and “nullable of T” and “task of T”; we can come up with a suitable type system abstraction of “probability distribution of T” and the language feature which use it.

The more profound problem that probabilistic programming languages try to solve is to build on that type system support to allow us to express probabilistic models and then infer what probability distribution the model represents. I think this is best illustrated with a simple example. What we want is the ability to represent a model like this:

We are going to mint a coin, but this mint produces unfair coins. We choose a random number between 0.0 and 1.0 based on some distribution – uniform, beta, whatever. This is called the prior distribution; it is what we believe to be true about the mint’s behaviour in general.
We mint a coin, but we do not know its fairness. We flip it ten times. It comes up heads eight times out of ten.
We now ask the question: based on these observations, what do we believe the true fairness is?

Just saying “we believe the true fairness is 0.8” ignores the prior. If the prior is that the mint almost always produces close-to-fair coins then it is more likely that we got a fair coin and it was just sheer luck that we got eight heads out of ten; that’s actually pretty likely for a fair coin. By knowing the prior and the observations we can do some math and determine a posterior distribution on the fairness of the coin, either an estimate or an exact distribution.

Why is solving the posterior distributions of coin flipping problems relevant to line-of-business programming languages? Of course we do not actually care about coin flips! What we care about is:

We have a prior on how much email is spam and what words appear in spam messages; given an observation of an actual email, what is the posterior distribution on it being spam?
We have a prior on how many user accounts are fake and what behaviours fake accounts engage in; given observations of new user behaviour, what is the posterior distribution on the user being fake?
We have a prior on how routers fail in a network. Given some observations of dropped packets, what is the posterior distribution of a particular router being faulty?

Let me end with a challenge for you that is unfortunately relevant. Suppose we have a disease that affects 1% of the population (our prior), and we have a test that is 99% accurate at detecting the disease. You are chosen at random from the population and given the test. It comes up positive; it says you have the disease. What is the probability that you have the disease? It is not 99%. What is it? This task is trivial for a probabilistic programming language, but even doctors who are trained to interpret tests get this very wrong all the time. You might be surprised at the true probability.

The basic idea here is to express a model of how the world works, input some observations, and produce a posterior distribution that tells us what we should think about the observed world. It’s a fascinating application of programming languages.

osdotsystem · March 16, 2020, 6:34pm

Great, now tell me, how do you apply it to making languages? What prior thing do you take into consideration? or does it mean you write functions to calculate the probalbilities? Thanks!

Edit: Thanks for post!

ericlippert · March 16, 2020, 7:41pm

It depends on a bunch of factors:

Are you attempting to integrate probabilistic features into an existing general-purpose language, or devising a new domain-specific language?
How does the inference work?

I give an example in my blog series where I work out the exact math for the simplest case: small categorical distributions. I show how to take advantage of LINQ syntax in C# to make the operators work, and then I show how you could add more targeted features to C# to make it look nicer for the user. However, in my series I never go into advanced inference techniques for more complex distributions where programming languages are concerned.

For a good discussion of those more advanced inference techniques see