DistML: A Shorthand Language for Probability Distributions

A July 2020 draft (marked Version 0.1) proposing a shorthand text syntax for writing probability distributions, building on the distribution syntax used in Guesstimate and Foretold. The author uses the placeholder name “DistML,” notes a live early prototype, and lays out motivation (portability, readability, brevity), open design questions (naming, scope, whether a separate Python parser is needed, plugin/extensibility), and a catalog of features: a list of continuous and discrete distributions, math functions, the “to” percentile-interval shorthand (e.g. “50 to 150”), order-of-magnitude suffixes, regular versus pointwise (“dot”) distribution operations, degenerate/point distributions, multimodal mixtures, variables, and externally parameterized functions. Later sections sketch many alternative syntaxes for transforms and pipelines, and discuss normalization and how summary statistics interact with sampling, with the author repeatedly flagging unresolved choices (one section is labeled “particularly messy”). The draft is a working specification rather than a finished standard, relies on several image-based figures for outputs, and closes with restored comments from Nuño Sempere and replies from Ozzie.

July 21, 2020

Ozzie Gooen

Version 0.1

Background & Motivation

The distribution syntax in Guesstimate seemed to be relatively successful in both Guesstimate and Foretold. We’re interested in improving this, formalizing it, and making it accessible to other platforms. It would be totally open source and have a fully permissive license (CC0, for instance).

You can play with an early version of the updated syntax here.

I’ve (Ozzie) been the main one figuring this out, but I’ve had a lot of help from several others, especially with the related ReasonML library.

For this document, I’ll refer to the syntax as “DistML”; in part because I don’t want to use this name, and don’t want people to anchor on a different choice.

Having a short text syntax is useful for a bunch of reasons. It’s possible that UI editors will be more commonly used, but I think there will at least be some important cases where having a shorthand is preferred:

Reasons for a shorthand syntax

The shorthand is very flexible (unlike many non-code UI editors), so is good when flexibility is needed.
The shorthand is in plaintext, so is trivial to copy & paste between applications. It would be trivial for people to post their predictions on Twitter, Facebook, or in surveys, using this format.
The shorthand is relatively readable. If other apps exported to this format (where applicable), it would be relatively easy to understand as plain text. This is in comparison to X-Y coordinates or much longer snippets of code, which are quite messy.
The shorthand is very short, so is simple & fast to store and send via API.
Unlike writing PyMC3 code or Stan, this can be much simpler to use for simple descriptions of probability distributions.

One potential goal would be to advance, formalize, and standardize such a syntax. Ideally it would be easy to use in all Javascript (and later Python) applications.

I have a lot of uncertainty on the specifics, and am looking for other opinions and feedback. I’ve recently been working on the DistPlus library which helps support it in Javascript.

This syntax really represents a simple programming language or a DSL. That said, other common syntaxes also have features of simple programming languages as well. YAML, CSS, SASS, etc, use it. Lua is often used for configuration and is used in similar ways.

Key Properties

Dynamic

High-Level

Key Questions

1. Are others interested in making this (or similar) a small standard, something used in multiple applications by different application developers and similar?

Even if others don’t use it, I want to formalize it for the purposes of Foretold, Guesstimate, and future apps I work on. I’m curious who else may be interested.

2. What should the name be?

I’m quite ambivalent on the name. Please suggest other ideas if you have or come up with them.

Some options:

Distax (distribution + syntax)
DistLang
Guessdown

3. How advanced should we aim for it to be?

The more power that it has, the more expressive it can be, but this comes with additional complexity. For instance, it could aim for all of the functionality of a simple language like Lua.

**4. Does it need a separate parser in Python?**We’re currently working on a Javascript implementation. It’s a fair amount of work to write this. It would be great to be able to call this from Python, but if we wanted a complete Python implementation that would be a fair bit of work.

Another option would be to support a translation of DistML formats to Python code equivalents.

4. Should we aim for it to be extensible with plugins, or to have all functionality out of the box?

There are some shorthands and/or sugar that may be better as adjustable features that get converted into the standard. For instance, the syntax “5 to 10” may be adjusted over time or configurable, so could be best not to be part of the standard. If a user writes this syntax, it will get converted to a string like, “lognormal(a,b)”, which is part of the standard.

Rather than “plugins” we could also have different options, like “DistML-core” vs. “DistML-quick” or similar, where “DistML-quick” would be a superset of “DistML-core” that has some extra shorthands or features, but can ideally be converted into “DistML-core”. This is similar to Markdown vs. Kramdown, though more about shorthands, than added features.

I also imagine in many cases apps will want additional functions and functionality, so that would be done separately. For example, in Guesstimate, you can type “=@Cities.NewYork.population” or similar to get the NYC population.

5. Should we aim to support mixtures of discrete and continuous distributions?

Our library currently supports this, but this does make several things more tricky.

Example:

mm(0,3,normal(5,1), [.4,.2,.3])

figure from DistML_ A shorthand language for probability distributions

This is a more generic format, but introduces a fair bit of complexity.

One nice thing is that these mixed distributions can be converted to fully continuous ones on the end of the client, using some assumptions, if needed.

6. Are there other things we should aim to standardize as well, or instead?

Simple Fundamentals

Common Distributions

Continuous

normal()uniform()lognormal()beta()exponential()cauchy()pareto()triangular() metalog()

Discrete

bernoulli()binomial()degenerate()

Note: the current tool doesn’t yet support metalog or the discrete distributions. Metalog seems a bit tricky, but doable, to add.

Functions

These functions are mostly inspired from the library in math.js, which Guesstimate and Foretold initially used. We could also add the trigonometry functions easily enough.

Function	Notes
floor()	Converts continuous -> discrete
ceil()	Converts continuous -> discrete
log(x, [,base=10])
log10()
log2()
sqrt()
pdf()
inv()
cdf()
sample(x, [,n=1])
mean()
median()
mode()
percentiles(a, [percentiles])
std()
variance()
min()
max()

truncate()

truncateLessThan

truncateGreaterThan

Other names:

filterLessThan,

bounds(normal(5,2), {lower: 0, greater: 100)

Key Optional / Questionable Features

“To” syntax

This syntax is a quick way to write out a 90th percentile. This uses a lognormal distribution when the lower bound is above 0, and a normal distribution when it is at or below 0.

50 to 150

50to150 (not yet implemented)

figure from DistML_ A shorthand language for probability distributions

Possible changes:

We may want to change this to an 80th percentile interval or smaller, or state this separately.
- It seems most people are overconfident, so 90th is wider than I expect most would really believe as the standard. I’d expect that they also don’t notice the exact percentiles, so they would say the same ranges if these numbers represented their 50th percentile intervales.
It could make sense to use a different distribution. In particular, the lognormal may not have a long enough tail.
- Maybe the Pearson or Metalog distributions, with a few defaults.
- Causal seems to use triangular distributions for the “to” syntax and give a few different options.
My guess is that “to” shouldn’t be part of any standard now, but instead converted to the standard notation. Like, “30 to 80” would be converted to “lognormal(a,b)”, where is possible.

Orders of Magnitude

k,K -> thousand

m,M -> million

b,B -> billion

t,T -> trillion

Possible changes:

We may not want to have this be the standard, be convertible to the standard.
The lowercase letters are more correct to be used for other things, but in our use cases, those things don’t seem to be used much at all. (For instance, a lower case m should mean “milli”, not “million”).

Regular Distribution Operations

normal(5,1) * normal(10,2)

figure from DistML_ A shorthand language for probability distributions

normal(5,1) / normal(10,2)normal(5,1) + normal(10,2)normal(5,1)^normal(10,2)normal(5,1) * normal(10,2)log(normal(5,1),normal(10,2))

uniform(0,1) + uniform(0,1)

figure from DistML_ A shorthand language for probability distributions

These operations treat the two distributions as uncorrelated, and do the operations similar to how they would be done in Guesstimate (the functions act on the X axis, so to speak, instead of the Y axis). The most general way to handle this is with sampling.

Pointwise Distribution Operations

normal(5,1) ./ normal(10,2)

normal(5,1) .+ normal(10,2)

figure from DistML_ A shorthand language for probability distributions

normal(5,1) .^ normal(10,2)

normal(5,1) .* normal(10,2)

figure from DistML_ A shorthand language for probability distributions

.log(normal(5,1),normal(10,2))

normal(5,1) .- normal(10,2)

This syntax performs dotwise combinations of distributions.

Possible changes

We may not want an infix at all to do this.
“.-” and “./” would need to be used carefully, as it’s possible that they would prevent the result from being a proper probability distribution.

Current Problems

We still haven’t figured out / decided the best way to do pointwise operations on mixtures of discrete and continuous distributions.
It’s not obvious how to handle floats. For instance, “normal(5,2) .* 5”. 5 could either mean a discrete distribution with mass at 5 (a degenerate distribution) like, x=5, or a line of y=5. My guess is that most people would assume the latter (the former would return a result that would rarely be useful), but this would break with other cases where it is used as a degenerate distribution.

Degenerate (point) Distributions

Degenerate distributions are distinct from dirac delta functions, though we can also call the ones we use dirac delta functions for simplicity.

Below, we use the multimodal syntax with 2 degenerate distributions at x=0 and x=3. This doesn’t require they use any syntax to convert them from floats to degenerate distributions, but we may want that later for specificity and consistency.

mm(0,3,normal(5,1), [.4,.2,.3])

figure from DistML_ A shorthand language for probability distributions

A different version could require a wrapping, like,

mm(delta(0),delta(3),normal(5,1), [.4,.2,.3])

We could require the use of the d() shorthand or other if this is common.

mm(d(0),d(3),normal(5,1), [.4,.2,.3])

We think that users will sometimes want floats to mean degenerate distributions, and sometimes to mean functions of y=n. It’s not clear if we should make assumptions for them, or leave things to be more explicit.

Multimodals / Mixtures

multimodal(normal(2, 1), uniform(5,8), [.2, .8])

mm(normal(2, 1), uniform(5,8), [.2, .8])

figure from DistML_ A shorthand language for probability distributions

mm(1, 2, normal(5,2), [.2, .8])

figure from DistML_ A shorthand language for probability distributions

This is a simple way to combine multiple distributions into a mixture, with the weights at the end. It’s been used quite heavily in Foretold inputs. One clear case has been when someone wants to assign most of the probability mass to one main distribution, but a few percent to a very wide distribution, “just in case.”

Possible changes:

We could rename it to “mixture”, with the shorthand of “mix” or “mixt”. This is in some ways more precise.
Having the “weights” be a distribution at the end is kind of awkward.

Variables

long_tail = 3 to 20000;main_dist = 10 to 1000;mm(maindist, longtail, [.9,.1])

For readability it would be nice to add simple variables.

Key Questions

Is this worth the complexity and potential expectations? It would ask that this notation use newlines.
Should we require variables to declare with the terms “var”, “let”, or “const”?
Should we allow for additional metadata, like name and/or description?
If there are multiple lines, can we assume the last one returns, or should we ask that users explicitly state “return”, as, “return mm(maindist, longtail)”…

More Complicated Variable Possibilities

Accept external parameters, to act as functions.

This would also make it easier to pass in variables from other sources, or enable simple functions. For example, an input could take a time parameter t, which is the number of years since 0AD.

500*(normal(1.01, .01)^(t - 2020))

Accept external metadata

We could allow parameters to accept hashes with additional metadata, like in the following.

long_tail = {value:3 to 20000, name: “Long Tail”, description:“I think there’s a log of uncertainty on the total time”};main_dist = 10 to 1000;mm(maindist, longtail, [.9,.1])

Non-Distribution Functions

This takes a normal distribution and multiplies it pointwise by the simple equation (y=x^2).

normal(5,2) .* (y=x^2)

This function multiplies all of the normal distribution by 3 along the y axis.

normal(5,2) .* (y=3)

This could introduce a fair bit of complexity, but also allow possibly a much broader class of potential distributions. Some would be not computationally tractable however, or not even proper distributions.

It’s not obvious if this format would make sense or could be parsed correctly and easily.

A different syntax for this could be something like

transform(normal(5,2), x => x^2)

transform(normal(5,2), y => y*3)

transform(normal(5,2), (x,y) => (x^2,y*3))

Other quick experiments:

transform(normal(5,2), {x,y} => {x:x*1x,y})

normal(5,2) |> ({x,y}) => {x: x+1, y: y}

normal(5,2) |> xmap(x => *2)

normal(5,2) |> ymap(x => x + normal(2,1))

normal(5,2) |> ymap(y => y + pdf(normal(2,1),y))

yCombine(normal(5,2), 4, (y1, y2) => y1 + y2))

xCombine(normal(5,2), normal(5,1), (x1, x2) => x1 + x2, {correlation: 0.0});

normal(5,2) * 2

normal(5,2) .* 2

normal(5,2) |> ymap(*2)

Some other options / sketches:

f(normal(5,2), (x,y) => (x^2,y*3))

((x,y) => (x^2,y*3))(normal(5,2))

toNormal = r => {value: normal(r,2), name: “My Normal Distribution”}

rTransform = r .* normal(5,2)

{t} => t |> toNormal |> rTransform |> from(0,2) |> normalize

Explicitness vs. simplicity with distributions

Having a shorthand like this presents a tradeoff between explicitness and conciseness. Finding a reasonable medium is challenging.

Regular operations vs. pointwise operations, and floats

normal(5,2) * 3

This gets converted to:

normal(5,2) * delta(3)

normal(5,2) .* 3

This gets converted to something like:

normal(5,2) .* (y=3)

The reason for that is that the alternative methods seem quite unusual. For the latter,

normal(5,2) .* delta(3)

Would return a distribution of delta(3), which seems like an unusual thing to be interested in.

That said, it should be possible to make this explicit by writing out,

normal(5,2) .* delta(3)

If that is what is desired.

Normalization

A function like normal(5,2) .* normal(10,3) is not normalized. If it’s being submitted as a prediction, it is assumed it will be normalized in the end.

There can also be a normalize() function when users want to make this explicit, or want to normalize any part of the function.

Normalization & Multimodals

We can assume that all terms in multimodals should get auto normalized before they get scaled and pointwise-added. Maybe there could be an optional third parameter to not auto normalize.

This is to make sure users can write,

mm(normal(5,2), normal(5,2) .* normal(10,3), [.5,.5])

And have the first part get 50% of the mass, and the second get 50% of the mass. If both subparts were not normalized, then the section function would have significantly less.

This is equivalent to saying that the weights at the end represent the weights out of the total probability mass, instead of weights to pointwise multiply each term by.

Random sampling decisions, and summary statistics

Note: this section is particularly messy right now.

normal(5,2)
normal(normal(10,3), 2)
normal(mean(normal(10,3)), 2)

Normal distributions take in two floats. If one of the inputs is a summary statistic of a distribution (as in #3), then this would ideally be calculated as such. In this case, the mean of a normal distribution could be solved symbolically, so hopefully it would be found to be 10.

The case of #2 is less obvious. We assume here that it means that we should sample from the inside distribution, normal(10,3), and for each sample, we then sample from the outside distribution, normal(normal(10,3), 2).

Here’s another interesting case:

normal(4,2) + normal(std(normal(1,2) + uniform(1,3)),2)

Every time there’s a summary statistic of a distribution, we resolve that function before continuing with sampling.

To solve this, we first calculate std(normal(1,2) + uniform(1,3)), then use that for the other calculations.

Use sampling for normal(1,2) + uniform(1,3).
Convert those samples into a shape using kernel density estimation or similar.
Take the standard deviation of that shape. (Here, assume it’s .5, for simplicity)
Use sampling to calculate the function, normal(4,2) + normal(.5,2)
Convert those samples into a shape.

There can also be summary statistics of arrays or sets of items. These don’t work the same way.

Take, mean(normal(5,2), uniform(3,2)).

Maybe in the future we could have a special syntax like {} for the portion that should be done using sampling in one go. Therefore, to make things simple,

normal(4,2) + normal(mean(normal(1,2) + uniform(1,3)),2)

Would be converted to:

{normal(4,2) + normal(mean({normal(1,2) + uniform(1,3)}),2)}

If brackets are only applied around the entire function, as in,

{normal(4,2) + normal(mean(normal(1,2) + uniform(1,3)),2)},

Then it will work differently; the mean will be applied to samples of the subdistributions.

The {} is kind of messy as it’s also used for hashes in js, and other parameters we might want to use. It’s not clear what notation would be preferable at this point.

Comments from Nuño Sempere

Restored with permission (Nuño’s comments, with Ozzie’s replies).

On “These operations treat the two distributions as uncorrelated, and do the operations similar to how they would be done in Guesstimate (the functions act on the…”:

Nuño Sempere: I understand what this means, but it could be explained better.

[a comment by another collaborator omitted]

Ozzie Gooen: Good point

On “.log(normal(5,1),normal(10,2))”:

Nuño Sempere: ?

Ozzie Gooen: .log is the “dot” equivalent for log, with a particular base. The base would be the value of the second distribution.

On “;”:

Nuño Sempere: do you really want to add ; to the end of each line except the last?

Ozzie Gooen: I’d be fine making this optional. It can be nice to allow people to have multiple statements on the same line.

On “not even proper distributions”:

Nuño Sempere: What would be outputted here? In reasonML this would be a Some(x)/None variable.

Nuño Sempere: Also, in that case, maybe give a function to normalize the result of an operation.

Ozzie Gooen: I imagine that it would basically return one of a few types: [ | NormalizedDistribution(d) | Distribution(d) | Function(f) | Float | `Error(r)| ] Figuring this out is also important, of course.