Generating Random Numbers for Statistics with PostgreSQL

by Christoph Schiessl on PostgreSQL

When working with statistics, generating pseudo-random samples distributed according to a given probability distribution is a frequent requirement. For me, generating these numbers directly in PostgreSQL is very convenient and improves performance in some cases. In this article, we'll look at some of the most important continuous probability distributions and implement them as SQL functions.

Uniform distribution

Uniform distributions are easy because we can use PostgreSQL's built-in RANDOM() function. RANDOM() returns uniformly distributed DOUBLE PRECISION numbers in the 0.0 <= x < 1.0 range. Adjusting that range is simply a question of adding and multiplying with appropriate values:

-- random double precision value in the range 1.0 <= x < 20.0
SELECT 1 + 19 * RANDOM();

That being said, it is easy to encapsulate that behavior in an SQL function for later reuse:

-- generates `count` uniformly distributed random numbers in the range `min <= x < max`
CREATE OR REPLACE FUNCTION random_uniform(
    count INTEGER DEFAULT 1,
    min DOUBLE PRECISION DEFAULT 0.0,
    max DOUBLE PRECISION DEFAULT 1.0
    ) RETURNS SETOF DOUBLE PRECISION
      RETURNS NULL ON NULL INPUT AS $$
        BEGIN
            RETURN QUERY SELECT min + (max - min) * RANDOM()
                         FROM GENERATE_SERIES(1, count);
        END;
    $$ LANGUAGE plpgsql;

The function's relative frequency-histogram (based on one million values):

Histogram for Uniform Distribution

Relative frequencies have been obtained as follows: SELECT c, COUNT(r) / 1000000.0 FROM GENERATE_SERIES(0,49) AS c INNER JOIN random_uniform(1000000, 0, 50) AS r ON c = FLOOR(r) GROUP BY c ORDER BY c;

Exponential distribution

RANDOM() is the only way of generating random numbers supported by PostgreSQL. Therefore, the exponential distribution is more challenging because we must somehow derive these numbers from uniformly distributed numbers.

Since the cumulative distribution function of the exponential distribution is invertible, we can use a technique known as inverse transform sampling. All of this boils down to a simple mathematical expression that we can encapsulate in another SQL function:

-- generates `count` exponentially distributed random numbers with a given expected value
CREATE OR REPLACE FUNCTION random_exp(
    count INTEGER DEFAULT 1,
    mean DOUBLE PRECISION DEFAULT 1.0
    ) RETURNS SETOF DOUBLE PRECISION
      RETURNS NULL ON NULL INPUT AS $$
        DECLARE
            u DOUBLE PRECISION;
        BEGIN
            WHILE count > 0 LOOP
                u = RANDOM(); -- range: 0.0 <= u < 1.0
                IF u != 0.0 THEN
                    RETURN NEXT -LN(u) * mean;
                    count = count - 1;
                END IF;
            END LOOP;
        END;
    $$ LANGUAGE plpgsql;

The function's relative frequency-histogram (based on one million values):

histogram for exponential distribution

Relative frequencies have been obtained as follows: SELECT c, COUNT(r) / 1000000.0 FROM GENERATE_SERIES(0,49) AS c INNER JOIN random_exp(1000000, 5) AS r ON c = FLOOR(r) GROUP BY c ORDER BY c;

Normal distribution

The normal distribution is probably the most important of all continuous distributions.

We must again derive them from uniformly distributed numbers to generate normally distributed numbers. Fortunately, mathematicians have discovered many ways of doing exactly that: Central limit theorem, Box-Muller transform, Marsaglia polar method, ...

For now, I've decided to implement the Marsaglia polar method because it's reasonably simple to implement in SQL:

-- generates `count` normally distributed random numbers with a given mean and standard deviation
CREATE OR REPLACE FUNCTION random_normal(
    count INTEGER DEFAULT 1,
    mean DOUBLE PRECISION DEFAULT 0.0,
    stddev DOUBLE PRECISION DEFAULT 1.0
    ) RETURNS SETOF DOUBLE PRECISION
      RETURNS NULL ON NULL INPUT AS $$
        DECLARE
            u DOUBLE PRECISION;
            v DOUBLE PRECISION;
            s DOUBLE PRECISION;
        BEGIN
            WHILE count > 0 LOOP
                u = RANDOM() * 2 - 1; -- range: -1.0 <= u < 1.0
                v = RANDOM() * 2 - 1; -- range: -1.0 <= v < 1.0
                s = u^2 + v^2;

                IF s != 0.0 AND s < 1.0 THEN
                    s = SQRT(-2 * LN(s) / s);

                    RETURN NEXT mean + stddev * s * u;
                    count = count - 1;

                    IF count > 0 THEN
                        RETURN NEXT mean + stddev * s * v;
                        count = count - 1;
                    END IF;
                END IF;
            END LOOP;
        END;
    $$ LANGUAGE plpgsql;

The function's relative frequency-histogram (based on one million values) clearly shows that the generated numbers are indeed normally distributed (with an expected value of 25 and a standard deviation of 7):

Histogram for Normal Distribution

Relative frequencies have been obtained as follows: SELECT c, COUNT(r) / 1000000.0 FROM GENERATE_SERIES(0,49) AS c INNER JOIN random_normal(1000000, 25, 7) AS r ON c = FLOOR(r) GROUP BY c ORDER BY c;

That's it for today! Thank you for reading, and see you soon!

Christoph Schiessl

Hi, I'm Christoph Schiessl.

I help you build robust and fast Web Applications.


I'm available for hire as a freelance web developer, so you can take advantage of my more than a decade of experience working on many projects across several industries. Most of my clients are building web-based SaaS applications in a B2B context and depend on my expertise in various capacities.

More often than not, my involvement includes hands-on development work using technologies like Python, JavaScript, and PostgreSQL. Furthermore, if you already have an established team, I can support you as a technical product manager with a passion for simplifying complex processes. Lastly, I'm an avid writer and educator who takes pride in breaking technical concepts down into the simplest possible terms.

Continue Reading?

Here are a few more Articles for you ...


The Built-In all() Function

Learn how to use the built-in all() function in Python for boolean logic, with examples and different implementations.

By Christoph Schiessl on Python

The Built-In chr() and ord() Functions

Discover Python's built-in functions chr() and ord() for handling Unicode characters and converting between integers and characters.

By Christoph Schiessl on Python

How to <link> your Blog's Atom/RSS Feed from HTML Pages

Learn how to <link> Atom and RSS feeds from your HTML documents to make them discoverable for clients and, by extension, for your readers.

By Christoph Schiessl

Web App Reverse Checklist

Ready to Build Your Next Web App?

Get my Web App Reverse Checklist first ...


Software Engineering is often driven by fashion, but swimming with the current is rarely the best choice. In addition to knowing what to do, it's equally important to know what not to do. And this is precisely what my free Web App Reverse Checklist will help you with.

Subscribe below to get your free copy of my Reverse Checklist delivered to your inbox. Afterward, you can expect one weekly email on building resilient Web Applications using Python, JavaScript, and PostgreSQL.

By the way, it goes without saying that I'm not sharing your email address with anyone, and you're free to unsubscribe at any time. No spam. No commitments. No questions asked.