Sunday, August 16, 2015

Implementing John Conway's Game of Life in Microsoft SQL Server


I may be carbon-dating myself here, but one of the first things I implemented on my Apple IIe in the early 80s was John Conway's Game of Life. If you're not familiar with it, it's a mathematical game that simulates cellular automata with fixed rules. There's an infinite square grid containing spaces for cells. Every generation, cells can be born or they can die, according to very simple rules. Despite the simple rules, very complex structures can be created. See the wiki page for more information.

All the code here can be found on github: https://github.com/pauljchang/sandbox/blob/master/life/life.sql



The rules for the game are:

  • Any empty space with exactly three neighbours gives "birth" to a new cell.
  • Any cell with 1 or 0 neighbours "dies" from starvation and is removed.
  • Any cell with 4 or more neighbours "dies" from overcrowding and is removed.

For fun, I've implemented this in BASIC and C++, and maybe even Java, but I thought I'd share my MSSQL implementation, as it uses some interesting concepts.

First of all, instead of using an array, we are going to create a table that contains one row for every cell that is in the field. This may seem overkill, but for very sparse fields where there are just a few cells occupying a very large area, this can be surprisingly efficient, and more efficient than a very large, two-dimensional array.

So, let's see the table:

-- 
-- dbo.cell -- Table to hold all cells
-- 
-- Each cell has an entry in the dbo.cell table
-- It knows its (x,y) coordinates, as well as the generation number
-- The generation number is useful for diagnostics
-- and indicates how "old" a cell is
-- 
CREATE TABLE dbo.cell (
    x   SMALLINT NOT NULL
,   y   SMALLINT NOT NULL
,   gen SMALLINT NOT NULL
);
CREATE UNIQUE CLUSTERED INDEX KUX_cell    ON dbo.cell (y, x);
CREATE UNIQUE           INDEX UX_cell_x_y ON dbo.cell (x, y) INCLUDE (gen);
CREATE                  INDEX X_cell_gen  ON dbo.cell (gen, y, x);

You might notice that I clustered the table along the y-axis, which will facilitate printing out the field for display. I also included other indices that will help later on as we step through generations of cells.

There is also a "gen" field for the generation of the cell. This is interesting information, as to the age of the cells, and also helps us distinguish between newly "born" cells and cells that were already there.

Let's also seed the field with the classic R-Pentamino shape. This happens to be one of the smallest patterns that grows into a huge shape that doesn't stabilise for many generations. Here's what we want to seed:

   +---+
 1 | 00|
 0 |00 |
-1 | 0 |
   +---+

The numbers on the left are the y-axis, and I didn't think it was necessary to display the x-axis. Here's the code to seed the table:

INSERT INTO dbo.cell (x, y, gen)
VALUES
    ( 0,  1, 0)
,   ( 1,  1, 0)
,   (-1,  0, 0)
,   ( 0,  0, 0)
,   ( 0, -1, 0)
;

In order to step through, there are two calculations we must make. One, we have to see if any new cells will be born with the existing cells on the field. And two, we have to see if any of the existing cells (not the newly born cells) will die.

To do this, I use CTEs (Common Table Expressions), which are like temporary queries or views that live only for the lifespan of a query, but can be referenced by the query.

The first CTE I create is "delta", which allows me to look at neighbouring cells or spaces in all eight directions. Just one delta CTE can be referenced multiple times, which is very convenient.

WITH delta (num) AS (
              SELECT CAST(-1 AS SMALLINT) AS num
    UNION ALL SELECT CAST( 0 AS SMALLINT) AS num
    UNION ALL SELECT CAST( 1 AS SMALLINT) AS num
)
...

The next CTE is for generating a set of neighbouring empty spaces. This is based on the current set of cells, but looking in left, right, up, down, and diagonally for empty spaces. Here, we can see how "delta" is used. It's a pretty simple query -- return distinct neighbouring spaces that don't already have a cell.

...
,   empty_neighbour (x, y) AS (
        SELECT DISTINCT
            dbo.cell.x + delta_x.num AS x
        ,   dbo.cell.y + delta_y.num AS y
        FROM
            dbo.cell
            CROSS JOIN delta AS delta_x
            CROSS JOIN delta AS delta_y
        WHERE
            NOT EXISTS (
                SELECT *
                FROM
                    dbo.cell AS other_cell
                WHERE
                    other_cell.x = dbo.cell.x + delta_x.num
                AND other_cell.y = dbo.cell.y + delta_y.num
            )
    )
...

And another CTE, but this one counts how many neighbouring cells exist for each of those empty spaces. Again, we use "delta" to look in eight directions to count neighbouring cells.

...
,   neighbour_count (x, y, neighbour_count) AS (
        SELECT
            empty_neighbour.x
        ,   empty_neighbour.y
        -- This expression is here to eliminate the silly NULL agregation warning
        -- Otherwise, we could just COUNT(other_cell.gen)
        ,   COALESCE(SUM(CASE WHEN other_cell.gen IS NOT NULL THEN 1 ELSE 0 END), 0) AS neighbour_count
        FROM
            empty_neighbour
            CROSS JOIN delta AS delta_x
            CROSS JOIN delta AS delta_y
            LEFT JOIN dbo.cell AS other_cell
            ON  other_cell.x = empty_neighbour.x + delta_x.num
            AND other_cell.y = empty_neighbour.y + delta_y.num
        GROUP BY
            empty_neighbour.x
        ,   empty_neighbour.y
    )
...

I have a funny COALESCE(SUM(CASE...)) statement because I'm trying to avoid the annoying aggregation of NULL values warning. Just think of it as counting the number of neighbours.

And finally, the INSERT. We insert with the current generation number to distinguish the newly "born" cells from older cells. This will be useful in the DELETE query.


...
INSERT INTO dbo.cell (x, y, gen)
SELECT
    neighbour_count.x
,   neighbour_count.y
,   @gen AS gen
FROM
    neighbour_count
WHERE
    neighbour_count.neighbour_count = 3
ORDER BY
    neighbour_count.y
,   neighbour_count.x
;
...

So with the newly inserted cells, our field now looks like this:

   +---+
 1 |100|
 0 |00 |
-1 |10 |
   +---+

The "1" cells are the ones that we just inserted, based on the rule of empty spaces with three surrounding cells. But we still need to delete cells that are overcrowded or starved. In this case, the centre cell should be removed because it has four neighbours (not including the newly "born" ones).

Here, we merely count the neighbours of existing cells, minus newly "born" ones. Again, we use the "delta" CTE.

...
,   neighbour_count (x, y, neighbour_count) AS (
        SELECT
            dbo.cell.x
        ,   dbo.cell.y
        -- This expression is here to eliminate the silly NULL agregation warning
        -- Otherwise, we could just COUNT(other_cell.gen)
        ,   COALESCE(SUM(CASE WHEN other_cell.gen IS NOT NULL THEN 1 ELSE 0 END), 0) AS neighbour_count
        FROM
            dbo.cell
            CROSS JOIN delta AS delta_x
            CROSS JOIN delta AS delta_y
            LEFT JOIN dbo.cell AS other_cell
            ON  other_cell.x = dbo.cell.x + delta_x.num
            AND other_cell.y = dbo.cell.y + delta_y.num
            -- Don't count the cells we just created
            AND other_cell.gen < @gen
            -- We don't want to count the cell itself, just neighbours
            AND (   other_cell.x <> dbo.cell.x
                OR  other_cell.y <> dbo.cell.y
                )
        WHERE
            -- Don't count the cells we just created
            dbo.cell.gen < @gen
        GROUP BY
            dbo.cell.x
        ,   dbo.cell.y
    )
...

...and the final delete:

...
DELETE
    dbo.cell
FROM
    dbo.cell
    INNER JOIN neighbour_count
    ON  neighbour_count.x = dbo.cell.x
    AND neighbour_count.y = dbo.cell.y
    AND (   neighbour_count.neighbour_count <= 1
        OR  neighbour_count.neighbour_count >= 4
        )
;
...

Now our field looks like the following:

   +---+
 1 |100|
 0 |0  |
-1 |10 |
   +---+

If we step forward a few generations, we see the pattern grow:

   +----+
 2 |  2 |
 1 | 10 |
 0 |2  2|
-1 | 10 |
   +----+

   +----+
 2 | 32 |
 1 | 103|
 0 |2  2|
-1 | 10 |
   +----+

   +----+
 2 | 3 4|
 1 |4  3|
 0 |2  2|
-1 | 10 |
   +----+

   +-----+
 2 |  5  |
 1 |45 35|
 0 |2  2 |
-1 | 10  |
   +-----+

And after 100 generations, we see it grows, and keeps growing:

    +--------------------------------------------------+
 12 |                                           98     |
 11 |                                          9  9    |
 10 |                                  58      8  8    |
  9 |                                  45       98     |
  8 |                      89                          |
  7 |    0                  70     45           00 9   |
  6 |   989                0       69           49 80  |
  5 |  9  70                                           |
  4 |  80 07                                     069 89|
  3 | 950                              980       08  80|
  2 |  09 0                    06     0 6 89    00076  |
  1 |   0  0                   38     8   5 0  7 90    |
  0 |                              0  0 6   8  9 8     |
 -1 |   9  9                       9   98 9 0   0      |
 -2 |    00                        0     089           |
 -3 |                                                  |
 -4 |  09 80                                           |
 -5 |   7                                              |
 -6 |07     0     71                                   |
 -7 |88     0     52                                   |
 -8 |0                                                 |
 -9 | 9    0                                           |
-10 | 0  8                                             |
-11 |  090                                             |
    +--------------------------------------------------+

The code on github contains two stored procedures -- "print_cells" to display the field, and "step_cells" to step one generation. It also contains code to set up the initial field with the R-Pentamino example above. The code is written for MSSQL, but I've tried to stay within ANSI SQL-92, so it should be reasonably easy to port this to other RDBMSs.

I'm reasonably happy with this code. Unlike array implementations, which slow down as the patterns grow larger and occupy more rectangular space, this implementation grows O(n), linearly with the number of cells, n.

Is there a better way to implement this in SQL? Of course, we all know that SQL is likely not the best way to implement this, but within the bounds of SQL, how else can we improve this?

6 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I thought I had my own version, but there's an error somewhere. I'll track it down and repost! Sorry for messing up the comments section.

    ReplyDelete
  3. Let's try this again. The big differences are 1) The way I calculate the new generation *includes* self, and produces the entire board instead of the diff. 2) This entire board is then merged to perform the diff, and rows that already exist in the dbo.Cell table are ignored during the merge.

    My display routine is having a problem past nvarchar(4000) so I'll be fixing that later and posting it! In the meantime, you can adapt either my or your script and use your print routine.


    CREATE TABLE dbo.Cell (
    X smallint NOT NULL,
    Y smallint NOT NULL,
    Generation smallint NOT NULL
    );

    CREATE UNIQUE CLUSTERED INDEX CI_Cell ON dbo.Cell (Y, X);
    CREATE UNIQUE INDEX UQ_Cell_X_Y ON dbo.Cell (X, Y);

    INSERT dbo.Cell (X, Y, Generation) VALUES
    ( 0, -1, 0),
    ( 1, -1, 0),
    (-1, 0, 0),
    ( 0, 0, 0),
    ( 0, 1, 0)
    ;

    IF object_id('dbo.AdvanceLife', 'P') IS NULL EXEC ('CREATE PROCEDURE dbo.AdvanceLife AS RETURN;');
    GO
    ALTER PROCEDURE dbo.AdvanceLife
    AS
    MERGE
    dbo.Cell C
    USING
    (
    SELECT
    P.X,
    P.Y,
    Generation = Max(C.Generation) + 1
    FROM
    dbo.Cell C
    CROSS APPLY (
    SELECT
    C.X + X.X,
    C.Y + Y.Y,
    CASE WHEN X.X = 0 AND Y.Y = 0 THEN 1 ELSE 0 END
    FROM
    (VALUES (-1), (0), (1)) X (X)
    CROSS JOIN (VALUES (-1), (0), (1)) Y (Y)
    ) P (X, Y, IsSelf)
    GROUP BY
    P.X,
    P.Y
    HAVING
    Count(*) * 2 - Max(IsSelf) BETWEEN 5 AND 7
    ) N
    ON C.X = N.X
    AND C.Y = N.Y
    WHEN NOT MATCHED BY SOURCE THEN
    DELETE
    WHEN NOT MATCHED BY TARGET THEN
    INSERT (X, Y, Generation)
    VALUES (N.X, N.Y, N.Generation)
    ;
    GO

    ReplyDelete
    Replies
    1. Hi, Erik,

      This is great! I should have thought about an "UPSERT" like this! Honestly, I haven't used code like that, other than very specific ETL code, but I'm glad you thought of it! Thanks for reading the blog and contributing!

      Delete
  4. This comment has been removed by the author.

    ReplyDelete