Or, how to optimise your code so it runs better…

Intro

A while back I started learning Z80 assembly language and decided to create a version of Conway’s Life. The code I wrote worked, but it wasn’t that efficient with the simulation being measured in seconds per frame rather than frames per second.

This is an explanation of what I did to make it run faster. Well it’s not, it’s actually an explanation of how to optimise code and my Life code is an example.

Watch the video!

Grab the code!

The source code and compiled binary can be obtained from my Github Repo here. Feel free to copy it and make modifications. If you fix or improve it, let me know!

Don’t prematurely optimise

Make it work first

It’s more important to have a working algorithm than one that’s perfect. Optimisation tends to create less readable code and less readable code is harder to debug. Make the code work first so that while optimising it you can more easily spot things that break.

Make it work better later

Once you have a working algorithm, maybe it needs rewriting as you now understand the problem better. It might feel like you wasted time making the first version, but without it you wouldn’t be here. Programming is an iterative process and every program is different and full of unique challenges that get very implementation specific.

You can’t polish a turd

but you can roll it in glitter to make it look prettier. It’s still a turd though.

A bad algorithm will always be a bad algorithm no matter how many clever tricks you use. In my case I went about writing Life as if it were a 2D grid because that’s how I was visualising it. Computers don’t store data as 2D grids, to them it’s all just one long list of data. Accessing data as a grid requres X,Y co-ordinates and that involves multiplication.

The Z80 can’t do multiplication by itself, you have to write code and that code involves loops. Without realising I was trading off readability of my code with execution speed. Nobody but me reads my code, but others will run it.

What I did wasn’t wrong or bad practise though. It’s far more important to make a working program than a fast one to begin with. People will notice a slow algorithm but put up with it however if the program doesn’t work properly they’ll soon complain.

Other people can help

I sort of knew that I needed to change the algorithm but couldn’t work out how. It wasn’t until someone left a comment on my video that I had an idea. They didn’t completely explain a better algorithm but they did give me a different way to tackle the problem. A way that didn’t involve any multiplication by the Z80.

Prototype, focus on the algorithm

Debugging code can be hard, you need to separate syntax and logic errors caused by trying to write the algorithm from the errors caused by programming the computer. Figure out the algorithm first - the general way you want to instruct the computer to perform your task. Don’t get bogged down in which variables or loops you need. Get the overall idea clear.

Once you have a working algorithm the process of programming the computer is much more straight forward.

I wrote a version of the code using Python on my PC where debugging is easier. It let me work out the general form of the algorithm separated from the specifics of instructing a Z80 CPU to do the task.

Don’t over generalise

We programmers love abstracting problems out to their most general form, it means two unrelated problems can often be solved with the same technique, or it makes something difficult easier to understand. “It’s just a 2D grid and we go around the grid looking at neighbouring squares” is easy to imagine and the same code can work for every square on the grid. Even those tricky corner and edge cases - we just need a bit of logic to recognise we’re in one of those situations.

Specific problems require their own specific solutions

That’s where my thinking was going wrong. I was trying to treat every grid on the board the same. Instead of doing this I should have been looking for patterns. The grid is split into nine distinct areas - amusingly the same nine areas a GUI window or layout is split into.

  • Four corners, each of which is different
  • Top and bottom rows
  • Left and right edges
  • The middle

The middle is the easiest place to code up and is what most Life implementations focus on. You take a square on the grid and then calculate the eight surrounding cells to work out how many neighbours exist. We’re still thinking in 2D here, but recognising the middle is different is an important step.

The four corners are special and each needs its own specific logic and there’s no way to generalise this. However it’s only four corners.

The edges also need their own specific logic, but this can be more generalised and applies to long runs of cells so there’s potential for more optimisation later.

Cheat if you can

The Z80 can’t multiply, but if we think about what I’m trying to achieve we realise it’s not necessary. The Life board is n x m cells in size and I choose that size based on the size of the terminal window. The user can’t change it.

Constants and the pre-processor are your friends

Since the number of cells never changes, I know it in advance. This means I can make use of constants when referring to the total number of cells. And since my assembler has a preprocessor that can do arithmetic using my PC, I can hard code many values.

For example the top-right cell is at index 79 in the list of cells because I made the grid 80x25 in size. This also means the bottom right corner is index (80 x 25)-1 which is multiplication, but if I make the assembler do it it can just insert 1999 in the code for me.

; Defines to avoid having to do any actual maths in the simulation
middleRowLength EQU ScreenWidth*(ScreenHeight-1)

CellA EQU 0
CellB EQU ScreenWidth-1
CellC EQU ScreenSize-ScreenWidth
CellD EQU ScreenSize-1

I used this everywhere. The code is full of hardcoded values and constants which is something we’re sort of taught not to do as the code shouldn’t be full of “magic numbers” that make things work. Except these numbers aren’t “magic” and they’re all labelled with descriptive names. It’s not really 80x25 it’s really ScreenWidth x ScreenHeight.

; Start at x-((width*2)-1) (3)
ld hl,de
ld bc,(ScreenWidth*2)-1    ; Assembler macro
sbc hl,bc
ld bc,(currentGrid)
add hl,bc   
ld a,(hl)   ; Add to A

Remove redundant code

After getting a decent algorithm sorted out, and code that works you can now start to look at ways to simplify the code. Algorithms are quite high level ways of thinking, and turning them into working code can often involve more steps. The Z80 has limited registers and deals with 8 bit numbers natively. I needed a 16 bit counter but there aren’t any built in instructions for comparing 16 bit numbers so I needed to write my own.

These types of specific situations are where you can optimise further.

Reorganise the instructions

Sometimes changing the order of instructions can make it easier to compute. When calculating the neighbours in the grid my original 2D thinking had me working from the top left neighbour and then across each row, calculating the index each time based on the centre.

After typing this out several times it soon became obvious to me that a better way would be to compute the top left neighbour and then simply incrementally move through the list to reach each neighbour. Some calculation is required to find the top left neighbour but every one after that is merely incrementing our position in the list or adding on a constant number of cells to reach the next row.

ld hl,de
; Get contents of Cell x-width-1
ld bc,ScreenWidth
sbc hl,bc   ; x-width
dec hl      ; -1
ld bc,(currentGrid)
add hl,bc : ld a,(hl)  ; Add to A

; Get contents of Cell x-width
inc hl : add a,(hl)   ; Add to A

; Get contents of Cell x-width+1
inc hl : add a,(hl)   ; Add to A

Doing this not only made the code easier to follow, but I removed hundreds of instructions in the process.

Apply the dark arts last

There does become a point where your algorithm is optimised but there’s still room for improvement. This is where the dark and dirty tricks specific to your system can be used. Tread carefully, the performance gains made here are likely to be small and are best used when trying to hit a specific code size or tight execution limits.

Optimisations of this nature are prone to undoing themselves or being blown away by new code that’s added later.

Instructions have side effects

One common task in my code was to accumulate a count of now many live neighbours each cell had. Each cell has eight neighbours. There are 2000 cells on the grid. This accumulation code runs 18,000 times per tick of the simulation.

One step in this is to clear out the accumulator and reset several CPU flags.

Originally I was using the fairly straight forward process of

ld a,0 ; load 0 into the accumulator
or a   ; this resets the carry flag

These two instructions take up 7 and 4 cycles respectively so 11 in total. This is nothing really, the Z80 has instructions that by themselves use 23 cycles.

However we can do better. The code after I clear the accumulator works out if the top left neighbour is alive and then adds 1 to the accumulator if it is, which is why it needs to be zero.

Except it doesn’t need to add because if this is the first value to go into the accumulator then I can directly load the result using ld a,1. The load instruction does take 7 cycles compared to the 4 of an add however before using add I first had to reset the accumulator to zero by loading 0 into it. So I can save the 4 cycles of the add instruction by removing it entirely.

Now there is an even more shorter way of doing this. It turns out xor a will both reset the accumulator to zero and clear the flags using only four cycles. So I could use this with the original add instruction and save three more cycles.

Assembly programming is full of these kinds of thoughts, at the start cycle counting sounds like some mystical dark art that few comprehend. After a while it becomes just another thing to think about and consider.

Don’t over optimise

While running my code I noticed odd behaviour cropping up when the simulation was running. I got what could only be described as screen tearing, presumably because the serial link between the RC2014 and my PC couldn’t keep up with the data being generated and something was overflowing.

There is a point in optimisation where the benefits gained are either so small it becomes an academic exercise, or other bottlenecks appear - some of which can’t be overcome.

Summary

Overall I managed to take code that was very inefficient and slow and re-write parts of it so that they now ran “too quickly”. This gave me some spare cycle time in which I could add a nice border to the screen.

What’s pleasing to see is that the simulation runs in constant time regardless of how many live or dead cells exist. A full screen of cells doesn’t slow the simulation down because every cell is being computed regardless of whether it is alive or not.

I’m sure there’s an even better algorithm out there that takes this observation into account and only looks at the live cells. Again though, there is a point in this where if the simulation runs too quickly it becomes unwatchable and needs slowing down to be pleasant to look at.