Sunday, February 17, 2019

Space Saving Techniques in Phantomas 2.0

Although most of my recent games have been for the MSX, last year Jordi Sureda and myself started working together on a game for the Amstrad CPC. The game is a remake of the Dinamic classic Phantomas 2. You can see a full walk through here of the game by JGonza (who was one of the beta testers of the game):


The game itself can be downloaded from Github, and if you want a physical copy of the game, you can buy it from matranet. The game is not very complex technically, and it works well mostly thanks to the graphics (made by Jordi Sureda). But there are a few interesting technical aspects. One of them in particular I think is particularly interesting since it's something that has to be done in every 8 bit project: how compress data as much as possible.

There are many techniques that people have used to reduce the space used by the room definition (e.g., using macro-tiles, as brilliantly explained in this video), or compressing them using a compression algorithm (like Pletter, or Exomizer). But there is one more thing that can be done, which is what I want to explain in this post: how to organize your data so that it gets maximally compressed when using a compression algorithm. Let me start by introducing a couple of basic ideas about compression algorithms.

Compression Algorithms in 8 bit Computers

A compression algorithm takes as input a sequence of symbols (say, of length n), and tries to generate another sequence of symbols of length smaller than n by exploiting regularities in the data. For example, if I have a sequence that is something like this "aaaaabbb", I could replace it by "5a3b", meaning that I have 5 times an "a", and then 3 times a "b". This is called "run length encoding", and works well when we are compressing sequences that have long sequences of repeated symbols. This was used in the classic PCX image format, popular in the Amiga and DOS times, where images contained usually large patches of the same color. The most common compression tool nowadays is the "zip" utility that comes with most OSs, which does use much more advanced compression than simple run length encoding (most modern compressors use dictionary-based approaches such as the LZW or DEFLATE algorithms).

These advanced compressors are usually too computationally intensive for 8 bit machines to handle, but there are tools, like Pletter (popular in the MSX world) or Exomizer (popular in the Amstrad CPC world), specifically designed for these machines. Usually, though, the way they are used is to have a compressor that runs on a modern PC to compress the data to fit it in a game, and then only the decompressor (which is not as computationally heavy) runs on the 8 bit machine.

For a comparison of how well these algorithms work, I refer you to an amazing empirical evaluation done by @newcoleco on the Atariage forums ( http://atariage.com/forums/topic/268244-data-compression-benchmark-dan-pletter-zx7-exomizer-pucrunch-megalz/ ). In summary: Exomizer compresses more, but decompression is much slower. So, if space is the biggest concern, use Exomizer, if decompression time is a concern (i.e., if you need to decompress things during gameplay), then Pletter is the better option.

For Phantomas 2.0, I considered both (for a few weeks I maintained two separate versions of the game using both compressors), but I ended deciding for Pletter, since Exomizer was too slow (it took about 4 seconds to decompress all the data I needed when changing rooms, which made the game unplayable). But in other games, Exomizer might be the best option. So, the important thing is to know your options!

Compressing the Rooms in Phantomas 2.0

There are many ways in which rooms can be defined in games. In Phantomas 2.0, we used a simple 2-plane grid-based approach: rooms are made out of 8x16 pixel tiles, and since the playable screen is 128x128 pixels wide (the game runs in Amstrad CPC mode 0, where pixels are twice as wide as tall) each room is 16x8 tiles in size. We used two planes, one for background tiles (drawn behind the player and enemies) and one for foreground tiles (drawn in front of the player and enemies). For example, here I show an example room, and the two planes it is made out of:

This representation for rooms uses quite a lot of space (ignoring enemy/item information), since there are 2 planes of 16*8 tiles, and we use one byte to specify each tile, each room is 16*8*2 = 256 byes in size (plus 20 or 30 for enemies/items). The game has 85 rooms, so that would be 24KB just for the rooms! (and remember that the Amstrad CPC only has 64 KB of RAM, and that 16KB of those are the video RAM, so, we can actually only use 48KB...). That was not going to work...

An option could have been going for the idea of macro tiles (used brilliantly in many of the productions by 4MHz). But macro tiles also have some limitations. So, since I did not want to impose any more restrictions on Jordi (who was creating the rooms), I decided I had to figure out a way to reduce space while maintaining this room representation. Of course, a first step is compressing them with Pletter. But there are 3 additional ideas that can be applied after that that can reduce space even more. So, let's get started!

Step 1: Direct Pletter Compression

Putting all 85 rooms together uses exactly 24316 bytes (286.07 bytes per room). Let's keep that as a baseline.

So, the first step is to just compress one by one, all the rooms using Pletter. The result is 12649 bytes (148.81) bytes per room. So, this is not bad, but we can do much better :)

Step 2: Exploiting Useles Bytes

Some bytes in the rooms are "useless". What I mean by "useless" is that whatever their value, the final room would look the same! For example, if there is a foreground tile that completely covers the background, then whatever value we have for the background tile is not going to affect the final room.

A first idea is to just set all of those background tiles covered completely by a foreground tile to 0. But a much better idea is to set them to the same value as the tile immediately to the left of them. If we are on the left-most tile, then we use the last background tile in the row before, and if we are on the top-left coordinate, then we just use "0". In this way, we are setting several bytes in a row with the same value, which results in a sequence that compressors like Pletter can compress better.

With this small change, the result is that the rooms (after compressing them, one by one with Pletter) use now only 11850 bytes (139.31 per room). With this simple trick, we have saved almost 800 bytes! not bad!

Step 3: |C(A+B)| < |C(A)| + |C(B)|

The next step might not be as obvious. But it is well known in information theory that if you have a compression algorithm C, and two sequences of symbols, A and B, the compression of the concatenation of A and B (which I will denote as "A+B") is usually smaller than the sum of the compression of A plus the compression of B. This is intuitive, since imagine that there is a subsequence of A that also happens to be in B. In this case, the compressor can exploit that when compressing A+B, however it cannot exploit it when compressing A or B separately, since there is no repetition. So, compressing two sequences together saves space most of the time.

Using that idea, the obvious thing to do is to just concatenate all the rooms, and compress the whole thing together, right? The problem with this is that you need to have enough space in RAM to decompress the rooms (recall they use 24316 bytes uncompressed). So, this was not an option. I know that some games (for example Operation Alexandra compressed the WHOLE game together!, but this was not an option for us, as the game does not fit in memory uncompressed :))

So, what we ended up doing is to compress the rooms by groups. We compressed rooms in groups of 4 (put 4 maps together, and then call Pletter to compress them). A little detail is that if you want good compression, you better place things that are similar together (as we saw above). So, I placed the background of all 4 maps together, then the foreground of the 4 maps, and then the item/enemy information.

With this idea, we get a significant reduction in space, and we are now at: 9926 bytes (116.18 bytes per room), which is already pretty good.

But we can do even better! :)

Step 4: Optimizing the Groupings

Some rooms look more similar to each other than some other rooms. So, I thought: if I put rooms that look similar in the same group when compressing them, would that save space? and the answer is yes! So, this is what I did: I wrote a program that first calculated the space needed when compressing all maps in groups of 4. Then, the program starts trying to permute maps (taking a map in a group and swapping it by a map in another group). After each swap, it recalculates the space, and if we have saved some space, then we keep the new grouping. Trying all possible map groupings is not feasible, but we can use a simple greedy search algorithm like this:
1) Start with the base grouping
2) Try every possible swap, if one reduces, keep it and continue from there
3) Once we have tried every possible swap, restart from the beginning again if we had any improvement, otherwise, terminate.

And this was another huge saving! After optimization, the rooms only use 9164 bytes (107.81 bytes per room)!!

Here's a little diagram showing a summary of the different ideas:



Conclusion

So, in summary, to get better compression:
1) try to put things that are similar together.
2) compress things together, better than compressing them separately
3) if you are compressing things together, try to optimize what groupings do you make!

Since these ideas worked very well, I used the same idea to store all the graphics of the game (in groups of 16 tiles), saving additional space (again, all the tiles uncompressed do not fit in RAM, so, we need to keep them compressed, and we only uncompress those needed for the current room). If you play the game, you might notice a small delay when changing rooms, which is when the game decompresses the necessary tiles for the next room. This is the part that was way too slow using Exomizer unfortunately...

The funniest thing is that after all of this space saving, I got a message from @Joseman on twitter, explaining me some details of the Amstrad CPC hardware I wasn't aware of (this was our first CPC game), and it turns out there are 4 full KB of extra space we didn't use! So, at the end there was plenty of unused space in the computer memory. But oh well, it was fun optimization anyway, and these ideas can be useful for other projects!

And that's it, I hope these techniques are useful to somebody else! If you have any additional space compression tricks, by all means I would like to hear about them! :D

Friday, February 15, 2019

The Flag Effect in XRacing

A few days ago I submitted my last MSX game to the MSXDev'18 competition. The game is called XRacing, and it is a top-down view car racing game inspired heavily in the Amiga game "Super Cars". A quick video of the intro and first level can be seen here:


I will explain the technique I used for the game scroll in another post, but in this post I wanted to focus on explaining how does the flag waving animation in the title screen work! I want to explain how it works, since the method achieved pretty nice results and it is not too complex, and it could be useful to other people trying to add similar effects to other MSX games!

My goal was to figure out a way to play a small "video loop" in the background that looked very smooth. I wanted to do this since I saw this video of a fire animation converted to the ZX Spectrum specs. I thought it would be amazing to have something like that on a game. So, I started thinking how could I convert a small video on my computer to play on the MSX, and how could I make it small enough to fit on a small game.

I was initially wanting to convert a fire animation (a real video of a fire, not one of those fire visual effects), but I struggled to find any that I could easily convert to a loop with a handful of frames (e.g., 16 frames or so). So, I remembered that back in the day, when we were working on the F-1 Spirit remake, we had this fancy flag animation in the background of the title screen. So, I thought that would be an easier one to convert! I started looking for real videos of a waving flag, but again I struggled to find one I could easily turn into a small 16 frame loop. So, I just wrote a little program to generate my own flag. It doesn't look real, but it did the trick.

Alright, so, before I describe the technique, let me take a brief moment to explain the technical specs of the MSX graphic chip, which will define the constraints of the approach!

The MSX VDP

Most first generation MSX computers shipped with the TMS9918 graphics chip, which I will just call "the VDP" (Video Display Processor). The VDP has several graphic modes, but I will focus on what is known as "Screen 2" in the MSX world, which is the most powerful graphic mode the VDP offers.

Screen 2 is a 256x192 pixel graphic mode, where the screen is divided into 8x8 pixel blocks called "patterns" or "tiles". The screen is thus 32 patterns wide (32*8 = 256) and 24 patterns tall (24*8 = 192), as is shown in this small diagram:


Each pattern is defined by two components a "pattern table entry" and a "color table entry", which are combined in the following way to draw the pattern:


So, as can be seen above, the "pattern table entry" (8 bytes per pattern) defines the "shape" of the pattern, and the "color table entry" (8 bytes per pattern) defines the color of the pixels. Two colors can be specified in each byte of the color table entry, which correspond to the 2 colors that can be used in the each of the 8 rows of the pattern. Each of the bits in the pattern table entry corresponds to a pixel in the pattern, and whether its a 0 or a 1 determines which of the two colors specified for the corresponding row will be drawn. This is where the "2 colors for each block of 8x1 pixels" comes in the MSX!

In order to draw something in the screen, what you need to do is to write data into the "video RAM" (VRAM from now on), and as a result, the content of the screen will change. The VDP has 16KB of VRAM, divided into a collection of areas:
- the Pattern Table: we can specify up to 768 different patterns, divided into 3 different banks (as shown in the picture above; more on this below). Each pattern is 8 bytes long, with each byte corresponding to one of the rows of the pattern (each bit corresponding to 1 pixel), as described above.
- the Color Table: also has 768 entries of 8 bytes each, specifying the colors of each of the 768 patterns as described above.
- the Name Table: finally, the name table is a 768 byte table that specifies which pattern to draw in each of the 768 positions of the screen. For example, if we define two patterns (a smiley face in pattern 0, and a sad face in pattern 1), if we write 0, 1, 0, 1, 0, 1, etc. in the first 32 bytes of the name table, we will see a row of smiley and sad faces. So, once we define a pattern, we can reuse it multiple times in the screen. There is only one limitation here, and it is that since we can only use 1 byte to specify which pattern to draw, we can only select from 256 patterns (since that's the maximum number of values we can represent with a single byte). Since there are 768 patterns, these are divided in "3 banks". In the top 8 rows of the screen (BANK 1), we can only select from patterns 0 - 255, in the middle rows, we can select from patterns 256 - 511 (BANK 2), and in the bottom 8 rows from 512 - 767 (BANK 3).
- and then a couple of tables for sprites (but we will ignore sprites for now :))

By looking at this structure you can already probably see why some MSX games have a very "chunky" scroll of 8x8 pixels! It's just because the easiest way to implement scroll in MSX is by just defining our graphics as patterns, and then just moving them with the name table (which lets us move them only in increments if 8 by 8 pixels!). If we want smoother scroll, then we have to figure out a way to do it, by using just the tools I described above!

Another thing to have in mind is the slow speed of the MSX CPU (Z80) and the even slower speed of data transfer to the VDP! The MSX Z80 has a clock of 3.58 MHz (3580000 ticks per second). If we want to have a smooth video/game that renders at 50 frames per second, we only have 71600 ticks per frame. If we have in mind that it takes usually 29 ticks to transfer a single byte to the VDP. So, we can only optimistically transfer about 2500 bytes per frame to the VDP, which is not enough to redraw a whole frame! 768 patterns * (8 bytes for pattern table + 8 bytes for color table) = 12288 bytes, which is the reason for it being so hard to have smooth scroll on MSX...

Ok, so that's the VDP. Now, how do we get a smooth animation on such a slow hardware?


Step 1: Generating the Video

To generate the source video, I just wrote a small Java script to render a flag waving loop with 16 frames of 256x192 pixels in black and white (I wanted to have gray-scale, but I'll explain below why I decided against it). The original video looks like this:



Step 2: Extracting the Patterns

The next step is to go one by one each of the 16 frames of the animation, and extract all the different patterns needed to display this animation. For doing this, we scan each 8x8 pixel block in each of the 16 frames, and we create a list of all the different patterns we find. Also, since once this is uploaded to the VDP, each of the 3 banks of the VDP is separate, we do this for each bank separately. With the video above, we find the following:

- BANK 1: 787 different patterns
- BANK 2: 510 different patterns
- BANK 3: 772 different patterns

So, assuming we could store all of these tiles in the VDP, what we would have to do is this:
1) Store all of these tiles in the game ROM: (787+510+772)*16 = 33104 bytes
2) Now, each frame of the animation would just be 768 bytes (the name table for each frame). Since we have 16 frames, this would be: 768*16 = 12288 bytes
3) Then, at run time, we just load all the patterns in the VDP initially. And at each game frame, we just keep uploading the name table that corresponds to the current frame. This is just 768 bytes per frame, which is doable.

Problems:
1) The VDP can only store 256 patterns per bank, and thus 787, 510 and 772 do not fit!
2) Even if they were to fit, that animation already would use most of the size of the ROM cartridge! (I was targeting a 48KB ROM)

So, we need to reduce that somehow :)


Step 3: Reducing the Set of Patterns Needed

If we were to just display the flag animation by itself, we could afford to have 256 patterns per bank. However, I wanted to have the game logo drawn on top of the flag, and maybe have some text message at the bottom of the screen. So, I had to reserve some patterns for that. In order to show the title image, I needed these patterns (not all patterns are strictly needed, but it was easier for me to have these patterns, which I reuse in other parts of the game):


If you count, those are 108 patterns  But I wasn't sure if I was going to need a few more, so, I assumed I might need 112 patterns  This left me with 256 - 112 = 144 patterns in each bank for the flag animation. So, how to reduce from 787 tiles to 144?

The answer is a standard machine learning family of algorithms called clustering algorithms. Clustering algorithms work as follows: the input is a large collection of points in an n-dimensional space, and the output is a "clustering" (a grouping of the input points into a collection of "buckets" called "clusters"), where the points in a cluster should look more similar to each other than to points in other clusters. So, what I did was to consider every pattern as a point in a 64-dimensional space (one dimension per pixel in the pattern), and I gave all 787 patterns for BANK 1 as input to a clustering algorithm (a very simple one called k-medoids), and I asked the algorithm to cluster them into 144 clusters. From each cluster, I then pick the most "representative" pattern (the "medoid"). I then do the same thing with BANK 2, and BANK 3. I then replace every pattern in the flag animation by the representative pattern of its cluster. This effectively reduces the number of patterns needed for the animation to just 144 per bank!

The resulting animation now looks like this:


Notice that it mostly looks like the same animation, except for a few glitches! So, this already looks promising! After that, I just manually replaced the medoid of a couple of clusters by a different one to fix those dots at the top of the animation, and the animation already looked much cleaner!

Step 4: Translate the Animation to MSX Data

Finally, now we can just take those 144 patterns per BANK, and save them to a binary file. Notice that all patterns have just black and white as color, so, I do not store the color table (assume that the color table for all patterns is identical, and just set to black and white). So, I need to store 144*3*8 = 3456 bytes for the patterns (much lower than before!). And after compressed using Pletter, it's just 1434 bytes.

I considered having gray in the flag to have shadows, but that forced me to store a color table, which used more space. So, I decided against it!

Once we have the patterns, for each animation frame (there are 16 frames), we store the corresponding "name table", i.e., which patterns have to be drawn in each of the 32x24 positions of the screen. This is 768 bytes per frame, so 768*16 = 12272 bytes. I actually never draw the bottom line of the animation, since I use that for text, so that saves some space, and I just need 11776 bytes. After compressing them with Pletter, it;s reduced to 5842 bytes.

Step 5: Playing it in the MSX

Once we have that, to play it back on the MSX, we just need to copy the patterns to the pattern table in the VDP, set the color tables to black and white, and then at each frame copy the corresponding name table to the VDP. The video could play easily at 50 or 60 frames per second (since it just needs to copy 32*23 = 736 bytes per frame), but the animation would be too fast. So, the game plays it at 25 or 30 frames per second depending on if it is running on a PAL or NTSC MSX.

The XRacing logo is then drawn on top by just overwriting the proper positions on the name table, and drawing a few sprites on top of it to achieve the 4 rounded corners, and a few other details (there are 19 sprites on the title screen).

And that's it! So, basically the idea is: 1) take an animation loop as input, 2) get the set of patterns needed for each bank, 3) use k-medoids to reduce the number of patterns to 144 per bank, 4) with that we can already display the animation on the MSX, since we just need to change the name table in each frame!

It's not rocket science, but what I find interesting is that there is nothing in the code hard-coded for the video being a flag waving animation. The idea could in principle work for any other small animation loop. So, I might use this technique in the future to display other types of animations :)

Ok, that's all for this post, which is already getting too long! Next time I want to explain how the in-game scroll works! which is a bit more involved :)