Sunday, October 11, 2020

Quick Technical Overview of The Menace from Triton (MSX Game)


One more game, one more post! My latest MSX game is called The Menace fro Triton, and it is a horizontal space shooter that takes inspiration from classics like Salamander or the Nemesis saga for MSX, as well as from more modern shooter games like Steredenn or Z-Exemplar. You can see a small game play video here:


I put a lot of work into this game, and I just wanted to explain a few of the interesting technical aspects I designed for it, just in case someone is interested. For this game, I set out to create an MSX game that used only 16KB of RAM, and with a cartridge of at most 48KB. I like setting constraints like these on the games I make for two main reasons: First, because it is fun! Without constraints, many things become easier (e.g., if I allow more RAM, I could have many memory buffers, making things easier; if I allow for MSX2, then I can use the additional capabilities of the MSX2 VDP to make the scroll easier, etc.), but I enjoy the "game" of figuring out if it's possible to make all I want to do within the constraints I set. Second (and most important), I have found that if I don't set a size limit on a game, I never finish them. My PC games go on forever in an endless spiral of adding more and more features. But with a small and finite amount of space, I know that as soon as the 48KB of the ROM are filled, that's it! No more content! So, that helps a lot in getting games actually finished. If I had decided to use a MegaROM, I'd probably would still be working on it, and at some point the project would get abandoned...

Anyway! While in previous games there was something obvious to talk about (3d rendering in ToP, content generation in XSpelunker and scroll in XRacing), the Menace from Triton does not have one key technical distinguishing feature, but it has a collection of smaller ones that might be interesting to explain. So, I wanted to talk about 4 technical details that at least I found interesting when designing this game:

  • Text rendering routines: not sure if you noticed but the font in the game is not monospace. Each letter has a different width. This is a bit more complicated to set up than the usual "1 letter per tile", but allowed me to fit much more (and better looking) text!
  • Content generation: I wanted to have lots of levels in the game, and I did not have enough space in the ROM for all of them. So, I resorted to content generation as a way to reduce space. The most interesting routine perhaps is the one that generates the star map.
  • Scroll and Agendas: the core technique for smooth scroll is nothing to write home about (same idea as used in many other games before), but there are a few details that I find interesting such as the way tiles are precomputed to minimize space, and how updating all the different game elements (scroll, content generation, enemies, bullets, explosions, etc.) is orchestrated at run-time while maintaining the smooth scroll.
  • Code compression: I really didn't want to do this, as it's a bit annoying to set up. But I ended up having to divide the source code into modules that are compressed in the ROM, and only decompressed when they are needed. This saved quite some space in the ROM, as, of all of my MSX games, this is the one that has the largest amount of source code, using too much space in the ROM.
There's a few other details I could talk about such as the custom music playback routines, or the scripts I use to generate the music data (I composed songs using EXCEL as the "tracker" for this game, which was surprisingly powerful, as I had all the power of modern copy/paste/insert/formulas/etc. at my disposal :) ).

Alright, so, let's just get started! Looks like this is going to be a long post!

Text Rendering in The Menace from Triton

In most MSX1 games, text is rendered in a very simple way: each character uses an 8x8 tile. In this way, drawing a character means simply sending a single byte to the VDP (the graphics chip) to change the name table. However, that limits text to 32 characters per line (as the MSX resolution is 256x192 pixels, i.e., 32x24 tiles). I knew there would be a lot of text in The Menace from Triton, so that was not good enough...

Last year, I worked with my friend Jordi Sureda in a point-and-click demo for Amstrad CPC, and text was very important, so, I worked on a nicer text rendering piece of code. Although we never released it (coming soon), I also built an MSX version of that demo. So, I already had the text rendering code from there, which I adapted to this game. The resulting text looks like this (this is a screenshot from the game intro, which you can see if you wait a few seconds in the title screen):


Rendering works as as follows:

1. Drawing the font: I wanted the font to use as little space as possible, so, I only used the characters actually used in the game. No lower case (I had upper and lower case for the point-and-click demo, but for this game, I just used upper case letters, numbers and some special symbols; 49 characters in total). The image I drew looks like this:


2. Encoding the font: the next step is to encode the font into some memory structure. I decided to encode the text by columns. So, each byte represents a column of pixels. This limits the fonts I can support to those that are 8 pixels tall. But this was an acceptable constraint. With this representation, a thin character such as an exclamation mark (that is just 2 pixels wide) is encoded as 2 bytes, a wider letter, like a "W" that is 6 pixels wide is encoded as 6 bytes, etc. Putting all the letters together one after another, results in the following character definition table, which contains, for each character, a byte for the character width, followed by bytes defining the characters:

Note: after drawing this figure, I noticed that I flipped the bits horizontally. But I was lazy to re-do it, so, just imagine a mirror of the left-table :)

A second small table shown above (right) stores the indexes where each character definition starts. In order to make indexes fit in a single byte (i.e., have values between 0 to 255), I store index/3, which means that each character needs to start in a position multiple of 3 in the character definition data table. Thus, there are gaps in the table (e.g., bytes at offsets 4 and 5, right before the exclamation mark). Storing index/2 would make this table smaller (saving about 30 bytes), but results in worse compression for the text later on, and thus index/3 was the best tradeoff. As we will see later, the index table is actually not needed, so, the font uses 349 bytes.

3. Encoding Text: Now, in order to encode text, instead of having strings like "HI!" encoded as ['H', 'I', '!'], text is directly stored by a sequence of indexes. So, the string "HI!" is stored as [62, 64, 2], which are the indexes form the index table of the 3 characters. 

There is a lot of text in The Menace from Triton (if you put together all the strings in the game, they add up to 3220 bytes). In order to compact them a little bit, as expected, text is compressed. I used aplib to compress text, which resulted in a slightly better compression than pletter. Since I do not have a buffer of size 3220 in RAM to decompress text to, I divided all the sentences into "banks" of at most 512 bytes each, resulting in 7 banks. In the same way as I explained for Phantomas 2.0, I optimize the distribution of sentences over the banks, and at the end it all compressed down to 2017 bytes.

4. Rendering Text: To render text, each individual sentence in the game is identified by two bytes: (bank number, sentence number within the bank). If I want to render, say, the 5th sentence in bank 2, I decompress bank 2, and then skip the first 4 sentences in the bank, to get the pointer to the 5th sentence. The first byte of the sentence contains its length in characters.

Then to actually render, I have a small RAM buffer (just enough to draw a whole line of text), where I render the text column by column, and then I copy it over to the VDP pattern table. I can then set the attribute table in the VDP to whichever color I want, and thus render text in any color. For this to happen, almost everywhere in the game (except in-game) I use "screen 2 bitmap mode" (where the name table is reset to 0, 1, 2, 3, 4, ..., 255 in all three banks, and I just draw directly to the pattern and attribute table. If you are curious, the rendering code can be found in this file ("draw_sentence_pre_render" and "draw_font_character" are the actual rendering functions).

Content Generation

There are two separate procedural content generators in the game: the star chart generator and the game level generator

The game level generator is fairly standard, and uses a classic "pattern-based" approach, where levels are made out of pre-defined patterns (of size 16x22 in this case). Each of these patterns has annotations to determine which type of pattern can follow (e.g. "empty top", "land at top and bottom", etc.), as well as which enemy waves make sense to spawn in each of them. Other than that, levels are just generated randomly by picking from the set of patterns that match the constraints, and then spawning enemy waves that match with the selected pattern. The land-based enemies (moai heads, turrets, etc.) are pre-defined in the patterns, but not always spawned. Each planet in the star chart has a difficulty level associated with it that determines the chance of spawning enemies: the higher the difficulty, the higher the chance of spawning more enemies. The difficulty of a level is defined by 13 numbers:
  • Enemy spawn probability: probability with which the land-based enemies are spawned in their marked positions in the map
  • Bullet speed
  • Fire rate for aggressive enemies, fire rate for medium aggressive enemies, fire rate for enemies that should not fire (even these fire in later levels of the game haha)
  • Fire delay (minimum delay enemies need to wait before they can fire again)
  • Enemy movement speed
  • Base enemy hit points, hit points for medium enemies, hit point for large enemies
  • Power pellet spawn probability (the probability with which normal enemies spawn power pellets), power pellet spawn probability for enemies that should spawn less power pellets.
  • Target level length
For example, the first level you play has values: [0.5, 2, 0.06, 0.004, 0, 4, 1, 1, 1, 4, 0.75, 0.5, 10], while the very last level (Triton) has values [0.7, 3, 0.19, 0.06, 0.04, 0, 2, 2, 6, 10, 0.375, 0.19, 40]. Difficulty is determined by the "column" of the star chart (the more you move to the right, the harder levels get). You can see all the different difficulty values in the "difficulty_d1_values_ROM", "difficulty_d3_values_ROM", etc. structures in this file.

The star chart generator is a bit more interesting, and I really struggled to find a way that could be coded compactly in assembler and that generated an interesting star chart... This is an example of how the star chart looks in the latest version (not yet on GitHub):


The challenge is to generate a connected graph that starts at the bottom-left (Ithaki) and finishes in the top-right (the Aigai Nebula), while offering multiple possible paths for the player to choose from. After trying many failed solutions, I settled for something quite simple actually, but that worked pretty well, so, I kept it! The algorithm works as follows:
  • The basic technique is to use a semi-random walk on a 8-connectivity grid. The "canvas" where to draw the star chart is a 11*5 grid, with 8-connectivity (up/down/left/right/diagonals). Given a start and and end destination in this grid, a "semi-random walk" starts at the start position, and moves straight to the target position but with some small chance, it moves in a different direction than the shortest path. 
  • Generating the star chart with this "semi-random walk" is as simple as this:
  • Step 1: defining the anchoring points
    • point A: Ithaki
    • point B: Aigai Nebula
    • point C: a random point in a 4x4 space at the center of the canvas
    • point D: a random point in the 4x2 top-left part of the canvas
    • point E: a random point in the 4x2 bottom-right part of the canvas
  • Step 2: generate the following semi-random walks, as shown in the figure below:
    • A-D
    • A-E
    • A-C
    • D-B
    • E-B
    • C-B
    • D-C
    • C-E
  • Step 3: place a random planet type at each visited grid cell (there's 4 planet types: Moai, Water, Tech, Temple), and connectors in all the connections visited by the walks.
  • The first boss (Polyphemus) is always spawned in the 3rd planet you visit, and the second and third bosses (Scylla and Charibdis) are spawned in random planets in two predefined columns of the map. The last boss (Triton) is always in the final planet, also called Triton.
And that's it! So, after considering more complex techniques like graph-based algorithms (e.g. Kruskal), I settled for this very simple method that does not use too much space when implemented in assembler, and generates star charts like those I imagined! So, if you see any star chart that looks particularly complicated, it's just because the semi-random walks some times take lots of detours to reach to the destination, creating loops, etc. (which is nice!). Too much randomness and it was a mess, so, I set the randomness to a fairly low probability (there is, for example, only a 0.023 chance of moving in the diametrically opposite direction of the target during a walk).

Scroll and Agendas

The basic scrolling technique used in Triton is standard: scroll modes 2 by 2 pixels, and all scrolling tiles are pre-rendered. So, at runtime all that needs to be updated is the name table. This is not anything new, and many games use this technique. But there are two details that I wanted to talk about: how are tiles pre-computed and the agendas used to ensure all runs smooth.

Pre-computing tiles: The MSX VDP has 3 tile banks with the patterns/attributes that correspond to the top/middle/bottom parts of the screen. In most scrolling engines either the same set of 256 pre-rendered tiles is used for all 3 banks, or a separate one is used for each bank.  In Triton I used 2 sets. One for the top of the screen, and another for the middle/bottom. This was a good compromise between variety and space. Using the Moai levels as an example, the tile sets look like this:

Pre-computing tiles is usually done off-line (usually pre-computed on a PC using some script), and then directly included in the ROM. However, that means storing a lot of tile information. And having 4 different level types, that was too much space, which I did not have. So, I went with a different approach. Each pre-rendered tile uses 16 bytes before compression, so, 2 sets of 256 tiles would be about 8KB before compression. Thus, instead of storing the pre-computed tiles, I stored a small set of base tiles, that look like this for the Moai levels (in reality I had a single set of base tiles including the tiles of all level types):

Each pre-rendered tile is the result of taking two base tiles and an offset, like this:


So, basically instead of 16 bytes, we just need 3 bytes per pre-rendered tile: left-tile, right-tile and offset. So, that's what I store. Then, just before starting a game, this table is used to calculate all the pre-rendered tiles directly on the MSX side. This table is just 256 * 3 bytes * 2 banks = 1536 bytes before compression. The real table is a bit bigger (as calculating the attributes of the pre-rendered tile is tricky in some edge cases and the assembler code to reconstruct it would have been too large, so, since the offset only needs 2 bits (as it can only take values 0, 2, 4, 6), I use the other 6 bits to store a bit more info, and in some special cases there some extra bytes, but those are implementation details ;) ).

Thus, I mostly only need to store a single base tile image (4KB) plus 1536*4 bytes. Thus, from 32KB before compression, we go to only 10KB before compression. Moreover, in addition to those tables, in both cases I need some additional tables to accelerate scroll calculations during rendering (2212 bytes in total after compression). In total, once all is compressed, I only need to actually use 6018 bytes for all the tile information used for all levels, which was compact enough! Without this approach I was using almost 10KB, so this saved about 4KB!

On the other hand, I need some code for pre-calculating the tiles, but that code is just about 150 bytes or so. Thus, it is worth it!

Agendas: the second interesting part of the scrolling engine is how to fit all of the calculations into the limited 58000 CPU cycles per frame available in 60Hz machines! These are the different things that need to be rendered/processed in the game, together with their average and maximum time they use (some are marked with question marks as I have not measured them recently, and I was lazy to measure them again explicitly for this blog post haha):
  • 27457 cycles (27457 max): Draw map
  • 3794 cycles (3809 max): Update sprites in the VDP
  • 849 cycles (2400 max): Choose next pattern for level generation
  • 16314 cycles (20014 max): decompress chosen pattern for level generation (level patterns are compressed as they do not fit in RAM or ROM decompressed, so, when one needs to be used in the game, it has to first be decompressed)
  • 7472 cycles (7472 max): copy decompressed pattern to the map buffer
  • 2138 cycles (5327 max): background star field scroll
  • 1157 cycles (1157 max): get keyboard/joystick input
  • 2179 cycles (8275 max): update the player ship/options
  • 6211 cycles (18047 max): update sprite-based enemies
  • 2782 cycles (23575 max): update enemy bullets (max is high, as it includes handling the effect of bullets colliding with player/walls)
  • ??? cycles (??? max): update tile-based enemies
  • 3766 cycles (8180 max): update player bullets
  • ??? cycles (??? max): update player secondary bullets (option bullets/missiles/torpedoes)
  • ??? cycles (??? max): update tile-based explosions
  • ??? cycles (??? max): update power pellets
  • ??? cycles (??? max): update player/enemy/bullet/power pellet/explosion positions when scroll loops around in the circular buffer
  • ??? cycles (??? max): music/SFX
If you add up all of those number, the total is about 85K cycles (with a max of about 150K), which was clearly not going to fit within one game frame :)

So, the solution used in the Menace from Triton was to use agendas. I have an agenda with 256 time slots, where each of the 256 time slots specifies which of the operations above is executed. So, not all operations are updated at every game cycle. It takes 16 game frames for the scroll to advance one whole tile, so 256 game frames is what is needed to advance 16 tiles (which is a whole level generation pattern). Thus, once each 256 cycles we need to choose a new pattern, decompress it and copy it to the game map circular buffer. I am not going to give the complete agenda, but here are a few hints:
  • Draw map is only done in the even cycles
  • The star field is only updated once every 4 cycles
  • Player is updated at every frame (thus, player moves at a smooth 50/60 fps)
  • Sprite-based enemies are only updated in the odd cycles (so, enemies move only at 25/30 fps)
  • Tile-based enemies are only updated once each 16 frames
  • Choosing the next PCG pattern and copying it to the circular buffer is spread across the 128 odd cycles:
    • in cycle 1 we choose the pattern
    • in cycle 3 we decompress it
    • cycles 5, 7, 9, etc. we go through the new chosen pattern (2 rows at each cycle) and spawn any tile-based enemies that need to be spawned
    • in cycles 131 and 133 the fully instantiated new chosen PCG pattern is copied to the circular map buffer (in 131 we make one copy, and in one edge case we need to copy the pattern twice; at the beginning and end of the circular buffer, for when scroll wraps around; this second copy if needed is done on cycle 133)
  • etc.
I really like this idea of an agenda, as it managed to reduce the average CPU required per frame to about 29K cycles (thus, there's CPU to spare in most frames!), and the maximum per frame in most cases stays below 50K. There are a few peaks of 100K (when text needs to be rendered, e.g. when a power pellet is picked up and the text corresponding to the next weapon in the power bar needs to be rendered). But those are one frame in a thousand, and are barely noticeable! The actual agenda used is encoded as code and not as an explicit table, but in games where space is not an issue, an actual table might be a better implementation. You can see some notes on the actual agenda for procedural generation operations in the "update_scroll_and_pcg" function comments in this file.

I thought this is a pretty general concept that could be useful in many many games, and I am definitively going to be using it in the future!

Code Compression

Even with all my efforts to fit the game within the 48KB of the ROM, it became clear it was not going to fit... So, I even considered having only 3 level types instead of the 4 I had initially planned... I even combined two different compressors (pletter and aplib), and compressed every piece of compressed data with both to see which of them compressed each file the best! At the end both of them are included in the game, as it turned out that I could save space by including both compressors, and choosing the best compressor for each piece of data than just picking either of the two! For example, interestingly for the level generation patterns, the best turned out to be to compress each pattern with pletter, then concatenate the compressed patterns, and compress the result with aplib. This was better than pletter-pletter, aplib-pletter, or aplib-aplib... In any case, I was desperate for space, so, I decided to try to compress the part of the game that was using the most space: code!

The code of the game uses about 27KB, which is way too much. That's more than half of the ROM! Compressing code is not uncommon in platforms like Spectrum/Amstrad where you can count on having a lot of RAM for decompressing the code to. For example, the Operation Alexandra game went all the way and THE WHOLE game is compressed in the tape, and decompressed right after the intro scene plays (overwriting it). But in MSX with only 16KB of RAM that's a bit tricky... So, I ended up doing this:
  • During gameplay it is when I am more RAM constrained, as the scrolling engine needs some fairly large buffers. So, the in-game code had to be in ROM.
  • An exception of this is while in boss-fights, as there is no more content generation going on, so, I don't need the content generation buffers.
  • So, I chose a set of "core" routines (in-game code, music, graphics, etc.) and that went to the ROM: this was about 16KB-17KB.
  • All the code for the intro/menu/star chart/password/weapon configuration menus is compressed in a single chunk. From 6291 bytes, this went down to 4232 bytes
  • Each of the 4 bosses code is also compressed, and only decompressed right before the boss fight (you might have noticed a slight pause when the boss is spawned): the code for each of the 4 bosses went down from: 756 / 1003 / 1476 / 749 bytes to 593 / 712 / 949 / 548 bytes!
All in all there are 10275 bytes of compressed code that compress down to 7034 bytes, saving a very nice chunk of 3241 bytes! I was very excited by this!

The one tricky thing is how to handle "calls/jumps" between the main part of the code and the compressed parts. So, what I did is this. The compressed pieces of code are always decompressed to the same address in RAM. So, I know where do they go. I inserted "hooks"/"entry points" in the compressed code so that I knew which addresses to jump/call. For example, the menus compressed code looks like this:
    include "constants.asm"
    include "autogenerated/text-constants.asm"
    include "../triton.sym"
keyboard_line_clicks:       equ keyboard_line_state+1

    org NON_GAME_COMPRESSED_CODE_START

    ; entry points:
    jp state_braingames_screen
    jp state_mission_screen_from_game_complete
    jp state_mission_screen_from_game_failed
    jp state_gameover_screen
    jp generate_minimap
    jp enable_nearby_minimap_paths
    jp get_minimap_pointer
    jp state_mission_boss_under_cursor

    include "state-story.asm"
    include "pcg-minimap.asm"
    include "state-title.asm"
    include "state-ending.asm"
    include "state-mission.asm"
    include "gfx-bitmap.asm"
    include "state-weapons.asm"
    include "state-gameover.asm"
    include "state-braingames.asm"
    include "state-password.asm"

Notice two things:

  • See that "include "../triton.sym"" statement? So, I compile the main part of the code. Such compilation generates a symbol table (triton.sym), which is then included by the compressed code, so the compressed code can call any function from the main part of the code. I then compile the compressed code, compress it with pletter/aplib, choose the one that best compresses it, and add it to the ROM.
  • The second part is that list of "jp ..." right before the list of includes that have the actual code. Those are the "hooks". Notice that the symbols present in the compressed code are not visible to the main code. So, I need a way to call those functions! So, if from the main part of the code, I want to call, say, the "generate_minimap" function, I just need to call it like this: "call NON_GAME_COMPRESSED_CODE_START+3*4", as each jump instruction uses 3 bytes. Of course, you I can define constants for these calls if I wanted, to make the code look prettier. 
Compiling the project then involves the following steps:
  • Create placeholder (empty) files corresponding to the compressed code (e.g., empty files with the right name, so that the assembling process works)
  • Compile the ROM (this just compiles the main part of the code, and includes the placeholder files instead of he actual compressed code). This also generates triton.sym.
  • Compile the compressed code (using triton.sym), and compress it, overwriting the placeholders.
  • Re-compile the ROM again, this time with the proper compressed code.
So, I need to compile the ROM twice every time I do a change, but that just takes a couple of seconds, so it's ok!

Finally, after submitting to the MSXDev competition, I also tried optimizing the code using my MDL assembler code optimizer, and that saved about 200 extra bytes. Thanks to those bytes I could add the new password saving system (with those 200 bytes, plus about 800 bytes of free space that were already in the ROM). In the final version of the game I only have 232 free bytes in RAM, and 0 free bytes in the ROM. MDL still suggests changes to save an additional 150 bytes or so when ignoring speed, but most of those are jp -> jr changes that would make some of the core routines slower, so, I don't want to make those changes unless I have no other choice! I have done just the minimum necessary MDL-suggested space optimizations to fit the password code (and the last few were really painful to accept hehe).

Conclusion

Anyway, that's it! So, as I mentioned above, I don't think there is any groundbreaking techniques used in the game, but there are lots of interesting (at least to me) little details here and there. I think the text rendering and the agendas are the part that I am happier about in the game, as those really made a difference! I'll definitively be reusing those in the future!

Monday, April 22, 2019

Smooth Scroll in XRacing

As I explained in a previous post, my latest MSX game is XRacing, a top-down racing game inspired in the classic Super Cars for the Commodore Amiga.

When I was planning XRacing, the one thing that I wanted the game to have was smooth scroll. As a Super Cars-style game, if the scroll was the usual choppy 8x8 MSX scroll, it would be very hard to play. At the end I ended up using a method for achieving this that worked pretty well! So, today I just wanted to give you an idea of how it works!

Ok, so, let's start with the basics. How do people usually implement scroll in MSX games? (MSX2 and later machines can do this by hardware, but let's forget about those for now, and focus on the classic MSX1 machines) There are basically 3 common techniques:

  • Scroll by tiles: As I explained in this post, the MSX graphic chip (the VDP) displays graphics by defining "tiles" (or "patterns"), which are 8x8 pixel small graphics. There are 3 banks of 256 tiles each. Each of the banks contains the tiles to be displayed in the top, middle and bottom thirds of the screen respectively. A "name table" is a small 768 byte array in the VDP that contains one byte per tile in the screen (at the standard 256x192 pixels MSX resolution, there are 32x24 = 768 tiles in the screen), and specifies which tile to be drawn in each position. So, to do tile by tile scroll, you just need to update this "name table", which just requires writing 768 bytes to the VDP per game frame. This is a very small amount of information, and thus can be done easily at 50/60 frames per second, and that's why it's used by most games. This is what was used in all Konami games like the Nemesis saga, Knightmare, etc. 
  • Bitmap-mode scroll: The other extreme is to ignore completely the name table, and use the video memory as if it was a bitmap (i.e., like if we were programming a game for the ZX spectrum or the Amstrad CPC, which did not have the concept of a name table). In this approach, the game just redraws the whole screen at each game frame, and thus can move the graphics pixel by pixel if desired. This allows for smooth scroll, but has the down side of having to write A LOT of data to the VDP in each game frame. Since each tile requires 16 bytes of data (8 bytes for the pattern, and 8 bytes for the colors), we need to write 16*768 = 12288 bytes per game frame. Now, the maximum speed at which we can transfer data to the VDP is one byte each 29 CPU cycles, so, that is 12288 * 29 = 356352 cycles. Since the Z80 CPU of the MSX runs at about 3.58MHz (3.58M cycles per second), this would result in at most 10 frames per second (ignoring the time we need for the game logic, and assuming maximum transfer speed, which is unlikely). Games that use this technique usually ignore the color table, and just use monochrome graphics, so that only half the data has to be transferred, and then, on top of that usually a large scoreboard is used to reduce the size of the playable area, reducing data transfer even more. Most ZX Spectrum game ports to the MSX use this technique, since in Spectrum this technique makes sense (Spectrum video memory is memory-mapped, and thus transfer speeds are much higher). To port the game to the MSX, they created a small buffer in RAM that acted as the ZX Spectrum video memory, and then wrote a routine that transferred its content to the VDP at each game frame (in this way the code for the ZX spectrum barely needs to be modified to run on an MSX), which is the routine responsible for the slowdown we see in most ZX Spectrum ports...
  • Pre-calculated scrolled tiles: this is a very interesting technique used by some more modern MSX games (and a few classic ones too). The idea is to have multiple versions of each tile in the tile banks. The different versions contain the same tile, but already scrolled 1 pixel to the left, 2 pixels to the left, 3 pixels to the left, etc. In this way, if we want to (for example) move the scroll 1 pixel to the left, we just need to update the name table to replace the current tiles by the versions of the tiles already scrolled 1 pixel to the left. This achieves smooth scroll at very fast speeds! The downside is that since we need to have multiple versions of each tile, the amount of different tiles we can have in a game level is much smaller, thus reducing graphic variety. Example games that used this technique are Lotus F3, or Uridium. As you can see Lotus F3 has very little graphical variety. Uridium has more variety, since it only requires horizontal scroll, and thus less versions of each tile are needed. This was also used in a few classic games like Circus Charlie, for example.
So, in summary "Scroll by tiles" is fast but choppy, "Bitmap-mode scroll" is smooth but slow, and "Pre-calculated scrolled tiles" is fast and smooth but reduces graphical variety. None of those options was good enough for what I wanted to do. So, I had to figure out a way to do better :)


Pre-calculating Scrolled Tiles

My method is based upon the pre-calculated scrolled tiles method. So, let's start with that. To illustrate all the concepts, I'm going to be using the actual graphics I used for the "grass" type of tracks in the game (there are two separate types of tracks: tracks with a grass background, and urban tracks with buildings). This is one of the first Formula 1 tracks in the game, called "St. Junipero" (named after my favorite Black Mirror episode :)):


Note that this is NOT how the track looks in the game, since I use a lot of sprites to add details to the track, but let's focus just on the tiles for now :) To author that track, this is the set of tiles I wanted to use:


Notice that there are only 51 different tiles used. It seems like there aren't that many, right? Well, you'll see that 51 is quite a large number for the "pre-calculated scrolled tiles" approach. Suppose that we wanted to use that map in a game with smooth scroll that moves in increments of 2 pixels (1 pixel smooth scroll would be even harder!). In order to do that, what we need to do is this:
  • For each tile in our set (51 tiles): find all the other tiles that appear at least once to the right of this tile in the map above. For each such pair of tiles, we need to generate the following pre-calculated tiles, and have them in our tile banks:
  • So, for each pair of tiles that appear next to each other (one to the right of the previous), we need, in addition to the two original tiles, three additional intermediate tiles corresponding to the intermediate scroll positions in which the tiles might appear.
  • If we wanted to have vertical scroll instead, We need to do the same thing for tiles that appear one of top of another, for vertical scroll.
  • If we wanted multi-directional scroll, then, instead of pairs of tiles, we need to account for combinations of horizontal + vertical scroll. So, in general, for each 2x2 block of tiles that appear in the map, we need to do what I described above. So, for each unique 2x2 tile combination, we would need to add 21 different pre-calculated tiles.
You can already see where this is going. If we want pre-calculated tiles for multidirectional scroll (even just for moving 2 pixels at a time, let alone pixel by pixel), we would need A LOT of tiles. If we remove duplicates (since lots of tiles generated with the previous method would look identical), here is the set of tiles we would need for the map above:


This is a total of 1196 tiles! And if we now create a separate map that has some unique tile combination not appearing in this map, we would then need even more tiles! Also, have in mind that I authored those tiles with this method in mind, so they are carefully drawn to minimize the number of tiles needed. Notice that all elements always have an even number of empty space surrounding them (for example the house has 4 empty pixels above (not 3, not 5), the cone has 2 empty pixels below, etc.). If you are not careful, you'd get even more tiles!

Now, in XRacing, I need the bottom part of the screen for a minimap, and some basic information about the game (speed, time, laps, car status, etc.). So, I only used the top two banks for the actual gameplay. Since maps scroll arbitrarily, I have the same tile set in each of the two banks. Thus, the maximum number of different tiles I can have loaded at once in the VDP is 256. This was clearly not going to work for XRacing :)


Scroll "On Rails"

My first idea was to force the scroll to be only either horizontal or only vertical. So, if we are scrolling horizontally, I would snap the y coordinate to a multiple of 8 pixels, and when scrolling vertically, I'd snap the x coordinate to a multiple of 8 pixels. With this idea, we would only need the following tiles:


That is "just" 389 tiles! Still too many, but we are getting closer to the magical 256 number! So, in order to reduce this number further, I used the idea of "rails".

Rather than allow the track to freely scroll, I would define "rails" in the track, and the camera can ONLY slide along those rails. For example, in the track above, these are the defined rails:


When the game starts, the player is in the left-most red-color rail, and thus the camera can only move vertically. As soon as the player reaches the top part of the map, the camera "snaps" to the horizontal blue-colored rail, and the camera can only move left-right, etc. The key idea is that not all tiles would need pre-calculated versions that scroll vertically or horizontally. For example, the little house in the tiles above was designed to only scroll vertically (there would be color clash if I scroll it horizontally!). So, I only add it to parts of the map associated with a vertical rail, and never to a horizontal rail. Likewise, there are graphics that I only ever associate with horizontal rails (like the yellow banner sign).

This reduces the set of tiles needed even further! I don't know how many tiles would I need for the map above, since I never calculated it. But when putting together ALL the XRacing maps that use this tile set, I only need 313 tiles, which is even closer to the 256 magic number! We are getting there!


Multiple Tile Groups

The last trick is to realize that not all tiles are needed at all time. For example the starting block (the blue semaphore at the beginning of a track), only appears once in the track! So, I decided to divide my tiles into smaller groups, and load/unload them during the game. The constraint is that we cannot upload too many tiles at once, or the player would notice a slowdown in the scroll. So, what I did is to define a large group of "common tiles" that would always be loaded, and then smaller groups of tiles, of which only one of them at a time would be loaded, and each time we switch from one rail to another, the rail has a tag which which group of tiles it needs to load.

This part was a bit tricky, and I wrote a small optimization algorithm using a classic hill-climbing approach to find the optimal grouping of tiles. I struggled quite a lot with this part, since it was hard to find a good grouping, and my algorithm kept finding local optima that were not good. Happy to provide details if anyone wants to replicate any of this. But suffice to say that there is an algorithm that given a set of maps can find these groups of tiles (the idea is to start with one group per rail, and then take tiles out of those groups into a "common group", trying to take out the tiles that appear most often). These were the groups of tiles that were found for XRacing:
  • A large common group of tiles with 199 tiles:

  • Two smaller groups of tiles with 57 tiles each:

For the urban tile set, there was 4 smaller groups of tiles instead of 2, but that was also manageable. So, in a regular game frame, I only need to update the two banks of the name tables (512 bytes to the VDP), and when the scroll snaps from one rail to the next, we need to upload an additional 57 tiles (57*16 = 912 bytes). I need to upload those to both bank 1 and bank 2, so, it's twice that amount, but that was also manageable. The game runs at 25/30 frames per second, so this was well within the margins of what I could handle!

Another tricky part was that some times, my optimization algorithm was not able to find a solution to the problem using only 256 tiles, and thus I had to constantly change the maps if there was a particular combination of tiles that only appeared in one map, to reduce the number of tiles actually needed. This was a bit painful, but eventually I got a set of maps for which my optimization algorithm could find a good grouping solution, and thus I could use for the game!


Sprites

Finally, in order to add a bit more graphic detail in the game, I use a collection of sprites in addition to tiles. In addition to the cars, and the small dots in the minimap, I use sprites in 3 different ways in XRacing:
  • To add graphic detail to the game. This is used for trees and signs, which have sprites to draw the outline of the trees, and the MSX logo on the signs. This can be seen in the images below.
  • To protect the scoreboard from sprites intruding into it I put 4 blank sprites in the top-left corner of the scoreboard, in that way the car, tree and MSX-logo sprites do not get drawn over the scoreboard. Just to clarify on this, the MSX can only display 4 sprites in each horizontal screen line at a time. So, if you have more than 4 sprites, the 5th sprite doesn't get displayed. I exploit this trick, and put 4 sprites in the scoreboard, so any other sprite that tries to be drawn there simply cannot be drawn :)

  • I use the same 4 sprite trick to make it look as if the cars go underneath the semaphore at game start. So, there are 4 sprites (black color) making up the semaphore, which serve both to add graphic detail, and to protect it from the car sprites



Conclusions

And that's basically it! So, the key question I wanted to answer was: can I take the "pre-calculated scrolled tiles" scrolling technique and push it so that I can increase graphic variety as much as possible? In order to do that I used the idea of "rails", which limit the number of tiles we need to pre-calculate and the idea of splitting the tiles into smaller groups that I can load/unload to the VDP at run time. This results in a scroll that is not fully multi-directional, but has the benefit of allowing me to have different graphics in the game (like the little houses, or trees, etc.) to decorate the map, which would not have been possible otherwise (look at the Lotus F3 example I pointed out above).

As a final note, once I made the game available, Artrag from the msx.org forums, pointed out to me that there is a way to push the VDP even further, and make the top two tile banks of the VDP share a single bank. In this way, we can have 2 banks of tiles, and switch from one to the other arbitrarily. So, to clarify: in XRacing the top part of the screen uses bank 1, and the middle bank 2. What Artrag is suggesting is that we can make both top and middle use bank 1, and then swap to bank 2 when we want. So, this would bring the number of total tiles we can have up to 512 (only 256 of them at once). I want to explore this possibility in my future games, so, it seems we have not yet found the "limits" of our beloved MSX computers! So, looking forward to better scrolling algorithms in the future! :)

Anyway, I hope you found this interesting, and if you have any different ideas or techniques to push the possibilities of doing smooth scroll in MSX games, I'd love to hear about them! :)

Sunday, February 17, 2019

Space Saving Techniques in Phantomas 2.0

Although most of my recent games have been for the MSX, last year Jordi Sureda and myself started working together on a game for the Amstrad CPC. The game is a remake of the Dinamic classic Phantomas 2. You can see a full walk through here of the game by JGonza (who was one of the beta testers of the game):


The game itself can be downloaded from Github, and if you want a physical copy of the game, you can buy it from matranet. The game is not very complex technically, and it works well mostly thanks to the graphics (made by Jordi Sureda). But there are a few interesting technical aspects. One of them in particular I think is particularly interesting since it's something that has to be done in every 8 bit project: how compress data as much as possible.

There are many techniques that people have used to reduce the space used by the room definition (e.g., using macro-tiles, as brilliantly explained in this video), or compressing them using a compression algorithm (like Pletter, or Exomizer). But there is one more thing that can be done, which is what I want to explain in this post: how to organize your data so that it gets maximally compressed when using a compression algorithm. Let me start by introducing a couple of basic ideas about compression algorithms.

Compression Algorithms in 8 bit Computers

A compression algorithm takes as input a sequence of symbols (say, of length n), and tries to generate another sequence of symbols of length smaller than n by exploiting regularities in the data. For example, if I have a sequence that is something like this "aaaaabbb", I could replace it by "5a3b", meaning that I have 5 times an "a", and then 3 times a "b". This is called "run length encoding", and works well when we are compressing sequences that have long sequences of repeated symbols. This was used in the classic PCX image format, popular in the Amiga and DOS times, where images contained usually large patches of the same color. The most common compression tool nowadays is the "zip" utility that comes with most OSs, which does use much more advanced compression than simple run length encoding (most modern compressors use dictionary-based approaches such as the LZW or DEFLATE algorithms).

These advanced compressors are usually too computationally intensive for 8 bit machines to handle, but there are tools, like Pletter (popular in the MSX world) or Exomizer (popular in the Amstrad CPC world), specifically designed for these machines. Usually, though, the way they are used is to have a compressor that runs on a modern PC to compress the data to fit it in a game, and then only the decompressor (which is not as computationally heavy) runs on the 8 bit machine.

For a comparison of how well these algorithms work, I refer you to an amazing empirical evaluation done by @newcoleco on the Atariage forums ( http://atariage.com/forums/topic/268244-data-compression-benchmark-dan-pletter-zx7-exomizer-pucrunch-megalz/ ). In summary: Exomizer compresses more, but decompression is much slower. So, if space is the biggest concern, use Exomizer, if decompression time is a concern (i.e., if you need to decompress things during gameplay), then Pletter is the better option.

For Phantomas 2.0, I considered both (for a few weeks I maintained two separate versions of the game using both compressors), but I ended deciding for Pletter, since Exomizer was too slow (it took about 4 seconds to decompress all the data I needed when changing rooms, which made the game unplayable). But in other games, Exomizer might be the best option. So, the important thing is to know your options!

Compressing the Rooms in Phantomas 2.0

There are many ways in which rooms can be defined in games. In Phantomas 2.0, we used a simple 2-plane grid-based approach: rooms are made out of 8x16 pixel tiles, and since the playable screen is 128x128 pixels wide (the game runs in Amstrad CPC mode 0, where pixels are twice as wide as tall) each room is 16x8 tiles in size. We used two planes, one for background tiles (drawn behind the player and enemies) and one for foreground tiles (drawn in front of the player and enemies). For example, here I show an example room, and the two planes it is made out of:

This representation for rooms uses quite a lot of space (ignoring enemy/item information), since there are 2 planes of 16*8 tiles, and we use one byte to specify each tile, each room is 16*8*2 = 256 byes in size (plus 20 or 30 for enemies/items). The game has 85 rooms, so that would be 24KB just for the rooms! (and remember that the Amstrad CPC only has 64 KB of RAM, and that 16KB of those are the video RAM, so, we can actually only use 48KB...). That was not going to work...

An option could have been going for the idea of macro tiles (used brilliantly in many of the productions by 4MHz). But macro tiles also have some limitations. So, since I did not want to impose any more restrictions on Jordi (who was creating the rooms), I decided I had to figure out a way to reduce space while maintaining this room representation. Of course, a first step is compressing them with Pletter. But there are 3 additional ideas that can be applied after that that can reduce space even more. So, let's get started!

Step 1: Direct Pletter Compression

Putting all 85 rooms together uses exactly 24316 bytes (286.07 bytes per room). Let's keep that as a baseline.

So, the first step is to just compress one by one, all the rooms using Pletter. The result is 12649 bytes (148.81) bytes per room. So, this is not bad, but we can do much better :)

Step 2: Exploiting Useles Bytes

Some bytes in the rooms are "useless". What I mean by "useless" is that whatever their value, the final room would look the same! For example, if there is a foreground tile that completely covers the background, then whatever value we have for the background tile is not going to affect the final room.

A first idea is to just set all of those background tiles covered completely by a foreground tile to 0. But a much better idea is to set them to the same value as the tile immediately to the left of them. If we are on the left-most tile, then we use the last background tile in the row before, and if we are on the top-left coordinate, then we just use "0". In this way, we are setting several bytes in a row with the same value, which results in a sequence that compressors like Pletter can compress better.

With this small change, the result is that the rooms (after compressing them, one by one with Pletter) use now only 11850 bytes (139.31 per room). With this simple trick, we have saved almost 800 bytes! not bad!

Step 3: |C(A+B)| < |C(A)| + |C(B)|

The next step might not be as obvious. But it is well known in information theory that if you have a compression algorithm C, and two sequences of symbols, A and B, the compression of the concatenation of A and B (which I will denote as "A+B") is usually smaller than the sum of the compression of A plus the compression of B. This is intuitive, since imagine that there is a subsequence of A that also happens to be in B. In this case, the compressor can exploit that when compressing A+B, however it cannot exploit it when compressing A or B separately, since there is no repetition. So, compressing two sequences together saves space most of the time.

Using that idea, the obvious thing to do is to just concatenate all the rooms, and compress the whole thing together, right? The problem with this is that you need to have enough space in RAM to decompress the rooms (recall they use 24316 bytes uncompressed). So, this was not an option. I know that some games (for example Operation Alexandra compressed the WHOLE game together!, but this was not an option for us, as the game does not fit in memory uncompressed :))

So, what we ended up doing is to compress the rooms by groups. We compressed rooms in groups of 4 (put 4 maps together, and then call Pletter to compress them). A little detail is that if you want good compression, you better place things that are similar together (as we saw above). So, I placed the background of all 4 maps together, then the foreground of the 4 maps, and then the item/enemy information.

With this idea, we get a significant reduction in space, and we are now at: 9926 bytes (116.18 bytes per room), which is already pretty good.

But we can do even better! :)

Step 4: Optimizing the Groupings

Some rooms look more similar to each other than some other rooms. So, I thought: if I put rooms that look similar in the same group when compressing them, would that save space? and the answer is yes! So, this is what I did: I wrote a program that first calculated the space needed when compressing all maps in groups of 4. Then, the program starts trying to permute maps (taking a map in a group and swapping it by a map in another group). After each swap, it recalculates the space, and if we have saved some space, then we keep the new grouping. Trying all possible map groupings is not feasible, but we can use a simple greedy search algorithm like this:
1) Start with the base grouping
2) Try every possible swap, if one reduces, keep it and continue from there
3) Once we have tried every possible swap, restart from the beginning again if we had any improvement, otherwise, terminate.

And this was another huge saving! After optimization, the rooms only use 9164 bytes (107.81 bytes per room)!!

Here's a little diagram showing a summary of the different ideas:



Conclusion

So, in summary, to get better compression:
1) try to put things that are similar together.
2) compress things together, better than compressing them separately
3) if you are compressing things together, try to optimize what groupings do you make!

Since these ideas worked very well, I used the same idea to store all the graphics of the game (in groups of 16 tiles), saving additional space (again, all the tiles uncompressed do not fit in RAM, so, we need to keep them compressed, and we only uncompress those needed for the current room). If you play the game, you might notice a small delay when changing rooms, which is when the game decompresses the necessary tiles for the next room. This is the part that was way too slow using Exomizer unfortunately...

The funniest thing is that after all of this space saving, I got a message from @Joseman on twitter, explaining me some details of the Amstrad CPC hardware I wasn't aware of (this was our first CPC game), and it turns out there are 4 full KB of extra space we didn't use! So, at the end there was plenty of unused space in the computer memory. But oh well, it was fun optimization anyway, and these ideas can be useful for other projects!

And that's it, I hope these techniques are useful to somebody else! If you have any additional space compression tricks, by all means I would like to hear about them! :D

Friday, February 15, 2019

The Flag Effect in XRacing

A few days ago I submitted my last MSX game to the MSXDev'18 competition. The game is called XRacing, and it is a top-down view car racing game inspired heavily in the Amiga game "Super Cars". A quick video of the intro and first level can be seen here:


I will explain the technique I used for the game scroll in another post, but in this post I wanted to focus on explaining how does the flag waving animation in the title screen work! I want to explain how it works, since the method achieved pretty nice results and it is not too complex, and it could be useful to other people trying to add similar effects to other MSX games!

My goal was to figure out a way to play a small "video loop" in the background that looked very smooth. I wanted to do this since I saw this video of a fire animation converted to the ZX Spectrum specs. I thought it would be amazing to have something like that on a game. So, I started thinking how could I convert a small video on my computer to play on the MSX, and how could I make it small enough to fit on a small game.

I was initially wanting to convert a fire animation (a real video of a fire, not one of those fire visual effects), but I struggled to find any that I could easily convert to a loop with a handful of frames (e.g., 16 frames or so). So, I remembered that back in the day, when we were working on the F-1 Spirit remake, we had this fancy flag animation in the background of the title screen. So, I thought that would be an easier one to convert! I started looking for real videos of a waving flag, but again I struggled to find one I could easily turn into a small 16 frame loop. So, I just wrote a little program to generate my own flag. It doesn't look real, but it did the trick.

Alright, so, before I describe the technique, let me take a brief moment to explain the technical specs of the MSX graphic chip, which will define the constraints of the approach!

The MSX VDP

Most first generation MSX computers shipped with the TMS9918 graphics chip, which I will just call "the VDP" (Video Display Processor). The VDP has several graphic modes, but I will focus on what is known as "Screen 2" in the MSX world, which is the most powerful graphic mode the VDP offers.

Screen 2 is a 256x192 pixel graphic mode, where the screen is divided into 8x8 pixel blocks called "patterns" or "tiles". The screen is thus 32 patterns wide (32*8 = 256) and 24 patterns tall (24*8 = 192), as is shown in this small diagram:


Each pattern is defined by two components a "pattern table entry" and a "color table entry", which are combined in the following way to draw the pattern:


So, as can be seen above, the "pattern table entry" (8 bytes per pattern) defines the "shape" of the pattern, and the "color table entry" (8 bytes per pattern) defines the color of the pixels. Two colors can be specified in each byte of the color table entry, which correspond to the 2 colors that can be used in the each of the 8 rows of the pattern. Each of the bits in the pattern table entry corresponds to a pixel in the pattern, and whether its a 0 or a 1 determines which of the two colors specified for the corresponding row will be drawn. This is where the "2 colors for each block of 8x1 pixels" comes in the MSX!

In order to draw something in the screen, what you need to do is to write data into the "video RAM" (VRAM from now on), and as a result, the content of the screen will change. The VDP has 16KB of VRAM, divided into a collection of areas:
- the Pattern Table: we can specify up to 768 different patterns, divided into 3 different banks (as shown in the picture above; more on this below). Each pattern is 8 bytes long, with each byte corresponding to one of the rows of the pattern (each bit corresponding to 1 pixel), as described above.
- the Color Table: also has 768 entries of 8 bytes each, specifying the colors of each of the 768 patterns as described above.
- the Name Table: finally, the name table is a 768 byte table that specifies which pattern to draw in each of the 768 positions of the screen. For example, if we define two patterns (a smiley face in pattern 0, and a sad face in pattern 1), if we write 0, 1, 0, 1, 0, 1, etc. in the first 32 bytes of the name table, we will see a row of smiley and sad faces. So, once we define a pattern, we can reuse it multiple times in the screen. There is only one limitation here, and it is that since we can only use 1 byte to specify which pattern to draw, we can only select from 256 patterns (since that's the maximum number of values we can represent with a single byte). Since there are 768 patterns, these are divided in "3 banks". In the top 8 rows of the screen (BANK 1), we can only select from patterns 0 - 255, in the middle rows, we can select from patterns 256 - 511 (BANK 2), and in the bottom 8 rows from 512 - 767 (BANK 3).
- and then a couple of tables for sprites (but we will ignore sprites for now :))

By looking at this structure you can already probably see why some MSX games have a very "chunky" scroll of 8x8 pixels! It's just because the easiest way to implement scroll in MSX is by just defining our graphics as patterns, and then just moving them with the name table (which lets us move them only in increments if 8 by 8 pixels!). If we want smoother scroll, then we have to figure out a way to do it, by using just the tools I described above!

Another thing to have in mind is the slow speed of the MSX CPU (Z80) and the even slower speed of data transfer to the VDP! The MSX Z80 has a clock of 3.58 MHz (3580000 ticks per second). If we want to have a smooth video/game that renders at 50 frames per second, we only have 71600 ticks per frame. If we have in mind that it takes usually 29 ticks to transfer a single byte to the VDP. So, we can only optimistically transfer about 2500 bytes per frame to the VDP, which is not enough to redraw a whole frame! 768 patterns * (8 bytes for pattern table + 8 bytes for color table) = 12288 bytes, which is the reason for it being so hard to have smooth scroll on MSX...

Ok, so that's the VDP. Now, how do we get a smooth animation on such a slow hardware?


Step 1: Generating the Video

To generate the source video, I just wrote a small Java script to render a flag waving loop with 16 frames of 256x192 pixels in black and white (I wanted to have gray-scale, but I'll explain below why I decided against it). The original video looks like this:



Step 2: Extracting the Patterns

The next step is to go one by one each of the 16 frames of the animation, and extract all the different patterns needed to display this animation. For doing this, we scan each 8x8 pixel block in each of the 16 frames, and we create a list of all the different patterns we find. Also, since once this is uploaded to the VDP, each of the 3 banks of the VDP is separate, we do this for each bank separately. With the video above, we find the following:

- BANK 1: 787 different patterns
- BANK 2: 510 different patterns
- BANK 3: 772 different patterns

So, assuming we could store all of these tiles in the VDP, what we would have to do is this:
1) Store all of these tiles in the game ROM: (787+510+772)*16 = 33104 bytes
2) Now, each frame of the animation would just be 768 bytes (the name table for each frame). Since we have 16 frames, this would be: 768*16 = 12288 bytes
3) Then, at run time, we just load all the patterns in the VDP initially. And at each game frame, we just keep uploading the name table that corresponds to the current frame. This is just 768 bytes per frame, which is doable.

Problems:
1) The VDP can only store 256 patterns per bank, and thus 787, 510 and 772 do not fit!
2) Even if they were to fit, that animation already would use most of the size of the ROM cartridge! (I was targeting a 48KB ROM)

So, we need to reduce that somehow :)


Step 3: Reducing the Set of Patterns Needed

If we were to just display the flag animation by itself, we could afford to have 256 patterns per bank. However, I wanted to have the game logo drawn on top of the flag, and maybe have some text message at the bottom of the screen. So, I had to reserve some patterns for that. In order to show the title image, I needed these patterns (not all patterns are strictly needed, but it was easier for me to have these patterns, which I reuse in other parts of the game):


If you count, those are 108 patterns  But I wasn't sure if I was going to need a few more, so, I assumed I might need 112 patterns  This left me with 256 - 112 = 144 patterns in each bank for the flag animation. So, how to reduce from 787 tiles to 144?

The answer is a standard machine learning family of algorithms called clustering algorithms. Clustering algorithms work as follows: the input is a large collection of points in an n-dimensional space, and the output is a "clustering" (a grouping of the input points into a collection of "buckets" called "clusters"), where the points in a cluster should look more similar to each other than to points in other clusters. So, what I did was to consider every pattern as a point in a 64-dimensional space (one dimension per pixel in the pattern), and I gave all 787 patterns for BANK 1 as input to a clustering algorithm (a very simple one called k-medoids), and I asked the algorithm to cluster them into 144 clusters. From each cluster, I then pick the most "representative" pattern (the "medoid"). I then do the same thing with BANK 2, and BANK 3. I then replace every pattern in the flag animation by the representative pattern of its cluster. This effectively reduces the number of patterns needed for the animation to just 144 per bank!

The resulting animation now looks like this:


Notice that it mostly looks like the same animation, except for a few glitches! So, this already looks promising! After that, I just manually replaced the medoid of a couple of clusters by a different one to fix those dots at the top of the animation, and the animation already looked much cleaner!

Step 4: Translate the Animation to MSX Data

Finally, now we can just take those 144 patterns per BANK, and save them to a binary file. Notice that all patterns have just black and white as color, so, I do not store the color table (assume that the color table for all patterns is identical, and just set to black and white). So, I need to store 144*3*8 = 3456 bytes for the patterns (much lower than before!). And after compressed using Pletter, it's just 1434 bytes.

I considered having gray in the flag to have shadows, but that forced me to store a color table, which used more space. So, I decided against it!

Once we have the patterns, for each animation frame (there are 16 frames), we store the corresponding "name table", i.e., which patterns have to be drawn in each of the 32x24 positions of the screen. This is 768 bytes per frame, so 768*16 = 12272 bytes. I actually never draw the bottom line of the animation, since I use that for text, so that saves some space, and I just need 11776 bytes. After compressing them with Pletter, it;s reduced to 5842 bytes.

Step 5: Playing it in the MSX

Once we have that, to play it back on the MSX, we just need to copy the patterns to the pattern table in the VDP, set the color tables to black and white, and then at each frame copy the corresponding name table to the VDP. The video could play easily at 50 or 60 frames per second (since it just needs to copy 32*23 = 736 bytes per frame), but the animation would be too fast. So, the game plays it at 25 or 30 frames per second depending on if it is running on a PAL or NTSC MSX.

The XRacing logo is then drawn on top by just overwriting the proper positions on the name table, and drawing a few sprites on top of it to achieve the 4 rounded corners, and a few other details (there are 19 sprites on the title screen).

And that's it! So, basically the idea is: 1) take an animation loop as input, 2) get the set of patterns needed for each bank, 3) use k-medoids to reduce the number of patterns to 144 per bank, 4) with that we can already display the animation on the MSX, since we just need to change the name table in each frame!

It's not rocket science, but what I find interesting is that there is nothing in the code hard-coded for the video being a flag waving animation. The idea could in principle work for any other small animation loop. So, I might use this technique in the future to display other types of animations :)

Ok, that's all for this post, which is already getting too long! Next time I want to explain how the in-game scroll works! which is a bit more involved :)

Friday, September 22, 2017

Raycasting in Tales of Popolon (MSX)

Today I wanted to talk about how raycasting works in my game Tales of Popolon, which I'll refer to as ToP for short. If you have never seen ToP, think of it as if you were to take the classic Wolfenstein 3D, but with a Greek theme, and running on an 8bit computer rather than on a PC. You can see a video of it here:


I am particularly proud of this game, since I put a lot of effort into making the raycasting engine run fast enough to run in the slow MSX CPU, and also using the MSX VDP (the graphics chip), that was definitively NOT designed for being used this way (the VDP better suits games with 2D graphics made out of 8x8 tiles).

Before I go on, I just wanted to take this chance to thank everyone that helped me complete this project. Starting from everyone that provided feedback, suggestions, and even help with the MSX2/MSX2+ and Turbo R support! And also special thanks to Matra for the fantastic physical edition of the game! (which you can find here).

Also, a small warning: this post is going to be quite technical, and I'm going to get into a lot of details of assembler programming. So, you are warned! :)

Ok, so, let's get right on to raycasting! Let me start by briefly explaining how raytracing works!

Raycasting Basics

The key idea behind raycasting is the following. Imagine that we stand in the middle of our living room and hold an empty picture frame in front of us (at arms length). Imagine now that we put a grid on top of the picture frame, so that the empty imaginary canvas is divided into a rectangular grid (each of the cells in this grid is one of the pixels in the screen). Now, if we want to know which color to paint each of the cells of this grid, we just need to "cast a ray" from the coordinates or our eye (assume we have a single eye, for simplicity) in the direction of each cell. The first object in our living room that such ray collides with after passing through the picture frame will determine the color of that cell.

So, to render a game using ray casting, we basically have to "cast a ray" for each pixel in the screen, calculate which object in the game gets "hit" by the ray, and paint the corresponding pixel of the color of that object.

Since that might be hard to visualize, let me show it with a small diagram (perspective might be totally off, since I drew it by hand in OminGraffle, hahaha):



Of course, if we want to do all of those calculations for a complex 3D scene, and expect the game to run at a decent frame rate, we would need a much more powerful computer than an MSX1. So, we will have to do lots of simplification assumptions. So, let me explain my simplifications in more detail (which are very similar to those used in Wolfenstein 3D, although not exactly the same).

Representing the Level Map


Simplification 1: the first simplification is how to represent the map. Rather than have 3D geometry, the map is represented as a 2 dimensional grid. Specifically, this is how I represent the very first level of ToP:

map:
db 2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1
db 2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,1
db 2,2,2,2,0,0,2,1,1,0,1,1,1,1,0,1
db 6,0,0,0,0,0,5,1,1,0,0,1,0,1,0,1
db 2,2,2,2,0,0,2,1,1,1,0,1,0,0,0,1
db 1,1,2,2,0,0,2,1,1,1,0,1,1,1,1,1
db 1,1,1,2,2,3,2,1,1,1,0,1,1,1,0,1
db 1,0,1,1,1,0,1,1,0,0,0,0,0,6,0,1
db 1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1
db 1,8,1,1,1,0,0,0,0,0,0,0,0,0,0,1
db 1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1
db 1,0,0,0,0,0,0,1,1,1,1,2,2,2,2,2
db 1,1,1,0,1,1,1,1,0,0,0,0,0,2,2,2
db 1,1,1,0,1,0,0,0,0,0,0,0,0,3,0,5
db 1,1,1,0,0,0,1,1,0,0,0,0,0,2,0,2
db 1,1,1,1,1,1,1,1,1,1,1,2,2,2,4,2


As you can see, it is a 16x16 grid of cells. Where each cell is stored as 1 byte (all the levels of ToP are exactly the same size: 16x16 = 256 bytes). This is done so that I can represent map coordinates at cell resolution in a single byte, with the 4 most significant bytes being the Z coordinate, and the 4 least significant bytes being the X coordinate. (note: I am using Y to represent "up", so, the map grid is in X, Z coordinates)

The number in each cell represents what is there in that position of the map: 0 means corridor, 1 is a red wall, 2 is a blue wall, 3 is a door, 4 is a staircase, 5 is a stone face, 6 is a gate, etc.

ToP uses coordinates at two levels of resolution:

  • Cell coordinates: where 0, 0 is the top-left of the map, and 15, 15 the bottom-right of the map represent the coordinates of the map at increments of one "cell". Walls are placed at this resolution.
  • Pixel coordinates: each "cell" is 16x16 pixels in size, so maps are 256x256 "pixels" in size. The coordinates of the player, enemies, items, bullets, etc. are at this resolution. 
So, to get the cell coordinates of an enemy at position (x, z), we just need to do (x/16, z/16). 


Also, as you can see, walls in ToP always occupy the whole height of a level, from floor to ceiling. The engine does not support slopes, half-height walls, etc. That is the first and most important simplification (half height walls are actually trivial to add, but I ended up removing them, since they didn't look great, and didn't add anything to the game!).

To make this post an acceptable length, I will only focus on rendering the background (walls, ceiling and floor) or a level for now, ignoring the items, enemies, etc.

Casting Rays


Ok, so, the very first thing we need to do is to be able to cast the ray corresponding to a given pixel. Let us make the following assumptions:

  • We have a camera in the game world at coordiantes (cx, cy, cz). 
  • The camera is looking in a direction (fx, fy , fz). (the forward vector)
  • Now, we assume that the "viewport" would be placed in world coordinates at a fixed distance D in front of the camera. So, the coordinates of the pixel at the very center of the viewport are (cx, cy, cz) + D*(fx, fy, fz). Also, we know that (fx, fy, fz) is the normal vector to the viewport (which is now completely defined).
Now, what we need to do is to calculate the coordinates of each of the pixels in the "viewport" in game-world coordinates, and with that we can define the ray that starts in the camera and goes through each one of them. But this can start to get complicated already (not math-wise, but computation-wise, since the MSX will struggle with trigonometry, multiplications, etc.). So, let's start making simplification assumptions:

Simplification 2: the camera will ALWAYS be at half-distance between the floor and the ceiling. So cy = "height of the walls"/2. Also, this means that all the pixels in the top-half of the screen correspond to rays that "go up" (and will eventually collide with the ceiling), and all the pixels in the bottom-half of the screen "go down" (and will eventually collide with the floor). 

Simplification 3: All the pixels in the very center of the screen, for which the rays go forward at very far distances without colliding with ceiling/floor will be ignored! So, there is a horizontal band of pixels in the middle of the screen that are treated specially.

Simplification 4: the camera can never loop up or down, always forward. This means that fy = 0.

With those simplifications, the process is already much simpler than the general case of raycasting! So, considering just the top half of the screen, to determine the color of a pixel P, what we need to do is:
  1. Calculate the coordinates of pixel P in game-world coordinates: px, py, pz
  2. Calculate the game-world coordinates at which the ray that passes through px, py, pz collides with the ceiling (i.e., when y = "height of the walls"): (lx, ly, lz)
  3. See what object is in the map at cell coordinates (lx/16, lz/16), and use the color of that object to draw that pixel.
And basically that's it!.... except for a small problem with Step 3... Notice that step 3 assumes that the position where the ray collides with the ceiling contains the object we want to draw, but that is not always the case, for example, look at the following situations (side view):


  • ray 1 will collide at a position where there is no wall (so, it's a corridor, identified with a "0" in the example map I showed above), and thus will render with "corridor" color
  • ray 2 will collide at a position where there is a blue wall (number "2" in the example map), and thus will render blue color.
  • But ray 3, which should also be a wall, collides with the ceiling at a position where there is no wall, so, it will be rendered the "corridor" color, which is wrong.

To solve this problem, there are several solutions. But the one I chose for ToP was this:
  1. For each vertical column of pixels in the screen (remember, we are only considering the top half of the screen for now), start with the top pixel:
  2. If the current ray collided with a corridor, draw corridor color, move on to the next pixel and repeat from step 2 (unless we are done with this column, in which case, we terminate).
  3. If the current ray collided with a wall, then draw wall color for this pixel and for all the remaining pixels of this vertical column, then terminate.

In this way, as soon as "ray 2" collided with a position where there was a wall, "ray 3" would have not even been considered, since all the remaining pixels would be drawn as wall.

Now, as a consequence of Simplifications 2 and 4, once we have rendered the top half of the screen (and if we ignore the fact that there are textures), the bottom half of the screen is just a mirror image of the top, and we don't even need to cast rays! So, if we want to do raycasting without textures, we would be done at this point.

I will explain below how to calculate the "collision points" in an efficient way, but for now, let's move on... Suffices to say that we want the collision coordinates to be at the "pixel coordinates" resolution level (so that we can properly render textures).

Rendering Textures

Rendering a texture basically means calculating which of the pixels of the texture needs to be copied to each pixel in the screen. In ToP, in order to make this calculation fast enough, I used textures that are 16x16 pixels in size. So, for example, this texture representing one of those stone faces that talks to you in the game


is stored like this:

facetexture:
    db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    db 0,0,0,0,#f1,#f1,#f1,#f1,1,1,1,1,0,0,0,0
    db 0,0,0,#f1,#f1,1,1,1,1,1,1,1,1,0,0,0
    db 0,0,#f1,1,1,1,1,1,1,1,1,1,1,1,0,0
    db 0,0,#f1,1,1,1,1,1,1,1,1,1,1,1,0,0
    db 0,0,#f1,0,0,0,0,1,1,0,0,0,0,1,0,0
    db 0,0,1,1,0,#81,0,1,1,0,#81,0,1,1,0,0
    db 0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0
    db 0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0
    db 0,0,0,1,1,1,0,1,1,0,1,1,1,0,0,0
    db 0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0
    db 0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0
    db 0,0,0,1,0,1,1,1,1,1,1,0,1,0,0,0
    db 0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0
    db 0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0
    db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The pixels that have their 4 least significant bits to 0, are painted "black", those that have value "1" are painted of the "default texture color" (in this case grey), and those that have their 4 most significant bits different form 0 have their own color. I added multicolored textures later in the project, so this representation was just a hack, since I was running out of space in the ROM. But if I re-implement this again, I would probably just have the color directly (without having any "default color" thinguie).

So, when an object (wall, ceiling, floor) needs to be rendered, what needs to be done is to calculate the "offset" in the texture that needs to be rendered, i.e., the coordinates (tx, ty) of the pixel of the texture that needs to chosen.

Ceiling/Floor: Given that we have the collision point (lx, ly, lz) of each ray in "pixel coordinates", this is as simple as: tx = lx%16, and ty = lz%16.

Walls: Walls are more complicated, and "tx" and "ty" are each calculated via different methods:
  • tx: the "cell coordinates" of the ray collision point (lx, ly, lz) are (lx/16, lz/16), and we know that there is a wall in that position (otherwise, we would be rendering a ceiling/floor texture). But what we need to know is whether the ray entered in (lx/16, lz/16) from the north, east, south or west, and what is the "pixel coordinate" at which the ray exactly entered this cell. That entrance point is exactly what will determine "lx". So, if the entrance point is (ex, ez) in pixel coordinates, and the ray entered through the north or south, then tx = ex. If the ray entered through east/west, then tx = ez. To resolve this, ToP does this: we know that the previous screen pixel we rendered was showing either ceiling or floor (since if it were showing wall, this ray would have never been casted!). Let us call (lx0, ly0, lz0) to the ray collision point of the previous pixel. So, what we do is we use a Bresenham algorithm to move, pixel by pixel from (lx0, lz0) to (lx, lz) until we cross into the cell where the wall is, which is exactly the point where the ray entered the wall. In case that this is the very first pixel in the current column, then (lx0, lz0) = (cx, cz). 
  • ty: for this, what ToP does is this: let "n" be the screen y coordinate of the first pixel in the current column that is a wall (all the previous pixels were ceiling). If the vertical screen resolution is "m", then we know that the wall will be "m - (n*2)" pixels tall when rendered. So, ty is determined by starting with ty = 0 in the first pixel in this column that is a wall, and then advancing at increments of "16 / (m - (m*2))" pixels at each consecutive pixel in the column.

Comparing this to Wolfenstein 3D: one detail is that Wolfenstein 3D did NOT render floor/ceiling textures, but ToP does. So, in a sense ToP's engine is a little bit more advanced than Wolfenstein 3D's one :) (in other sense, it's more limited, of course, such as in rendering partially occluded sprites, for which ToP uses hardware sprites, which cannot be partially occluded easily)

Also, I noticed that this way of rendering textures would allow animations for free, since I just need to associate a different texture to a given object in different frames, and it'll look like an animation is playing. So, I added one wall type that has a small animation. I wanted to add more animations, but I ran out of ROM space... :( (notice that animations do not slow down the game at all, every wall could be displaying an animation at the same time, and it would not affect rendering time).

So, in summary, the rendering algorithm is as follows:

for(x = 0;x < screen_width; x++) {
  (lx0, lz0) = (cx, cz)
  for(y = 0;y < screen_height; y++) {
     (lx, ly, lz) = collisionCoordinatesOfRayForPixel(x, y);
     object = map[lx/16][lz/16];
     if (object == CORRIDOR) {
       putPixel(x, y, corridorTexture[lx%16][lz%16]);
         (lx0, lz0) = (lx, lz)
     } else {
       (rx,rz, direction) = crossOverPoint(lx0, lz0, lx, lz);
       if (direction == NORTH || direction == SOUTH) tx = rx%16;
                                                else tx = ry%16;
       ty = 0;
       for(y2 = y; y2 < screen_height; y2++) {
         putPixel(x, y2, objectTexture[object][tx][ty]);
         ty += 16 / (SCREEN_WIDTH - y*2);
       }
       break;  // we are done with the column
     }
   }
}

Doesn't look that complicated, right? Well, it's actually not! the complex part is to make it run fast. So, let's get on to it! How do we implement this on an MSX?

Organizing the VDP Memory for Raycasting


As you can see in the pseudo code above, pixels are rendered column by column. So, it is important that pixels that are one on top of another are contiguous in memory, so that if we have the pointer of the current pixel in say, HL, we can just do "inc hl", to move to the next pixel. So, since I used SCREEN 2 for the game, I organized the name table like this:



Where you can see that I use the top two banks for rendering the game, and the bottom one for the score board. So, the pattern and color table of banks 1 and 2 is set to all 0s at the beginning of the game, and the regular patterns (the txt font, etc.) are only kept in bank 3.

Since I don't use the whole screen to render the game, I left some black borders around (marked with pattern "#ff"). Then you see that the first pattern that is used for rendering is #00, then immediately below that is #01, etc.

Simplification 5: I reduced the resolution of the game, and each rendered pixel is actually a 2x2 pixel. This divides by 4 the number of rays I need to cast, making the game run 4 times faster. I will call each of the 2x2 larger pixels a "logical pixel" from now on, to avoid confusion.

So, each 8x8 pattern contains a block of 4x4 logical "pixels". So, the first four "logical pixel columns" (0, 1, 2, 3) are stored in patterns #00, #01, #02, #03, #04, #05, #06 and #07. The next four columns (4, 5, 6, 7) are stored in patterns #08, #09, #0A, #0B, #0C, #0D and #0F, etc.

I keep an offscreen buffer in RAM that contains both the pattern and the color table, which is what I render to. Once a complete game frame has been rendered to this buffer. I copy its contents over to the VDP (first pattern table, and then color table). Since there is a lot of info to copy, you might have noticed that the colors in the game some times flicker a bit (that's because it takes more than one screen refresh cycle to copy all the information over to the VDP!). I considered interlacing the copying, but, even if it made it look better, it made the game a bit slower too, and used more space in the ROM, so, I decided against it.

In order to render, the process is thus like this:
  1. Clear RAM buffer: the part corresponding to the pattern buffer to all 0s, and the part corresponding to the color buffer to the color of the ceiling or floor (in that way the color is already set, and I only need to change it when I render a wall). The background color for all the patterns is always kept to black, and I only change the foreground color.
  2. For each column of pixels in the screen:
    1. Update the pattern buffer: for this, at the beginning of each column, I calculate a "mask", which is #C0 for the first column, #30 for the second, #0C for the third, and #03 for the fourth (after that, the cycle repeats).  To draw a pixel I OR this mask with the current pattern if I want to set the pixel as "on", and if I don't need to draw a pixel (when the texture has value 0), then I just skip this pixel.
    2. Update the color buffer: since the color can only be changed for blocks of 8x1 pixels, this means that each block of "4x1 logical pixels" (remember that the "logical pixels" in ToP are blocks of 2x2 pixels) have the same color. So, there can be some color clash. What I do is simply, each time a pixel is drawn (when I OR the mask for the pattern buffer), I set the color in the color buffer to the pixel color (only when rendering walls, when rendering floor/ceiling, the color buffer does not need to be even touched).
  3. Once all the columns have been rendered, copy the RAM buffer over to the VDP.

I considered not clearing the RAM buffers, and just overwrite. But it turns out that the extra "AND" operation required to "erase a pixel", and having to write colors each time was slower. So, clearing the RAM buffer was the fastest solution.

The Game Loop


Rendering a whole game frame takes very long (I can only render between 2 to 3 frames per second, depending on the viewport size selected). So, characters and enemies would move too slow for making the game playable. The solution I used was to divide the game loop into "sub frames". Each "game frame update" is actually divided into four "sub-frame updates" like this:
  • Subframe 1: Update player character/enemies/bullets/etc. Clear RAM buffer, and render 1/4th of the columns.
  • Subframe 2: Update player character/enemies/bullets/etc. Render 1/4th of the columns.
  • Subframe 3: Update player character/enemies/bullets/etc. Render 1/4th of the columns.
  • Subframe 4: Update player character/enemies/bullets/etc. Render 1/4th of the columns, copy the RAM buffer to the VDP.
In this way, the game executes 4 updates for each complete frame rendered using raycasting, which feels more fluid.  You can seethe effect of this in that enemies/bullets move smoothly, even if the environment is rendered much slower.

Note: as pointed out by NYYRIKKI in the msx.org forums, a better strategy would have been to run the game update on the screen refresh interrupt, and let the renderer run loose in the background. In that way, I would not need any speed adjustments when running on faster machines (turbo R). So: noted for the future, in case I reimplement this again! :)


Precalculated Tables


The computations required for raycasting require many multiplications, divisions and calls to trigonometry functions, all of which are very slow on MSX. So, in order to make the renderer render more than 1 frame every 10 seconds, I used a bunch of pre-calculated tables. Specifically, I used these tables:

  • Raycasting collisions: this is all precalculated. Notice that since I do not allow the camera to rotate up and down. The rays for each of the pixels in the screen will always be the exact same ones. So, what is the point of calculating this every time?! What I have is two precalculated tables (ray_x_offs_table, ray_y_offs_table) of size 512 x 32, which contain this: 
    • ray_x_offs_table[a][b]: the distance in the "x axis" at which a ray casted at an angle "a" around the y axis and "b" around the x axis will collide with the ceiling. "a" is precalculated at a resolution such that "512" steps is a full revolution. When rendering adjacent columns, we just advance one step in this dimension. So, since the default rendering viewport is 80 logical pixels wide, that corresponds to a viewing angle of about 56 degrees (which is close to the usual 60 degrees). The vertical resolution is 32 steps, since the viewport is 64 logical pixels tall, which means that the top half is 32 pixels tall. So, [b] is basically indexed by the pixel row we are currently rendering. 
    • ray_y_offs_table[a][b]: is the same but in the "z axis".
The camera angle is kept in the resolution of "128 degrees is one full revolution". So, when I want to cast a ray for logical pixel "i, j", I check ray_x_offs_table[angle*4 + i][j], and to the result of that I just have to add the camera coordinates, and that gives me directly the ray collision coordinates. So, no need to do any complex math during raycasting!
  • Texture offset increments: if you check the algorithm above, one step says "ty += 16 / (SCREEN_WIDTH - y*2);", this is problematic since it has one division. I have a precalculated table called "texture_vertical_rate_table" that stores the value of "16 / (SCREEN_WIDTH - y*2)" indexed by "y". So, one more division saved!
  • Sin/Cos tables: for moving the player / bullets / enemies around, I need to make them move depending on the angle they are facing. So, I need to calculate sine and cosine. This is all precalculated in four tables: sin_table, cos_tablesin_table_x12, cos_table_x12. The last two, since there are things that move faster, and thus, it's convenient to have it precalculated with the multiplication built-in already.
  • Distances: this has to do with rendering enemies/bullets/items. When I need to render an enemy that I know is at a certain distance, I need to know where in the screen to render it. That would also have involved divisions. So, I have a precalculated table that tells me for an object that is at distance dx, dz (in the X and Z axis respectively), at which y coordinate do I need to render it in the screen.
  • atan2: this is a routine I got from a post that ARTRAG did in the msx.org forums, that also has some precalculated tables (see original post here). I use it to calculate the X coordinate where to render enemies / items in the screen.
You would think that after all of these simplifications, this would already run fast on an MSX, right?! Well, think again!!! After I implemented all of the above it took 5439255 cycles in average to render a single frame... if we recall that the Z80 in the MSX executes 3579545 cycles per second, that comes to about 0.66 frames per second... it was unplayable...

At this point I had spent quite some time working on the engine and I was about to just give up. "The MSX is too slow for this" I thought. But I started looking at videos of other raycasting games for the MSX and Spectrum. Other people had managed to do it! so, there had to be a way! The next few weeks were 100% dedicated to optimization of the whole thing!

Speed Optimization Techniques


There are lots of websites with a very large number of Z80 optimization tricks. I read all I could find, and every single little trick that I could apply to my code, I applied it! The one that I'm sad I did not find a place to apply it is this one (which blew my mind!): http://www.maxcoderz.org/forum/viewtopic.php?f=5&t=2579. So, I'm not going to go into detail of every single little thing, but just focus on those techniques that really made a big difference.

Moving if-then-else statements outside of inner loops: the first optimization (that saved about 400k or 500k cycles if I recall correctly) was the realization that every time I had a piece of code that looked like this:

while(some condition) {
  if (condition that does not change inside of the loop) {
    do X
  } else {
    do Y
  }
}

I should replace it with:

if (condition that does not change inside of the loop) {
  while(some condition) {
    do X
  }
} else {
  while(some condition) {
    do Y
  }
}

This increased the size of the code significantly, but made it much faster. If you look at the raycasting code, you will see that I did this, for example, for handling the four quadrants. The code for managing raycasting in each of the four quadrants (from 0 to 90 degrees, 90 to 180, etc.) around the Y axis is identical except for a couple of places where I need to either add or subtract. By taking the "if we are in quadrant 1" conditions outside, I had to replicate the loop 4 times, but each loop ran much faster.

Aligned Tables: imagine that you have a table in a memory position X and you need to index it by the content of the accumulator register A, what you would have to do is something like this:

ld hl,X
ld b,0
ld c,a
add hl,bc
ld a,(hl)

However, if you know that X is a multiple of 256, then you know that when you do "ld hl,X", "l" will be equal to 0. So, you can actually just do:

ld h,X/256
ld l,a
ld a,(hl)

Voilà! much faster! This is a very common trick, and not anything I invented, but it saved another few hundred K cycles as well! I tried to use aligned tables for everything I could: textures, maps, etc.

Maximizing register use: but by far the thing that saved more CPU cycles was removing as much as I could any access to memory. The Z80 might only have a few registers: AF, BC, DE, HL, IX and IY. But a few of those (BC, DE, HL) are actually replicated! Using the assembler instruction EXX we swap from BC, DE, HL to the "ghost" copies of those registers. So, in fact we have AF, BC1, BC2, DE1, DE2, HL1, HL2, IX and IY, which is a TON of registers! Also, the IX and IY registers can also be used as pairs 8 bit registers (IXL, IXH, IYL, IYH), they are slower than the others, but still faster than RAM!

One of the most time consuming optimization processes I undertook was to track the usage of ALL registers (including the ghost BC, DE, HL) throughout the whole raycasting code. So, by making extensive use of EXX, I managed to remove all usages of variables in RAM in all inner loops of the raycasting code. For example, when rendering a texture, I keep the pointers to the texture, etc. in the regular BC, DE, HL, and the pointers to the raycasting pattern and color RAM buffer in the ghost copies. Sounds simple, but it was a pain to track which registers were being used at all times... I think all of this saved at least 1 million cycles or so.

Top Half of the Screen is Analogous to the Bottom: when rendering the bottom half of a column in the screen, there is no point on repeating calculations, since it is almost a mirror of the top half. When rendering the ceiling texture on the top half, I store all the texture data in a 32 byte buffer (which, of course, is 256-aligned), and when rendering the bottom half, I just retrieve it from that buffer (raycast_floor_texture_buffer).

Floor/Ceiling Texture Optimizations: in the first version, I allowed arbitrary textures for the ceiling/floor. But it turns out that rendering the ceiling/floor was the most time consuming part of the whole renderer! (no wonder they did not include it in Wolfenstein 3D!!!). So, what I ended up doing is to remove the floor/ceiling textures, and replace them by hard-coded code, that given the (tx, ty) offset coordinates calculates whether the pixel should be painted or not. For example, the floor texture is generated by painting those pixels for which (tx+ty) & 0x80 != 0

Thanks to such optimization, the code that renders the floor, is as simple as this!

raycast_render_next_pixel_floor:
    ;; retrieve the texture stored during ceiling rendering (bc points to "raycast_floor_texture_buffer"):
    ld a,(bc)
    or a
    jp z,raycast_render_done_with_floor_pixel

    ;; texture is 1, render pixel:
    ld a,(hl)   ;; hl points to the raycasting pattern RAM buffer
    or d        ;; d has the "mask" corresponding to this column
    ld (hl),a   ;; render pixel
    inc hl
    ld (hl),a   ;; we render it twice, since logical pixels are 2x2
    inc hl
    dec c       ;; c is the number of pixels left to render in this column
    jp nz,raycast_render_next_pixel_floor
    jp raycast_render_done_with_column

raycast_render_done_with_floor_pixel:
    inc hl
    inc hl
    dec c
    jp nz,raycast_render_next_pixel_floor

This saved another very significant number of CPU cycles!

Saving Space in the ROM: finally at the end I was running out of memory, so, in addition to compressing everything with Pletter, etc. I had to reduce the space taken by all the precalculated tables. For that reason, if you look at the code, you'll see that all the sine / cosine tables are overlapped (since the sine / cosine functions are the same function, just one is shifted from the other). And also, I don't really have a ray_y_offs_table, since it's exactly the same table as ray_x_offs_table but reversed. So, I only store one of them and either add (for X) or subtract (for Y) to index it.

After all of these optimizations, at the default viewport size, the game takes 1550184 cycles in average to render a whole frame, less than 1/3 of the original time! and making it to about 2.3 frames per second! If you reduce the viewport size to the smallest size then it takes about 1053651 cycles in average, getting about 3.5 frames per second, which, compared to the original 0.66 is ridiculously smooth!!!

Summary


So, those are the basics. There are a bunch of things that I have not explained, like how I am rendering the enemies / items, or how does occlusion testing works (when an enemy is behind a wall, it should not be rendered, and occlusion testing is expensive CPU-wise), but this entry is already getting too long. I also implemented a fancy sprite pattern loading code that dynamically loads and unloads sprites onto the VDP based on what is being rendered that I thought was very cool. But I think I'll leave it here, since the basics are covered!

One of the main obstacles to making it faster is basically the VDP. Not having direct access to change bytes of the video memory is what forces me to have to have the RAM buffer, and then a copy, etc. I think that the same type of game could perhaps run a bit faster on computers such as the Spectrum or the Amstrad CPC, where the video memory is memory mapped. There would be some drawbacks, of course, such as color clash in Spectrum, which might force me to forget about coloring textures at a pixel-level, and force me to use texture-wide colors. Also, in Spectrum or Amstrad I would not be able to use hardware sprites, so, rendering enemies/items would be harder, since it cannot be done independently of the background. Another obstacle is memory. I think that with a bit more RAM I could have added more pre-calculated tables, but this was not an option for the MSX1...

I also think that things could be optimized a bit more on an MSX2 since some of the additional screen modes supported by the MSX2 would not require having to OR a mask to render pixels (right now I need to READ a pixel, OR the mask, and the WRITE it, requiring two memory accesses per pixel rendered! which is not ideal...). 

Ok, so that's it. That's how the raycasting-side of ToP works! If you made it all the way here, congrats!!! Since this one was a long one!! hahaha