I am particularly proud of this game, since I put a lot of effort into making the raycasting engine run fast enough to run in the slow MSX CPU, and also using the MSX VDP (the graphics chip), that was definitively NOT designed for being used this way (the VDP better suits games with 2D graphics made out of 8x8 tiles).
Before I go on, I just wanted to take this chance to thank everyone that helped me complete this project. Starting from everyone that provided feedback, suggestions, and even help with the MSX2/MSX2+ and Turbo R support! And also special thanks to Matra for the fantastic physical edition of the game! (which you can find here).
Also, a small warning: this post is going to be quite technical, and I'm going to get into a lot of details of assembler programming. So, you are warned! :)
Ok, so, let's get right on to raycasting! Let me start by briefly explaining how raytracing works!
Raycasting Basics
The key idea behind raycasting is the following. Imagine that we stand in the middle of our living room and hold an empty picture frame in front of us (at arms length). Imagine now that we put a grid on top of the picture frame, so that the empty imaginary canvas is divided into a rectangular grid (each of the cells in this grid is one of the pixels in the screen). Now, if we want to know which color to paint each of the cells of this grid, we just need to "cast a ray" from the coordinates or our eye (assume we have a single eye, for simplicity) in the direction of each cell. The first object in our living room that such ray collides with after passing through the picture frame will determine the color of that cell.So, to render a game using ray casting, we basically have to "cast a ray" for each pixel in the screen, calculate which object in the game gets "hit" by the ray, and paint the corresponding pixel of the color of that object.
Since that might be hard to visualize, let me show it with a small diagram (perspective might be totally off, since I drew it by hand in OminGraffle, hahaha):
Of course, if we want to do all of those calculations for a complex 3D scene, and expect the game to run at a decent frame rate, we would need a much more powerful computer than an MSX1. So, we will have to do lots of simplification assumptions. So, let me explain my simplifications in more detail (which are very similar to those used in Wolfenstein 3D, although not exactly the same).
Representing the Level Map
Simplification 1: the first simplification is how to represent the map. Rather than have 3D geometry, the map is represented as a 2 dimensional grid. Specifically, this is how I represent the very first level of ToP:
map:
db 2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1
db 2,2,2,2,0,0,2,0,0,0,0,0,0,0,0,1
db 2,2,2,2,0,0,2,1,1,0,1,1,1,1,0,1
db 6,0,0,0,0,0,5,1,1,0,0,1,0,1,0,1
db 2,2,2,2,0,0,2,1,1,1,0,1,0,0,0,1
db 1,1,2,2,0,0,2,1,1,1,0,1,1,1,1,1
db 1,1,1,2,2,3,2,1,1,1,0,1,1,1,0,1
db 1,0,1,1,1,0,1,1,0,0,0,0,0,6,0,1
db 1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1
db 1,8,1,1,1,0,0,0,0,0,0,0,0,0,0,1
db 1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1
db 1,0,0,0,0,0,0,1,1,1,1,2,2,2,2,2
db 1,1,1,0,1,1,1,1,0,0,0,0,0,2,2,2
db 1,1,1,0,1,0,0,0,0,0,0,0,0,3,0,5
db 1,1,1,0,0,0,1,1,0,0,0,0,0,2,0,2
db 1,1,1,1,1,1,1,1,1,1,1,2,2,2,4,2
As you can see, it is a 16x16 grid of cells. Where each cell is stored as 1 byte (all the levels of ToP are exactly the same size: 16x16 = 256 bytes). This is done so that I can represent map coordinates at cell resolution in a single byte, with the 4 most significant bytes being the Z coordinate, and the 4 least significant bytes being the X coordinate. (note: I am using Y to represent "up", so, the map grid is in X, Z coordinates)
The number in each cell represents what is there in that position of the map: 0 means corridor, 1 is a red wall, 2 is a blue wall, 3 is a door, 4 is a staircase, 5 is a stone face, 6 is a gate, etc.
ToP uses coordinates at two levels of resolution:
- Cell coordinates: where 0, 0 is the top-left of the map, and 15, 15 the bottom-right of the map represent the coordinates of the map at increments of one "cell". Walls are placed at this resolution.
- Pixel coordinates: each "cell" is 16x16 pixels in size, so maps are 256x256 "pixels" in size. The coordinates of the player, enemies, items, bullets, etc. are at this resolution.
So, to get the cell coordinates of an enemy at position (x, z), we just need to do (x/16, z/16).
Also, as you can see, walls in ToP always occupy the whole height of a level, from floor to ceiling. The engine does not support slopes, half-height walls, etc. That is the first and most important simplification (half height walls are actually trivial to add, but I ended up removing them, since they didn't look great, and didn't add anything to the game!).
To make this post an acceptable length, I will only focus on rendering the background (walls, ceiling and floor) or a level for now, ignoring the items, enemies, etc.
Casting Rays
Ok, so, the very first thing we need to do is to be able to cast the ray corresponding to a given pixel. Let us make the following assumptions:
- We have a camera in the game world at coordiantes (cx, cy, cz).
- The camera is looking in a direction (fx, fy , fz). (the forward vector)
- Now, we assume that the "viewport" would be placed in world coordinates at a fixed distance D in front of the camera. So, the coordinates of the pixel at the very center of the viewport are (cx, cy, cz) + D*(fx, fy, fz). Also, we know that (fx, fy, fz) is the normal vector to the viewport (which is now completely defined).
Now, what we need to do is to calculate the coordinates of each of the pixels in the "viewport" in game-world coordinates, and with that we can define the ray that starts in the camera and goes through each one of them. But this can start to get complicated already (not math-wise, but computation-wise, since the MSX will struggle with trigonometry, multiplications, etc.). So, let's start making simplification assumptions:
Simplification 2: the camera will ALWAYS be at half-distance between the floor and the ceiling. So cy = "height of the walls"/2. Also, this means that all the pixels in the top-half of the screen correspond to rays that "go up" (and will eventually collide with the ceiling), and all the pixels in the bottom-half of the screen "go down" (and will eventually collide with the floor).
Simplification 3: All the pixels in the very center of the screen, for which the rays go forward at very far distances without colliding with ceiling/floor will be ignored! So, there is a horizontal band of pixels in the middle of the screen that are treated specially.
Simplification 4: the camera can never loop up or down, always forward. This means that fy = 0.
With those simplifications, the process is already much simpler than the general case of raycasting! So, considering just the top half of the screen, to determine the color of a pixel P, what we need to do is:
- Calculate the coordinates of pixel P in game-world coordinates: px, py, pz
- Calculate the game-world coordinates at which the ray that passes through px, py, pz collides with the ceiling (i.e., when y = "height of the walls"): (lx, ly, lz)
- See what object is in the map at cell coordinates (lx/16, lz/16), and use the color of that object to draw that pixel.
And basically that's it!.... except for a small problem with Step 3... Notice that step 3 assumes that the position where the ray collides with the ceiling contains the object we want to draw, but that is not always the case, for example, look at the following situations (side view):
- ray 1 will collide at a position where there is no wall (so, it's a corridor, identified with a "0" in the example map I showed above), and thus will render with "corridor" color
- ray 2 will collide at a position where there is a blue wall (number "2" in the example map), and thus will render blue color.
- But ray 3, which should also be a wall, collides with the ceiling at a position where there is no wall, so, it will be rendered the "corridor" color, which is wrong.
To solve this problem, there are several solutions. But the one I chose for ToP was this:
- For each vertical column of pixels in the screen (remember, we are only considering the top half of the screen for now), start with the top pixel:
- If the current ray collided with a corridor, draw corridor color, move on to the next pixel and repeat from step 2 (unless we are done with this column, in which case, we terminate).
- If the current ray collided with a wall, then draw wall color for this pixel and for all the remaining pixels of this vertical column, then terminate.
In this way, as soon as "ray 2" collided with a position where there was a wall, "ray 3" would have not even been considered, since all the remaining pixels would be drawn as wall.
Now, as a consequence of Simplifications 2 and 4, once we have rendered the top half of the screen (and if we ignore the fact that there are textures), the bottom half of the screen is just a mirror image of the top, and we don't even need to cast rays! So, if we want to do raycasting without textures, we would be done at this point.
I will explain below how to calculate the "collision points" in an efficient way, but for now, let's move on... Suffices to say that we want the collision coordinates to be at the "pixel coordinates" resolution level (so that we can properly render textures).
Rendering Textures
Rendering a texture basically means calculating which of the pixels of the texture needs to be copied to each pixel in the screen. In ToP, in order to make this calculation fast enough, I used textures that are 16x16 pixels in size. So, for example, this texture representing one of those stone faces that talks to you in the gameis stored like this:
facetexture:
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
db 0,0,0,0,#f1,#f1,#f1,#f1,1,1,1,1,0,0,0,0
db 0,0,0,#f1,#f1,1,1,1,1,1,1,1,1,0,0,0
db 0,0,#f1,1,1,1,1,1,1,1,1,1,1,1,0,0
db 0,0,#f1,1,1,1,1,1,1,1,1,1,1,1,0,0
db 0,0,#f1,0,0,0,0,1,1,0,0,0,0,1,0,0
db 0,0,1,1,0,#81,0,1,1,0,#81,0,1,1,0,0
db 0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0
db 0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0
db 0,0,0,1,1,1,0,1,1,0,1,1,1,0,0,0
db 0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0
db 0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0
db 0,0,0,1,0,1,1,1,1,1,1,0,1,0,0,0
db 0,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0
db 0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
The pixels that have their 4 least significant bits to 0, are painted "black", those that have value "1" are painted of the "default texture color" (in this case grey), and those that have their 4 most significant bits different form 0 have their own color. I added multicolored textures later in the project, so this representation was just a hack, since I was running out of space in the ROM. But if I re-implement this again, I would probably just have the color directly (without having any "default color" thinguie).
So, when an object (wall, ceiling, floor) needs to be rendered, what needs to be done is to calculate the "offset" in the texture that needs to be rendered, i.e., the coordinates (tx, ty) of the pixel of the texture that needs to chosen.
Ceiling/Floor: Given that we have the collision point (lx, ly, lz) of each ray in "pixel coordinates", this is as simple as: tx = lx%16, and ty = lz%16.
Walls: Walls are more complicated, and "tx" and "ty" are each calculated via different methods:
- tx: the "cell coordinates" of the ray collision point (lx, ly, lz) are (lx/16, lz/16), and we know that there is a wall in that position (otherwise, we would be rendering a ceiling/floor texture). But what we need to know is whether the ray entered in (lx/16, lz/16) from the north, east, south or west, and what is the "pixel coordinate" at which the ray exactly entered this cell. That entrance point is exactly what will determine "lx". So, if the entrance point is (ex, ez) in pixel coordinates, and the ray entered through the north or south, then tx = ex. If the ray entered through east/west, then tx = ez. To resolve this, ToP does this: we know that the previous screen pixel we rendered was showing either ceiling or floor (since if it were showing wall, this ray would have never been casted!). Let us call (lx0, ly0, lz0) to the ray collision point of the previous pixel. So, what we do is we use a Bresenham algorithm to move, pixel by pixel from (lx0, lz0) to (lx, lz) until we cross into the cell where the wall is, which is exactly the point where the ray entered the wall. In case that this is the very first pixel in the current column, then (lx0, lz0) = (cx, cz).
- ty: for this, what ToP does is this: let "n" be the screen y coordinate of the first pixel in the current column that is a wall (all the previous pixels were ceiling). If the vertical screen resolution is "m", then we know that the wall will be "m - (n*2)" pixels tall when rendered. So, ty is determined by starting with ty = 0 in the first pixel in this column that is a wall, and then advancing at increments of "16 / (m - (m*2))" pixels at each consecutive pixel in the column.
Comparing this to Wolfenstein 3D: one detail is that Wolfenstein 3D did NOT render floor/ceiling textures, but ToP does. So, in a sense ToP's engine is a little bit more advanced than Wolfenstein 3D's one :) (in other sense, it's more limited, of course, such as in rendering partially occluded sprites, for which ToP uses hardware sprites, which cannot be partially occluded easily)
Also, I noticed that this way of rendering textures would allow animations for free, since I just need to associate a different texture to a given object in different frames, and it'll look like an animation is playing. So, I added one wall type that has a small animation. I wanted to add more animations, but I ran out of ROM space... :( (notice that animations do not slow down the game at all, every wall could be displaying an animation at the same time, and it would not affect rendering time).
So, in summary, the rendering algorithm is as follows:
for(x = 0;x < screen_width; x++) {
(lx0, lz0) = (cx, cz)
for(y = 0;y < screen_height; y++) {
(lx, ly, lz) = collisionCoordinatesOfRayForPixel(x, y);
object = map[lx/16][lz/16];
if (object == CORRIDOR) {
putPixel(x, y, corridorTexture[lx%16][lz%16]);
(lx0, lz0) = (lx, lz)
} else {
(rx,rz, direction) = crossOverPoint(lx0, lz0, lx, lz);
if (direction == NORTH || direction == SOUTH) tx = rx%16;
else tx = ry%16;
ty = 0;
for(y2 = y; y2 < screen_height; y2++) {
putPixel(x, y2, objectTexture[object][tx][ty]);
ty += 16 / (SCREEN_WIDTH - y*2);
}
break; // we are done with the column
}
}
}
Doesn't look that complicated, right? Well, it's actually not! the complex part is to make it run fast. So, let's get on to it! How do we implement this on an MSX?
Organizing the VDP Memory for Raycasting
As you can see in the pseudo code above, pixels are rendered column by column. So, it is important that pixels that are one on top of another are contiguous in memory, so that if we have the pointer of the current pixel in say, HL, we can just do "inc hl", to move to the next pixel. So, since I used SCREEN 2 for the game, I organized the name table like this:
Where you can see that I use the top two banks for rendering the game, and the bottom one for the score board. So, the pattern and color table of banks 1 and 2 is set to all 0s at the beginning of the game, and the regular patterns (the txt font, etc.) are only kept in bank 3.
Since I don't use the whole screen to render the game, I left some black borders around (marked with pattern "#ff"). Then you see that the first pattern that is used for rendering is #00, then immediately below that is #01, etc.
Simplification 5: I reduced the resolution of the game, and each rendered pixel is actually a 2x2 pixel. This divides by 4 the number of rays I need to cast, making the game run 4 times faster. I will call each of the 2x2 larger pixels a "logical pixel" from now on, to avoid confusion.
So, each 8x8 pattern contains a block of 4x4 logical "pixels". So, the first four "logical pixel columns" (0, 1, 2, 3) are stored in patterns #00, #01, #02, #03, #04, #05, #06 and #07. The next four columns (4, 5, 6, 7) are stored in patterns #08, #09, #0A, #0B, #0C, #0D and #0F, etc.
I keep an offscreen buffer in RAM that contains both the pattern and the color table, which is what I render to. Once a complete game frame has been rendered to this buffer. I copy its contents over to the VDP (first pattern table, and then color table). Since there is a lot of info to copy, you might have noticed that the colors in the game some times flicker a bit (that's because it takes more than one screen refresh cycle to copy all the information over to the VDP!). I considered interlacing the copying, but, even if it made it look better, it made the game a bit slower too, and used more space in the ROM, so, I decided against it.
In order to render, the process is thus like this:
- Clear RAM buffer: the part corresponding to the pattern buffer to all 0s, and the part corresponding to the color buffer to the color of the ceiling or floor (in that way the color is already set, and I only need to change it when I render a wall). The background color for all the patterns is always kept to black, and I only change the foreground color.
- For each column of pixels in the screen:
- Update the pattern buffer: for this, at the beginning of each column, I calculate a "mask", which is #C0 for the first column, #30 for the second, #0C for the third, and #03 for the fourth (after that, the cycle repeats). To draw a pixel I OR this mask with the current pattern if I want to set the pixel as "on", and if I don't need to draw a pixel (when the texture has value 0), then I just skip this pixel.
- Update the color buffer: since the color can only be changed for blocks of 8x1 pixels, this means that each block of "4x1 logical pixels" (remember that the "logical pixels" in ToP are blocks of 2x2 pixels) have the same color. So, there can be some color clash. What I do is simply, each time a pixel is drawn (when I OR the mask for the pattern buffer), I set the color in the color buffer to the pixel color (only when rendering walls, when rendering floor/ceiling, the color buffer does not need to be even touched).
- Once all the columns have been rendered, copy the RAM buffer over to the VDP.
I considered not clearing the RAM buffers, and just overwrite. But it turns out that the extra "AND" operation required to "erase a pixel", and having to write colors each time was slower. So, clearing the RAM buffer was the fastest solution.
The Game Loop
Rendering a whole game frame takes very long (I can only render between 2 to 3 frames per second, depending on the viewport size selected). So, characters and enemies would move too slow for making the game playable. The solution I used was to divide the game loop into "sub frames". Each "game frame update" is actually divided into four "sub-frame updates" like this:
- Subframe 1: Update player character/enemies/bullets/etc. Clear RAM buffer, and render 1/4th of the columns.
- Subframe 2: Update player character/enemies/bullets/etc. Render 1/4th of the columns.
- Subframe 3: Update player character/enemies/bullets/etc. Render 1/4th of the columns.
- Subframe 4: Update player character/enemies/bullets/etc. Render 1/4th of the columns, copy the RAM buffer to the VDP.
In this way, the game executes 4 updates for each complete frame rendered using raycasting, which feels more fluid. You can seethe effect of this in that enemies/bullets move smoothly, even if the environment is rendered much slower.
Note: as pointed out by NYYRIKKI in the msx.org forums, a better strategy would have been to run the game update on the screen refresh interrupt, and let the renderer run loose in the background. In that way, I would not need any speed adjustments when running on faster machines (turbo R). So: noted for the future, in case I reimplement this again! :)
Precalculated Tables
The computations required for raycasting require many multiplications, divisions and calls to trigonometry functions, all of which are very slow on MSX. So, in order to make the renderer render more than 1 frame every 10 seconds, I used a bunch of pre-calculated tables. Specifically, I used these tables:
- Raycasting collisions: this is all precalculated. Notice that since I do not allow the camera to rotate up and down. The rays for each of the pixels in the screen will always be the exact same ones. So, what is the point of calculating this every time?! What I have is two precalculated tables (ray_x_offs_table, ray_y_offs_table) of size 512 x 32, which contain this:
- ray_x_offs_table[a][b]: the distance in the "x axis" at which a ray casted at an angle "a" around the y axis and "b" around the x axis will collide with the ceiling. "a" is precalculated at a resolution such that "512" steps is a full revolution. When rendering adjacent columns, we just advance one step in this dimension. So, since the default rendering viewport is 80 logical pixels wide, that corresponds to a viewing angle of about 56 degrees (which is close to the usual 60 degrees). The vertical resolution is 32 steps, since the viewport is 64 logical pixels tall, which means that the top half is 32 pixels tall. So, [b] is basically indexed by the pixel row we are currently rendering.
- ray_y_offs_table[a][b]: is the same but in the "z axis".
The camera angle is kept in the resolution of "128 degrees is one full revolution". So, when I want to cast a ray for logical pixel "i, j", I check ray_x_offs_table[angle*4 + i][j], and to the result of that I just have to add the camera coordinates, and that gives me directly the ray collision coordinates. So, no need to do any complex math during raycasting!
- Texture offset increments: if you check the algorithm above, one step says "ty += 16 / (SCREEN_WIDTH - y*2);", this is problematic since it has one division. I have a precalculated table called "texture_vertical_rate_table" that stores the value of "16 / (SCREEN_WIDTH - y*2)" indexed by "y". So, one more division saved!
- Sin/Cos tables: for moving the player / bullets / enemies around, I need to make them move depending on the angle they are facing. So, I need to calculate sine and cosine. This is all precalculated in four tables: sin_table, cos_table, sin_table_x12, cos_table_x12. The last two, since there are things that move faster, and thus, it's convenient to have it precalculated with the multiplication built-in already.
- Distances: this has to do with rendering enemies/bullets/items. When I need to render an enemy that I know is at a certain distance, I need to know where in the screen to render it. That would also have involved divisions. So, I have a precalculated table that tells me for an object that is at distance dx, dz (in the X and Z axis respectively), at which y coordinate do I need to render it in the screen.
- atan2: this is a routine I got from a post that ARTRAG did in the msx.org forums, that also has some precalculated tables (see original post here). I use it to calculate the X coordinate where to render enemies / items in the screen.
You would think that after all of these simplifications, this would already run fast on an MSX, right?! Well, think again!!! After I implemented all of the above it took 5439255 cycles in average to render a single frame... if we recall that the Z80 in the MSX executes 3579545 cycles per second, that comes to about 0.66 frames per second... it was unplayable...
At this point I had spent quite some time working on the engine and I was about to just give up. "The MSX is too slow for this" I thought. But I started looking at videos of other raycasting games for the MSX and Spectrum. Other people had managed to do it! so, there had to be a way! The next few weeks were 100% dedicated to optimization of the whole thing!
At this point I had spent quite some time working on the engine and I was about to just give up. "The MSX is too slow for this" I thought. But I started looking at videos of other raycasting games for the MSX and Spectrum. Other people had managed to do it! so, there had to be a way! The next few weeks were 100% dedicated to optimization of the whole thing!
Speed Optimization Techniques
There are lots of websites with a very large number of Z80 optimization tricks. I read all I could find, and every single little trick that I could apply to my code, I applied it! The one that I'm sad I did not find a place to apply it is this one (which blew my mind!): http://www.maxcoderz.org/forum/viewtopic.php?f=5&t=2579. So, I'm not going to go into detail of every single little thing, but just focus on those techniques that really made a big difference.
Moving if-then-else statements outside of inner loops: the first optimization (that saved about 400k or 500k cycles if I recall correctly) was the realization that every time I had a piece of code that looked like this:
while(some condition) {
if (condition that does not change inside of the loop) {
do X
} else {
do Y
}
}
I should replace it with:
if (condition that does not change inside of the loop) {
while(some condition) {
do X
}
} else {
while(some condition) {
do Y
}
}
This increased the size of the code significantly, but made it much faster. If you look at the raycasting code, you will see that I did this, for example, for handling the four quadrants. The code for managing raycasting in each of the four quadrants (from 0 to 90 degrees, 90 to 180, etc.) around the Y axis is identical except for a couple of places where I need to either add or subtract. By taking the "if we are in quadrant 1" conditions outside, I had to replicate the loop 4 times, but each loop ran much faster.
Aligned Tables: imagine that you have a table in a memory position X and you need to index it by the content of the accumulator register A, what you would have to do is something like this:
ld hl,X
ld b,0
ld c,a
add hl,bc
ld a,(hl)
However, if you know that X is a multiple of 256, then you know that when you do "ld hl,X", "l" will be equal to 0. So, you can actually just do:
ld h,X/256
ld l,a
ld a,(hl)
Voilà! much faster! This is a very common trick, and not anything I invented, but it saved another few hundred K cycles as well! I tried to use aligned tables for everything I could: textures, maps, etc.
Maximizing register use: but by far the thing that saved more CPU cycles was removing as much as I could any access to memory. The Z80 might only have a few registers: AF, BC, DE, HL, IX and IY. But a few of those (BC, DE, HL) are actually replicated! Using the assembler instruction EXX we swap from BC, DE, HL to the "ghost" copies of those registers. So, in fact we have AF, BC1, BC2, DE1, DE2, HL1, HL2, IX and IY, which is a TON of registers! Also, the IX and IY registers can also be used as pairs 8 bit registers (IXL, IXH, IYL, IYH), they are slower than the others, but still faster than RAM!
One of the most time consuming optimization processes I undertook was to track the usage of ALL registers (including the ghost BC, DE, HL) throughout the whole raycasting code. So, by making extensive use of EXX, I managed to remove all usages of variables in RAM in all inner loops of the raycasting code. For example, when rendering a texture, I keep the pointers to the texture, etc. in the regular BC, DE, HL, and the pointers to the raycasting pattern and color RAM buffer in the ghost copies. Sounds simple, but it was a pain to track which registers were being used at all times... I think all of this saved at least 1 million cycles or so.
One of the most time consuming optimization processes I undertook was to track the usage of ALL registers (including the ghost BC, DE, HL) throughout the whole raycasting code. So, by making extensive use of EXX, I managed to remove all usages of variables in RAM in all inner loops of the raycasting code. For example, when rendering a texture, I keep the pointers to the texture, etc. in the regular BC, DE, HL, and the pointers to the raycasting pattern and color RAM buffer in the ghost copies. Sounds simple, but it was a pain to track which registers were being used at all times... I think all of this saved at least 1 million cycles or so.
Top Half of the Screen is Analogous to the Bottom: when rendering the bottom half of a column in the screen, there is no point on repeating calculations, since it is almost a mirror of the top half. When rendering the ceiling texture on the top half, I store all the texture data in a 32 byte buffer (which, of course, is 256-aligned), and when rendering the bottom half, I just retrieve it from that buffer (raycast_floor_texture_buffer).
Floor/Ceiling Texture Optimizations: in the first version, I allowed arbitrary textures for the ceiling/floor. But it turns out that rendering the ceiling/floor was the most time consuming part of the whole renderer! (no wonder they did not include it in Wolfenstein 3D!!!). So, what I ended up doing is to remove the floor/ceiling textures, and replace them by hard-coded code, that given the (tx, ty) offset coordinates calculates whether the pixel should be painted or not. For example, the floor texture is generated by painting those pixels for which (tx+ty) & 0x80 != 0
Thanks to such optimization, the code that renders the floor, is as simple as this!
raycast_render_next_pixel_floor:
;; retrieve the texture stored during ceiling rendering (bc points to "raycast_floor_texture_buffer"):
ld a,(bc)
or a
jp z,raycast_render_done_with_floor_pixel
;; texture is 1, render pixel:
ld a,(hl) ;; hl points to the raycasting pattern RAM buffer
or d ;; d has the "mask" corresponding to this column
ld (hl),a ;; render pixel
inc hl
ld (hl),a ;; we render it twice, since logical pixels are 2x2
inc hl
dec c ;; c is the number of pixels left to render in this column
jp nz,raycast_render_next_pixel_floor
jp raycast_render_done_with_column
raycast_render_done_with_floor_pixel:
inc hl
inc hl
dec c
jp nz,raycast_render_next_pixel_floor
This saved another very significant number of CPU cycles!
Saving Space in the ROM: finally at the end I was running out of memory, so, in addition to compressing everything with Pletter, etc. I had to reduce the space taken by all the precalculated tables. For that reason, if you look at the code, you'll see that all the sine / cosine tables are overlapped (since the sine / cosine functions are the same function, just one is shifted from the other). And also, I don't really have a ray_y_offs_table, since it's exactly the same table as ray_x_offs_table but reversed. So, I only store one of them and either add (for X) or subtract (for Y) to index it.
After all of these optimizations, at the default viewport size, the game takes 1550184 cycles in average to render a whole frame, less than 1/3 of the original time! and making it to about 2.3 frames per second! If you reduce the viewport size to the smallest size then it takes about 1053651 cycles in average, getting about 3.5 frames per second, which, compared to the original 0.66 is ridiculously smooth!!!
Summary
So, those are the basics. There are a bunch of things that I have not explained, like how I am rendering the enemies / items, or how does occlusion testing works (when an enemy is behind a wall, it should not be rendered, and occlusion testing is expensive CPU-wise), but this entry is already getting too long. I also implemented a fancy sprite pattern loading code that dynamically loads and unloads sprites onto the VDP based on what is being rendered that I thought was very cool. But I think I'll leave it here, since the basics are covered!
One of the main obstacles to making it faster is basically the VDP. Not having direct access to change bytes of the video memory is what forces me to have to have the RAM buffer, and then a copy, etc. I think that the same type of game could perhaps run a bit faster on computers such as the Spectrum or the Amstrad CPC, where the video memory is memory mapped. There would be some drawbacks, of course, such as color clash in Spectrum, which might force me to forget about coloring textures at a pixel-level, and force me to use texture-wide colors. Also, in Spectrum or Amstrad I would not be able to use hardware sprites, so, rendering enemies/items would be harder, since it cannot be done independently of the background. Another obstacle is memory. I think that with a bit more RAM I could have added more pre-calculated tables, but this was not an option for the MSX1...
I also think that things could be optimized a bit more on an MSX2 since some of the additional screen modes supported by the MSX2 would not require having to OR a mask to render pixels (right now I need to READ a pixel, OR the mask, and the WRITE it, requiring two memory accesses per pixel rendered! which is not ideal...).
Ok, so that's it. That's how the raycasting-side of ToP works! If you made it all the way here, congrats!!! Since this one was a long one!! hahaha