September 2023: Learning Assembler as a Joke
Spoiler, the joke wasn't funny for very long...
Clearly most of my ideas are not great and founded in frankly impractical reflexes. I've been aware of Assembler as a very basic programming language (?) for a while, but there's really no scenario where I was going to learn it without an explicit need for it, of which there are probably very few out there either way. Assembler is pretty deep in the weeds as far as programming languages go, far deeper than I'm used to thinking about. Technically, "getting" assembler will also improve my C code, and might help me break fewer things whenever I force my operating system to do something some engineer somewhere tried very hard to prevent people doing accidentally. Most assembler tutorials out there will start with a few theoretical computer science bits which I won't recant, because I'll assume number systems are familiar to most people, and this isn't meant to be strictly speaking educational anyways. Instead, I'll start directly into the construction of an assembly program.
An assembly program has three sections: data, bss and text. The first is for declaring data and constants, the second for declaring variables and the third is where the actual code happens. The first thing that one might notice, is how long a "hello world" program is in assembler. Because we work on the level of memory, all of those assignments that would otherwise be implicit need to be called by the user. A central command for this is MOV (technically not case sensitive, but it's better legible if assembler calls are written in CAPS). It assigns some function to a register, which to me were a new odd thing I never had to think about before. These registers can fall into three categories: General, Control and Segment. The general registers are also divided into: Data, Pointer, and Idex registers. Data registers come up in the hello world program in a similar way that one would use variables. Unfortunately this doesn't really become clear by looking at an assembler program, because there are named in a scheme that doesn't show up a lot in modern programming languages.
The four 32-bit data registers have names such as: EAX, EBX, ECX and EDX. True to the binary system, these registers can be split into the 16-bit ones by omitting the leading "E" and the resulting 16-bit register can be split into the lower half and higher half by replacing the trailing "X" by an "L" or "H" respectively. Each of these registers have default functions as well, something like the stdout pointer in C. AX and DX for example are used for input/output operations, BX as the base register, taking the function for indexed addressing, comparable to pointers to memory. CX will keep track of counts in iterative loops. Keeping this in mind, let's go look at the hello world program.
section .data msg DB "Hello World!", 0xa len EQU $ - msg section .text global _start _start: MOV EDX, len MOV ECX, msg MOV EBX, 1 MOV EAX, 4 INT 0x80 MOV EAX, 1 INT 0x80
Note that all data registers are declared in the start section. Let's start with the easiest one, the system calls in EAX. It'll tell the system in which mode it's currently running up until the next time INT is invoked to make the kernel take the next step. In this case, it suffices to call the kernel, and apparently that is done by passing 0x80 to INT. 4 is sys_write, and 1 is sys_exit. stdout, represented by 1 is assigned to EBX, so the program knows where to do the sys_write to. The decision to put the message into ECX is based in the problem that C programmers should be intimately familiar with, namely that a string is basically an array, and thus iterative. Iterative things go to ECX. EDX needs the length, because that's just how it works in lower level programming languages: If we print something, the computer will want to know how much it needs to print, before it starts going.
Because Assembler works very, very close to the hardware, not only will it need pointers, it'll need registers to put those pointers in. Those are the Instruction Pointer (IP), Stack Pointer (SP) and Base Pointer (BP). The former two do exactly what they sound like, the last time, is what we would understand as a "variable pointer" in C, if you've never learned it properly (like meeeee). Assembler doesn't jump around in memory as a principle. As such, these pointers don't as much give the direct address in memory, but rather give an offset. What's more similar to these pointers can be found in the Index Registers, where the chief distinction is between the Source Index and the Destination Index.
Occasionally some flag can give a convenient way to communicate general instructions to the program. These flags, of course, have their own register. This register is 16-bits long.
Variables will have to be dimensioned, of course, which I'm not technically against, but is something I'm not exactly used to. Where most programming languages that do this will expect the variably type to be given, Assembler is really more interested in the size. It exists to literally allocate space for the variable. The syntax is __name__ __directive__ __value__, where the directive is a short, 2-letter instruction (starting with a "D") informing the program to allocate either 1, 2, 4, 8, or 10 bytes to the variable. If the data isn't initialized, then the D is replaced by "RES" for "Reserve". If the value is a multiple of a sequence, then the syntax can be supplemented by TIMES N after the name. Of course this should still be contained within the size these variables are allowed to be in, although, on the flipside, as long as there's space, values delimited by a comma will instantiate an array. For non-strings, the TIMES directive can be used to produce an array as well. If the variable in question is meant to be constant, then it can be declared with EQU, rather than the variable directive. This has the benefits of setting an expression (such as the result of a calculation) rather than a value. For numeric constants, the %assign directive can be used to define, and redefine the constant. The %define directive works similarly to the way macros do, setting an expression as a constant.
Most arithmetics outside of declarations is done using the arithmetic instructions. Besides the usual ADD(-ition) and SUB(-traction) instructions, Assembler includes INC(-rement) and DEC(-rement),which are useful for tracking variables in most programming languages. Occasionally, numbers need to be converted, which is done implicitly. How this works exactly, I'm not sure, but apparently subtracting by '0' to converts an ASCII value into a decimal number, while adding a '0' will convert it back. Multiplication (MUL/IMUL) is either signed or unsigned, which is a distinction necessary mostly because of how multiplication can handle the Carry and Overflow flags. Unsigned data might have to be given in ASCII. Division (DIV/IDIV) functions analogously, though the result is a quotient and a remainder. This means that division will always round down.
For all other programming, we of course can't quite do without loops and conditions. At least I can't. Because of how Assembler is structured, the branches implicit in a condition are always different locations, the checking of the conditions will always have to be succeeded by a jump. Unconditional jumps are done using the JMP instruction. Whether the jump goes forwards or backwards is up to the programmer. Conditional jumps are noted down as j<condition>. It's basically a shorthand for applying an offset value in the IP register. For the type of condition, it's usually important to do a comparison between two values. This is done using the CMP instruction. After the comparison, there should be a jump instruction to a label. This label is usually called Lxx, where xx are numbers. Since the labels can be inserted wherever without interfering with the program, the JMP instruction can be used not dissimilar to the goto: instruction of C, the liberal use of which I'm admittedly prone to. The exact outcomes of the comparisons are given by the jump instructions, e.g. if the compared values are equal, then the jump instruction "JE" triggers. If the first value is greater, the "JG" does.
For instruction that are meant to loop over a piece of code following some label, the LOOP instruction may be invoked. It assumes that the loop count is saved in the ECX register, decrementing it with each iteration, and the loop condition is terminated at the point at which the ECX register comes to zero. Otherwise, the decrementing, and the register will have to be appointed explicitly, along with the jump instruction.
Emulating functions will become very necessary at some point. In Assembly, these are procedures, defined by
proc_name: procedure body ... ret
and called using the CALL instruction. Return values are assumed to be stored in the EAX register. Procedures usually accesses registers instead of taking parameters. These are good for simple shortening of codes, thought of course we're very much missing the parameters. At this point, defining a macro might be more sensible. Macros are something I usually avoid, because I don't really get them. They usually seem very volatile, and if I write them, something breaks. Even further, the definition of macros can depend on the compiler, which would require me to have more than a basic knowledge of the compilers I'm using. For NASM, they are defined by
%macro name num_of_params <macro body> %endmacro
This is about as much of a primer I can work through in a month that wants a lot of other time from me. I also don't honestly know what I need this for, but I guess now I could debug an Assembler script, if I wanted to. I might use this to break my system some time, and if that happens I'll mention it in that month's report.