Learning Assembly

Note on this tutorial

I wrote this tutorial many years ago (2001-2002). Because of the large number of links pointing to it I've left it up. However, I no longer support it. Please do not contact me asking for help with QBASIC, ASM, or other programming issues. Also, please do not contact me about minor errors in the text: the text of this tutorial is mostly unedited, and only the presentation and markup have been significantly updated (to valid XHTML 1.0 strict). If there are major errors which significantly interfere with readability, though, please feel free to bring them to my attention - Billy Wenge-Murphy
This tutorial is (c) 2001-2007 Billy Wenge-Murphy. All rights reserved. It may not be copied, reproduced, or redistributed in any form without permission. If you wish to share it, please link to it instead of reposting it.

While learning ASM, I found many tutorials to be very confusing, and did not cover assembly in the detail that's necessary for such a complicated programming language as this one. So, I write this rudimentary tutorial in order to ease the pain others may have learning ASM.

The problem with most beginner level tutorials is that they assume the reader has previous programming knowledge in one language or another. While I'll make comments that draw connections between programming in BASIC and ASM, i hope to write this is such a way that you can skip these remarks without affecting your learning, therefore making this a completely newbie-level tutorial.

First off, i believe it very difficult to learn programming without programming as you learn. So, i suggest you have a copy of TASM, a necessary utility for writing assembly programs.

Also before you start, it's important that you understand about hexadecimal + binary.

2.1 - Introduction to programming

[Those with programming experience in any other language may want to ignore this section]

So what is programming anyway. Well, the basic idea is that a computer program is made up of a bunch of "instructions" that a computer follows. For the most part, a program is made by typing in a bunch of instructions that make much more sense to us than they do to the computer. Then, they are translated, "compiled" or "assembled" into a program that the computer can understand. This is why you need to download an install the software higher up on this page.
For our means, we can type these commands into a simple, standard text editor such as "notepad". Actually, this is preferred - if you use a more advanced program like Microsoft Word, you'll have to make sure that you save it as "text only". So, if you can, use Notepad. It's standard with all versions of windows.

2.2 - Your first program

Open up notepad, or whatever you happen to have decided to type with. For a start, your programs should always have this skeleton

.MODEL SMALL
.STACK 200H
.CODE
START:

END START

That is, all your programs should include these lines. Your whole program will go in lines between "start" and "end start".

It's very important that if you copy these lines into your file instead of using Copy+Paste, notice the periods at the beginning of the first few lines. And, notice the colon after START. Even the smallest dot is a very important piece in programming so never overlook them.

Now, start and end start don't really mean much to a computer. But, to use, start is the beginning of something. And end start doesn't make a lot of logical sense to us, but that's how it goes, so just grin and bear it. End Start tell where the end of the part of the main program is. But right now, our program does absolutely nothing!. So, we may want to learn about the different commands we can use in assembly.

2.3 - Interrupts

We can write a very simple program that puts just a character of text on the screen using just "interrupts". If you're familiar with any higher level languages, you can think of interrupts as essentially commands.

Interrupts each have some complicated operation(s) they perform, and all they require is that you give them a small amount of information. In this case, we'll be using an interrupt that can put text characters on the screen. Because one interrupt may have many other functions it can perform, we must tell it which one to do. Then, we give it the required information, and tell it to do whatever it may do. We can therefore do very complicated operations while being totally oblivious to how they work. Here is our example program

.MODEL SMALL
.STACK 200H
.CODE
START:

Mov ah, 2
Mov dl, 1
Int 21h

mov ah, 4ch
mov al, 00h
int 21h

END START

It may seem like a collection of completely arbitrary words and numbers. Only at first. We soon realize that it is a very concrete concept. Every part along the way does its important part. The tiny pieces of code result in one big program that does exactly what we expected. Here's a breakdown, line by line, of what the program does

  1. We put the number 2 in a specific location in the computer's memory. Later, the computer will look at this number and, in this case, this number tell which "function number" the interrupt should do. As mentioned before, most interrupts can do a variety of functions. So, we must tell it which one to do. In this case, we want the DISPLAY OUTPUT function. This is function number 2. So, we put the number 2 in a specific place, just waiting for the computer to look it up later

  2. We put the number 1 in a different specific place in memory. We've already specified that we want to use function 2 of the specific interrupt, which is DISPLAY OUTPUT. But what should it display. Well, different characters of text have different number codes assigned to them (this is unrelated to the base-whatever numbering stuff we talked about earlier, just to let you know). This code is called "ASCII"

    So, if we're going to be displaying text, we should specify what text. The number 1 in the ASCII code happens to correspond to a little smiley face.

    After all of this, we've so far established that we want the computer to do some text displaying; The text we want to display is a smiley face

  3. The two pieces of information we gave the computer would be worthless if we didn't do something with it. In this 3rd line, we tell the computer to use interrupt #21. As soon as this happens, it looks at the place in memory called "ah" and sees whats there, because it must know which of it's numerous functions it should do. It ends up figuring out that it should display text, and ultimately, it should display a smiley face.
    Note that there's an "h" after the number 21. If we put an h after a number, it means that the number is not 21 in decimal. It's 21 in hexadecimal. Remember, don't think that this means 21 in both cases. Think of this hexadecimal number as "Two one"; And, if we convert it to decimal, we find that it's 33.
    But, it's common programming practice to use hexadecimal when referring to interrupts, rather than their decimal equivalents. So, unless you're a devout non-conformist, make it easy on yourself and think of this as "Int twenty 21 h" not "Int thirty three". It'll make it much easier for you to communicate about assembly, as everyone else calls interrupts by their hexadecimal numbers.

  4. The final commands end the program. This is necessary at the end of the all your programs, unless you want awful things to happen. If you forget this, random effects, that will more than likely freeze up the computer, will result.

There we have it! We've effectively written a program doing exactly what we expected from the outset.

A couple things you should note

Firstly, the blanks line are just my style of separating code to make it easier to read. The assembler, which we'll explain using in just a second, doesn't care one way or the other if there are blank lines, as long as they don't actually hurt the code in some way: They generally don't. you can take them out if you don't like - You can add more at your will - It doesn't matter, because the only important part is the code; The commands involved in our program.

Secondly, in "Mov ah", ah is not "10" in hexadecimal. In this instance, it's the name of a place in memory. It's just a coincidence. More is explained in the next section.

2.4 - The Registers

In the last section there was a strange unexplained part of it. Primarily, these two lines:

Mov ah, 2
Mov dl, 1

First, let's explain "Mov". It appears to be shorthand for the word "Move". This makes a lot of sense. This command, unlike an interrupt, does a tiny, simple command. However, it's a very important instruction in ASM.

The first one takes the number 2 and "Moves" it into a place the computer explicitly calls "ah". So we can deduce that the next command moves the number 1 into a place called "dl". It does. The MOV instruction can be used in other ways. For example, we could say

Mov ah, dl

The computer would take whatever is in dl and move it into ah. Well, to say "move" is misleading, because it's not actually moved. Whatever is in dl stays there. But now, it's also in ah. Likewise, this would work

Mov dl, ah

So what are ah and dl anyway. We know from many previous mentions that they're specific places in memory. They're called registers. The ones we're mainly concerned with right now are AX, BX, CX, and DX. They're made up of two 'pieces' each - hence, smaller registers. Ah, that we've already encountered, is one of the parts of ax. Ax also has another part called al. the h and l in ah and al, mean "High" and "Low". They make up the higher and lower parts of the register ax. For example, if we did this:

Mov ah, 1
Mov al, FF

Then, if we looked at what is in ax, we would see it contained: 01FF

Why? Because the "High" part contains 1, or 01. and the "low" part contains FF. So, combined into the bigger register, they make 01FF
So, we conclude that many registers, or at the least the ones we care about are made of 2 smaller parts. And, to find their values, we combine them (Don't add them, though: 01 + FF = 100, not 01FF)

ah + al = ax
bh + bl = bx
ch + cl = cx
dh + dl = dx

One final thing to mention - al, bl, ah, bh, and so on, can each have a value of 0 to 255. So, when combined to make ax, bx, and so on, the total value possible for those is 0 to 65,535

2.5 - Compiling our programs

We left off part 2.3, with a finished program. But, we never actually made a program out of it. Well, to make a program is really quite simple: First save your program, as something like "First.asm". Then, go to the folder where you have TASM and type this into the address bar:

>Tasm.exe First First.obj

As long as your program has no problems, this will make a file in the same directory called "first.obj". Then, type this into the address bar

>Tlink First.obj

Finally, this will make a program called "First.exe"! Hoorah! Our first successful compile! (hopefully). Now, click on it to run it. If you have problems seeing it run because it opens and closes itself too fast... well.... enjoy!

3.1 - Memory

True, MOV, interrupts, and registers are very important, as you just read. However, there's not a whole lot that can be done using only them. To move on, we'll need to understand a little bit about the computer's memory. And to do this, we also need to just know about memory in general. We'll first start by how memory is divided up.
This can become quite complex, so just read through slowly, and go back over it if something confuses you.

Basically, a computer's memory is a piece of circuitry; most of the time many pieces. It has small points in the circuits called "transistors" that can either have an electric charge of 5v, or no charge. The millions of these that the computer has is where is stores everything.
Taking into account our previous knowledge of binary, we remember that in binary a digit can only be either 0 or 1. So, we could think of either a transistor with a charge, or one with no charge, as the same as 1 and 0 in binary. This turns out to be true. 1 and 0 represent each transistor of memory.
Each transistor is called a "bit". This is short for "BInary digiT".

Well, hexadecimal is also important in our discussion. You see, if we wanted to look at memory and it was all in binary form, it would be very cryptic - 1010111010000111001100011100011100111110.... and so on. So, to make memory easier to read, we can read it in hexadecimal numbers

Well, recall that in hexadecimal the highest digit is F - which has a decimal equivalent of 15. In binary, that would take up four digits to show:

1111

is the same as F in hexadecimal.

Since our previous unit was called a "Bit", to keep in the same naming theme, 4 bits are called a "Nibble". Then, it just goes up from there.

8 bits = Byte
2 Bytes = Word
2 Words = Double Word (DWORD for short)

Then, for really big numbers, there's these:

1024 bytes = kilobyte (KB)
1024 KB = megabyte (MB)
1024 MB = gigabyte (GB)
1024 GB = terabyte
1024 terabytes = petabyte
1024 petabytes = exabyte

Terabyte can be TB, petabyte PB, etc, but these are not in common use. As of 2007, terabyte is only starting to come into common use as harddrives get larger.

In any case, we just need to deal with the terms "bit", "byte", and "word". They're the ones that'll come up most often in low-level programming

As I mentioned briefly in the last section, the registers ah, al, and so on could only have a maximum value of 255. This may seem arbitrary at first - why not 999? Because, they can only hold one byte. One byte is 8 bits, and the highest number we can make in binary with 8 bits is this


11111111

So when we put together ah and al, the highest number is 65535. Why? Well, each register can hold 1 byte, or 2 nibbles, or 8 bits - it's all the same amount. So, with two registers we have 2 bytes, or 4 nibbles, or 16 bits. Assuming we made the highest possible hexadecimal number with 4 nibbles, it would look like this:

FFFF

Punch that into your computer's calculator and convert it to decimal, and, surprise surprise : It equals 65535

3.2 - Addressing

In order to use the computer's memory - e.g. store numbers, text, etc - we have to understand how the computer goes about organizing it. It does this by something called Segments and Offsets. These are used to communicate between ourselves and the computer, where things should be put and where they should be gotten from. Whenever we want to read or write to memory, we must use numbers pointing the the exact location of the BYTE we want to read; Specifically, 2 numbers, called the Segment and Offset.
Usually these two numbers are WORDs (16 bits each). One points to the general area of memory, the "Segment". And the other, how many bytes into that segment, known as the "offset". This way, we can use up to 64KB of memory at once (65536 bytes)

For example, say these numbers (hexadecimal) we stored somewhere in memory:

00 AB D2 AC 98 4E 67

and so on.. Now, say we wanted to read those numbers. Well, the computer has millions of bytes of memory - so we must have some way of specifying what part of memory they're in. This is called their "Address". For a real life address with the street name and the number. Essentially the street name is what "part" of the city you live in. We do the same with computer memory, but both are numbers. So, that data above may be located at:

FE00:0000

FE00 would be the "segment", or part. And 0000 would be how far in the data starts. So the address FE00:0000 would "point" to the hexadecimal number 00. In that case, FE00:0001 would point to the hexadecimal number AB. FE00:0002 points to D2, and so on. Bear this in mind as we cover just one more section before making use of what we now know about memory.

3.3 - The Register DS

The registers we've covered: AX, BX, CX, DX, and their smaller parts, are all called General Purpose Registers . There is another kind of registers which are called Segment Registers

In this case we're discussing DS - not to be confused with DX - which stands for "Data Segment". Segment registers are used, not surprisingly, to point to segments of memory. They aren't usually used for holding data like the general purpose registers are.

Going back to the previous sections; say we wanted to print some text to the screen. We learned one method, but that would require that we print each individual character one after another!

There's a better way. There is another function of Int 21 that will print an entire string (a string is a bunch of text characters one after another). For the sake of simplicity, say that the text we want is stored at FE00:0000. This program will allow us to print it out to the screen

.MODEL SMALL
.STACK 200H
.CODE
START:

Mov ax, fe00
Mov ds, ax
Mov dx, 0
Mov ah, 09
Int 21h

mov ah, 4ch
mov al, 00h
int 21h

END START

So, what does this program do? Well, first, it puts fe00, the segment of the text, into AX. We use the MOV instruction to do this, which in this case 'moves' (or rather copies) fe00 into the ax register.

But, we wanted ds to have the segment. Well, that's one quirk of the segment registers - you're not allowed to change them directly. So, you can't just put a number right into DS. You can, however, put another register into them. So, we put the segment number first into ax. Now, we move it into ds Then we put 0 into dx. For this interrupt, it requires that we have the segment of the text in DS and the offset in DX. Since the offset is 0, we put 0 into dx.

Next, we put 9 into ah. Since int 21 has a lot of different functions it can do, we must specify which on we want. The one to print text, by specifying a segment and offset, is #9.

Finally, we use int 21 again, but this time to end the program.

In theory, this is just great. But, memory doesn't work like that. We generally don't just put whatever we want, where we want. At least not at this stage.

For example, when you run a program like this one (if you were to compile and run it, which i don't recommend), the computer picks out a free space in memory to load the program itself. You don't specify this. So, what have we accomplished then with segments and offsets if we can't use them?

We can, as you will see.

3.4 - Variables

DS was important to introduce in the previous section, because when you write a program, you can have things called "variables". And whatever you put in these variables is usually put in the "Data Segment", which is what DS points to.

When the compiler/assembler is done changing your program into something the computer can actually read, it doesn't actually use variables, but they make life a lot easier for programmers.

So, what are variables and how do we use them? A variable is where you can store data. Strings (text), numbers, etc. They're called variables because, well, they can vary. Not only can they contain a number or something like that, but you change them as much as you need during your program. This makes programs much more versatile and useful.

For example, let's rewrite that last program so that it does actually work

.MODEL SMALL
.STACK 200H
.DATA

This is a new part! Make sure to include it

Textstring db "I'm a string$"

.CODE
START:

Mov ax, SEG Textstring
Mov ds, ax
Mov dx, OFFSET Textstring
Mov ah, 09
Int 21h

mov ah, 4ch
mov al, 00h
int 21h

END START

Wow. A lot of things to explain here. Let's start from the top downward. You'll notice there's a new part that should be included in the beginning. The part called .DATA declares what variables we have.

As always, the period in front of DATA is very important. Also, make sure that .DATA comes before .CODE, because .CODE says that everything after it is part of the code.

Again, we put the segment into ax first, since we can't move it straight into ds. Once very convenient feature of the assembler is that we don't have to figure out the segment and offset that our variable is at; Which is good, because as we said, the computer decides quite randomly - it would make it tough to find where our variables are in memory. So, by saying SEGMENT Textstring, we move the segment of that variable into ax instead of what's actually in the variable. The same for OFFSET Textstring. It puts the offset of the variable textstring into the register, instead of the actual variable.

One more unexplained part - What's with that line after .DATA?

Textstring db "I'm a string$"

Well, Textstring is the name of the variable - we must specify the name we want to call the variable first.

Next, db, stands for "Declare Byte(s)". It can either be used if we want our variable to be one byte long, or multiple bytes. In this case, it's multiple bytes, because each character of text takes up one byte.

Finally, we tell the compiler what we want to be in the variable. This can be changed by your program, but we just tell what we want it to start at.

One more little detail of int 21, function 9 is that the text you're printing must have a dollar sign at the end. It doesn't actually print a dollar sign on the screen, it just indicates where the text ends.

Go ahead and compile and run this program. Unlike the last one, it should work.

Mascot: Billy's Weird Cat Thing