Compiler Notes, part 3
Embedded Assembly Language
Inventing a new form of syntax for every little operation that may ever be needed is extremely time consuming and makes a language very complicated and annoying. How many programs ever need to change the value of the Call Gate Base Register for instance?
A very useful alternative is the make it possible to write assembly language instructions directly in a normal program, perhaps as in this little snippet:
if (option=12) then
{ count=(count+1);
< LOAD R1, 0x80003800;
< SETSR R1, $CGBR }
This is easy to implement:
· Make sure your reader will accept (or whatever other symbol you choose) as a single symbol or operator. Remember that multi-character operators need special handling in reader.cpp
· Notice that the components of an assembly instruction do not follow the same patters as the symbols and operators of the rest of the program, so the text that comes after will need to be read in a different way
· Just as reader has a read() method that gets the next symbol of input, it can have a read_assembly() method gets a whole load of text including spaces, commas, and just about anything. The read_assembly() method should read everything that remains on the current input line, up to a ‘;’, ‘}’, or ‘\n’.
· When your parse_statement function sees as the first symbol of a statement, it simply uses read_assembly() to obtain the whole instruction, and returns a new tree node labelled as “assembly” and containing the instruction as its detail.
· When your translate_statement function finds an “assembly” node, all it has to do is print the detail string directly into the object file.
However, to be truly convenient, you will find the need for one added feature.
Perhaps you have designed a nice function to set up the call gate registers for you: you pass it both the address to be stored in $CGBR and the length to be stored in $CGLEN:
function set_cg(address, length)
{ < LOAD R1, [FP+2];
< SETSR R1, $CGBR;
< LOAD R1, [FP+3];
< SETSR R1, $CGLEN }
It is immediately clear from this example that having to work out the correct frame pointer offsets for local variables and parameters would be both annoying and unreliable. And this is exactly the sort of thing that compilers are supposed to do for you.
My usual trick is to pick on something bracket-like that never appears in assembly language, and use it to indicate the parts that I need help with, like this:
function set_cg(address, length)
{ < LOAD R1, [<address>];
< SETSR R1, $CGBR;
< LOAD R1, [<length>];
< SETSR R1, $CGLEN }
It is only a very small addition to the work done by translate_statement to implement this. When translate_statement reaches an “assembly” node, it doesn’t just print the string into the object file. First it scans the string to see if it contains or characters. If it does, everything up to (but not including) the is printed to the file normally. Then the characters between the and are extracted as a string, and looked up in the symbol table, so that the correct form of address may be printed. Finally, everything after the is printed normally. It is a very simple procedure.
Input and Output Convenience
The print statement was one of the first and easiest that we incorporated into the language. Just a couple of lines in each of the parser and the translator. All it will do is print integers in decimal, because it is always translated into "FAKEIT R1, $PRINTINT", which doesn't know how to do anything else.
The ability to print single characters given their ascii codes is just as easy to add. FAKEIT has a $PRINTCHAR option which does exactly that. It is not going to be possible to make this another facet of the print statement, perhaps being used like this:
local x;
x='A';
print x;
because there is no way the compiler could possibly tell that you want the value of x (which would be 65) to be printed as "A" rather than as "65". There is no need for that to be a problem. Just invent a new kind of statement, perhaps called "printchar". Parsing and translating would be exactly as they are for the existing print statement, except that $PRINTCHAR would be used instead of $PRINTINT. So this example:
local x;
x='A';
print x;
printchar x;
would print "65A".
There is also a basic input option. FAKEIT R1, $READCHAR takes one character of user input and returns its ascii code in the register. The easiest way to make use of that is to invent a new kind of expression, probably just one fixed word, that translates into a use of this instruction.
x=inchar;
something like that.
Whole strings are even more useful. The convenience of being able to write
x="One two three four";
printstring x;
or even
printstring "Error! Your program has gone wrong.";
is self evident. The assembler already knows how to deal with constant strings. You can do things like this
LOAD R1, string_17
FAKEIT R1, $PRINTSTR
...
string_17: .STRING "Greetings, Human.\n"
and it is not difficult to automate that in a compiler. There are really only three things to do.
- Make sure your reader knows what to do when it meets a double-quote character, and be careful of those \ things. And make sure your expression parser, when it receives a string from the reader, has the sense to create a sensibly tagged tree node containing the string.
- When the expression translator has to make use of a string, it needs to be able to generate a unique label to refer to that string, like "string_17" in the example above. You know how to do that. But be careful: defining that label has to be saved for later. The characters of the string can't be allowed to appear in the middle of executable instructions.
- That is also easy to handle. Just keep a list of all the strings that have been needed throughout the program, and at the very end when there is nothing else left to translate, just print all those string_n: definitions at the end of the assembly file.
Dynamic Memory Allocation
Properly implementing the new and delete operations (or malloc and free as they were in C) is a challenging and tricky problem. A stop-gap solution which provides a working new, but doesn't even attempt delete, is trivial. Without delete any program that uses dynamic memory will have a serious memory leak, so working out a solution might be an interesting exercise for you.
Perhaps you remember that at startup, the emulator puts some useful information in a few fixed memory locations. 0x100 contains the total amount of memory that exists, and 0x101 contains the first address that no code or data has been automatically loaded into. When a program starts, it should preserve these two values safely in variables. If it some point it needs 25 words of memory for something, it can simply use the 25 words starting from the first free memory location, and of course add 25 to its idea of where the first free location now is.
The equivalent of new would be a very small function. Tell it how many words you need, and so long as it hasn't run out of memory, it returns to you a pointer to the beginning of a lump of memory exactly that size. Programmers have to take responsibility for working out exactly how many words their data structures need to occupy, but someone programming at the systems level should have no trouble with that.
Arrays and Pointers
In our original version of the language, normal array variables were created by adding a size specification to a normal local or global declaration, as in
local A:10;
or
global data:300;
With the basic dynamic memory allocation system described above, they can now also be created on-the-fly, with a size that may not have been known when the program was compiled, and accessed through a pointer variable:
local x;
x = call new(size);
All we need to do is implement the essential pointer operations.
The only thing you ever really do with a pointer is to follow it. Sometimes pointers are followed in order to use that value that they point to, and sometimes, as in assignments, they are followed to find the place that a value should be stored. It is best to think of these as two distinct operations.
Invent an operator to represent the follow-the-pointer operation. You could go with *, as C and C++ do, or you could do something creative. For the sake of examples, I'll go with *.
Your expression parser must be able to deal with this new operator. If the first symbol seen by parse_expression is a *, it must be the pointer operator, as a multiplication sign would never appear first. After the *, things could get complicated unless we think of operator priorities. They have a reason for existing. Let's say * is going to be a very high priority operator. So high that the only things * could possibly be applied directly to are plain old variables, anything else must be put in parentheses.
Now parse_expression has a very easy job. After seeing a *, there are only two possibilities. If a '(' appears next, it must be followed by a whole expression then a closing ')'. So the parser just swallows up the '(', then calls itself again recursively to read the parenthetical expression, then makes sure the next symbol really is the matching ')'. If the * is followed by a variable name, then the expression is complete.
To handle these things, parse_expression would have an extra clause like this...
else if symbol is "*"
{ create new node n, tagged as "follow pointer"
read next symbol
if symbol is "("
{ call parse_expression
add the result to n
read next symbol
error if it isn't ')' }
else if the symbol is a variable
{ create a node representing the variable
add it to n }
When the expression translator comes across this "follow pointer" node, translation is nothing. In order to get the value of "*X" into register N, first get the value of X into register N, then complete the indirection with LOAD Rn, [Rn].
Assignment through a pointer is almost exactly the same. We already detect assignment statements when the first thing seen by parse_statement is a variable. Now there is an extra case. If the first thing seen by parse_statement is a *, we have detected an "assign through pointer" statement. The star must be followed by either a variable or an expression in parentheses, then the usual '=' and the value expression to complete the assignment. Translation of an "assign through pointer" statement follows a very similar pattern to a "follow pointer" expression.
*(expression1) = expression2
requires nothing more than
Get value of expression2 in R1
Get value of expression1 in R2
STORE R1, [R2]
Obtaining a Pointer
No additions are necessary, there are already two ways of obtaining a pointer. Recall that the dynamic memory allocation function discussed earlier will return as its value a pointer to the required amount of memory. For normal, non-dynamically allocated things, remind yourself of how variables are handled by looking quickly at the expression translator. You'll see that the part for dealing with things that appear to be variables has this sort of structure:
if (kind=='l') // local variable
use [FP+n]
if (kind=='L') // local array
use FP+n
if (kind=='g') // global variable
use [g_name]
if (kind=='G') // global array
use g_name
When the name of an array appears as an expression, its address is computed the same way as it is for a normal variable, except that the [ square brackets ] are not used. This means that the instruction operand evaluates to the address of an array, not its contents. In other words, the name of a global or local array already represents a pointer to the beginning of that array. No extra work is needed. Storing things in arrays just requires the use of + along with *:
int A[20]; º local A:20;
A[3]=99; º *(A+3) = 99;
A[4]=123; º *(A+4) = 123;
A[i*2]=A[i]*A[j]; º *(A+(i*2)) = (*(A+i)**(A+j));
If you want to be able to make pointers to regular, non-array, variables, then you will need a little something extra. Invent an operator to represent the "give me a pointer to" operation. C and C++ use for this purpose. It can only appear as an expression, and it can only reasonably be followed by a variable name, so adding it to the parser requires a small subset of the work needed to add the * operator.
What does the expression translator need to do with operations? All it can ever need to do is put the address of a variable into a particular register. How do you get the address of a local variable into register n?
LOAD Rn, FP+offset
How do you get the address of a global variable into register n?
LOAD Rn, g_name
It is just the same as a normal variable access but without the square brackets.
Strings
Strings are very easy if you don’t expect too much from them. Forget everything about C++ or java strings. Just let a string be a sequence of characters nicely packed four to a word with a zero byte at the end. We’ve already discussed the use of constant “quoted” strings: the assembler’s .STRING directive automatically packs the characters four per word and adds the final zero.
The equivalent of a string variable can be either a normal local or global array of the right size, or a lump of memory delivered by our primitive dynamic memory scheme. In all those cases, be careful to make sure the array is big enough - an 8 character string requires 3 words, not 2.
There are only two essential string operations: looking at the nth character, and setting the nth character. Everything else can be defined in terms of those two operations, just like in the <string.h> library used for C programming. The emulator provides special instructions for dealing with characters in strings, all you need to do is tell the compiler how to use them.