Compilers (Part 1)
Wednesday, January 16th, 2008Ok, I know I said I was going to stop, but I found a book about compilers. Through out the book, the talk about how compilers work, and work you through making most of a compiler for a stripped down version of java. The main thing they leave out is linking and assembling.
I started looking through the book, and find it rather easy (for me) to read, so I’m enjoying it. I figured I would start a new set on compilers, but this time, only every once in a while, not all at once like before.
First, a compiler isn’t a black box that does lots of stuff all at once. Rather, it has many portions that build on the previous one to do it all. The two main portions are the front end and the back end. The front end reads the input, does some magic that I’ll talk about later, and outputs to a psudo-assembly. This psudo-assembly isn’t something that an assembler can deal with, but it is similar to many kinds of assembly. It makes it easy for the back end to make it into assembly. The back end will read this psudo-assembly, and convert it to the desired assembly form, then assemble it, and link it, and turn it into machine code.
This two part arrangement allows for one to write one back end for the type of computer, with many front ends for the different languages. Or, you could make one frond end for your language, with many back ends, so that you can compile to many platforms with your new language. It might also be that there is a common psudo-assembly that you could use so you can make use of pre-existing back ends, or front ends, depending on if you are developing a language, or a architecture.
The front end consists of a part that will “tokenize” the code. This is basically looking at each word, rather than each letter. When programming a tokenizer, you define what your keywords will be and what blocks there are, like identifiers, literals, symbols, and comments. You can tell it to skip stuff, like comments, and also, have it pay attention or ignore white space. The tokenizer will basically figure out what the words are in the source. In my book, they use Regular Expressions to determine what is valid or not. They had an example of determine a valid identifier as “[a-zA-Z_][a-zA-Z_0-9]*” which is a word that starts with a letter or underscore, and can be followed by any combination of letter, numbers, or underscores. I’ll do more on Regular Expressions in another post.
The tokenizer will feed into a lexical analyzer that will determine if the syntax is mostly correct. This is where BNF comes in. Your BNF could define what the above tokens are, so some Compiler-Compilers do this. The one in the book separated it out. The Lexical Analyzer looks at each word and determines if you can derive what is there from the BNF. There are a few ways of doing this, but I’m not sure what they exactly are (I’m not there yet).
If it is valid, it passes it on to a section that determines if things are in the correct order. Like using a variable only after it’s been declared. While the BNF doesn’t define this specifically, this is important, and so this section has to check that.
After all that is done, it converts the tolkens into the psudo-assembly, kinda in sections. So this part of the code does this, so we use this block of psudo-assembly. After that, the back end will convert the psudo-assembly into real assembly for the target platform, then will link it to the libraries (not sure on this one), and assemble it into machine code.
