Numbers, Unary Operations, Variables

Lets Write a Compiler!

Our goal is to write a compiler which is a function:

compiler :: SourceProgram -> TargetProgram

In 131 TargetProgram is going to be a binary executable.

Lets write our first Compilers

SourceProgram will be a sequence of four tiny “languages”

  1. Numbers
  1. Numbers + Increment
  1. Numbers + Increment + Decrement
  1. Numbers + Increment + Decrement + Local Variables

Recall: What does a Compiler look like?

Compiler Pipeline
Compiler Pipeline

An input source program is converted to an executable binary in many stages:

Simplified Pipeline

Goal: Compile source into executable that, when run, prints the result of evaluating the source.

Approach: Lets figure out how to write

  1. A compiler from the input string into assembly,
  2. A run-time that will let us do the printing.
Simplified Compiler Pipeline with Runtime
Simplified Compiler Pipeline with Runtime

Next, lets see how to do (1) and (2) using our sequence of adder languages.

Adder-1

  1. Numbers

The “Run-time”

Lets work backwards and start with the run-time.

Here’s what it looks like as a C program main.c

#include <stdio.h>

extern int our_code() asm("our_code_label");

int main(int argc, char** argv) {
  int result = our_code();
  printf("%d\n", result);
  return 0;
}

Test Systems in Isolation

Key idea in SW-Engg:

Decouple systems so you can test one component without (even implementing) another.

Lets test our “run-time” without even building the compiler.

Testing the Runtime: A Really Simple Example

Given a SourceProgram

42

We want to compile the above into an assembly file forty_two.s that looks like:

section .text
global our_code_label
our_code_label:
  mov eax, 42
  ret

For now, lets just

$ nasm -f aout -o forty_two.o forty_two.s
$ clang -g -m32 -o forty_two.run forty_two.o main.c

On a Mac use -f macho instead of -f aout

We can now run it:

$ forty_two.run
42

Hooray!

The “Compiler”

Recall, that compilers were invented to avoid writing assembly by hand

First Step: Types

To go from source to assembly, we must do:

Simplified Pipeline
Simplified Pipeline

Our first step will be to model the problem domain using types.

Simplified Pipeline with Types
Simplified Pipeline with Types

Lets create types that represent each intermediate value:

Defining the Types: Text

Text is raw strings, i.e. sequences of characters

texts :: [Text]
texts =  
  [ "It was a dark and stormy night..."
  , "I wanna hold your hand..."
  , "12"
  ]

Defining the Types: Expr

We convert the Text into a tree-structure defined by the datatype

data Expr = Number Int

Note: As we add features to our language, we will keep adding cases to Expr.

Defining the Types: Asm

Lets also do this gradually as the x86 instruction set is HUGE!

Recall, we need to represent

section .text
global our_code_label
our_code_label:
  mov eax, 42
  ret

An Asm program is a list of instructions each of which can:

type Asm = [Instruction]

data Instruction
  = ILabel Text
  | IMov   Arg Arg
  | IRet

Where we have

data Register
  = EAX

data Arg
  = Const Int       -- a fixed number
  | Reg   Register  -- a register

Second Step: Transforms

Ok, now we just need to write the functions:

parse   :: Text -> Expr     -- 1. Transform source-string into AST
compile :: Expr -> Asm      -- 2. Transform AST into assembly
asm     :: Asm  -> Text     -- 3. Transform assembly into output-string

Pretty straightforward:

parse :: Text -> Expr
parse    = parseWith expr
  where
    expr = integer

compile :: Expr -> Asm
compile (Number n) =
  [ IMov (Reg EAX) (Const n)
  , IRet
  ]

asm :: Asm -> Text
asm is = L.intercalate "\n" [instr i | i <- is]

Where instr is a Text representation of each Instruction

instr :: Instruction -> Text
instr (IMov a1 a2) = printf "mov %s, %s" (arg a1) (arg a2)

arg :: Arg -> Text
arg (Const n) = printf "%d" n
arg (Reg r)   = reg r

reg :: Register -> Text
reg EAX = "eax"

Brief digression: Typeclasses

Note that above we have four separate functions that crunch different types to the Text representation of x86 assembly:

asm   :: Asm -> Text
instr :: Instruction -> Text
arg   :: Arg -> Text
reg   :: Register -> Text

Remembering names is hard.

We can write an overloaded function, and let the compiler figure out the correct implementation from the type, using Typeclasses.

The following defines an interface for all those types a that can be converted to x86 assembly:

class ToX86 a where
  asm :: a -> Text

Now, to overload, we say that each of the types Asm, Instruction, Arg and Register implements or has an instance of ToX86

instance ToX86 Asm where
  asm is = L.intercalate "\n" [asm i | i <- is]

instance ToX86 Instruction where
  asm (IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)

instance ToX86 Arg where
  asm (Const n) = printf "%d" n
  arg (Reg r)   = asm r

instance ToX86 Register where
  asm EAX = "eax"

Note in each case above, the compiler figures out the correct implementation, from the types…

Adder-2

Well that was easy! Lets beef up the language!

  1. Numbers + Increment

Repeat our Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

1. Examples

First, lets look at some examples.

Example 1

How should we compile?

add1(7)

In English

  1. Move 7 into the eax register
  2. Add 1 to the contents of eax

In ASM

mov eax, 7
add eax, 1    

Aha, note that add is a new kind of Instruction

Example 2

How should we compile

add1(add1(12))

In English

  1. Move 12 into the eax register
  2. Add 1 to the contents of eax
  3. Add 1 to the contents of eax

In ASM

mov eax, 12
add eax, 1
add eax, 1

Compositional Code Generation

Note correspondence between sub-expressions of source and assembly

Compositional Compilation
Compositional Compilation

We will write compiler in compositional manner

2. Types

Next, lets extend the types to incorporate new language features

Extend Type for Source and Assembly

Source Expressions

data Expr = ...
          | Add1   Expr

Assembly Instructions

data Instruction
  = ...
  | IAdd  Arg Arg

Examples Revisited

src1 = "add1(7)"

exp1 = Add1 (Number 7)

asm1 = [ IMov (EAX) (Const 7)
       , IAdd (EAX) (Const 1)
       ]
src2 = "add1(add1(12))"

exp2 = Add1 (Add1 (Number 12))

asm2 = [ IMov (EAX) (Const 7)
       , IAdd (EAX) (Const 1)
       ]

3. Transforms

Now lets go back and suitably extend the transforms:

parse   :: Text -> Expr     -- 1. Transform source-string into AST
compile :: Expr -> Asm      -- 2. Transform AST into assembly
asm     :: Asm  -> Text     -- 3. Transform assembly into output-string

Lets do the easy bits first, namely parse and asm

Parse

parse :: Text -> Expr
parse    = parseWith expr

expr :: Parser Expr
expr =  try primExpr
    <|> integer  

primExpr :: Parser Expr
primExpr = Add1 <$> rWord "add1" *> parens expr

Asm

To update asm just need to handle case for IAdd

instance ToX86 Instruction where
  asm (IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)
  asm (IAdd a1 a2) = printf "add %s, %s" (asm a1) (asm a2)

Note

  1. GHC will tell you exactly which functions need to be extended (Types, FTW!)
  2. We will not discuss parse and asm any more…

Compile

Finally, the key step is

compile :: Expr -> Asm
compile (Number n)
  = [ IMov (Reg EAX) (Const n)
    , IRet
    ]
compile (Add1 e)
  = compile e                    -- EAX holds value of result of `e` ...
 ++ [ IAdd (Reg EAX) (Const 1) ] -- ... so just increment it.

Examples Revisited

Lets check that compile behaves as desired:


ghci> (compile (Number 12)
[ IMov (Reg EAX) (Const 12) ]

ghci> compile (Add1 (Number 12))
[ IMov (Reg EAX) (Const 12)
, IAdd (Reg EAX) (Const 1)
]

ghci> compile (Add1 (Add1 (Number 12)))
[ IMov (Reg EAX) (Const 12)
, IAdd (Reg EAX) (Const 1)
, IAdd (Reg EAX) (Const 1)
]

Adder-3

You do it!

  1. Numbers + Increment + Double

Adder-4

  1. Numbers + Increment + Decrement + Local Variables

Local Variables

Local variables make things more interesting

Repeat our Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

Step 1: Examples

Lets look at some examples

Example: let1

let x = 10
in
    x

Need to store 1 variable – x

Example: let2

let x = 10         -- x = 10
  , y = add1(x)    -- y = 11
  , z = add1(y)    -- z = 12
in
    add1(z)        -- 13

Need to store 3 variable – x, y, z

Example: let3

let a = 10
  , c = let b = add1(a)
        in
           add1(b)
in
    add1(c)

Need to store 3 variables – a, b, c – but at most 2 at a time

Registers are Not Enough

A single register eax is useless:

There is only a fixed number (say, N) of registers:

Memory: Code, Globals, Heap and Stack

Here’s what the memory – i.e. storage – looks like:

Memory Layout
Memory Layout

Focusing on “The Stack”

Lets zoom into the stack region, which when we start looks like this:

Stack Layout
Stack Layout

The stack grows downward (i.e. to smaller addresses)

We have lots of 4-byte slots on the stack at offsets from the “stack pointer” at addresses:

How to compute mapping from variables to slots ?

The i-th stack-variable lives at address [EBP - 4 * i]

Required A mapping

Solution The structure of the lets is stack-like too…

Example: Let-bindings and Stacks

let x = 1           -- []
in                  -- [ x |-> 1 ]
    x
let x = 1           -- []
  , y = add1(x)     -- [x |-> 1]
  , z = add1(y)     -- [y |-> 2, x |-> 1]
in
    add1(z)         -- [z |- 3, y |-> 2, x |-> 1]

QUIZ

At what position on the stack do we store variable c ?

                          -- []
let a = 1                 -- [a |-> 1]
  , c =                   -- [a |-> 1]
        let b = add1(a)   --     [b |-> 2, a |-> 1]
        in add1(b)        -- [a |-> 1]
                          -- [c |-> 2, a |-> 1]
in
    add1(c)

A. 1 B. 2 C. 3 D. 4 E. not on stack!

haskell "ENVIRONMENT" -- [] let a = 1 -- [a |-> 1] , b = 2 -- [b |-> 2, a |-> 1] , c = -- let b = add1(a) -- [b |-> 3, b -> 2, a |-> 1] in add1(b) -- -- [b |-> 2, a |-> 1] in add1(b)

              -- ENV(n)
let x = STUFF 
              -- [x |-> n+1, ENV(n)]
in OTHERSTUFF 
              -- ENV(n)

Strategy

At each point, we have env that maps (previously defined) Id to StackPosition

Variable Use

To compile x given env

  1. Move [ebp - 4 * i] into eax

(where env maps x |-> i)

Variable Definition

To compile let x = e1 in e2 we

  1. Compile e1 using env (i.e. resulting value will be stored in eax)
  2. Move eax into [ebp - 4 * i]
  3. Compile e2 using env'

(where env' be env with x |-> i i.e. push x onto env at position i)

Example: Let-bindings to Asm

Lets see how our strategy works by example:

Example: let1

Convert let1 to Assembly
Convert let1 to Assembly

QUIZ: let2

When we compile

let x = 10 
  , y = add1(x)
in
  add1(y)

The assembly looks like

mov eax, 10                ; LHS of let x = 10
mov [ebp - 4*1], eax       ; save x on the stack 
mov eax, [ebp - 4*1]       ; LHS of   , y = add1(x) 
add eax, 1                 ; "" 
???
add eax, 1

What .asm instructions shall we fill in for ???

mov [ebp - 4 * 1], eax    ; A  
mov eax, [ebp - 4 * 1]

mov [ebp - 4 * 1], eax    ; B  

mov [ebp - 4 * 2], eax    ; C  

mov [ebp - 4 * 2], eax    ; D  
mov eax, [ebp - 4 * 2]

                          ; E  (empty! no instructions)

Example: let3

Lets compile

let a = 10 
  , c = let b = add1(a)
        in 
            add1(b)
in
    add1(c)

Lets figure out what the assembly looks like!

mov eax, 10                ; LHS of let a = 10
mov [ebp - 4*1], eax       ; save a on the stack 
??? 

Step 2: Types

Now, we’re ready to move to the implementation!

Lets extend the types for Source Expressions

type Id   = Text

data Expr = ...
          | Let Id Expr Expr    -- `let x = e1 in e2` modeled as is `Let x e1 e2`
          | Var Id             

Lets enrich the Instruction to include the register-offset [ebp - 4*i]

data Arg = ...
         | RegOffset Reg Int    -- `[ebp - 4*i]` modeled as `RegOffset EBP i`

Environments

Lets create a new Env type to track stack-positions of variables

data Env = [(Id, Int)]

data Maybe a = Nothing | Just a

lookupEnv :: Env -> Id -> Maybe Int
lookupEnv [] x              = Nothing 
lookupEnv ((y, n) : rest) x = if x == y 
                                then Just n 
                                else lookupEnv rest x 

pushEnv   :: Env -> Id -> (Int, Env)
pushEnv env x = (xn , env')
  where
    env'      = (x, xn) : env
    xn        = 1 + length env

compile env (Let x e1 e2) = 
  compile env e1
  ++ -- EAX hold the value of "x"
  [IMov (RegOffset EBP xn) EAX ]
  ++
  compile env' e2
  where
    (xn, env') = pushEnv env x

compile env (Var x) = [IMov EAX (RegOffset EBP xn)]
  where
    xn              = case lookupEnv env x of
                         Just n -> n
                         Nothing -> error "variable out of scope"

API:

push :: Id -> Env -> (Int, Env)
push x env = (i, (x, i) : env)
  where
    i         = 1 + length env

lookup :: Id -> Env -> Maybe Int
lookup x []             = Nothing
lookup x ((y, i) : env)
  | x == y              = Just i
  | otherwise           = lookup x env   

Step 3: Transforms

Ok, now we’re almost done. Just add the code formalizing the above strategy

Code

Variable Use

compileEnv env (Var x) = [ IMov (Reg EAX) (RegOffset EBP i) ]
  where
    i                  = fromMaybe err (lookup x env)
    err                = error (printf "Error: Variable '%s' is unbound" x)

Variable Definition

compileEnv env (Let x e1 e2 l)  = compileEnv env  e1  
                               ++ IMov (RegOffset EBP i) (Reg EAX)
                                : compileEnv env' e2
      where
        (i, env')               = pushEnv x env

Step 4: Tests

Lets take our adder compiler out for a spin!

Recap: We just wrote our first Compilers

SourceProgram will be a sequence of four tiny “languages”

  1. Numbers
  1. Numbers + Increment
  1. Numbers + Increment + Decrement
  1. Numbers + Increment + Decrement + Local Variables

Using a Recipe

  1. Build intuition with examples,
  2. Model problem with types,
  3. Implement compiler via type-transforming-functions,
  4. Validate compiler via tests.

Will iterate on this till we have a pretty kick-ass language.