This was the first ICFPC ever for me, and for other guys on the team as well. Amazing experience, and I’m very thankful to the organizers and to my friends who came to hack with me. It was a weekend full of fun and constant deadline pressure. I haven’t been solving non-trivial algorithmic problems since I can’t remember, and this experience was a great reminder.

{
    "contestScore": 547,
    "lightningScore": 249,
    "trainingScore": 117,
    "mismatches": 171,
    "numRequests": 5066,
    "cpuTotalTime": 4384.762
}

We scored 249 in the lightning round and 547 overall. It isn’t a great result, but it’s probably not too bad. We could do a bit better if we had some more time – in hindsight, we spent much time doing things that are not worth it. But on a larger scale I think our solution was just not good enough.

The source code for our solution is here.

Setup

We were a team of 3 uni friends who now are professional programmers. During the contest, we had 2 people on average working on the problem. Some minutes after the contest began, we were already on it (mind you, the contest began at 3AM local time.)

At our disposition we had two Core i5 laptops, one Core i5 desktop and one Core 2 Duo laptop.

We chose the F# programming language as our weapon. No particular reason, just the fact that F# is awesome.

Initial solution

I’m assuming you’ve read the problem description.

The idea of our solution was to run a search over candidate programs that were generated before querying the server for eval. After implementing this basic algorithm for the lightning round, we used the rest of the time to optimize it.

At first, we were generating all programs from a given set of operators that fit exactly in a specified length. We then filtered out programs that weren’t using each operator at least once (yes, one of the mistakes we made). The generation function looked like this:

let rec private emit'' = memoize emit'
and private emit' (len, (ops : Set<Op>), (vars : Set<BV.Name>)) = seq {
        match len with
        | 1 ->
            yield BV.Zero
            yield BV.One
            for v in vars do
                yield BV.Var v
        | l ->
            // tfold can only be emitted on top level
            let ops' = Set.remove TFold ops
            if l >= 5 then 
                yield! emitTFold l ops' vars (Set.contains TFold ops)
                yield! emitFold l ops' vars
            if l >= 4 then yield! emitIf l ops' vars
            if l >= 3 then yield! emitBinary l ops' vars
            if l >= 2 then yield! emitUnary l ops' vars
    }
and // ...

While our ‘player’ (the part that communicates with the server and implements the solving strategy) wasn’t ready yet, I ran the generation to store results on disk so that we can start running as soon as the 'player’ is ready.

Our solving strategy was as follows:

generation: generate all programs for a problem;
filtering:
- query the server for 256 random inputs;
- throw away all programs whose outputs don’t match with what server told us;
- repeat until we have 1 or 0 programs left, or if last filtering run didn’t remove any programs (this would mean that we have found a set of equivalent programs);
submission:
- submit a random program from our current list;
- if the server responds with a 'mismatch’ – go back to filtering phase.

At problem size 11 generation was already taking several minutes, and at size 12 we hit a brick wall – it took 30 to 50 minutes each! At this point we decided that we should start running our solver as it was 1.5 hours before the lighthing round deadline.

We managed to solve all problems up to (and including) size 10, and about a half of 11s before the time was up. It was scary to see a 'mismatch’ for the first time, but really everything went smoothly – we didn’t fail anything and I can’t remember any major mishaps we’ve had.

After hour 24 I left the solver running for problems of size 11, and ran the generator for size 12 on another computer (so that next morning we could submit some more solutions) and went to sleep.

Day 2. Optimizing generation

Sadly, the generator I left running overnight didn’t help much. It only managed to crunch through about 10 problems and then died with OutOfMemoryException. For some of our functions we used System.Collections.Generic.Dictionary-based memoization, which will throw an OOM exception if you put too many elements in it. But even not taking into account memory issues, the generator was excruciatingly slow, so we needed a better solution.

By this time we started thinking about search optimizations based on program structure analysis. For example, (and 0 expr) will always evaluate to 0 so it doesn’t matter what expr actually is: this means we can skip generating the expr subtree entirely.

With our current generation function it was hard to do during the generation process, and filtering programs out after everything is already generated wouldn’t give any generation speedup, and that’s what we currently needed.

I rewrote the generator to emit all programs given only operator order. It would then put operators in the exact order that was specified, only shuffling around constants and variables from sample to sample. This meant that we now also had to generate all possible operator order sequences, feed them to the generation function and then concatenate all results.

This simplified the code a lot, but more importantly it gave us the ability to put some program structure analysis in it. We were hoping to reduce the amount of programs generated by optimizing generation for obvious ifs, ands with 0, ors with 0xFFFFFFFFFFFFFFFF, folds with constant lambda, etc. It looked like this:

// binary
| And | Or | Xor | Plus ->
    for [rest1; rest2] in decompList 2 rest do // @TODO
        for e1 in emitForOpOrder (rest1, inFold) do
            let e2s =
                let all = emitForOpOrder (rest2, inFold)
                match op, e1 with
                // (and 0 e) -> 0
                | And, BV.ConstExpr 0UL -> Seq.firstOrNone all
                | Or, BV.ConstExpr 0xFFFFFFFFFFFFFFFFUL -> Seq.firstOrNone all
                | _ -> all
            for e2 in e2s do
                yield mapBinary op e1 e2
// ...

(As it turned out, there actually were other clever ways to analyze generated expressions. It’s a pity we didn’t came up with any of those.)

During generation, we were even trying to evaluate partial expression tree if it could be proven that it is a constant. If it was, then we could make more assumptions about the equivalence of this expression to others. For example, if we know that the condition argument to if0 is a constant expression, we can omit generation of either then or else subtree, depending on the conditional value:

// ...
| If0 ->
    // (if1 e1 e2 e3)
    for [rest1; rest2; rest3] in decompList 3 rest do
        for e1 in emitForOpOrder (rest1, inFold) do
            let e2s, e3s =
                let e2s = emitForOpOrder (rest2, inFold)
                let e3s = emitForOpOrder (rest3, inFold)
                match e1 with
                // (if0 0 e2 e3) -> e2
                | BV.ConstExpr 0UL -> e2s, Seq.firstOrNone e3s
                // (if0 1 e2 e3) -> e3
                | BV.ConstExpr x when x > 0UL -> Seq.firstOrNone e2s, e3s
                | _ -> e2s, e3s
            for e2 in e2s do
                for e3 in e3s do
                    yield BV.If0 (e1, e2, e3)

We spent most of the day on this, but we weren’t getting the speedup we hoped for.

Suddenly, after another rebuild, I noticed that the generation works 10 or even 100 times faster than before. We were happily baffled about this, but didn’t do any investigation as we were very tired. It’s actually kind of outrageous since I don’t remember doing any optimizations before rebuilding.

We settled for hypothesis that I was running a wrong binary the whole time (ugh), but really we cared less about this since with the new speed we could solve the 12s! And we did.

Then, we started trying the 13s, but this time brickwalled on memory usage. Somewhere around this time we began working on program compilation to speed up the filtering, as we figured that the slow part was the evaluation.

Day 3. Filtering optimization and desperate hacks

While trying 12s, we noticed that some of the problems took over 3 minutes (of allowed 5) to solve. Even if we could generate the required amount of programs for a 13-size problem in time and fit them all in memory, we might not make it on time while choosing the right one. So now we knew that we had to optimize our filtering stage somehow.

Many optimizations were tried out and implemented this day. Some of us worked on ways to reduce the overall amount of generated programs we get – less programs means less evaluations means faster filtering. Others tried different approaches to speed up the evaluation of a program.

We tried out three different 'compilation’ methods (one of which was the actual compilation to .NET bytecode).

At first, we tried to convert our expression tree into a nested closures: this would work the same as normal expression tree walking, but only with the walking path fixed in a tree of nested closures.

This gave us a speedup of 2x, but it still wasn’t enough – by this time, I think, we had our first failed problem because we didn’t make it on time. So then we tried to convert expression trees to RPN and evaluate those with a 'stack machine’.

That didn’t help much either – the speed was almost exactly the same. So the next thing to try would be to use F#’s quotations to construct the program, and then compile it to an actual function (.NET method). I’ve spent several hours on this, as juggling code quotations turned out to be much harder than I thought it would.

Nevertheless, this was actually one of the fun parts of the contest for me, as it is really cool to see an expression tree to be transformed in the actual F# code at runtime, and then compiled to a (probably) very efficient bytecode that should run super fast. Metaprogramming, especially practical metaprogramming – is fascinating.

The final code for compilation looked like this:

let compile (expr: BV.Expr) : (uint64 -> uint64) = 
    let st, derefSt = variable<uint64 []> "st"
    let rec compile' (expr : BV.Expr) =
        match expr with
        | Zero -> <@@ 0UL @@> | One -> <@@ 1UL @@>
        | Var "x" -> <@@ (%%derefSt : uint64 []).[0] @@>
        | Var "v_1" -> <@@ (%%derefSt : uint64 []).[1] @@>
        | Var "v_2" -> <@@ (%%derefSt : uint64 []).[2] @@>
        | Op1 (op, exp) ->
            let sub = compile' exp
            match op with
            | Not -> <@@ ~~~(%%sub : uint64) @@>
            | Shl1 -> <@@ (%%sub : uint64) <<< 1 @@>
            | Shr1 -> <@@ (%%sub : uint64) >>> 1 @@>
            | Shr4 -> <@@ (%%sub : uint64) >>> 4 @@>
            | Shr16 -> <@@ (%%sub : uint64) >>> 16 @@>
        | Op2 (op, expr1, expr2) ->
            let sub1, sub2 = compile' expr1, compile' expr2
            match op with
            | Xor -> <@@ (%%sub1 : uint64) ^^^ (%%sub2 : uint64) @@>
            | Plus -> <@@ (%%sub1 : uint64) + (%%sub2 : uint64) @@>
            | And -> <@@ (%%sub1 : uint64) &&& (%%sub2 : uint64) @@>
            | Or -> <@@ (%%sub1 : uint64) ||| (%%sub2 : uint64) @@>
        | If0 (cond, then', else') ->
            let c, t, e = compile' cond, compile' then', compile' else'
            <@@ if (%%c : uint64) = 0UL then (%%t : uint64) else (%%e : uint64) @@>
        // ...

    let inner = compile' expr
    let l = Quotations.Expr.Lambda(st, inner)
    let compiled = l.CompileUntyped()() :?> uint64 [] -> uint64
    (fun x -> compiled [|x; 0UL; 0UL|])

We had to use untyped quotations, couldn’t get it to work with typed. We also had to put all the type annotations manually, as we were getting invalid cast exceptions at runtime – I guess the type inference didn’t work here as one would expect. We used FSharp.PowerPack for quotation compilation, and it didn’t support local mutable vars, so we had to use an array for that. What would we need mutability for? fold, of course!

// ...
| Fold (inputExpr, initExpr, _, _, lambdaExpr) ->
    let input, init = compile' inputExpr, compile' initExpr
    let lambda = compile' lambdaExpr

    <@@
        let (input : uint64), (init : uint64) = %%input, %%init

        (%%derefSt : uint64 []).[2] <- init
        (%%derefSt : uint64 []).[1] <- (input >>> 0) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 8) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 16) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 24) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 32) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 40) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 48) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)
        (%%derefSt : uint64 []).[1] <- (input >>> 56) &&& 0xffUL
        (%%derefSt : uint64 []).[2] <- (%%lambda : uint64)

        (%%derefSt : uint64 []).[2]
    @@>

As you can see, the loop is fully unrolled, hardcore-ugly style. The PowerPack quotation compiler didn’t support for loops, and Array.fold was too slow.

Miserably enough, this compilation turned out to be just as slow as the two previous attempts. It was very fast for expressions without fold, though. We abandoned this idea and moved on.

Thankfully, after staring at the evaluation code to find any other miniscule optimizations I could do, I noticed that the variable environment that is threaded around during recursive expression walking was designed to accept any number of variables and to adhere to lexical scope rules. But a \BV program can have 3 variables at most.

So this was done – I removed the actual Map<_, _> that was containing the evaluation context, and replaced it with a 3-tuple: (x, v_1, v_2). It speeded up the evaluation! But, only to the speed we had with compiled expression trees. So, not a great success.

Finale

We ran out of ideas and were running out of time. Only then we realized, that the program doesn’t have to be of exact length, and doesn’t have to contain each operator at least once.

So with this knowledge, our next desperate attempt would be to try and solve big problems with small programs. At first I tried to run our solver on about 20 of those manually, and most of the time it succeeded. So, we decided we should automate this and send it to battle. I should say, that by this point a 'failed’ problem was a normal thing, since we started feeling we didn’t have anything to lose.

I implemented a strategy like this for our solver:

from the list of all problems, choose the easiest unsolved;
try to solve it with a program of size 8;
if that didn’t work – try bigger size, up to 14 for problems without fold and up to 16 for problems with fold;
repeat until every problem is either solved or failed.

The 'easiness’ of a problem was determined, aside from it’s size, by the amount of operators it had, and by which operators. For example, fold problems were easier to solve, as there was a lot less possible combinations, because fold takes up a lot of program space.

I left this running on two computers and went to sleep. In the morning I checked the status, and saw a jolly good +100 points.

reverse microwave

M-type vs ICFPC 2013

Links:

Setup

Initial solution

Day 2. Optimizing generation

Day 3. Filtering optimization and desperate hacks

Finale