Introducing: A distributed debugger for Goblins with time travel! (aka Causeway Live!)

[UPDATED with new, easier to understand vat names: Player, Monster, and Cave!]

Debugging distributed programs is difficult, but we’ve been hard at work developing tools to make it easier within the Goblins object capability system. While this work is still in its early stages, we’re excited to share the progress we’ve made so far!

The new Goblins debugger allows programmers to inspect and debug distributed computations that happen across many vats (communicating event loops.) The time travel feature allows for visiting past events and inspecting program state at the time the event happened.

These tools are implemented as “meta-commands” for Guile’s REPL, complementing the tools that Guile already provides for sequential debugging. In true Scheme fashion, this means that debugging happens live, while the program is running, allowing programmers to fix bugs and try again without having to stop and recompile/reboot their program.

Seeing is believing, so let’s walk through building a simple program and using the new debugger to find and fix a small bug. We’re going to make the world’s simplest dungeon crawling RPG. It will allow a player to attack a monster to deal damage, and that’s about it.

The first step is to start up a Guile REPL and import the Goblins modules that we need:

scheme@(guile-user)> ,use (goblins)
scheme@(guile-user)> ,use (goblins actor-lib cell)
scheme@(guile-user)> ,use (goblins actor-lib let-on)
scheme@(guile-user)> ,use (goblins actor-lib methods)

The (goblins) module provides the core Goblins API. The other three modules are part of Goblins’ handy standard library of actor abstractions. In Guile’s REPL, expressions that start with a comma invoke a meta-command. Guile provides a number of these commands, like ,use. We’ll be seeing commands specific to Goblins later.

Next, we need to create some vats where our actors can live. Since there is overhead in collecting debug data, vat logging is currently opt-in on a per-vat basis. The #:log? #t keyword argument enables logging.

scheme@(guile-user)> (define p-vat (spawn-vat #:name 'Player  #:log? #t))
scheme@(guile-user)> (define m-vat (spawn-vat #:name 'Monster #:log? #t))
scheme@(guile-user)> (define c-vat (spawn-vat #:name 'Cave    #:log? #t))

Now we can define some actors to live in our vats. To do that, we first need to define constructor procedures. We only need two kinds of actors for this demo: Characters and dungeons. Characters have a name, some amount of hit points (HP), and a strength value. They can be damaged by attacks and die when their HP reaches 0.

scheme@(guile-user)>
(define (^character _bcom name max-hp strength)
  ;; HP is reduced when the character is attacked, so we use a cell
  ;; actor to store a counter that can be modified.
  (let ((hp (spawn ^cell max-hp)))
    (methods
     ;; Getters
     ((name) name)
     ((strength) strength)
     ((hp) ($ hp))
     ;; Deal combat damage
     ((damage amount)
      ($ hp (- ($ hp) amount))
      #t) ; attacks always hit
     ;; Check if character is dead
     ((dead?)
      (zero? ($ hp))))))

Dungeons are places where a player character fights a monster character. For simplicity, this is going to be a rather one-sided battle and we’ll only implement the player’s turn. A player turn starts by dealing damage to the monster and ends by returning a log summarizing what happened. If the monster is killed, the log will say so.

scheme@(guile-user)>
(define (^dungeon _bcom player monster)
  (methods
   ((player-turn)
    (let-on ((player-name (<- player 'name))
             (monster-name (<- monster 'name))
             (strength (<- player 'strength)))
      ;; Deal damage to the monster and check to see if it's dead.
      (let*-on ((hit? (<- monster 'damage strength))
                (dead? (<- monster 'dead?)))
        (let ((damage-msg (format #f "~a deals ~a damage to ~a!"
                                  player-name strength monster-name)))
          ;; Log a message if the monster is dead.
          (if dead?
              (let ((dead-msg (format #f "~a has been defeated!"
                                      monster-name)))
                (list damage-msg dead-msg))
              (list damage-msg))))))))

To show off the “distributed” part of “distributed debugger”, we’ll spawn actors across all three of the vats we defined earlier. The player character “Alice” will live in the Player vat, p-vat. The monster, a basilisk, will live in the Monster vat, m-vat. The dungeon will live in the Cave vat, c-vat. This simulates how, in a real game built using a distributed architecture, different actors in the game world would be hosted on many different computers.

scheme@(guile-user)> (define alice
                       (with-vat p-vat (spawn ^character "Alice" 100 30)))
scheme@(guile-user)> (define basilisk
                       (with-vat m-vat (spawn ^character "Basilisk" 20 15)))
scheme@(guile-user)> (define dungeon
                       (with-vat c-vat (spawn ^dungeon alice basilisk)))

With our actors spawned, we are ready to play this trivial game. But first, we’ll use a Goblins-specific REPL command to enter a new context for evaluating code:

scheme@(guile-user)> ,enter-vat c-vat
Entering vat 'Cave'.  Type ',q' to exit.  Type ',help goblins' for help.
goblins/Cave@(guile-user) [1]> 

Note the change in the REPL prompt to indicate that we are now in the Goblins vat named “Cave” and we are in a nested REPL session that is one level deep. The ,enter-vat command is provided by Goblins to allow code (and other Goblins REPL commands) to be evaluated within the context of a specific vat. This makes Goblins operators (such as spawn, $, and <-) “just work” since they require a vat in order to do their job.

We’re now ready to have Alice take a turn. Alice will attack the basilisk, dealing 30 damage. The basilisk only has 20 HP, so it should die and we should see a log message saying so. Getting a result requires communicating with the Player and Monster vats, so we’ll use the the ,vat-resolve command to wait for the asynchronous operation to complete and print out the results:

goblins/Cave@(guile-user) [1]> ,vat-resolve ($ dungeon 'player-turn)
("Alice deals 30 damage to Basilisk!")

Hmm, that’s not right! The log says that Alice dealt 30 damage, but it should also say that the basilisk was defeated. Let’s debug and see if we can spot the error. For starters, we can run ,vat-tail to see the most recent messages that the Cave vat has processed:

goblins/Cave@(guile-user) [1]> ,vat-tail
Churn 7:
  36: (receive listen #<local-promise>)
Churn 8:
  37: (receive message #<local-object ^resolver> fulfill #f)
  38: (receive message #<local-object ^on-listener> fulfill #f)
  39: (receive message #<local-object fulfilled-handler> #f)
  40: (receive message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
  41: (receive message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
  42: (receive message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
  43: (receive message #<local-object ^on-listener> fulfill ("Alice deals 30 damage to Basilisk!"))
  44: (receive message #<local-object fulfilled-handler> ("Alice deals 30 damage to Basilisk!"))
  45: (receive message #<local-object finally-handler>)

The events have been printed in chronological order. There is some useful information in here, but only events in the Cave vat are shown and there’s nothing that describes how these events relate to one another. However, we now know that event 45 is the most recent. This event represents the terminal point of the ,vat-resolve command we evaluated above.

To view the events related to another event, across all relevant vats, we will use the ,vat-tree and ,vat-graph commands. ,vat-tree outputs a textual representation:

goblins/Cave@(guile-user) [1]> ,vat-tree 45
Vat Cave, 2: (message #<local-object ^call-with-vat>)
├▸ Vat Player, 4: (message #<local-object ^character> name)
│  └▸ Vat Cave, 10: (message #<local-object ^resolver> fulfill "Alice")
├▸ Vat Monster, 5: (message #<local-object ^character> name)
│  └▸ Vat Cave, 16: (message #<local-object ^resolver> fulfill "Basilisk")
│     └▸ Vat Cave, 19: (message #<local-object ^resolver> fulfill ("Alice" "Basilisk" 30))
│        └▸ Vat Monster, 23: (message #<local-object ^character> damage 30)
│           └▸ Vat Cave, 28: (message #<local-object ^resolver> fulfill #t)
│              └▸ Vat Monster, 32: (message #<local-object ^character> dead?)
│                 └▸ Vat Cave, 37: (message #<local-object ^resolver> fulfill #f)
│                    └▸ Vat Cave, 40: (message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
│                       └▸ Vat Cave, 41: (message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
└▸ Vat Player, 6: (message #<local-object ^character> strength)
   └▸ Vat Cave, 13: (message #<local-object ^resolver> fulfill 30)

And ,vat-graph outputs a graphical Lamport causality diagram (thanks to Graphviz):

goblins/Cave@(guile-user) [1]> ,vat-graph 45

In each of these views, we can see that the message (<- monster 'dead?) (with the debug representation of msg ^character dead? in the ,vat-graph output) resolves to #f when we expected #t. Did you miss it? Here’s the relevant bit of the ,vat-tree output above:

│              └▸ Vat Monster, 32: (message #<local-object ^character> dead?)
│                 └▸ Vat Cave, 37: (message #<local-object ^resolver> fulfill #f)

But why did this happen? The monster should have been dead. Let’s travel back in time and find out. The first step is to truly enter the debugger so we can walk backwards in time to when the dead? check happened. To do this, we’ll use the ,vat-debug command, which takes an event ID argument and starts a new sub-REPL with all the context we need to debug the problem.

goblins/Cave@(guile-user) [1]> ,vat-debug 45
Entering a new prompt. Type `,q' to exit.
goblins/Cave@(guile-user) [2]>

Note that the [1] in the prompt became a [2] because we are in yet another sub-REPL. To get our bearings, we’ll use the ,vat-trace command to get a distributed backtrace showing the direct causes of event 45:

goblins/Cave@(guile-user) [2]> ,vat-trace
In vat Cave:
  Churn 2:
    2: (message #<local-object ^call-with-vat>)
    4: (message #<local-object ^character> name)
In vat Monster:
  Churn 2:
    6: (message #<local-object ^resolver> fulfill "Basilisk")
In vat Cave:
  Churn 5:
    17: (message #<local-object ^on-listener> fulfill "Basilisk")
    18: (message #<local-object fulfilled-handler> "Basilisk")
    19: (message #<local-object ^resolver> fulfill ("Alice" "Basilisk" 30))
    20: (message #<local-object ^on-listener> fulfill ("Alice" "Basilisk" 30))
    21: (message #<local-object fulfilled-handler> ("Alice" "Basilisk" 30))
    22: (message #<local-object ^character> damage 30)
In vat Monster:
  Churn 3:
    24: (message #<local-object ^resolver> fulfill #t)
In vat Cave:
  Churn 7:
    29: (message #<local-object ^on-listener> fulfill #t)
    30: (message #<local-object fulfilled-handler> #t)
    31: (message #<local-object ^character> dead?)
In vat Monster:
  Churn 4:
    33: (message #<local-object ^resolver> fulfill #f)
In vat Cave:
  Churn 8:
    38: (message #<local-object ^on-listener> fulfill #f)
    39: (message #<local-object fulfilled-handler> #f)
    40: (message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
    41: (message #<local-object ^resolver> fulfill ("Alice deals 30 damage to Basilisk!"))
    43: (message #<local-object ^on-listener> fulfill ("Alice deals 30 damage to Basilisk!"))
    45: (message #<local-object finally-handler>)

Much like a regular backtrace in Guile’s sequential debugger, output is in order from outermost call to innermost. In other words, the most recent event is the last event printed. Our debugging session starts focused on event 45, but we can shift our focus with the ,vat-up, ,vat-down, and ,vat-jump commands.

,vat-up moves up one event, from 45 to 43 in this case:

goblins/Cave@(guile-user) [2]> ,vat-up
Now in vat Cave, event 43:
  (receive message #<local-object ^on-listener> fulfill ("Alice deals 30 damage to Basilisk!"))

,vat-down moves down one event, back to event 45 in this case:

goblins/Cave@(guile-user) [2]> ,vat-down
Now in vat Cave, event 45:
  (receive message #<local-object finally-handler>)

,vat-jump moves to a particular event ID, so we can save ourselves the tedium of typing ,vat-up or ,vat-down a bunch of times if the event we’re trying to inspect is far away from our current position. In fact, event 31 looks particularly interesting, since that’s when the dead? check happened. Let’s jump there now:

goblins/Cave@(guile-user) [2]> ,vat-jump 31
Now in vat Cave, event 31:
  (send message #<local-object ^character> dead?)

What ,vat-trace is hiding, for brevity’s sake, is that this “send” event in the Cave vat has a corresponding “receive” event in the Monster vat. Let’s step down the trace and see where we end up:

goblins/Cave@(guile-user) [2]> ,vat-down
Now in vat Monster, event 32:
  (receive message #<local-object ^character> dead?)

We’re now focused on an event in vat Monster, where the basilisk lives. This means that we can travel back in time and send messages to the basilisk as it existed at event 32. To do this, we’ll use the ,vat-peek command. Let’s see what its HP looked like at this point:

goblins/Cave@(guile-user) [2]> ,vat-peek basilisk 'hp
-10

Hmm, zero should be the lowest HP value. The dead? method is specifically checking if the character’s HP is zero. It’s a safe bet that the bug is in the damage method, which happened around event 22, so let’s confirm our suspicion. We’ll jump there, move down into an event in the Monster vat, and check out the basilisk’s HP:

goblins/Cave@(guile-user) [2]> ,vat-jump 22
Now in vat Cave, event 22:
  (send message #<local-object ^character> damage 30)
goblins/Cave@(guile-user) [2]> ,vat-down
Now in vat Monster, event 23:
  (receive message #<local-object ^character> damage 30)
goblins/Cave@(guile-user) [2]> ,vat-peek basilisk 'hp
20

Thanks to time travel, we can see that the starting HP value is 20, like it should be. The basilisk’s health was indeed initialized correctly. Moving down one more event, we can see for sure that this is the point where we entered a problematic state:

goblins/Cave@(guile-user) [2]> ,vat-down
Now in vat Monster, event 24:
  (send message #<local-object ^resolver> fulfill #t)
goblins/Cave@(guile-user) [2]> ,vat-peek basilisk 'hp
-10

A note for the curious: Goblins uses transactional heaps to manage actor state, enabling us to take snapshots of a vat before each event is processed. Querying these snapshots allows users to travel through time while debugging. We’re using very simple code for demonstration purposes, but we believe that this feature will make debugging real programs, with complex logic and state, much more feasible.

So, what we’re missing is some code to enforce a lower bound on the HP value. Let’s update our character constructor to fix that issue (exiting the nested REPLs and getting back to the top-level Guile prompt):

goblins/Cave@(guile-user) [2]> ,quit
goblins/Cave@(guile-user) [1]> ,quit
scheme@(guile-user)>
(define (^character _bcom name max-hp strength)
  ;; HP is reduced when the character is attacked, so we use a cell
  ;; actor to store a counter that can be modified.
  (let ((hp (spawn ^cell max-hp)))
    (methods
     ;; Getters
     ((name) name)
     ((strength) strength)
     ((hp) ($ hp))
     ;; Deal combat damage
     ((damage amount)
      ;; HP can't go below 0.
      ($ hp (max (- ($ hp) amount) 0))
      #t) ; attacks always hit
     ;; Check if character is dead
     ((dead?)
      (zero? ($ hp))))))

A call to max has been added to the damage method to ensure that zero is the lowest HP value. All that’s left to do now is rebuild the actors and verify the fix:

scheme@(guile-user)> (define alice
                       (with-vat p-vat (spawn ^character "Alice" 100 30)))
scheme@(guile-user)> (define basilisk
                       (with-vat m-vat (spawn ^character "Basilisk" 20 15)))
scheme@(guile-user)> (define dungeon
                       (with-vat c-vat (spawn ^dungeon alice basilisk)))
scheme@(guile-user)> ,enter-vat c-vat
Entering vat 'Cave'.  Type ',q' to exit.  Type ',help goblins' for help.
goblins/Cave@(guile-user) [1]> ,vat-resolve ($ dungeon 'player-turn)
("Alice deals 30 damage to Basilisk!" "Basilisk has been defeated!")

We see the Basilisk has been defeated message now. Problem solved!

We hope this small demo has given you a taste of the exciting things to come in the next Guile Goblins release. If you want to use these tools right now, all the code used for this demo is available in the main branch of the Goblins Git repository. While it is already quite useful, right now the debugger only supports multiple vats in a single Guile process. In the near future, we plan to add network debugging via CapTP. We’d also like to take this opportunity to give credit to E’s Causeway debugger as a significant source of inspiration and prior art that was crucial for this work.

4 Likes

We’ve uploaded an edited version of the demo presentation for Friam…

[Link updated to point to our website instead of YouTube.]