Conundrum: Failure Opacity

The very nature of the connections via capabilities is that of specific-use security - the messaging layer is abstracted away. This leaves us with challenges when attempting to debug and optimize the architecture and applications.

How should we handle failure (in general) and catastrophic failures (random network partitions and crashes?) Clearly core-dumps, stack-traces, and traditional tools leak out-of-scope information.

1 Like

Your last sentence is clearly true for most system designs, capability oriented or not. Dealing with network failures and vulnerabilities was one of the considerations in the design of Client Utility.

In the CU design, all network communication was handled by ordinary processes, each with its own clist. The only special privilege it had was the ability to talk on some communication channel. The main reason for this choice was security. The damage an attacker could do by completely taking over the process was limited to what the clist authorized.

While debugging did leak out of scope information, that leak was limited. In particular, debugging gave you no privileges that the process did not already have. You could still do things that the application designer may not have wanted done, but the overall system was protected. Of course, you were still dependent on the OS and network drivers, but I think that problem is inescapable.

1 Like

Alan, could you supply any public links to CU’s design documentation/whitepapers/etc?

The overview document has more detail than you would ever care to learn. I’ve listed some of the key points below. There’s also a demo that runs on this architecture. (The movable desktop is an independent piece.) The prototype was written in 1996-7, which was before enterprises supported open source.

The top of page 27 (28 in your PDF reader) shows the basic request flow within a “logical machine.” An application selects an entry from its clist (Name Space) as the target of a request, sends it to the runtime (Core). The Core looks up the destination (Handler) in the Repository and forwards the request. The request can specify other clist entries, which will be delegated to the handler.

The picture at the top of page 92 (93 in your PDF reader) shows the essence of the multi-machine interaction. A unique(?) feature of CU is that the Core doesn’t know about any other machines, only the handler does. The result is that you can add a new communication protocol simply by deploying an application that supports it.

Section 9.2, starting on page 96, describes failure handling.

CU has some interesting features, a few of which are actually good :grinning:.

Naming was path dependent, so EQ was not possible for references received on different paths. Our enterprise customers were very worried about that, but it turned out not to be a problem for their use cases. In fairness, this architecture was only used for some 18 months by a half dozen customers, so I can’t say that EQ is never needed.

Delegation to a client on another machine was default by proxy. You have to formally ask to shorten delegation chains. (E, on the other hand, with its sturdy refs, shortens by default.) That choice was important back then because our boxes could handle only a modest number of connections.

We had a resource discovery mechanism based on name-value pairs that we called Vocabularies. The interesting thing is that vocabularies were resources that were accessed controlled by capabilities. You couldn’t even find out that something existed if you didn’t have a capability to a vocabulary it was advertised in.

CU supported negative permissions. My worry was that an administrator would accidentally give a powerful capability to a guest user. The mechanism I used was based on split capabilities, one of the questionable features in the design. However, I think the mechanism can be implemented without them.

3 Likes

The question of how to surface distributed failures remains hard. Closely related problems occur when one vat is upgraded with active connections and/or messages in flight with counter-parties. Consider these problems together. They are both kinds of boundary trauma that must be made visible to higher layers. They cannot be masked, though they can be confusingly swept under the rug until it hurts.

3 Likes

This will be front and center when we hire our next team member. It’s important for so many reasons, including [network] debugging…

2 Likes