Precise Symbolic State Merging

Why we need nondeterministic value summaries for state merging, and a time dimension to reconstruct execution traces

Last updated on Dec 21, 2020 9 min read

Summary

State merging is a standard technique for mitigating path explosion in symbolic execution. Nondeterministic value summaries are guarded sets of values with overlapping guards. Without those, we cannot perform fully precise state merging in all situations (do-while loops, unstructured code, nondeterministic programs). Reconstructing symbolic execution traces from states merged with nondeterministic value summaries requires some notion of time or origin.

Table of Contents

Symbolic Execution and Path Explosion

Testing and debugging are techniques appreciated by any reasonable programmer. They help to rule out major problems in an implementation and require little specialized knowledge. After the project runs for most inputs and passes the test suite, those techniques, however, reach their limits, as they can only ever show the presence, but not the absence of bugs; in other words: It’s likely you miss some border case that can make your program crash or more generally fail.

Automatic test case generation and, quite prominently, fuzzing techniques help to cover many missed inputs without user assistance. This is still testing, though. In general, it’s still possible (and for complex programs quite likely) that some equivalence class of inputs remains untested.

This is where static analysis enters the stage. If high confidence levels into the program under test need to be established, the program has to be analyzed independently of concrete inputs. One such analysis technique is Symbolic Execution. The principles are simple: Instead of calling a method with concrete inputs like 17 and 42, call it with symbolic inputs like $a_{0}$ and $b_{0}$ . The consequences are more difficult: Most prominently, concrete upper bounds for loops and recursive calls are usually missing, so that mitigation strategies such as loop invariants and method contracts have to be used to ensure termination of the analysis. Automatic discovery of such abstractions is a hot research topic, but not the topic of this article.

Symbolic execution maintains Symbolic (Execution) States consisting of a path condition updated by case distinctions and a symbolic store updated by assignments. Starting from an empty state $(\emptyset, -)$ , an assignment statement i = 17; updates this state to $(\emptyset, i \mapsto 17)$ . A subsequent if statement

if (k>0) {
  i=2*k; 
} else { 
  i=42;
}

induces a case distinction (since the truth value of the symbolic guard is unknown at this point) and produces two output states: $({k > 0}, i \mapsto 17)$ and $({k \leq 0}, i \mapsto 17)$ . Executing the respective branches updates those states to $({k > 0}, i \mapsto 2 \cdot k)$ and $({k \leq 0}, i \mapsto 42)$ .

Stopping Explosion: Symbolic State Merging

From there on, these branches are traditionally followed independently, forming a tree structure. Any code after the if statement is thus executed twice, which leads to an exponential blow-up¹. A well-known approach to avoid this explosion is Symbolic State Merging² ³, which brings back together branches after a split. The resulting data structure resembles the CFG of the executed code although usually, cycles are broken up by introducing abstraction (we get a directed acyclic graph).

For example, after the if statement above, we know that i attains the value $2 \cdot k$ if k is strictly positive and the value 42 otherwise. Using a conditional expression³ ⁴ we can directly express this as $(\emptyset, i \mapsto if (k > 0) then (2 \cdot k) else (42))$ . From this state on, we can continue executing any statements that come after the conditional, avoiding executing them twice.

Using conditional expressions for state merging is an example of fully precise state merging. Alternatively, one could use data abstractions as in Abstract Interpretation⁵. For example, since k is positive in the then branch, we know that the final value of i after the if statement is also positive. A possible, abstracted symbolic state would thus be $({c > 0}, i \mapsto c)$ for some fresh constant $c$ .

The Limits of Conditionals

The strength of symbolic execution is its high precision. Ideally, one would only introduce abstractions if really necessary (as for loop invariants). Furthermore, merging with conditional expressions works fully automatically, while abstract domains for data abstractions have to be chosen upfront.

There is however one limitation to using conditional expressions. Consider the following code with a loop:

do {
  i--;
} while (i > 0)

Assume that we want to repeatedly execute the loop symbolically, maybe with the aim to discover an invariant. When first entering the do loop, we’re in the symbolic state $(\emptyset, i \mapsto i_{0})$ (where $i_{0}$ is the initial value of i). After one iteration, the resulting state is $({i_{0} > 0}, i \mapsto i_{0} - 1)$ .

Can we merge those states using conditionals?

An empty path condition is equivalent to truth (the state is reached by all paths), so the result when merging the first and the second states is

$({true \lor i_{0} > 0}, i \mapsto if (t r u e) then (i_{0}) else (i_{0} - 1)) .$

This is, though, equivalent to $(\emptyset, i \mapsto i_{0})$ ! If we merge the states the “other way round”, we obtain a different result:

$({true \lor i_{0} > 0}, i \mapsto if (i_{0} > 0) then (i_{0} - 1) else (i_{0})) .$

This cannot be what we want, and does of course not meet soundness criteria of symbolic execution (“exhaustiveness” and “precision”⁴). Sound state merging is always symmetric (and associative). Indeed, the merge method using conditional expressions may only be applied if the input states are “distinguishable”⁴, i.e., if only ever one path condition can hold true at the same time.

Design Options for Precise Merging

Of course, we could resort to abstraction—or use a different kind of summation operator. Value Summaries have been proposed by Sen et al.² for precise symbolic state merging. A value summary is a guarded collection of values: $i \mapsto {(k > 0 | 2 \cdot k), (k \leq 0 | 42)}$ . The way Sen et al. use them suggests that they are basically thinking of conditionals, since path conditions are distinguishable as in the example.

This restriction, however, is not necessary. The full strength of summaries becomes apparent when relaxing them to Nondeterministic Value Summaries: If more than one guard is satisfiable by an initial assignment, the summary can attain either of the guarded values. In the extreme case, nondeterministic value summaries are sets: ${(t r u e | 17), (t r u e | 42)}$ is equivalent to ${17, 42}$ .

Applied to our loop example from above, we can summarize the initial state and the state after one iteration to

$({true \lor i_{0} > 0}, i \mapsto {(i_{0} > 0 | i_{0} - 1), (true | i_{0})})$

or equivalently

$(\emptyset, i \mapsto {(i_{0} > 0 | i_{0} - 1), i_{0}})$

using a nondeterministic value summary.

This strategy is symmetric and sound. It can be applied without restrictions, and is yet equivalent to merging with conditionals when path conditions are distinguishable. Furthermore, repeated merging yields “flatter” terms which are nicer to look at.

It is not strictly required to add value summaries as a new data type for right-hand sides of symbolic stores; alternatively, they could be encoded using fresh constants constrained by path conditions⁴, similarly to how we proceeded for data abstraction above. The downsides of this approach are, among others, the introduction of the constants, which can be troublesome for some constraint solvers, and the fact that path conditions no longer only consist of decisions taken during execution, but also of “artificial” constraints on the store.

The Problem with the Traces and the Time Dimension

Yet, how can I call this nondeterministic summary “fully precise”? Indeed, when regarding the resulting state in isolation, we do not know if i attains its original value or that value decreased by one, even if we know that the initial value is strictly positive. The notion “precise” stems from the fact that exactly all concrete states represented by the merged inputs are still represented by the output with the summary. So, the “problem” is a different one: If we’re in the (symbolic) initial state where $i_{0}$ could be either strictly positive or smaller or equal than 0, we cannot retrieve the exact value of i from the summary. In other words, we cannot reconstruct the (symbolic) execution trace from the summarized state.

By adding a “time dimension”, we can rescue this capability. Assume that a variable t tracks the number of times a statement has been the execution target. Adding this to the guards of the value summary yields

$({t ≐ 0 \lor (t ≐ 1 \land i_{0} > 0)}, i \mapsto {(t ≐ 1 | i_{0} - 1), (t ≐ 0 | i_{0})})$

The resulting states are distinguishable again, such that we even could use a conditional for merging. To reconstruct symbolic traces when using state merging with nondeterministic value summaries, we—generally—need some (abstract) notion of time. Usually, a properly used loop counter suffices.

Implementations of Value Summaries

Value summaries (used deterministically) have been proposed² and implemented⁶ by Sen et al. The implementation does not seem to be maintained any more. More recently, value summaries are used for fully precise, automatic symbolic state merging by the pluggabl tool⁷. pluggabl is a symbolic execution engine for JVM bytecode which automatically performs state merging at all merge points, aligning the analysis result to the CFG structure.

Precise merging with conditional expressions has been implemented in the KeY tool⁸. State merging in KeY is not done automatically, but requires either point-and-click interaction with the symbolic execution tree or merge annotations in source code.

Conclusion

Value summaries are guarded sets of values. In deterministic value summaries, all constraints are mutually exclusive (“distinguishable”); nondeterministic summaries allow for overlapping guards. They are needed in certain situations, like do-while loops, where path conditions of input states are not necessarily distinguishable. This is also useful for unstructured programming languages or nondeterministic code. To be able to perform fully precise symbolic state merging in all situations, nondeterministic value summaries are indispensable, be it as special data types or encoded into path conditions. Their nondeterministic nature only becomes a problem if we want to be able to reconstruct execution traces from merged states. Then, we need a notion of time (for deterministic programs) or some general notion of origin (for nondeterministic programs with several entry points).

Cristian Cadar, Koushik Sen (2013): Symbolic Execution for Software Testing: Three Decades Later. Commun. ACM 56(2), 82–90 ↩︎
Koushik Sen, George C. Necula, Liang Gong, Wontae Choi (2015): MultiSE: Multi-Path Symbolic Execution Using Value Summaries. ESEC/SIGSOFT FSE: 842-853 ↩︎
Dominic Scheurer, Reiner Hähnle, Richard Bubel (2016). A General Lattice Model for Merging Symbolic Execution Branches. Proc. 18th Intern. Conf. on Formal Engineering Methods (ICFEM).
PDF Cite DOI

↩︎
Dominic Steinhöfel (2020). Ever Change a Running System: Structured Software Reengineering Using Automatically Proven-Correct Transformation Rules. Ernst Denert Award for Software Engineering 2020: Practice Meets Foundations.
PDF Cite DOI URL

↩︎
Patrick Cousot, Radhia Cousot (1977): Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. POPL 1977: 238-252 ↩︎
https://github.com/SRA-SiliconValley/jalangi ↩︎
https://github.com/rindPHI/pluggabl ↩︎
https://www.key-project.org/2017/05/03/new-feature-state-merging-in-key/ ↩︎