Typestate analysis, sometimes called protocol analysis, is a form of program analysis employed in programming languages. It is most commonly applied to object-oriented languages. Typestates define valid sequences of operations that can be performed upon an instance of a given type. Typestates, as the name suggests, associate state information with variables of that type. This state information is used to determine at compile-time which operations are valid to be invoked upon an instance of the type. Operations performed on an object that would usually only be executed at run-time are performed upon the type state information which is modified to be compatible with the new state of the object.
Typestates are capable of representing behavioral type refinements such as "method A must be invoked before method B is invoked, and method C may not be invoked in between". Typestates are well-suited to representing resources that use open/close semantics by enforcing semantically valid sequences such as "open then close" as opposed to invalid sequences such as leaving a file in an open state. Such resources include filesystem elements, transactions, connections and protocols. For instance, developers may want to specify that files or sockets must be opened before they are read or written, and that they can no longer be read or written if they have been closed. The name "typestate" stems from the fact that this kind of analysis often models each type of object as a finite state machine. In this state machine, each state has a well-defined set of permitted methods/messages, and method invocations may cause state transitions. Petri nets have also been proposed as a possible behavioral model for use with refinement types.[1]
Typestate analysis was introduced by Rob Strom in 1983[2] in the Network Implementation Language (NIL) developed at IBM's Watson Lab.It was formalized by Strom and Yemini in a 1986 article[3] that described how to use typestate to track the degree of initialisation of variables, guaranteeing that operations would never be applied on improperly initialised data, and further generalized in the Hermes programming language.
In recent years, various studies have developed ways of applying the typestate concept to object-oriented languages.[4] [5]
Strom and Yemini (1986) required the set of typestates for a given type to be partially ordered such that a lower typestate can be obtained from a higher one by discarding some information. For example, an int
variable in C typically has the typestates "uninitialized" < "initialized", and a FILE*
pointer may have the typestates "unallocated" < "allocated, but uninitialized" < "initialized, but file not opened" < "file opened". Moreover, Strom and Yemini require that each two typestates have a greatest lower bound, i.e. that the partial order is even a meet-semilattice; and that each order has a least element, always called "⊥".
Their analysis is based on the simplification that each variable v is assigned only one typestate for each point in the program text; if a point p is reached by two different execution paths and v inherits different typestates via each path, then the typestate of v at p is taken to be the greatest lower bound of the inherited typestates. For example, in the following C snippet, the variable n
inherits the typestate "initialized" and "uninitialized" from the then
and the (empty) else
part, respectively, hence it has typestate "uninitialized" after the whole conditional statement.
Every basic operation[6] has to be equipped with a typestate transition, i.e. for each parameter the required and ensured typestate before and after the operation, respectively. For example, an operation fwrite(...,fd)
requires fd
to have typestate "file opened". More precisely, an operation may have several outcomes, each of which needs its own typestate transition. For example, the C code FILE *fd=fopen("foo","r")
sets fd
's typestate to "file opened" and "unallocated" if opening succeeds and fails, respectively.
For each two typestates t1 <· t2, a unique typestate coercion operation needs to be provided which, when applied to an object of typestate t2, reduces its typestate to t1, possibly by releasing some resources. For example, fclose(fd)
coerces fd
's typestate from "file opened" to "initialized, but file not opened".
A program execution is called typestate-correct if
A program text is called typestate-consistent if it can be transformed, by adding appropriate typestate coercions, to a program whose points can be statically labelled with typestates such that any path allowed by the control flow is typestate-correct. Strom and Yemini give a linear-time algorithm that checks a given program text for typestate-consistency, and computes where to insert which coercion operation, if any.
In order to achieve a precise and effective typestate analysis, it is necessary to address the problem of aliasing. Aliasing occurs when an object has more than one reference or pointer that points to it. For the analysis to be correct, state changes to a given object must be reflected in all references that point to that object, but in general it is a difficult problem to track all such references. This becomes especially hard if the analysis needs to be modular, that is, applicable to each part of a large program separately without taking the rest of the program into account.
As another issue, for some programs, the method of taking the greatest lower bound at converging execution paths and adding corresponding down-coercion operations appears to be inadequate.For example, before the return 1
in the following program,[8] all components x
, y
, and z
of coord
are initialized, but Strom's and Yemini's approach fails to recognize this, since each initialization of a struct component in the loop body has to be down-coerced at loop re-entry to meet the typestate of the very first loop entry, viz. ⊥. A related problem is that this example would require overloading of typestate transitions; for example, parse_int_attr("x",&coord->x)
changes a typestate "no component initialized" to "x component initialized", but also "y component initialized" to "x and y component initialized".
There are several approaches seeking to infer typestates out of programs (or even other artifacts such as contracts). Many of them can infer typestates at compile time [9] [10] [11] [12] and others mine the models dynamically.[13] [14] [15] [16] [17] [18]
Typestate is an experimental concept that has not yet crossed over into mainstream programming languages. However, many academic projects actively investigate how to make it more useful as an everyday programming technique. Two examples are the Plaid and Obsidian languages, which are being developed by Jonathan Aldrich's group at Carnegie Mellon University.[19] [20] Other examples include the Clara[21] language research framework, earlier versions of the Rust language, and the >>
keyword in ATS.[22]
+=
in C, and standard library routines, e.g.fopen
malloc
ed memory has been free
d. In most programming languages, a variable's lifetime may end before program termination; the notion of typestate-correctness has then to be sharpened accordingly.int parse_int_attr(const char *name,int *val)
initializes *val
whenever it succeeds