CodeSonar Python API

The CodeSonar API is primarily concerned with providing access to the intermediate representation (IR) of a code base. The Python API provides a safe and convenient interface to the intermediate representation.

A short guided tour is provided below (Exploring the Intermediate Representation).

Annotated example CodeSonar plug-ins in Python and the other supported API languages, including step-by-step instructions for installing and testing, are provided in the Plug-In Tutorial and AST Tutorial.

Intermediate Representation Quick Reference

Type Description
project an entire project
warning warning instance
warningclass warning class
compunit compilation unit source file
sfile source or header file
sfileinst file instance (include tree node)
procedure procedure
symbol variable or procedure
point program point (statement, roughly)
ast abstract syntax tree
Visitors visitors over the project’s internal representation
Metrics metric function decorators
rpchandler() remote procedure call handler decorator

Exploring the Intermediate Representation

CodeSonar ships with a plug-in that launches an interactive Python console once the analysis has started.

Important

The Python API is only available within a CodeSonar analysis process. In particular, it is not possible to run a normal Python binary and import and the cs module.

Setting Up

  1. Save the following to a file named foo.c.

    /* foo.c */
    int foo(void)
    {
       int i = 1;
       int j = i + 1;
       int k = i + j;
       return k;
    }
    
  2. Run the following command to analyze foo.c (where host:port is your hub location)

    codesonar analyze foo -preset python_debug_console -foreground hub:port gcc -c foo.c
    

    Running in foreground mode ensures that the analysis process is connected to the terminal from which you run the command.

    Once the analysis has reached the end of the “Linking” phase, you will be presented with a Python prompt.

  3. Try out the Python prompt.

    >>> 1 + 1
    2
    

    Note

    “Advanced” terminal features such as the arrow keys do not work in the interactive shell. We apologize for this; GNU readline’s license is not compatible with CodeSonar’s license.

  4. Import the cs module, which contains the Python API for CodeSonar.

    >>> import cs
    

    Most of the standard Python modules are available, with the exception of threading.

Intermediate Representation Tour

We will start with a tour of the intermediate representation.

The project object contains the intermediate representation of the entire project. Get the currently loaded project with project.current().

>>> p = cs.project.current()
>>> p
foo

The project is made of compilation units. Use project.compunits() to get an iterator over the compilation units (compunit) in the project.

>>> for cu in p.compunits():
...     print(cu.name())
...
System Initialization, Indirect & Undefined
/tmp/foo.c

System Initialization, Indirect & Undefined is a synthetic compilation unit that contains, among other things, a call to the function main (if one exists).

In many cases we are not interested in synthetic compilation units, or those generated for library functions. compunit.is_user() allows us to distinguish user-generated compilation units.

>>> for cu in (c for c in p.compunits() if c.is_user()):
...    print(cu)
...
/tmp/foo.c

We are interested in the root node of the include tree of the foo.c compilation unit. The root is a source file instance (sfileinst); we obtain it with compunit.get_sfileinst().

>>> sfi = cu.get_sfileinst()
>>> sfi
<cs.sfileinst /tmp/foo.c>

We can use sfileinst.read() to read a substring from this source file instance.

>>> sfi.read(2,0,4,0) # reading from line 2, column 0 through line 4, column 0
'int foo(void)\n{\n'

We can examine the procedures in foo.c, using compunit.procedures() to get an iterator over the procedures and procedure.name() to get the user-friendly name of each procedure.

>>> for proc in cu.procedures():
...     print(proc.name())
...
foo

In this case there is only one procedure: foo(). We can determine its location (file instance and line) with procedure.file_line().

>>> proc.file_line()
(<cs.sfileinst /tmp/foo.c>, 2)

Procedure foo() starts on line 2 of foo.c.

Procedures are composed of program points, which we can recover with procedure.points().

>>> proc.points()
<cs.point_set {<cs.point [entry] foo>, <cs.point [body] foo>, <cs.point [expression] i = 1>, 10 more...}>

We can convert the returned point_set into a native Python list using the list constructor:

>>> list(proc.points())
[<cs.point [entry] foo>, <cs.point [body] foo>, <cs.point [expression] i = 1>, <cs.point [expression] j = i + 1>, <cs.point [expression] k = i + j>, <cs.point [expression] foo$return = k>, <cs.point [return] foo>, <cs.point [exit] foo>, <cs.point [formal-out] >]

That’s a lot of program points for such a small function!

Every procedure starts with an entry point, which we can obtain with procedure.entry_point()

>>> proc.entry_point()
<cs.point [entry] foo>

Each program point has a point_kind such as point_kind.ENTRY or point_kind.EXPRESSION. To get a point’s kind, use point.get_kind().

>>> proc.entry_point().get_kind()
<cs.point_kind entry>

Enumeration classes like point_kind contain constants for all enumeration values, and define comparison operators.

>>> proc.entry_point().get_kind() == cs.point_kind.ENTRY
True

To see all the possible point kinds (and a few extra things) we can use Python’s builtin dir function:

>>> dir(cs.point_kind)
['ACTUAL_IN', 'ACTUAL_OUT', 'AUXILIARY', 'BODY', 'CALL_SITE', 'CONTROL_POINT', 'DECLARATION', 'ENTRY', 'EXCPT_EXIT', 'EXCPT_RETURN', 'EXIT', 'EXPRESSION', 'FORMAL_IN', 'FORMAL_OUT', 'GLOBAL_ACTUAL_IN', 'GLOBAL_ACTUAL_OUT', 'GLOBAL_FORMAL_IN', 'GLOBAL_FORMAL_OUT', 'HAMMOCK_EXIT', 'HAMMOCK_HEADER', 'INDIRECT_CALL', 'JUMP', 'LABEL', 'NORMAL_EXIT', 'NORMAL_RETURN', 'PHI', 'PI', 'RESERVED_000', 'RESERVED_002', 'RESERVED_003', 'RESERVED_004', 'RETURN', 'SWITCH_CASE', 'UNAVAILABLE', 'VARIABLE_INITIALIZATION', '__class__', '__cmp__', '__del__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__swig_destroy__', '__swig_getmethods__', '__swig_setmethods__', '__weakref__', 'name', 'participates_in_cfg']

The uppercase attributes are enumeration values. Enumeration classes typically have a small number of methods. The methods for point_kind include point_kind.__str__() and point_kind.participates_in_cfg()

>>> str(cs.point_kind.CALL_SITE)
'call-site'
>>> cs.point_kind.CALL_SITE.participates_in_cfg()
True

Back to the entry point from before. We can get the points that come next in the control flow ordering with point.cfg_targets().

>>> list(proc.entry_point().cfg_targets())
[(<cs.point [expression] i = 1>, <cs.edge_label T>)]

This should look familiar. i = 1 was the first statement in procedure foo(). The edge label T is short for “true”. Program points that have only a single successor will use the T edge label.

An easier way of getting the successor of a program point is to use the point.solitary_cfg_target() method. This method can be applied successively:

>>> pnt = proc.entry_point()
>>> while True:
...     pnt = pnt.solitary_cfg_target()
...     print(pnt)
...
[expression] i = 1
[expression] j = i + 1
[expression] k = i + j
[expression] foo$return = k
[return] foo
[exit] foo
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/ca0/dvitek/trunk/csurf/src/api/python/cs.py", line 3854, in solitary_cfg_target
    def solitary_cfg_target(self): return _cs.point_solitary_cfg_target(self)
cs.result: CS_PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS

What happened here? We see above that a number of program points were dumped in order. However, the attempt to get the successor of the exit point raised an exception. Exit points have zero successors.

Most exceptions raised by the API have type result (ast_pattern.__init__() is a special case and raises ast_pattern_compilation_error exceptions). There are also a few places where standard Python exceptions such as KeyError and StopIteration are raised.

We can catch and examine this exception. This time, instead of iterating through the control flow graph (CFG) from the entry point to the e xit point, we obtain the exit point directly with procedure.exit_point().

>>> try:
...     proc.exit_point().solitary_cfg_target()
... except cs.result as r:
...     print(r)
...     if r == cs.result.PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS:
...         print('oops!')
...
CS_PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS
oops!

Let’s go back and dig deeper into the first expression. point objects can contain multiple abstract syntax trees (ASTs, ast objects). point objects for C code may have normalized and unnormalized ASTs. Normalized ASTs use a smaller language and are therefore easier to analyze, so we request a normalized C AST associated with pnt by specifying ast_family.C_NORMALIZED as the family argument to point.get_ast().

>>> pnt = proc.entry_point().solitary_cfg_target()
>>> pnt
<cs.point [expression] i = 1>
>>> e = pnt.get_ast(cs.ast_family.C_NORMALIZED)
>>> e
<cs.ast [c:=] i = 1>

e is the root node of the ast for pnt. Every ast node has an ast_class, which we can obtain with ast.get_class().

>>> e.get_class()
<cs.ast_class c:=>
>>> e.get_class() == cs.ast_class.NC_NORMALASSIGN
True

This ast node has class cs.ast_class.NC_NORMALASSIGN: it represents a normalized C normal assignment.

Every AST has:

  • zero or more children which are part of the structure of the tree (so an AST-valued child may not be part of a cycle), and
  • zero or more attributes which are not part of the structure of the tree (so an AST-valued attribute may point anywhere in the AST).

We can get an AST’s children with ast.children().

>>> e.children()
(<cs.ast_field 1:i>, <cs.ast_field 2:1>)

Both children and attributes are represented by objects of class ast_field, which pairs an ordinal (ast_field) and a value (several possible types, see ast_field_type). The ordinal serves as an identifier for the field.

AST e has two children.

  • The child with ordinal 1 corresponds to the left hand side of the assignment (i).
  • The child with ordinal 2 corresponds to the right hand side (1).

We can inspect the value in an ast_field with ast_field.value().

>>> e.children()[0].value()
<cs.ast [c:variable] i>

The array syntax provides a more efficient and convenient way to get a field using its ordinal:

>>> e[1]
<cs.ast [c:variable] i>

In addition to numeric ast ordinals, there are also symbolic ast ordinals. Both children of e have fields with symbolic ordinals.

>>> e[1].fields()
(<cs.ast_field name:i>, <cs.ast_field type:[c:integer] int>, <cs.ast_field storage-class:auto>, <cs.ast_field abs-loc:<cs.symbol i>>)

We can index an ast with a symbolic ordinal just as we can with a numeric ordinal.

>>> e[1][cs.ast_ordinal.NC_ABS_LOC]
<cs.symbol i>

This is the symbol for variable i.

ast objects are fairly complicated. The easiest way to inspect an ast is to invoke ast.dump():

>>> print(e.dump())
(c:=)-+-(c:variable)-+-name:"i"
      |              +-type:(c:integer)-+-size:4
      |              |                  +-is-const:false
      |              |                  +-is-volatile:false
      |              |                  +-is-near:false
      |              |                  +-is-far:false
      |              |                  +-is-unaligned:false
      |              |                  +-is-restrict:false
      |              |                  +-is-complete:true
      |              |                  `-integer-kind:int
      |              +-storage-class:auto
      |              `-abs-loc:i-6
      +-(c:integer-value-32)-+-value:1
      |                      +-type:(c:integer)-+-size:4
      |                      |                  +-is-const:false
      |                      |                  +-is-volatile:false
      |                      |                  +-is-near:false
      |                      |                  +-is-far:false
      |                      |                  +-is-unaligned:false
      |                      |                  +-is-restrict:false
      |                      |                  +-is-complete:true
      |                      |                  `-integer-kind:int
      |                      `-is-decimal-literal:true
      `-type:(c:integer)-+-size:4
                         +-is-const:false
                         +-is-volatile:false
                         +-is-near:false
                         +-is-far:false
                         +-is-unaligned:false
                         +-is-restrict:false
                         +-is-complete:true
                         `-integer-kind:int

Let’s look some more at the symbol for variable i. Symbols represent variables and procedures.

>>> sym = e[1][cs.ast_ordinal.NC_ABS_LOC]

This is a local variable (check with symbol.is_local()), so we can get its containing procedure with symbol.get_procedure().

>>> sym.is_local()
True
>>> sym.get_procedure()
<cs.procedure foo>

Determining the definition location of a symbol is similar to determining the definition location of a procedure: use symbol.file_line().

>>> sym.file_line()
(<cs.sfileinst /tmp/foo.c>, 4)

Let’s look for points that use symbol i.

Cross-referencing queries with project.token_search() provide search at a more lexical level, and can search for many things besides variables (for example, uses of macros or types). They are efficient on very large code bases. xr_query objects can restrict search results with respect to several dimensions. Here, we set up an xr_query object, use xr_query.add_term_filter() to restrict the search to tokens that match ‘i’. The token search returns an xr_query_iterator with one xr_tuple element for each occurrence of the token; we can iterate over these to print the file name (xr_tuple.get_file(), sfile.name()) and line (xr_tuple.get_line()) for each occurrence.

>>> q = cs.xr_query()
>>> q.add_term_filter('i')
>>> for x in p.token_search(q):
...     print('%s:%d' % (x.get_file().name(), x.get_line()))
...
/tmp/foo.c:4
/tmp/foo.c:5
/tmp/foo.c:6
/tmp/foo.c:4

If we look back at the source code for foo.c, we can see that i does indeed occur on lines 4, 5, and 6.

Type quit() to end the tour and allow the analysis to complete:

>>> quit()

This tour only touched on a small portion of the intermediate representation. More information is available in the cs Class Reference.

Side Effects

  • If a method has side effects, then the documentation says so.
  • Methods with side effects modify the object they are invoked on (self), or global state (e.g., issuing a warning).
  • Methods never modify their parameters.

Indices and tables