CodeSonar Python API¶
The CodeSonar API is primarily concerned with providing access to the intermediate representation (IR) of a code base. The Python API provides a safe and convenient interface to the intermediate representation.
A short guided tour is provided below (Exploring the Intermediate Representation).
Annotated example CodeSonar plug-ins in Python and the other supported API languages, including step-by-step instructions for installing and testing, are provided in the Plug-In Tutorial and AST Tutorial.
Intermediate Representation Quick Reference¶
| Type | Description |
|---|---|
project |
an entire project |
warning |
warning instance |
warningclass |
warning class |
compunit |
compilation unit source file |
sfile |
source or header file |
sfileinst |
file instance (include tree node) |
procedure |
procedure |
symbol |
variable or procedure |
point |
program point (statement, roughly) |
ast |
abstract syntax tree |
| Visitors | visitors over the project’s internal representation |
| Metrics | metric function decorators |
| rpchandler() | remote procedure call handler decorator |
Exploring the Intermediate Representation¶
CodeSonar ships with a plug-in that launches an interactive Python console once the analysis has started.
- It is itself implemented as a CodeSonar plug-in using the Python API. You can see the plug-in source at $CSONAR/codesonar/plugins/python_debug_console.py.
- To load the console, use the ‘python_debug_console’ preset when you build and analyze a CodeSonar project.
Important
The Python API is only available within a CodeSonar analysis process. In particular, it is not possible to run a normal Python binary and import and the cs module.
Setting Up¶
Save the following to a file named
foo.c./* foo.c */ int foo(void) { int i = 1; int j = i + 1; int k = i + j; return k; }
Run the following command to analyze
foo.c(where host:port is your hub location)codesonar analyze foo -preset python_debug_console -foreground hub:port gcc -c foo.c
Running in foreground mode ensures that the analysis process is connected to the terminal from which you run the command.
Once the analysis has reached the end of the “Linking” phase, you will be presented with a Python prompt.
Try out the Python prompt.
>>> 1 + 1 2
Note
“Advanced” terminal features such as the arrow keys do not work in the interactive shell. We apologize for this; GNU readline’s license is not compatible with CodeSonar’s license.
Import the
csmodule, which contains the Python API for CodeSonar.>>> import cs
Most of the standard Python modules are available, with the exception of threading.
Intermediate Representation Tour¶
We will start with a tour of the intermediate representation.
The project object contains the intermediate representation
of the entire project. Get the currently loaded project with project.current().
>>> p = cs.project.current()
>>> p
foo
The project is made of compilation units.
Use project.compunits() to get an iterator over the compilation units (compunit) in the project.
>>> for cu in p.compunits():
... print(cu.name())
...
System Initialization, Indirect & Undefined
/tmp/foo.c
System Initialization, Indirect & Undefined is a synthetic
compilation unit that contains, among other things, a call to the
function main (if one exists).
In many cases we are not interested in synthetic compilation units, or
those generated for library functions. compunit.is_user() allows
us to distinguish user-generated compilation units.
>>> for cu in (c for c in p.compunits() if c.is_user()):
... print(cu)
...
/tmp/foo.c
We are interested in the root node of the include tree of the
foo.c compilation unit. The root is a source file instance
(sfileinst); we obtain it with
compunit.get_sfileinst().
>>> sfi = cu.get_sfileinst()
>>> sfi
<cs.sfileinst /tmp/foo.c>
We can use sfileinst.read() to read a substring from this source file instance.
>>> sfi.read(2,0,4,0) # reading from line 2, column 0 through line 4, column 0
'int foo(void)\n{\n'
We can examine the procedures in foo.c, using
compunit.procedures() to get an iterator over the procedures and
procedure.name() to get the user-friendly name of each procedure.
>>> for proc in cu.procedures():
... print(proc.name())
...
foo
In this case there is only one procedure: foo(). We can determine
its location (file instance and line) with
procedure.file_line().
>>> proc.file_line()
(<cs.sfileinst /tmp/foo.c>, 2)
Procedure foo() starts on line 2 of foo.c.
Procedures are composed of program points, which we can recover with procedure.points().
>>> proc.points()
<cs.point_set {<cs.point [entry] foo>, <cs.point [body] foo>, <cs.point [expression] i = 1>, 10 more...}>
We can convert the returned point_set into a native Python list using the
list constructor:
>>> list(proc.points())
[<cs.point [entry] foo>, <cs.point [body] foo>, <cs.point [expression] i = 1>, <cs.point [expression] j = i + 1>, <cs.point [expression] k = i + j>, <cs.point [expression] foo$return = k>, <cs.point [return] foo>, <cs.point [exit] foo>, <cs.point [formal-out] >]
That’s a lot of program points for such a small function!
Every procedure starts with an entry point, which we can
obtain with procedure.entry_point()
>>> proc.entry_point()
<cs.point [entry] foo>
Each program point has a point_kind such as
point_kind.ENTRY or point_kind.EXPRESSION. To get a
point’s kind, use point.get_kind().
>>> proc.entry_point().get_kind()
<cs.point_kind entry>
Enumeration classes like point_kind contain constants for all
enumeration values, and define comparison operators.
>>> proc.entry_point().get_kind() == cs.point_kind.ENTRY
True
To see all the possible point kinds (and a few extra things) we can use Python’s builtin dir function:
>>> dir(cs.point_kind)
['ACTUAL_IN', 'ACTUAL_OUT', 'AUXILIARY', 'BODY', 'CALL_SITE', 'CONTROL_POINT', 'DECLARATION', 'ENTRY', 'EXCPT_EXIT', 'EXCPT_RETURN', 'EXIT', 'EXPRESSION', 'FORMAL_IN', 'FORMAL_OUT', 'GLOBAL_ACTUAL_IN', 'GLOBAL_ACTUAL_OUT', 'GLOBAL_FORMAL_IN', 'GLOBAL_FORMAL_OUT', 'HAMMOCK_EXIT', 'HAMMOCK_HEADER', 'INDIRECT_CALL', 'JUMP', 'LABEL', 'NORMAL_EXIT', 'NORMAL_RETURN', 'PHI', 'PI', 'RESERVED_000', 'RESERVED_002', 'RESERVED_003', 'RESERVED_004', 'RETURN', 'SWITCH_CASE', 'UNAVAILABLE', 'VARIABLE_INITIALIZATION', '__class__', '__cmp__', '__del__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattr__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__swig_destroy__', '__swig_getmethods__', '__swig_setmethods__', '__weakref__', 'name', 'participates_in_cfg']
The uppercase attributes are enumeration values. Enumeration classes typically have a small number of methods.
The methods for point_kind include point_kind.__str__() and point_kind.participates_in_cfg()
>>> str(cs.point_kind.CALL_SITE)
'call-site'
>>> cs.point_kind.CALL_SITE.participates_in_cfg()
True
Back to the entry point from before. We can get the points
that come next in the control flow ordering with point.cfg_targets().
>>> list(proc.entry_point().cfg_targets())
[(<cs.point [expression] i = 1>, <cs.edge_label T>)]
This should look familiar. i = 1 was the first statement in
procedure foo(). The edge label T is short for “true”. Program
points that have only a single successor will use the T edge label.
An easier way of getting the successor of a program point is to use
the point.solitary_cfg_target() method. This method can be
applied successively:
>>> pnt = proc.entry_point()
>>> while True:
... pnt = pnt.solitary_cfg_target()
... print(pnt)
...
[expression] i = 1
[expression] j = i + 1
[expression] k = i + j
[expression] foo$return = k
[return] foo
[exit] foo
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/ca0/dvitek/trunk/csurf/src/api/python/cs.py", line 3854, in solitary_cfg_target
def solitary_cfg_target(self): return _cs.point_solitary_cfg_target(self)
cs.result: CS_PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS
What happened here? We see above that a number of program points were dumped in order. However, the attempt to get the successor of the exit point raised an exception. Exit points have zero successors.
Most exceptions raised by the API have type result
(ast_pattern.__init__() is a special case and raises
ast_pattern_compilation_error exceptions). There are also a
few places where standard Python exceptions such as KeyError and
StopIteration are raised.
We can catch and examine this exception. This time, instead of iterating through the control flow
graph (CFG) from the entry point to the e
xit point, we obtain the exit
point directly with procedure.exit_point().
>>> try:
... proc.exit_point().solitary_cfg_target()
... except cs.result as r:
... print(r)
... if r == cs.result.PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS:
... print('oops!')
...
CS_PDG_VERTEX_HAS_ZERO_OR_MULTIPLE_SUCCESSORS
oops!
Let’s go back and dig deeper into the first expression.
point objects can contain multiple abstract syntax trees
(ASTs, ast objects). point objects for C code may
have normalized and unnormalized ASTs.
Normalized ASTs use a smaller
language and are therefore easier to analyze, so we request a
normalized C AST associated with pnt by specifying
ast_family.C_NORMALIZED as the family argument to
point.get_ast().
>>> pnt = proc.entry_point().solitary_cfg_target()
>>> pnt
<cs.point [expression] i = 1>
>>> e = pnt.get_ast(cs.ast_family.C_NORMALIZED)
>>> e
<cs.ast [c:=] i = 1>
e is the root node of the ast for pnt.
Every ast node has an ast_class, which we can obtain with ast.get_class().
>>> e.get_class()
<cs.ast_class c:=>
>>> e.get_class() == cs.ast_class.NC_NORMALASSIGN
True
This ast node has class cs.ast_class.NC_NORMALASSIGN: it represents a normalized C normal
assignment.
Every AST has:
- zero or more children which are part of the structure of the tree (so an AST-valued child may not be part of a cycle), and
- zero or more attributes which are not part of the structure of the tree (so an AST-valued attribute may point anywhere in the AST).
We can get an AST’s children with ast.children().
>>> e.children()
(<cs.ast_field 1:i>, <cs.ast_field 2:1>)
Both children and attributes are represented by objects of class
ast_field, which pairs an ordinal (ast_field) and
a value (several possible types, see ast_field_type). The
ordinal serves as an identifier for the field.
AST e has two children.
- The child with ordinal 1 corresponds to the left hand side of the assignment (
i). - The child with ordinal 2 corresponds to the right hand side (
1).
We can inspect the value in an ast_field with ast_field.value().
>>> e.children()[0].value()
<cs.ast [c:variable] i>
The array syntax provides a more efficient and convenient way to get a field using its ordinal:
>>> e[1]
<cs.ast [c:variable] i>
In addition to numeric ast ordinals, there are also symbolic ast
ordinals. Both children of e have fields with symbolic ordinals.
>>> e[1].fields()
(<cs.ast_field name:i>, <cs.ast_field type:[c:integer] int>, <cs.ast_field storage-class:auto>, <cs.ast_field abs-loc:<cs.symbol i>>)
We can index an ast with a symbolic ordinal just as we can with a numeric ordinal.
>>> e[1][cs.ast_ordinal.NC_ABS_LOC]
<cs.symbol i>
This is the symbol for variable i.
ast objects are fairly complicated. The easiest way to
inspect an ast is to invoke ast.dump():
>>> print(e.dump())
(c:=)-+-(c:variable)-+-name:"i"
| +-type:(c:integer)-+-size:4
| | +-is-const:false
| | +-is-volatile:false
| | +-is-near:false
| | +-is-far:false
| | +-is-unaligned:false
| | +-is-restrict:false
| | +-is-complete:true
| | `-integer-kind:int
| +-storage-class:auto
| `-abs-loc:i-6
+-(c:integer-value-32)-+-value:1
| +-type:(c:integer)-+-size:4
| | +-is-const:false
| | +-is-volatile:false
| | +-is-near:false
| | +-is-far:false
| | +-is-unaligned:false
| | +-is-restrict:false
| | +-is-complete:true
| | `-integer-kind:int
| `-is-decimal-literal:true
`-type:(c:integer)-+-size:4
+-is-const:false
+-is-volatile:false
+-is-near:false
+-is-far:false
+-is-unaligned:false
+-is-restrict:false
+-is-complete:true
`-integer-kind:int
Let’s look some more at the symbol for variable i. Symbols represent variables and procedures.
>>> sym = e[1][cs.ast_ordinal.NC_ABS_LOC]
This is a local variable (check with symbol.is_local()), so we
can get its containing procedure with symbol.get_procedure().
>>> sym.is_local()
True
>>> sym.get_procedure()
<cs.procedure foo>
Determining the definition location of a symbol is similar
to determining the definition location of a procedure: use symbol.file_line().
>>> sym.file_line()
(<cs.sfileinst /tmp/foo.c>, 4)
Let’s look for points that use symbol i.
Cross-referencing queries with project.token_search() provide
search at a more lexical level, and can search for many things besides
variables (for example, uses of macros or types). They are efficient
on very large code bases. xr_query objects can restrict
search results with respect to several dimensions. Here, we set up an
xr_query object, use xr_query.add_term_filter() to
restrict the search to tokens that match ‘i’. The token search
returns an xr_query_iterator with one xr_tuple
element for each occurrence of the token; we can iterate over these to
print the file name (xr_tuple.get_file(), sfile.name())
and line (xr_tuple.get_line()) for each occurrence.
>>> q = cs.xr_query()
>>> q.add_term_filter('i')
>>> for x in p.token_search(q):
... print('%s:%d' % (x.get_file().name(), x.get_line()))
...
/tmp/foo.c:4
/tmp/foo.c:5
/tmp/foo.c:6
/tmp/foo.c:4
If we look back at the source code for foo.c, we can see that i does indeed
occur on lines 4, 5, and 6.
Type quit() to end the tour and allow the analysis to complete:
>>> quit()
This tour only touched on a small portion of the intermediate representation. More information is available in the cs Class Reference.
Side Effects¶
- If a method has side effects, then the documentation says so.
- Methods with side effects modify the object they are invoked on
(
self), or global state (e.g., issuing a warning). - Methods never modify their parameters.