05 May 2012: DXQL - Dexter Query Language
                        DXQL - Dexter Query Language
                            Felix Matenaar 2012

Or: "Towards a suitable querying language for static program analysis"

1. Motivation

    There are several static reverse engineering and analysis tools out there.
    They provide analysts with the capability to understand how a certain
    program works without the need of having access to the actual sourcecode.
    Therefor the main goal of such a software is to help the analyst do his
    job faster - and thus lower cost.

    Since non-trivial programs include a number of objects which are impossible
    to keep in mind for humans, most of such analysis software implements a
    search functionality.

    This document deals with the question how to design search functionality
    in such a context. Our solution to this problem is presented.

    Most program analysis software implements a plain text search. One can for example
    enter a search string and all functions in a program are listed that somehow include
    the search term as a substring. Sometimes keywords are additionally provided so express semantic
    characteristics. We all know that from the advanced google search. With keywords
    like "filetype:" we can limit results to certain file types.

    This is the concept google uses so how can this be wrong? ;)
    As always it depends on the context. Text search is suitable for documents and therefor
    also for (most parts) of the internet.

    However plain text search is not able to express relations among objects which are an important
    property in the context of static program analysis.    Imagine a user thinking about the
    following query:

        "give me the intersection of packages where classes are included which are used
         as field types in the class X and all packages which are directly defined in our program"

    This example shows that there is a need for an advanced query language

2. Requirements

    We derived several requirements from our experience in using known static program analysis

    1. easy to use

        Writing a search term in a textfield is done very fast. The same cannot hold for more
        complex query languages. However writing queries should not take so long that the
        user feels slowed down in his analysis process.
        Therefor we need to make the querying concept as simple as possible while not
        touching flexibility as far as needed in this context.

    2. flexible

        Users should be able to launch complex queries easily. All limitations should either
        rely in security reasons or evaluation complexity reasons (which to the end are security
        reasons too).

    3. limitation in computation complexity

        As long as the queries are launched on a standalone machine, complexity limitations can
        just be considered as convenience. However since dexlabs wants to provide a platform,
        potentially untrusted users may not exploit e.g. non-terminating query evaluations for
        a denial of service attack.

    4. security

        Querying concepts must assert that no user is able to exploit the querying language
        to read data which he should not have access to. Pretty obvious requirement at all
        but not very intuitive in general since most program analysis software is run in
        a trusted user environment.

3. Design

    Let us give you some impressions first before starting with the description of our concepts:
    In our program analysis scenario, we have got several example objects:
        - packages
        - classes
        - methods
        - class attributes (fields)
        - cross references (method1 call method2)

    Example Queries:
        set(c()); # set of all classes
        set(m(abstract=False)) # set of all methods which are not abstract
        xref of set(m()) # union of xrefs from and to all methods
        xref of namewith(name='send') in set(m())
            # as before but just considering methods with 'send' in methodname
        package of type of field of class of set(p(name='mainpackage'))
            # get package dependencies through attribute types of classes belonging to 'mainpackage'
        xref of method of set(c(name='mainclass')) as xrefs; \
             method of set(c(name='otherclass')) as callers; \
             xrefs and callers;
            # get all xrefs from/to methods in class 'mainclass'
              and intersect them with all methods of class 'otherclass'

    3.1 Overview

        The query language is based on a sequence of statements that are evaluated one after another.
        They are delimeted by a semicolon. Each statement yields a set which can be bound to a variable.
        The set yielded by the last statement is the result    set of the whole query.

        A statement consists of a sequence of processors (mappers and filters), a set
        constructor and an optional variable declaration. If a variable is declared,
        the set which the statement is evaluated to is bound to this variable and can be
        used in further statements.

        The evaluation of a statement to a set works as follows: a so called set initializer
        creates an initial set either by directly using a set constructor like set(c(...))
        for example using classes, by using a variable or by concatenation of variables or set
        constructors through set operations. This initial set is then processed from right to
        left through the processors. The Keyword "in" indicates a filter processor. The Keyword "of"
        indicates a mapper processor. Since each processors yields a set, they can be used in arbitrary
        order (constrained by typechecking explained below).

    3.2 Set Constructors

        Set constructors directly create a set of objects from a type, like 'c' for classes.
        Every argument is optional and is used like the 'where' statement in sql.
        Every argument must be assigned through a keyword, e.g. "set(c(name='unameit'))
        Variables or any other sets are not accepted as arguments for set constructors.

        A set constructor creates a set of objects that all have the same type.
        Therefor it is unambigous to say that a set has one type, namely the type of each object contained.

            variable and set(c()) # variable must be set of type class
            variable and set(c()) not set(m()) # not allowed, since set(c()) and set(m()) have different

    3.3 Operators

        DXQL knows three operators:
            and = intersection (a and b => all objects that are included in both, a and b)
            or  = union     (a or b => all objects that are either included in a or in b)
            not = exclusion (a not b => all objects that are included in a but not in b)

        Operations among sets of different types are not allowed.
        Operators can only be used in the set construction section of a statement
        Operators are evaluated from right to left.
        There are no parentheses.

        Example: set(c()) and foo not bar
                foo and bar are both variables specifying a set
                1. calculate foo not bar
                2. calculate set(c()) and 'result of 1'

    3.4 Processors

        Processors get an input set which is specified through the order in the statement
            proc2 of proc1 in set(c());
                proc1 uses the output of set(c()) as input
                proc2 uses the output of proc1 as input. Its output is yielded by the statement.

        Processors may accept arguments.
        Arguments must be given by keywords.
        Arguments may in addition to set constructors be variables (sets)

        Each processor has a specified set of set types which are allowed as input
        Each processor has a specified set of set types which may be produced as output

    3.5 Filter

        A filter is a processor.
        A filter yields a new set with the following constrains:
            1. the type of the input set must be the same as the type
               of the output set
            2. The output set must be a subset of the input set

        Filters are used to eliminate unwanted objects from a set after set construction.
        Filters are indicated through the keyword "in"

        Example applications:
            filter a set in objects for their name
                 ... namewith(name='test') in ...
            filter a set in classes for their package
                 ... exclude(inpackage='bar') in ...

    3.6 Mapper

        A mapper is a processor.
        A mapper yields a new set.
        The output set type may differ from the type of the input set

        Example applications:
            xref of method of set(c())

            method takes class sets as input and yields method sets
            xref takes method sets as input and yields method sets

    3.7 Security

            A DXQL statement terminates iff processors and set constructors
            terminate since there are no loops.
            Set constructors terminate because they operate on a finite amount of data.
            Thus all DXQL queries terminate iff processors terminate.
            This means that termination is independent from the user which is
            what we want.

        Type checking:
            Since every set must be initially constructed by a set
            constructor specifying the type, the type of a specific set
            is always known.

            Therefor we can do typechecking for operators and processors
            to prevent semantically meaningless or wrong processing.

        Data safety:
            Only set constructors may initially query arbitrary data,
            access must be checked at this point.
            As long as the data model does not include unintended relations
            between objects, processors can not be exploited.

4. Implementation

    The implementation has yet been prototyped and will be included in Dexter.
    Stay tuned for the launch of our platform!


5. Conclusion

    DXQL is a query language used in dexter for static program analysis.
    Its advantage lies in the variety of expressions a user can create to
    search through a program using filters, mappers and basic set operations while
    at the same time being easy to learn.
    Since every statement yields a set of objects, more complicated queries can
    be split up to smaller ones which are then concatenated.
    The simplicity of DXQL allows complex database queries by potentially untrusted
    users to be made secure. One just has to make sure that set constructors do not
    yield information which should not be available to the user. Furthermore no filter
    or mapper shall take such date into account or even yield them as a result.

6. Appendix

    6.1 Syntax Definition

        # Query Syntax Definition with pyparsing

        # suppressed / syntactic sugar
        LPAREN             = Suppress("(")
        RPAREN             = Suppress(")")

        # basic types
        identifier         = Word(alphas, alphanums, min=1)
        number             = Word(nums, min=1)
        string             = QuotedString("'")
        boolean            = Literal("true") | Literal("false") | Literal("True") | Literal("False")
        comma              = Suppress(",")
        semicolon          = Suppress(";")
        vardecl            = Suppress("as") + identifier
        setkeyword         = Suppress("set")
        typeIdentifier     = oneOf(" ".join(settypes))
        OP_OR              = Literal("or")
        OP_NOT             = Literal("not")
        OP_AND             = Literal("and")
        constant           = string | number | boolean
        argkeyword         = Word(alphas, min=1)
        setargtype         = constant
        procargtype        = constant | identifier

        # set construction and operations
        SetOperator        = OP_OR | OP_AND | OP_NOT
        SetArg             = argkeyword + Suppress("=") + setargtype
        SetConstructor     = typeIdentifier + Group(LPAREN + Optional(SetArg) + ZeroOrMore(comma + SetArg) + RPAREN)
        NewSet             = setkeyword + LPAREN + SetConstructor + RPAREN
        DefinedSet         = NewSet | identifier
        SetBuilder         = DefinedSet + ZeroOrMore(SetOperator + DefinedSet)

        Argument           = Group(argkeyword + Suppress("=") + procargtype)
        Arglist            = Optional(Argument) + ZeroOrMore(comma + Argument)
        Filter             = identifier + Optional(LPAREN + Arglist + RPAREN) + Suppress("in")
        Mapping            = identifier + Optional(LPAREN + Arglist + RPAREN) + Suppress("of")

        # structural rules
        SetProcessor       = Mapping | Filter
        SetExpr            = Group(ZeroOrMore(SetProcessor)) + SetBuilder + Optional(vardecl)
        S                  = OneOrMore(SetExpr + semicolon)

< Blog List