Query Chronicles
Posts
How CodeQL works: Summary

How CodeQL works: Summary

Sim4n6
July 01, 2023

You are reading Sim4n6's newsletter, a publication designed for ethical hackers. Each issue features a selected vulnerability-related topic, providing the straight-to-the-point concept to master.

This edition is a summary of How CodeQL works.

CodeQL Analysis Overview

During your reading pleasure, please, consider the following flow chart describing how CodeQL does its magic to unearth vulnerabilities.

A brief overview of CodeQL analysis steps

The CodeQL Database

Initially, the code base is extracted using a proprietary extractor. The extractor would produce relational data and a source reference for each input source file.

The result of this step is to build a CodeQL database which is nothing more than a directory holding a queryable representation of the code base for a single programming language at a specific point in time.

The CodeQL database contains much more details than that. Things, like the logs of the database creation, and the results of running a query, among other operations.

The following command would create a database named database.db/ for a ./src/ codebase which contains code written in Python.

codeql database create database.db/ --language="python" --source-root=./src/

Running the Query

To my understanding, two steps are performed when you run a query on a CodeQL database:

The QL Compiler would ingest the query, its related libraries, and the QL database schema. The latter is a text file that is meant to describe the column types and extensional relations of a raw QL dataset. Initially, when you created a CodeQL database, the extractor copied the schema file into the database folder.
Then, the QL Compiler would generate an intermediate representation between QL and relation algebra (RA) named DIL (Datalog Intermediary Language). DIL is useful for advanced users as an aid for debugging query performance.
Finally, and with a bit of computation voodoo, the Evaluator assesses the query on the database to produce the results in a SARIF format.

Thank you for reading.

You can read all past editions ➡️ here.

@Sim4n6