Analyze real-world project batch with declarative static analysis provided by CodeQL for empirical study and statistical analysis to gain insight of patterns in real world projects.
QLStat provides a comprehensive framework for large-scale empirical analysis of software projects using CodeQL. Key features include:
- Batch Processing: Clone, build, and analyze multiple repositories in parallel
- Flexible Configuration: YAML-based configuration for defining analysis targets and parameters
- Extensible Analysis: Support for custom external predicates (e.g., escape analysis data)
- Scalable Query Execution: Parallel execution of CodeQL queries across repositories
- Comprehensive Logging: Detailed logging at each stage of the analysis pipeline
- Data Collection: Aggregation of results from multiple repositories into unified datasets
- Language Support: Currently focused on Go, with extensibility for other languages supported by CodeQL
Create your stat.yaml config file according to example.yaml. The configuration supports several key sections:
sources: Define repository sources with prefixes and specific repositorieslanguage: Specify the programming language for analysis (e.g., go)buildGrps: Configure build groups with timeout and build commandsexternalGenGrps: Generate external predicates (like escape analysis data)queryconfig: Set up query execution with parallelization optionsqueryGrps: Define query groups with specific queries and target repositories
Run go run ./cmd/batch_clone_build stat.yaml to clone repositories and create CodeQL databases:
go run ./cmd/batch_clone_build stat.yamlKey options:
-noclone: Skip cloning if repositories already exist-nobuild: Skip database creation if databases already exist-noextgen: Skip generation of external predicates
The tool supports three main phases:
- Cloning: Download repositories from specified sources
- Building: Create CodeQL databases using appropriate build commands
- External Predicate Generation: Generate additional data sources like escape analysis results
Create your queries in the qlsrc directory. Queries should follow CodeQL conventions and can leverage external predicates when needed.
Run go run ./cmd/codeql_qdriver -collect stat.yaml to execute queries on the created databases:
go run ./cmd/codeql_qdriver -collect stat.yamlAvailable options:
-format: Specify output format (text, csv, json, bqrs) - default: csv-decode-only: Only decode existing bqrs files without running queries-collect: Collect all CSV results into a single file with repository names
Results are processed in three stages:
- Query Execution: Run CodeQL queries on each database
- Decoding: Convert bqrs results to specified format (CSV, JSON, etc.)
- Collection: Aggregate results from all repositories into a single dataset
QLStat supports extending CodeQL with escape analysis data through the escape adapter:
- Configure
externalGenGrpsin your YAML withgenScript: goescape.goescapeis actually the commandgo build -a -gcflags=all=-m=2 .- You can also specify your own script with only one constraint: Generate
m2.login$logRoot/extgen/path/to/repo/m2.log
- This generates escape analysis data during the build phase
- Reference the external predicate in your query group with
externals: - movedToHeap - Use the external predicate in your CodeQL queries
For more details about how the escape analysis extension works, see Escape Analysis Documentation.
For detailed information about the storage structure and architecture, please refer to the Architecture Documentation.
@misc{qlstat,
author = {Qingwei Li},
title = {QLStat},
howpublished = {\url{https://github.com/Lslightly/QLStat}},
}