FormAI Dataset

Description

FormAI Dataset is a large collection of AI-generated compilable and independent C programs with vulnerability classification. Currently, there are two versions available: FormAI-v1, consisting of 112,000 compilable C programs generated by GPT-3.5-turbo, and FormAI-v2, recently released, which includes 331,000 compilable C programs generated utilizing various LLMs such as Google's GEMINI-pro, OpenAI's GPT-4, TII's 180 billion-parameter Falcon, CodeLLama2, and other compact models. These programs generated using dynamic zero-shot prompting technique and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Each program is labelled based on vulnerabilities present in the code using a formal verification method based on the Efficient SMT-based Bounded Model Checker (ESBMC). This strategy reduces false positive findings by providing concrete counterexamples for vulnerabilities and minimizes false negatives by completing the formal verification of the code within a specified threshold. The labeled samples can be utilized to train Large Language Models (LLMs) since they contain the exact program location of the software vulnerability. We utilized the Nicad tool (Automated Detection of Near-Miss Intentional Clones) to eliminate Type 1, Type 2, Type 3-1, and Type 3-2 clones from the dataset. However, we also provide the original C files of these clones because small modifications in the code can introduce new vulnerabilities. This can be highly beneficial when the dataset is used for fuzzing or identifying vulnerabilities in compiler tools like GCC or Clang.

Cite

FormAI-v1:

The original FromAI-v1 paper with 112,000 samples was presented at the 19th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2023) on December 8, 2023, in San Francisco, CA, USA.

Cite paper (ACM digital library): https://dl.acm.org/doi/10.1145/3617555.3617874

@inproceedings{10.1145/3617555.3617874,
author = {Tihanyi, Norbert and Bisztray, Tamas and Jain, Ridhi and Ferrag, Mohamed Amine and Cordeiro, Lucas C. and Mavroeidis, Vasileios},
title = {The FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification},
year = {2023},
isbn = {9798400703751},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3617555.3617874},
doi = {10.1145/3617555.3617874},
booktitle = {Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages = {33–43},
numpages = {11},
keywords = {Artificial Intelligence, Dataset, Formal Verification, Large Language Models, Software Security, Vulnerability Classification},
location = {<conf-loc>, <city>San Francisco</city>, <state>CA</state>, <country>USA</country>, </conf-loc>},
series = {PROMISE 2023}
}

FormAI-v2:

The FormAI-v2 dataset, which contains 331,000 C samples, has been published in the Empirical Software Engineering journal in the article titled "How Secure is AI-Generated Code: A Large-Scale Comparison of Large Language Models".

Cite paper (Springer - Empirical Software Engineering): https://link.springer.com/article/10.1007/s10664-024-10590-1

@article{tihanyi2024secure,
title={How secure is AI-generated code: a large-scale comparison of large language models},
author={Tihanyi, Norbert and Bisztray, Tamas and Ferrag, Mohamed Amine and Jain, Ridhi and Cordeiro, Lucas C.},
journal={Empirical Software Engineering},
volume={30},
number={47},
year={2025},
publisher={Springer},
doi={10.1007/s10664-024-10590-1},
url={https://doi.org/10.1007/s10664-024-10590-1}
}

Architecture

The design and methodology behind the creation of the FormAI-v2 dataset are depicted in the subsequent figure.

The dataset

The FormAI-v2 dataset is larger and more comprehensive than the FormAI-v1 dataset. Therefore, we strongly encourage the research community to transition from FormAI-v1 to FormAI-v2. The new dataset comprises 331,000 AI-generated C files from nine different models including Falcon2-11B, Falcon180B, GPT4o-mini, GPT-3.5, CodeLlama-13B, Llama2 13B, Gemini Pro 1.0, Mistral 7B, and Gemma 7B. All files are compilable and contain at least 50 lines of code. Furthermore, FormAI-v2 includes additional metrics that can be utilized in machine learning classification, such as cyclomatic complexity, which helps in understanding the code complexities of each generated program.

There are 4 files in the dataset:

a. FormAI-v2-DATASET.7z - This file contains all the compilable 331,000 C files.

b. FormAI-v2-DATASET_without_clones.7z : It contains 310531 C files (Type1, Type2, , Type3-1 and Type3-2 clones are all removed)

c. FormAI-v2-classification.7z.001 and FormAI-v2-classification.7z.002: It includes a file named FormAI-v2.json, which contains the labeled dataset with the vulnerability classification of the files, the precise location of the vulnerabilities, and the corresponding C code.

The new dataset uses the more comprehensive JSON format, unlike FormAI-v1 which used CSV. Here is an example of how a JSON entry looks in the dataset:

{
   "category": "VULNERABLE",
   "file_name": "gpt35-58832.c",
   "verification_finished": "yes",
   "vulnerable_line": 42,
   "column": 9,
   "function": "sanitize_url",
   "violated_property": "file gpt35-58832.c line 42 column 9 function sanitize_url",
   "stack_trace": "c:@F@sanitize_url at file gpt35-58832.c line 15 .. array bounds violated ..."
   "error_type": "array bounds violated: lower bound",
   "code_snippet": "<... C code snippet ...>",
   "source_code": "<... Full C code  ...>",
   "num_lines": 53,
   "cyclomatic_complexity": 3.0
}

Guidelines for Getting Started with FormAI-v2

The FormAI-v2 dataset is suitable for machine learning training, testing vulnerability detection softwares, and various tasks that require a substantial collection of C files. The validation of dataset compilation was conducted on a system running Ubuntu LTS version 22.04.03 with a Linux kernel version 5.19.0 and gcc 11.4.0

To begin using the FormAI-v2 dataset, follow the instructions below:

Start by obtaining the entire FormAI dataset from Github with the given command:

git clone https://github.com/FormAI-Dataset/FormAI-dataset/

The following sequence of commands includes an operating system upgrade and the installation of the build-essential package, a crucial requirement for the proper compilation of the dataset.
```
sudo apt update && sudo apt upgrade -y && sudo apt install build-essential -y
```

The subsequent step involves installing all the necessary dependencies to compile all the C files present in the dataset.

sudo apt install -y build-essential csvtool unzip 7zip libsqlite3-dev libfftw3-dev libssl-dev libncurses5-dev libncursesw5-dev libmysqlclient-dev libpq-dev libportaudio2 portaudio19-dev libpcap-dev libqrencode-dev libsdl2-dev freeglut3-dev libcurl4-openssl-dev

Unzip the .C samples from the compressed file (this action will generate a DATASETv2 directory containing all the C files and a JSON file called FormAI-v2.json):
```
7z x FormAI-dataset/FormAI-v2-DATASET.7z  && 7z x FormAI-dataset/FormAI-v2-classification.7z.001
```
One can verify the successful installation of the dependencies by testing 1000 files from the dataset. This command will verify that the first 1000 files from the dataset compile successfully (this process usually takes around 2-3 minutes).
```
find DATASETv2 -name "*.c" | head -n 1000 | xargs -I{} bash -c 'gcc {} -w -lcrypto -lfftw3 -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -lncurses &>/dev/null &&  echo {}' | wc -l
```
NOTE: If the output is not equal to 1000, it indicates that there may be missing dependencies or that certain programs fail to compile, particularly in earlier versions of GCC. This situation can occur in Ubuntu 20.04 LTS, where the default GCC version is 9.4.0, which is older than 11.4.0. However, it's important to note that this should not impact the usability of the DATASET.
If the result is 1000 from the previous run, it indicates that all the tested files were compiled without issues. Our system is ready to use the provided files. To retrieve and compile a C code from the JSON file, use the command below (taking the 12345th JSON entry as an example):
```
gcc -x c <(jq -r '.[12345].source_code' FormAI-v2.json) -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -o output
```
NOTE: FormAI-v2.json is more than 2.5GB, so extracting C code from JSON this way is not the most efficient method. However, it is suitable for demonstration purposes. For more efficient processing, using Python is recommended.
Each program is categorized according to the vulnerabilities found in its code, using a formal verification technique that leverages the Efficient SMT-based Bounded Model Checker (ESBMC).

ESBMC is a mature, permissively licensed open-source context-bounded model checker for verifying single- and multithreaded C/C++, Kotlin, and Solidity programs. It can automatically verify predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions.

The Github page for ESBMC can be found at the following link: https://github.com/esbmc/esbmc. ESBMC depends on numerous external tools and has a variety of intricate dependencies, from SMT solvers to more complex ones. Nevertheless, there's an option to utilize a pre-compiled binary version, which can be obtained using the command provided below:
```
wget https://github.com/esbmc/esbmc/releases/download/v7.5/ESBMC-Linux.zip && unzip ESBMC-Linux.zip && rm ESBMC-Linux.zip && chmod 777 bin/esbmc
```

If you wish to test ESBMC 7.5 on an individual file from the FormAI-v2 dataset, use the command below (taking falcon180b-1656.c as a reference):

bin/esbmc DATASETv2/falcon180b-1656.c  --overflow --memory-leak-check --timeout 30 --unwind 1 --multi-property --no-unwinding-assertions

The file appears to be vulnerable, revealing a buffer overflow in the scanf() function. Below is the output from ESBMC:

  [Counterexample]

  Enter a URL: 
  
  State 2 file falcon180b-1656.c line 67 column 5 function main thread 0
  ----------------------------------------------------
  Violated property:
    file falcon180b-1656.c line 67 column 5 function main
    buffer overflow on scanf
    0
  
  
  VERIFICATION FAILED

As a concluding action, you can confirm that this vulnerability is indeed present in the FormAI-v2 dataset by executing the command below:

jq -r '.[] | select(.file_name == "falcon180b-1656.c")' FormAI-v2.json

The outpus is the same as ESBMC found : buffer overflow on line 67

{
  "category": "VULNERABLE",
  "file_name": "falcon180b-1656.c",
  "verification_finished": "yes",
  "vulnerable_line": 67,
  "column": 5,
  "function": "main",
  "violated_property": "\n  file falcon180b-1656.c line 67 column 5 function main\n",
  "stack_trace": "\n  c:@F@main\n  buffer overflow on scanf\n  0",
  "error_type": "buffer overflow on scanf",
  "code_snippet": "}\n\nint main() {\n    char url[MAX_URL_LENGTH];\n    printf(\"Enter a URL: \");\n    scanf(\"%s\", url);\n    sanitize_url(url);\n    printf(\"Sanitized URL: %s\\n\", url);\n    return 0;\n}",
  "source_code": "//Falcon-180B DATASET v1.0 Category: URL Sanitizer ; Style: puzzling\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <ctype.h>\n\n#define MAX_URL_LENGTH 2048\n\nint is_valid_url_char(char c) {\n    if (isalnum(c) || c == '-') {\n        return 1;\n    }\n    switch(c) {\n        case '.':\n        case '_':\n        case '~':\n        case '+':\n        case '=':\n        case '%':\n        case '?':\n        case '#':\n        case '&':\n        case '/':\n        case ':':\n        case ';':\n        case ',':\n        case '@':\n        case '!':\n        case '$':\n        case '*':\n        case '(':\n        case ')':\n        case '\"':\n            return 1;\n        default:\n            return 0;\n    }\n}\n\nint is_valid_url(char* url) {\n    int i = 0;\n    while (url[i]!= '\\0' && i < MAX_URL_LENGTH) {\n        if (!is_valid_url_char(url[i])) {\n            return 0;\n        }\n        i++;\n    }\n    return 1;\n}\n\nvoid sanitize_url(char* url) {\n    int i = 0;\n    while (url[i]!= '\\0' && i < MAX_URL_LENGTH) {\n        if (!is_valid_url_char(url[i])) {\n            url[i] = '%';\n            i++;\n        } else {\n            i++;\n        }\n    }\n    url[MAX_URL_LENGTH - 1] = '\\0';\n}\n\nint main() {\n    char url[MAX_URL_LENGTH];\n    printf(\"Enter a URL: \");\n    scanf(\"%s\", url);\n    sanitize_url(url);\n    printf(\"Sanitized URL: %s\\n\", url);\n    return 0;\n}",
  "num_lines": 71,
  "cyclomatic_complexity": 8.0
}

Guidelines for Getting Started with FormAI-v1

All C programs in this dataset were generated by GPT-3.5-turbo. From the entire dataset 109,757 sample files can be compiled using only the command 'gcc -lm'. However, the remaining 3% of samples are more complex, utilizing external libraries such as OpenSSL, sqlite3, and others. After installing the required dependencies, all files should compile without any issues.

The FormAI dataset is suitable for machine learning training, testing vulnerability detection softwares, and various tasks that require a substantial collection of C files.

In this guide, we illustrate the process of classifying the samples in the FormAI dataset and how to extract the C source code from the categorized csv files. This serves as an ideal foundation for those looking to build upon these findings in future research.

The validation of dataset compilation was conducted on a system running Ubuntu LTS version 22.04.03 with a Linux kernel version 5.19.0 and gcc 11.4.0

Start by obtaining the entire FormAI dataset from Github with the given command:

git clone https://github.com/FormAI-Dataset/FormAI-dataset/

The following sequence of commands includes an operating system upgrade and the installation of the build-essential package, a crucial requirement for the proper compilation of the dataset.
```
sudo apt update && sudo apt upgrade -y && sudo apt install build-essential -y
```

The subsequent step involves installing all the necessary dependencies to compile the 112,000 C files present in the dataset.

sudo apt install -y build-essential csvtool unzip libsqlite3-dev libssl-dev libmysqlclient-dev libpq-dev libportaudio2 portaudio19-dev libpcap-dev libqrencode-dev libsdl2-dev freeglut3-dev libcurl4-openssl-dev

Unzip the .C samples from the compressed file (this action will generate a DATASET directory containing all the files):
```
unzip FormAI-dataset/FormAI_dataset_C_samples-V1.zip && unzip FormAI-dataset/FormAI_dataset_classification-V1.zip
```
One can verify the successful installation of the dependencies by testing 1000 files from the dataset. This command will verify that the first 1000 files from the dataset compile successfully (this process usually takes around 2-3 minutes).
```
find DATASET -name "*.c" | head -n 1000 | xargs -I{} bash -c 'gcc {} -w -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm &>/dev/null &&  echo {}' | wc -l
```
NOTE: If the output is not equal to 1000, it indicates that there may be missing dependencies or that certain programs fail to compile, particularly in earlier versions of GCC. This situation can occur in Ubuntu 20.04 LTS, where the default GCC version is 9.4.0, which is older than 11.4.0. However, it's important to note that this should not impact the usability of the DATASET.
If the result is 1000 from the previous run, it indicates that all the tested files were compiled without issues. Our system is ready to use the provided files. To retrieve and compile the C code from a specific row in the FormAI_dataset.csv, use the command below (taking row number 3679 as a example):
```
gcc -x c <(csvtool drop 3679 FormAI_dataset.csv | csvtool -t ',' head 1 - | csvtool -t ',' col 3 - | sed 's/^"//; s/"$//' | sed 's/""/"/g') -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -o output
```
Each program is categorized according to the vulnerabilities found in its code, using a formal verification technique that leverages the Efficient SMT-based Bounded Model Checker (ESBMC).

ESBMC is a mature, permissively licensed open-source context-bounded model checker for verifying single- and multithreaded C/C++, Kotlin, and Solidity programs. It can automatically verify predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions.

The Github page for ESBMC can be found at the following link: https://github.com/esbmc/esbmc ESBMC depends on numerous external tools and has a variety of intricate dependencies, from SMT solvers to more complex ones. Nevertheless, there's an option to utilize a pre-compiled binary version, which can be obtained using the command provided below:
```
wget https://github.com/esbmc/esbmc/releases/download/v7.3/ESBMC-Linux.zip && unzip ESBMC-Linux.zip && rm ESBMC-Linux.zip && chmod 777 bin/esbmc
```

If you wish to test ESBMC on an individual file from the FormAI dataset, use the command below (taking FormAI_14569.c as a reference):

bin/esbmc DATASET/FormAI_14569.c  --overflow --memory-leak-check --timeout 30 --unwind 1 --multi-property --no-unwinding-assertions

The file appears to be vulnerable, revealing a buffer overflow in the scanf() function. Below is the output from ESBMC:

  State 6 file FormAI_14569.c line 24 column 9 function main thread 0
  ----------------------------------------------------
  Violated property:
    file FormAI_14569.c line 24 column 9 function main
    buffer overflow on scanf
    0

  VERIFICATION FAILED

As a concluding action, you can confirm that this vulnerability is indeed present in the FormAI dataset by executing the command below:
```
cat FormAI-dataset/FormAI_dataset_human_readable-V1.csv | grep 'FormAI_14569.c'
```

The outpus is the same as ESBMC found : FormAI_14569.c,VULNERABLE,main,24.0,buffer overflow on scanf

Disclaimer

SHA256 checksum of the files:

FormAI-v1 dataset files:

FormAI_dataset_C_samples-V1.zip : fc458020ad0e9b0999882d4bfdd27edfdee2f0ff5de22848e11352d64230ca47 FormAI_dataset_classification-V1.zip : 82b00611d71ff992d85b7bba20ebece580bfd055939b0142861326dff17ede74
FormAI_dataset_human_readable-V1.csv : 316d7145006f48ec42c9a89b1dccb2c5e2c05012c926af65c0269aba45d47830

FormAI-v2 dataset files:

FormAI-v2-DATASET.7z : b33ed43d318b24e22a32e7f094deaea21e3acd1f5899e3e2725682e0deaaaddf
FormAI-v2-DATASET_without_clones.7z : 3098694cee2f4eb37e19d4c2bb7f0c2a5e053c50169a9a4679b1c3b4f6dd03c4
FormAI-v2-classification.7z.001 : 512a344251c1b86ecef85ef8e47d16cc7fec259be10c578af21cd6444561c9c0
FormAI-v2-classification.7z.002 : dd5c7901641e1d0262c37f23700bdb20a4896ceeec4d8e0e9ab8a1499008d4ab

WARNING: BE CAREFUL WHEN RUNNING THE COMPILED CODES, SOME CAN CONNECT TO THE WEB, SCAN YOUR LOCAL NETWORK, OR DELETE A RANDOM FILE FROM YOUR FILE SYSTEM. ALWAYS CHECK THE SOURCE CODE AND THE COMMENTS IN THE FILE BEFORE RUNNING IT!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FormAI Dataset

Description

Cite

Architecture

The dataset

Guidelines for Getting Started with FormAI-v2

Guidelines for Getting Started with FormAI-v1

Disclaimer

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.DS_Store		.DS_Store
FormAI-v2-DATASET.7z		FormAI-v2-DATASET.7z
FormAI-v2-DATASET_without_clones.7z		FormAI-v2-DATASET_without_clones.7z
FormAI-v2-classification.7z.001		FormAI-v2-classification.7z.001
FormAI-v2-classification.7z.002		FormAI-v2-classification.7z.002
FormAI_dataset_C_samples-V1.zip		FormAI_dataset_C_samples-V1.zip
FormAI_dataset_classification-V1.zip		FormAI_dataset_classification-V1.zip
FormAI_dataset_human_readable-V1.csv		FormAI_dataset_human_readable-V1.csv
LICENSE		LICENSE
README.md		README.md

License

FormAI-Dataset/FormAI-dataset

Folders and files

Latest commit

History

Repository files navigation

FormAI Dataset

Description

Cite

Architecture

The dataset

Guidelines for Getting Started with FormAI-v2

Guidelines for Getting Started with FormAI-v1

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages