The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

Claire Le Goues; Neal Holtschulte; Edward K. Smith; Yuriy Brun; Premkumar Devanbu; Stephanie Forrest; Westley Weimer

doi:10.1109/TSE.2015.2454513

The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, Westley Weimer

Research output: Contribution to journal › Article › peer-review

198 Scopus citations

Abstract

The field of automated software repair lacks a set of common benchmark problems. Although benchmark sets are used widely throughout computer science, existing benchmarks are not easily adapted to the problem of automatic defect repair, which has several special requirements. Most important of these is the need for benchmark programs with reproducible, important defects and a deterministic method for assessing if those defects have been repaired. This article details the need for a new set of benchmarks, outlines requirements, and then presents two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs. Each dataset is designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions. The datasets have empirically defined guarantees of reproducibility and benchmark quality, and each study object is categorized to facilitate qualitative evaluation and comparisons by category of bug or program. The article presents baseline experimental results on both datasets for three existing repair methods, GenProg, AE, and TrpAutoRepair, to reduce the burden on researchers who adopt these datasets for their own comparative evaluations.

Original language	English (US)
Article number	7153570
Pages (from-to)	1236-1256
Number of pages	21
Journal	IEEE Transactions on Software Engineering
Volume	41
Issue number	12
DOIs	https://doi.org/10.1109/TSE.2015.2454513
State	Published - Dec 1 2015
Externally published	Yes

Keywords

Automated program repair
INTROCLASS
MANYBUGS
benchmark
reproducibility
subject defect

ASJC Scopus subject areas

Software

Access to Document

10.1109/TSE.2015.2454513

Cite this

@article{de92a4e2e84344a4a7b6b8c25c754917,

title = "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs",

abstract = "The field of automated software repair lacks a set of common benchmark problems. Although benchmark sets are used widely throughout computer science, existing benchmarks are not easily adapted to the problem of automatic defect repair, which has several special requirements. Most important of these is the need for benchmark programs with reproducible, important defects and a deterministic method for assessing if those defects have been repaired. This article details the need for a new set of benchmarks, outlines requirements, and then presents two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs. Each dataset is designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions. The datasets have empirically defined guarantees of reproducibility and benchmark quality, and each study object is categorized to facilitate qualitative evaluation and comparisons by category of bug or program. The article presents baseline experimental results on both datasets for three existing repair methods, GenProg, AE, and TrpAutoRepair, to reduce the burden on researchers who adopt these datasets for their own comparative evaluations.",

keywords = "Automated program repair, INTROCLASS, MANYBUGS, benchmark, reproducibility, subject defect",

author = "{Le Goues}, Claire and Neal Holtschulte and Smith, {Edward K.} and Yuriy Brun and Premkumar Devanbu and Stephanie Forrest and Westley Weimer",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.",

year = "2015",

month = dec,

day = "1",

doi = "10.1109/TSE.2015.2454513",

language = "English (US)",

volume = "41",

pages = "1236--1256",

journal = "IEEE Transactions on Software Engineering",

issn = "0098-5589",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "12",

}

TY - JOUR

T1 - The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

AU - Le Goues, Claire

AU - Holtschulte, Neal

AU - Smith, Edward K.

AU - Brun, Yuriy

AU - Devanbu, Premkumar

AU - Forrest, Stephanie

AU - Weimer, Westley

PY - 2015/12/1

Y1 - 2015/12/1

N2 - The field of automated software repair lacks a set of common benchmark problems. Although benchmark sets are used widely throughout computer science, existing benchmarks are not easily adapted to the problem of automatic defect repair, which has several special requirements. Most important of these is the need for benchmark programs with reproducible, important defects and a deterministic method for assessing if those defects have been repaired. This article details the need for a new set of benchmarks, outlines requirements, and then presents two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs. Each dataset is designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions. The datasets have empirically defined guarantees of reproducibility and benchmark quality, and each study object is categorized to facilitate qualitative evaluation and comparisons by category of bug or program. The article presents baseline experimental results on both datasets for three existing repair methods, GenProg, AE, and TrpAutoRepair, to reduce the burden on researchers who adopt these datasets for their own comparative evaluations.

AB - The field of automated software repair lacks a set of common benchmark problems. Although benchmark sets are used widely throughout computer science, existing benchmarks are not easily adapted to the problem of automatic defect repair, which has several special requirements. Most important of these is the need for benchmark programs with reproducible, important defects and a deterministic method for assessing if those defects have been repaired. This article details the need for a new set of benchmarks, outlines requirements, and then presents two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs. Each dataset is designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions. The datasets have empirically defined guarantees of reproducibility and benchmark quality, and each study object is categorized to facilitate qualitative evaluation and comparisons by category of bug or program. The article presents baseline experimental results on both datasets for three existing repair methods, GenProg, AE, and TrpAutoRepair, to reduce the burden on researchers who adopt these datasets for their own comparative evaluations.

KW - Automated program repair

KW - INTROCLASS

KW - MANYBUGS

KW - benchmark

KW - reproducibility

KW - subject defect

UR - http://www.scopus.com/inward/record.url?scp=84961576983&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961576983&partnerID=8YFLogxK

U2 - 10.1109/TSE.2015.2454513

DO - 10.1109/TSE.2015.2454513

M3 - Article

AN - SCOPUS:84961576983

SN - 0098-5589

VL - 41

SP - 1236

EP - 1256

JO - IEEE Transactions on Software Engineering

JF - IEEE Transactions on Software Engineering

IS - 12

M1 - 7153570

ER -

The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this