SysGenSIM - Benchmark datasets

Pula-Magdeburg single-gene knockout benchmark dataset

A collection of single-gene knockout datasets has been produced as a benchmark for network inference algorithms described in the paper Reconstruction of large-scale regulatory networks based on perturbation graphs and transitive reduction: improved methods and their performance, BMC Systems Biology 2013, 7:73. The compendium consists of 270 datasets simulated from 30 different 5000-gene networks according to 9 noise configurations.

For each network, a compressed archive (whose size is approximately close to 700 MB) is made available for download. Each archive includes 19 files:

The list of unsigned edges encoding the directed interactions in the 5000-gene network.
The wild-type gene expression values, one file for each of the 9 noise configurations.
The matrix of expression values after the single-gene knockout of all genes in the network, one file for each of the 9 noise configurations.

The data can be downloaded from these links:

Network 1 (666 MB)
Network 2 (662 MB)
Network 3 (689 MB)
Network 4 (682 MB)
Network 5 (687 MB)
Network 6 (690 MB)
Network 7 (691 MB)
Network 8 (669 MB)
Network 9 (679 MB)
Network 10 (685 MB)

Network 11 (673 MB)
Network 12 (663 MB)
Network 13 (653 MB)
Network 14 (674 MB)
Network 15 (610 MB)
Network 16 (668 MB)
Network 17 (679 MB)
Network 18 (683 MB)
Network 19 (657 MB)
Network 20 (680 MB)

Network 21 (664 MB)
Network 22 (665 MB)
Network 23 (654 MB)
Network 24 (563 MB)
Network 25 (677 MB)
Network 26 (681 MB)
Network 27 (661 MB)
Network 28 (643 MB)
Network 29 (653 MB)
Network 30 (635 MB)

The above SysGenSIM and DREAM4 networks have been employed for evaluating this collection of scripts for network inference (one small bug has been fixed on August 7th, 2013). The algorithms can be also easily adapted to reverse-engineer other gene networks from perturbation data.

StatSeq benchmark dataset

The StatSeq compendium consists of 72 datasets originated from 9 different in silico gene networks, each simulated under 8 different parameter settings, in order to investigate the performances of inference algorithms over various network and population sizes, marker distances, and heritability. All datasets have been simulated with SysGenSIM 1.0.2.

The networks are characterized by different size (100, 1000 and 5000 genes) and contain a large strongly connected component.

More detailed information about the compendium and the evaluation of predictions is available here: StatSeq dataset description.

For each dataset, gold standard networks, simulated gene expression and genotype are available for download:

100-gene networks (6.3 MB)
1000-gene networks (62.7 MB)
5000-gene networks (311.3 MB)
Median value of the heritability for each dataset

The evaluation of predictions may be accomplished through the following MATLAB script: Evaluation script and gold standard networks (148.8 KB).

DREAM5 benchmark dataset

The DREAM5 SysGenA compendium is a collection of simulated datasets, produced for the DREAM5 Systems Genetics In-silico Network subchallenge in 2010. The aim is to reverse-engineer gene networks from systems genetics data.

The whole dataset has been simulated with a preliminary version of SysGenSIM.

The compendium consists of 15 datasets, corresponding to 15 different 1000-gene in silico networks equipped with simulated gene expression and genotype data. In particular, 5 networks have data for only 100 RILs, 5 networks for 300 RILs, and 5 networks for 999 RILs. More information is available in the challenge description.

Data download:

Gene expression and genotype data (29.7 MB): from DREAM5 website (registration required) or from CRS4 mirror
Gold standard networks (178.3 KB)

Predictions were evaluated by calculating various scores (described here) with the following MATLAB script: Evaluation script.