SysGenSIM manual

From SysGenSIM
Revision as of 10:53, 15 June 2011 by Soranzo (Talk | contribs)

Jump to: navigation, search

SysGenSIM is a software package to simulate Systems Genetics (SG) experiments in model organisms, for the purpose of evaluating and comparing statistical and computational methods and their implementations for analyses of SG data (e.g., methods for Expression Quantitative Trait Loci (eQTL) mapping and network inference). SysGenSIM allows the user to select a variety of network topologies, genetic and kinetic parameters to simulate SG data (genotyping, gene expression and phenotyping) with large gene networks with thousands of nodes.

The data of the DREAM5 Systems Genetics Challenges, generated with SyGenSIM, are currently used as a benchmark for Systems Genetics data analysis.

The software is written in MATLAB® and provided with a user-friendly graphical user interface (GUI).

The graphical user interface

While experienced MATLAB® users can enter the code and modify the parameter values according to their choices (we have placed comments throughout the code to greatly facilitate such modifications), SysGenSIM provides a GUI which enables users to simply select the relevant parameter settings and perform simulations. The GUI is partitioned into multiple fields for

  1. specifying the gene networks;
  2. specifying the phenotypes;
  3. specifying the genetic map and genetic parameters;
  4. specifying the kinetic and noise parameters;
  5. specifying the output files;
  6. specifying the experiment settings (population sizes and number of experimental populations).

To start the GUI, in the MATLAB® shell execute the command:

sysgensim_gui

Sysgensim gui.png

Generating gene networks

Gene network gui.png

Network topology

Several choices are provided including these standard topology models:

  • Erdös-Renyi (ER) random graph (Erdös and Renyi, 1959), with equiprobable edge directions;
  • scale-free network (Barabasi and Albert, 1999), with hub’s edges generally directed outward (as it is more likely that hubs correspond to regulators regulating many than being a gene regulated by many): these networks are generated by sampling the out degree sequence from a power law and then nodes are randomly connected.

and several relevant novel topologies:

  • EIPO, which stands for exponential in-degree and power-law out-degree distributions, which is of particular interest as these distributions have been observed in real gene networks (Guelzim, et al., 2002). These networks are generated by sampling the in- and out-degree sequences from the distributions of interest and subsequently the nodes are connected according to their degrees;
  • Modular EIPO, which is similar to the EIPO but consists of modules (genes clustered into densely connected components) which is also an often observed property of biological networks (e.g., Barabasi and Oltvai, 2004; Hartwell, et al., 1999). These networks are created by generating a number (specified at Modules) of EIPO networks and then connecting them through edge rewiring (with a probability specified at Rewiring probability);
  • Modular ER, which are generated in the same way, but starting with ER modules instead of EIPO modules.

Network size

The size refers to the number of nodes (genes) in the network. On a powerful computer, a user may be able to choose up to 10,000 nodes, but 1000 gene networks will most certainly run very quickly on any reasonable PC.

Average node degree

Sets the average number of neighbors of a node in the networks. Selecting for example the value 4 means that a node will receive on average 2 inputs and 2 outputs. For the modular networks, a single, average degree common for all the modules can be specified, or an array of average degrees, one entry for each module of the network, can be provided here.

Sign assignment

The signs of the edges, determining which regulatory effects are activating or inhibiting, can be assigned randomly either node-wise or edge-wise.

With edge-wise assignment, each edge is randomly given a sign, independent of the other edges.

With node-wise assignment all the outgoing edges of each node are given the same sign. Through this approach nodes are either activators or inhibitors, which might be more realistic than the edge-wise approach.

Sign probability

Sets the probability for an edge to have a positive sign; hence the probability for an edge to have a negative sign is 1 minus this probability.

Modules

This only applies when having selected either the Modular EIPO or Modular ER. When selecting a number of M modules, the network is partitioned into M components of equal size. In addition one can select a series of module sizes to be used, for example when considering a 100-gene network, one could enter the following series of numbers in the field: 50 25 10 10 5, resulting in a network with 4 modules, one with 50 genes, one with 25 genes, two with 10 genes and a small one of 5 genes. Make sure that your series sums up to the total number of genes in the network!

Rewiring probability

This field only applies when the chosen network topology is Modular EIPO or Modular ER. The rewiring probability determines how modular the final network will be. The extremes of 0 and 1 will result in a network of completely disconnected modules and a network without any modular structure, respectively. We recommend values between 0.1 and 0.2, as in our investigations we have observed that by using these values the modularity Q is between 0.3 and 0.4, similar to those observed for several biological networks (Girvan and Newman, 2002).

Costume network

Here the user can upload a network, either by typing the path or browsing to the location on the disk. To be able to upload the network, one must select Load gene network in Network topology. The network must be provided in the format of an edge list corresponding to a directed network, followed by the signs of the edges, as follows:

(source node) (target node) (sign)

For example:

G1 G2 -1
G1 G3 1
G2 G1 1

The user must use G to denote genes! This toy example results in a network in which G1 is an inhibitor (-1) of G2, and an activator (1) of G3. G2 is an activator (1) of G1.

If there is no knowledge about the signs of the edges in the considered network, the file should contain only the first two columns:

G1 G2
G1 G3
G2 G1

In this case then the signs are internally assigned according to the selections for Sign assignment and Sign probability.

It is also possible to load a gene-phenotype network: one must select Load gene-phenotype network in Network topology. The network must be provided in the format of an edge list corresponding to a directed network, with or without signs of the edges, as follows:

(source node) (target node) (sign)

For example:

G1 P1 -1
G1 G3 1
P1 G1 1

The user must use G for genes and P for phenotypes!

Adding macroscopic phenotypes

Phenotype nodes

Here you select the number of continuous macroscopic phenotypes to be added to the network.

Direct causal genes

Here you select the number of genes which have directed inputs into the phenotype node(s).

Direct reactive genes

Here you select the number of genes which are directly affected by phenotype node(s).


Please note that this strategy of including one or more phenotypes is only a first provisional approach. SysGenSIM is under continuous development, and optimal inclusion of continuous and discrete phenotypes is a high priority.

Generating genetics

Marker positions

Choose to let SysGenSIM generate a genetic map based on specified parameters or load a custom map.

RIL type

SysGenSIM provides the two recombinant inbred line (RIL) types: selfing and sibling mating. For plants, RILs are typically produced by selfing, while for animal models (mouse) they are produced by brother-sister matings. This selection will change the way in which the genetic map distances selected at Distances (cM) are converted into probabilities of neighboring loci to have the same or different parental origin: Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): p(x_i = x_{i-1}) = \tfrac{1}{1 + 2r} for selfing, and Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): p(x_i = x_{i-1}) = \tfrac{1 + 2r}{1 + 6r} for sibling mating, where r is the recombination rate.

Mapping function

To convert map distance to recombination rates, the user can select either Haldane's Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): r = \tfrac{1}{2} (1 - e^{-2d}) or Kosambi's mapping function Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): r = \tfrac{1}{2} \left( \tfrac{e^{4d} - 1}{e^{4d} + 1} \right) .

Chromosomes

This is simply the number of chromosomes in the genome of the organism considered for simulation.

Markers per chromosome

Here the mean number of genetic markers genes per chromosome and its standard deviation are specified. The chromosomes are assigned a particular number of markers genes by sampling from a Gaussian distribution (see below for explanation).

Distances (cM)

Here the mean distance (in centimorgan, cM) between two genetic markers, and its standard deviation are specified. All distances between markers are assigned by sampling from a Gaussian distribution. The length of each chromosome will thus depend on the sampling of actual distances (as well as the numbers of markers/genes assigned). The length of chromosomes is thus not directly specifiable (except for user-defined maps).

Gene positions

SysGenSIM provides two options to generate the map positions of the genes (relative to the map positions of the markers):

  • Place genes at markers: the functional DNA variants in the network genes are identical to the set of DNA variants genotyped in the population. If the user specifies network size to be G = 2000 genes, for example, then under this option 2000 DNA variants will be generated according to the genetic map parameters specified by the user. In the current version of SysGensSIM, the association between the G DNA variant locations on the map and the positions of the G genes in the network is random. If you select this option the mean Markers per chromosome will be automatically set as Network size/Chromosomes. In this case the number of genes/markers on last chromosome is not sampled but set such that the total number of genes/markers equals Network size;
  • Distribute genes uniformly: the user specifies genetic map parameters or provides a map for a set of measured variants. A grid of evenly spaced gene positions (0.1 cM apart) is created, and a number of genes (i.e. functional variant), set at Network size, are randomly assigned to positions in this grid;
  • Custom genetic map: When you select Load at Marker positions you will be able to load your own genetic map. The input file must have the following format:
(chromosome) (marker i) (marker i+1) (cM distance)

For example:

C1 M1 M2 5
C1 M2 M3 4
C2 M21 M22 4
C2 M22 M23 7
...

The gene positions can then be generated by selecting Place genes at markers or Distribute genes uniformly.

Alternatively, the user could also provide a Marker-Gene map in which the positions of genes (i.e. functional variants) relative to the markers are also specified. One must select Load at Marker positions.

(chromosome) (marker i or gene j) (marker i+1 or gene j+1) (cM distance)

For example:

C1 M1 G1 2
C1 G1 M2 3
C1 M2 G2 1
C1 G2 G3 2
C1 G3 M3 2
...

The user must use G for genes and M for markers!

Cis-effect percentage

The variant is located either in the gene's promoter region affecting its own transcription rate (cis-variant, resulting in a less efficient transcription process, modeled through Zgc in equation (1) below), or in the coding region of a regulatory gene altering the strength of its regulatory effect (trans-variant, resulting in a less potent inhibition/activation, modeled through Zkt in equation (1) below). Here the percentage of loci having a cis effect is specified; hence the percentage of loci to have a trans effect is 100 minus this percentage.

The allelic effects of the loci are determined by the parameters Z (see equation (1) below). When the allele of the reference parent is present, then the corresponding Z = 1, while in presence of the other parent’s allele the Z is reduced. The value for each Z is assigned to each locus by sampling from a uniform distribution with boundaries to be specified in Z lower and Z upper bounds (lower and upper bound can be identical resulting in constant Z). Note that which parent is the reference parent differs among loci, and is randomly assigned, so that loci are more relatively active in one parent or the other (this information can be retrieved by selecting Genotype information in the Output files).

Genotyping error percentage

As genotyping data is subject to error (e.g. wrong SNP calling), a percentage of wrongly called genotype values can be set. After the simulation of the genotype data, these frequencies of genotypes are changed from 0 to 1 or from 1 to 0.

Selecting the kinetic and noise parameters

Steady-state gene expression profiles, based on topology and genotypes, are simulated with nonlinear ordinary differential equations (ODEs).The rate law used in SysGenSIM for transcription is not based on any explicit biochemical mechanism, but displays two main features of biochemical kinetics: saturation and cooperativity (Mendes, et al., 2003). We assume that RNA decay is a first order process.

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \frac{\operatorname{d}G_g}{\operatorname{d}t} = Z_g^c \cdot V_g \cdot \theta_g^{syn} \cdot \prod_k \left( 1 + A_{k,g} \frac{G_k^{h_{k,g}}}{G_k^{h_{k,g}} + (K_{k,g} / Z_k^t)^{h_{k,g}}} \right) - \lambda_g \cdot \theta_g^{deg} \cdot G_g       (1)

where Gg is mRNA concentration (gene-activity, gene expression levels) of gene g, Vg is its basal transcription rate, and λg is the degradation rate constant. The Gk are expression levels of genes which have directed edges into node Gg. Kk,g is the interaction strength, a Michaelis constant, representing the concentrations of input gene k at which its effect on the transcription rate of gene g is half of its maximum effect, hk,g is a cooperativity coefficient, and Ak,g is an element of matrix A encoding the signed network structure (Ak,g = -1 for inhibitor, Ak,g = 1 for activator, Ak,g = 0 for no effect). The biological variance parameters θgsyn and θgdeg represent non-genetic additional biological noise in the transcription and degradation rates, respectively, and are sampled from a normal distribution with mean 1 and user specified standard deviations prior to the calculation of each steady state. Zgc and Zkt are parameters that incorporate effects of DNA variants (specified in Genetic Parameters).

The values for parameters θgsyn and θgdeg, the experimental noise level (see below) and the Z values could be chosen such that the estimated heritabilities of the expression profiles are close to those found in real data. For example, in our previous work (Liu, et al., 2008), the simulated expression traits had an average heritability of 56%, close to what was observed in a yeast Systems Genetics experiment (Brem and Kruglyak, 2005). There we used a Transcription Synthesis Biological Variance standard deviation of 0.1 and Degradation Biological Variance standard deviation of 0 (no variation) and Z values were either 0.75 or 1 (no uniform Z range). Multiplicative gene expression measurement noise is added to the calculated gene expression steady state values to reflect the inaccuracy of typical experimental gene expression data.

To generate values for all kinetic and noise parameters in equation (1), there are four choices:

  • Constant, which assigns to all parameters of a certain type the exact same value (parameter #1);
  • Uniform, which samples the values from a uniform distribution with specifiable lower (parameter #1) and upper (parameter #2) boundaries;
  • Gaussian, which samples the values from a Gaussian distribution with specifiable mean (parameter #1) and standard deviation (parameter #2). Values lower than 0.1 are resampled;
  • Gamma, which samples the values from a Gamma distribution with specifiable scale (parameter #1) and shape (parameter #2). Values for parameter h are sampled from a Gamma distribution shifted by 1, to avoid values smaller than 1 (they would not have any physical meaning).

Selecting the output

Output files

The simulated data is written to files by selecting the Genotype Matrix, Gene Expression Matrix and Phenotype Matrix. All matrices are n × m, with n the number of markers, genes or phenotypes and m the number of individuals/lines. Also the network structure can be output by selecting Edge List. This output file has the same format as described above in Custom network. The Genetic Map can be selected as output and has the same format as described above in Custom Genetic Map.


If you are interested in visualizing your network and performing some network analyses, you can select to output a Pajek Network File, which you can load with the software Pajek. The Module List contains lists of genes pertaining to each module. The Topological Properties file provides some standard topological information, for example the in and out degree of each node, the network component each node belongs to (strongly connected, in- or out-component, tendrils or tubes) and other indices. In Genotype Information all the available information about the DNA variants is provided, including the chromosome number and location, and for functional variants the trans/cis status, the value of the corresponding Z parameter and which genotype (parental origin) was assigned the Z = 1 value.

The Simulation Summary file contains the chosen values of all simulation parameters, so this file should always be saved!

Output figures

SysGenSIM enables the user to create several figures to visualize properties of the simulated data. Node Degree Distributions provides a plot of the in- and out-degree distributions of the network. Parameter distributions provides plots of the distributions of the actual parameter values used in the simulation.

Gene Expression Distributions provides the distributions of gene expression values, their means and their variances.


The figure above shows the distributions of gene expression values, gene-specific variances, and means in simulated data (1000- gene scale-free network). To compare to real data we provide here a similar plot for gene expression data from a SG experiment in A. thaliana (West, et al., 2007). As in the A. thaliana data, the variance and mean distributions are fit well by a power-law.

Gene Correlation distributions: This option provides several plots: a plot to display the distributions of all etrait-etrait correlations and the distribution of etrait-etrait correlations of adjacent genes in the network and a plot showing the co-expression network node degree distributions at several correlation thresholds.

Additional plots display the distributions of all etrait-phenotype correlations and the distribution of etrait-phenotype correlations of genes adjacent to the phenotypes in the network.


The minimum, maximum, mean and median values of the correlations are written to a text file.

The Heritability Distribution displays the distribution of the gene expression heritabilities. Heritability of each gene expression level is calculated as its variance over the experiment without Biological variance and Measurement Error (hence the variance is only due to the genetics, i.e. due to varying the Zs) divided by its variance over the experiment with Biological variance and Measurement Error (the total variance, due to genetics as well as additional biological and experimental noise). When this option is selected, also a text file with the heritabilities for each phenotype and gene is provided.

This figure shows the distribution of the heritabilities of the expression traits in simulated data (1000 gene scale-free network) at different levels of non-genetic Biological Variance (BAsigma). Changing the parameter value for the Biological Variance can be used (in combination with Measurement Error) to achieve a distribution of heritabilites which is more similar to distributions observed in real SG data.

Running the simulations

Population Size: This value is the number of individuals (i.e. lines) in the RIL population. Existing RIL populations consist of tens (mouse; (Schadt, et al., 2003)), or around a hundred (yeast; (Brem and Kruglyak, 2005)), or several hundred lines (Arabidopsis (West, et al., 2007) Soybean (Zhou, et al., 2008)) .

The entire experiment can be repeated m times. SysGenSIM will simply run through all selected settings again.

Finally, pressing the button Run SysGenSIM will start the data generation.


Enjoy!

Andrea, Nicola, Ina and Alberto

Funding

This work was partly supported by The Regional Authorities of Sardinia to A. de la Fuente and NIH grant 1R01HG005254-01 to I. Hoeschele.

References

Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science, 286, 509-512.
Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 5, 101-113.
Brem, R.B. and Kruglyak, L. (2005) The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci. USA, 102, 1572-1577.
Erdös, P. and Renyi, A. (1959) On Random Graphs. Publ. Math. Debrecen, 6, 290–297.
Girvan, M. and Newman, M. E. (2002) Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA, 99, 7821-7826.
Guelzim, N., et al. (2002) Topological and causal structure of the yeast transcriptional regulatory network. Nat. Genet., 31, 60-63.
Hartwell, L.H., et al. (1999) From molecular to modular cell biology. Nature, 402, C47-52.
Liu, B., de la Fuente, A. and Hoeschele, I. (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics, 178, 1763-1776.
Mendes, P., Sha, W. and Ye, K. (2003) Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics, 19 Suppl 2, ii122-129.
Schadt, E.E., et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature, 422, 297-302.
West, M.A., et al. (2007) Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics, 175, 1441-1450.
Zhou, L., et al. (2008) Dissecting soybean resistance to Phytophthora by QTL analysis of host and pathogen expression profiles. International Plant and Animal Genome Conference XVI. San Diego.