Modeling Gene Regulatory Relationships to Investigate Phenotypic Differences Between Strains

In biotechnology it is often important to understand how highly similar organisms exhibit substantial differences in phenotype.  Whether in strain development, securing IP, or process development, getting a clear picture of how small genetic modifications affect behavior can be crucial.

We often study closely related organisms where an experimental strain is engineered from gene modification or random mutation of a wild strain, such as a crop plant strain that demonstrates an improved phenotype. We know that observing genome sequences alone is usually not sufficient to discover why the developed strain exhibits the more favorable phenotype. Unexpectedly large differences in phenotype emerge even when the genomes are almost identical. What explains this?

Phenotypic differences cannot be traced back solely to changes in the genetic loci.  The phenotypic differences are mediated by changes in regulatory networks and dynamics. Minor changes in the gene regulation network, such as a single missing activation or repression, can snowball into dramatically different expression across many genes. In such a landscape, it is vital to understand the dynamics of gene expression and to model the regulation of important alleles.

At Mimetics, we study the dynamics of gene expression and have developed tools for elucidating the dynamic relationships between genes, including identifying which genes activate or repress other genes and in what sequence. Consider a simple example in studying the regulatory relationship between two genes, Gene A and Gene B. Suppose it is our hypothesis that Gene A may be activating Gene B in one or both strains. We want to test this hypothesis.

Network inference algorithms can analyze the expression patterns of the two genes and compute a “score” for this relationship. Note that even in this simplest case, we must analyze this score in the context of the other regulatory relationships in the network and across both strains, but here we give an example for a single regulation and strain.

To test our hypothesis, we deploy Mimetics’ network inference pipeline, which produces a score of 15.1 for the relationship Gene A ® Gene B in the first strain.

Here lower scores indicate higher strength of the inferred regulatory relationship, but how strong is this score exactly? We cannot directly compare scores between different strains because they depend on the shape of the target curve as well as the surrounding gene network. Thus, Gene A ® Gene B in a different strain may be less likely than in the first, even if it has a lower score. We can derive some information by comparing to other scores in the same network, but these results may be biased for various reasons; for example, it can make a big difference whether Gene B has a large number of activators or whether it has few or none.

We need to convert the initial score to a normalized score that can be compared objectively across genes and strains. To this end, we generate several hundred random curves. These curves are composed by combining segments of real gene expression values from the dataset to create randomized novel curves. We tested these curves to ensure that they closely mimic and are representative of the expression patterns in the dataset. We compute each random curve’s activation score for Gene B and construct the score distribution.

In this example, the score for Gene A ® Gene B is greater than 97.2% of the tested random curves. Computing the CDF from the distribution of random curves, we can refine our confidence to 98.2%.

This confidence score can help determine whether Gene A is unique in its ability to regulate Gene B or if it is among several strong candidates in our regulatory model. Further, this confidence score can be used to objectively determine the change in strength for this regulatory edge between strains. If there is a sufficient change in confidence between strains with and without a particular phenotype, then (along with other factors) the regulation Gene A ® Gene B is a candidate for phenotypic differences between those strains. This allows us to test our hypothesis of whether Gene A regulates Gene B in both strains.

Achieving reliable confidence scores as discussed here is a necessity for accurate comparison of regulatory networks between different strains. A background distribution allowing for precise confidence estimation enables us to deliver data to clients that demonstrates the magnitude of change for regulator/target genes and helps to prioritize the actionable models that the analysis delivers. When studying similar strains, we can detect differences in the gene regulatory networks by selecting for those regulatory relationships with a high confidence score in one strain and a low score in the other. These differences in regulatory relationships help us to pinpoint the origins of altered gene expression and uncover the causal connection between genome variation and altered phenotypes.