Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 24 additions & 7 deletions docs/functional-networks.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,31 @@
Tissue-specific Networks
===========================
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular tissue or process context.
In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, weighted and integrated, focused on a particular tissue, cell, or process context.

It is important to consider gene relationships within a tissue context as the precise actions of genes are frequently dependent on their tissue context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.
It is important to consider gene relationships within a tissue or cell type as the precise actions of genes are frequently dependent on their context, and human diseases result from the disordered interplay of tissue- and cell lineage–specific processes. These factors combine to make the understanding of tissue-specific gene functions, disease pathophysiology and gene-disease associations particularly challenging.

Tissue-specific network construction is described in the following publication: Greene, C. S., Krishnan, A., Wong, A. K., Ricciotti, E., Zelaya, R. A., Himmelstein, D. S., ... & Troyanskaya, O. G. (2015). `Understanding multicellular function and disease with human tissue-specific networks <https://www.nature.com/articles/ng.3259>`_. Nature Genetics.

Method
---------------------------
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, and an edge between two genes is probabilistically weighted based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.
Briefly, functional integration relies on the construction of process-specific functional relationship networks. These are interaction networks in which each node represents a gene, each edge a functional relationship, where an edge between two genes is a probability based on experimental evidence relating to those genes. We integrate evidence from many data sets, with each data set weighted in a process-specific manner.

One naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.
For GIANT, one naïve Bayesian classifier is trained per biological area of interest (e.g. a tissue, or a specific biological process), using the appropriate gold standard for the biological context in addition to one global process-unaware classifier trained using the complete gold standard. Each classifier consisted of a class node predicting the binary presence or absence of a functional relationship (FR) between two genes and n nodes conditioned on FR, each representing the value of a data set.

Parameter regularization is performed as described in `Steck and Jaakkola (2002) <https://proceedings.neurips.cc/paper_files/paper/2002/file/1819932ff5cf474f4f19e7c7024640c2-Paper.pdf>`_ using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.
Parameter regularization is performed as described in Steck and Jaakkola (2002) using mutual information between data sets to estimate a strength of prior belief for each data set. While a large amount of shared information does not guarantee a redundant data set, since the same subset of information could be shared many times, it provides a valuable quantitative estimate of data set uniqueness.

MAGE constructs networks in two stages.
In stage 1 (representation learning), each dataset is converted into a gene graph with edges derived from coexpression or protein/gene interactions. MAGE trains a masked graph autoencoder that hides a fraction of edges and learns to reconstruct them using information from neighboring genes in the graph. The decoder outputs a reconstruction probability for each gene pair, which serves as dataset-level evidence for functional relatedness.

In stage 2 (context-specific integration), MAGE learns a tissue- or cell-type-specific mapping from dataset-level evidence to a functional relationship probability. This supervised model is trained using a tissue- or cell-type-specific functional gold standard derived from Gene Ontology biological process relationships together with tissue expression patterns. The output is a tissue- or cell-type-specific functional network where each edge weight is the predicted probability that two genes participate in shared biological processes in that context.

Data integration
---------------------------
We collected and integrated 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.
GIANT integrates 987 genome-scale data sets encompassing approximately 38,000 conditions from an estimated 14,000 publications including both expression and interaction measurements. To integrate these data, we automatically assess each data set for its relevance to each of 144 tissue- and cell lineage–specific functional contexts. The resulting functional maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows us to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.

* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO). Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.
MAGE integrates 7,463 genome-scale datasets representing more than 250,000 experiments across multiple data types. These include protein–protein interaction resources, transcription factor binding motif information, perturbation and microRNA target profiles, and large collections of gene expression studies. Each dataset is processed into a graph representation, and the full collection of dataset-level edge evidence is then integrated into 289 tissue and cell-type networks.

* Gene co-expression: All gene expression data sets are from NCBI's Gene Expression Omnibus (GEO) for GIANT and refine.bio for MAGE. Genes with more than 30% of values missing were removed, and remaining missing values were imputed using ten nearest neighbors. Non-log-transformed data sets were log transformed. Expression measurements were summarized to Entrez identifiers, and duplicate identifiers were merged. The Pearson correlation was calculated for each gene pair, normalized with Fisher's z transform, mean subtracted and divided by the standard deviation.

* Protein-interaction: Interaction data are collected from BioGRID, IntAct, MINT, and MIPS.

Expand All @@ -29,6 +36,7 @@ We collected and integrated 987 genome-scale data sets encompassing approximatel

Evidence
---------------------------
For GIANT:
The "evidence" for an edge is measured as the contribution or "influence" of each dataset on the posterior classification probability. Each dataset contribution is calculated as the posterior probability of a functional relationship given only that dataset, minus the prior probablility.

Contribution of dataset D to an edge functional relationship prediction (FR)::
Expand All @@ -37,6 +45,15 @@ Contribution of dataset D to an edge functional relationship prediction (FR)::

Note that the contributions will not sum to 1.0, as each contribution is measured separately. Generally, individual gene expression datasets will not contribute much to the posterior probability but cumulatively can make a significant contribution.

For MAGE:
In each tissue- or cell-type-specific MAGE network, an edge between genes *u* and *v* is assigned a single score produced by the stage 2 (context-specific integration) gradient-boosting integration model (XGBoost). Each gene pair is represented by a 7,463-dimensional feature vector (one feature per dataset) derived from the stage 1 (representation learning) masked-edge reconstruction probabilities, and the boosting model maps these features to a predicted score between 0 and 1, where the score represents the probability of a functional relationship in that context.

The final network edge weight is the predicted score:
edge_weight(u, v) ∈ [0, 1]

Higher values indicate a higher predicted probability that the two genes participate in a functional relationship in the selected tissue or cell type.


Example
---------------------------

Expand Down