--- license: cc-by-nc-sa-4.0 --- # HNSCC MultiOmics Graph Attention Autoencoder ## Model Overview In this study, we address the challenge of integrating heterogeneous multiomic data structures through Graph Attention Networks, focusing on four key data types from The Cancer Genome Atlas (TCGA): gene expression, mutations, methylation, and copy number alterations. These data, derived from diverse technologies, offer a comprehensive view of the genomic landscape. Our approach leverages these networks to capture natural feature interactions, particularly those best represented by gene-gene pathway interactions, thereby consolidating these four diverse data types into a single, refined value. Further evaluation of these learned representations, alongside clinical covariates, highlights their strength in the identification of distinct survival groups. Additionally, attention analysis of these groups not only aids in the identification of potential novel biomarkers for precision oncology but also leads to the distinction of poor survival groups with a high Area Under the Curve (AUC) exceeding 0.9. This study, while focusing on head and neck squamous cell carcinoma (HNSCC) due to its diverse tissue types, primarily serves to demonstrate the robust applicability of Graph Attention Networks in harmonizing multiomic data for enhanced biomarker identification in precision oncology. ## Dataset The dataset used in this study is available at [HNSCC-MultiOmics-10-Cancer-Hallmark-Gene-Network](https://huggingface.co/datasets/VatsalPatel18/HNSCC-MultiOmics-10-Cancer-Hallmark-Gene-Network) and includes the following data types: - Gene expression - Mutations - Methylation - Copy number alterations ### Data Integration Within the graph G = (V, E), each gene was represented by a distinct node. The set of nodes V was defined as V = {gene 1, gene 2, ..., gene n}. A weighted edge in the graph signified the number of shared pathways between genes i and j. This relationship was mathematically captured as: e(gene i, gene j) = p where p ∈ R1. Consequently, the adjacency matrix E with dimensions V × V was derived from this relationship. ### Node Features The multi-omics data corresponding to each gene was depicted as a node feature. For gene i, the node i was associated with an embedded vector hi ∈ R17. This vector was defined as: hi = [gene expression, copy number alteration, frame shift deletion, frame shift insertion, in frame deletion, in frame insertion, missense mutation, nonsense mutation, nonstop mutation, silent, translation start site, 1st exon, 3’UTR, 5’UTR, body, TSS1500, TSS200] ## Model Architecture Our model, each gene is represented as a node, with its features encapsulating a variety of continuous and discrete variables that span a R17 dimensional space. These dimensions characterize different facets of gene behavior across multiple omic modalities, including gene expression, mutation types, methylation patterns, and copy number alterations. With the aim of distilling this multi-dimensional data into a unified representation, edges between genes are weighted based on shared pathways from the Cancer Hallmarks gene set. This set comprises 2,784 genes distributed across 10 pathways, offering a structured representation of cancer biology that is conducive to the extraction of a singular, informative metric potentially reflective of clinical outcomes such as survival. ## Training Ideal gene-gene graph networks are constructed using the cancer hallmark geneset, resulting in 3,672,566 weighted edges among the 2,784 genes. This network construction was performed for each of the 430 patients in the TCGA-HNSCC cohort. A Graph Attention Autoencoder is trained on a dataset split, with 60% for training and the rest divided between validation and testing. This model achieves a validation cosine similarity of 0.835 and a test set cosine similarity of 0.8, measuring the similarity in the input multiomic features per node to recreated features. Latent features for each gene of every patient are extracted from the Graph Encoder, effectively reducing the gene dimensionality from R17 to R1, while encapsulating the influence of cancer hallmark pathways. ## Usage To use this model for analysis, you can load the pre-trained model and perform various analyses as shown below: ```python from OmicsConfig import OmicsConfig from MultiOmicsGraphAttentionAutoencoderModel import MultiOmicsGraphAttentionAutoencoderModel from Attention_Extracter import Attention_Extracter from GraphAnalysis import GraphAnalysis # Load the model configuration and weights autoencoder_config = OmicsConfig.from_pretrained("./lc_models/MultiOmicsAutoencoder/trained_autoencoder/") autoencoder_model = MultiOmicsGraphAttentionAutoencoderModel(autoencoder_config) # Initialize the extracter graph_data_dict_path = './data/graph_data_dictN.pth' extracter = Attention_Extracter(graph_data_dict_path, autoencoder_model.encoder, gpu=False) # Perform graph analysis ga = GraphAnalysis(extracter) ga.find_optimal_clusters(save_path='temp') ga.cluster_data2(4) ga.pca_tsne() ga.visualize_clusters() ga.cluster_data2(6) ga.plot_kaplan_meier() ga.perform_log_rank_test() ga.generate_summary_table() ## Conclusion This model and the associated dataset provide a robust framework for integrating and analyzing multi-omics data using Graph Attention Networks. The ability to identify distinct survival groups and potential novel biomarkers underscores its potential application in precision oncology. The dynamic nature of the attention mechanism in the GATv2Conv layer ensures that the model effectively captures intricate relationships within the graph-structured data. By leveraging this model, researchers can gain deeper insights into the multi-dimensional interactions between different omic features and their impact on clinical outcomes. The Graph Attention Autoencoder's ability to distill complex multi-omics data into meaningful latent representations facilitates the discovery of new biomarkers and enhances the understanding of cancer biology. ### Future Work Future work can explore the extension of this model to other types of cancers and additional omic data types. Incorporating temporal data and longitudinal studies could further enhance the model's predictive capabilities and provide a more comprehensive view of tumor progression and treatment response. ### References - Brody, S., Alon, U., & Yahav, E. (2021). How Attentive are Graph Attention Networks?. arXiv preprint arXiv:2105.14491. - Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph Attention Networks. arXiv preprint arXiv:1710.10903. - Fey, M., & Lenssen, J. E. (2019). Fast Graph Representation Learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428. - Kipf, T. N., & Welling, M. (2016). Variational Graph Auto-Encoders. arXiv preprint arXiv:1611.07308.