Spaces:
Running
Running
title: README | |
emoji: 🔥 | |
colorFrom: yellow | |
colorTo: gray | |
sdk: static | |
pinned: false | |
# Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes | |
While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction. | |
*This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.* | |
## Project Overview | |
PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs). | |
## Core Objectives | |
1. Develop predictive models for understanding biological drivers of complex diseases | |
2. Create frameworks for inferring oncogenic potential of genetic mutations | |
3. Analyze clinical significance of protein modifications using sequence embeddings | |
4. Establish connections between protein networks and phenotypic outcomes | |
## Data Sources | |
The project utilized three major public databases: | |
- DepMap: CRISPR-based experimental data measuring protein deletion effects on cancer cell proliferation | |
- TCGA: The Cancer Genome Atlas data | |
- Longevity Database: Species longevity information | |
## Methodological Approach | |
### Model Development | |
The team developed three distinct models: | |
1. **Baseline Model** | |
- Fully connected network predicting CRISPR scores from embeddings | |
- Achieved correlation of 0.55 with ground truth | |
- Outperformed K-nearest neighbors baseline | |
- Performance correlated with training set proximity | |
2. **Cell Line-Specific Model** | |
- Incorporated cell line identity through one-hot embedding | |
- Included mutation status (wild type vs mutated) | |
- Achieved 0.44 correlation with ground truth | |
- Limited success in predicting cell line-specific differences | |
3. **PPI-Informed Model** | |
- Integrated protein-protein interaction data | |
- Results comparable to cell line-specific model | |
- Limited additional performance gain from PPI integration | |
### Additional Analyses | |
- Species Longevity Analysis | |
- Challenges in cross-phylogenetic prediction | |
- Limited success across different orders of the phylogenetic tree | |
- TCGA Patient Survival Analysis | |
- Achieved significant correlations | |
- Performance below initial expectations | |
## Key Findings | |
1. ESM3 embeddings contain valuable functional information | |
2. Simple models can outperform basic baselines | |
3. Current approach limitations in capturing subtle effects | |
4. Challenges in predicting mutation-specific impacts | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/gDGJH2ErnqGcHoF9DWuBc.png) | |
## Future Directions | |
1. Integration of additional data types: | |
- Copy number variation | |
- Transcriptomic information | |
2. Exploration of amino acid level embeddings | |
3. Enhanced signal processing methods | |
4. Improved model architectures | |
## Technical Achievements | |
- Successful implementation of protein embedding analysis | |
- Development of multiple predictive models | |
- Integration of complex biological datasets | |
- Novel approaches to phenotype prediction | |
## Limitations and Challenges | |
1. Limited success in cell line-specific predictions | |
2. Challenges in cross-phylogenetic predictions | |
3. Subtle effect detection limitations | |
4. Data integration complexities | |
## Impact and Applications | |
- Enhanced understanding of disease mechanisms | |
- Improved drug target identification | |
- Better prediction of genetic mutation effects | |
- Advanced protein function analysis | |
# PhenoSeq Longevity Analysis Component | |
This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions. | |
## Overview | |
The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales). | |
## Key Findings | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/vS8Fe-q1lY5Oiro4FPVEP.png) | |
### 1. Taxonomic Order Analysis | |
- The study examined lifespan distributions across multiple orders including: | |
- Rodentia | |
- Artiodactyla | |
- Carnivora | |
- Primates | |
- Chiroptera | |
- Cetacea | |
- Diprotodontia | |
- Perissodactyla | |
### 2. Prediction Performance | |
- Mean predictions across orders were relatively successful | |
- However, predictions within individual orders showed limited accuracy | |
- High-performing proteins were not well conserved between different orders | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/V9r5W8k5K9BgbuJfkf1XQ.png) | |
### 3. Model Architecture Insights | |
- Later layers in the neural network did not provide significant additional information | |
- Training curves showed convergence but with limitations in prediction accuracy | |
### 4. Protein Embedding Analysis | |
- Analysis of protein ALDOB showed that: | |
- Nearest neighbor species in embedding space typically belonged to the same Order/Family | |
- Strong taxonomic clustering was observed in the embedding space | |
### 5. Hierarchical Prediction Accuracy | |
Correlation strength increased with taxonomic specificity: | |
- Order level: r = 0.8 (271 species across 12 orders) | |
- Family level: r = 0.92 (191 species across 27 families) | |
- Genus level: r = 0.97 (47 species across 15 genera) | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/qHsUpGuLTIo3Nw3CHJDVM.png) | |
## Technical Limitations | |
- Limited success in cross-order predictions | |
- Difficulty in generalizing predictions across distant phylogenetic relationships | |
- Need for order/family-specific modeling approaches | |
## Key Insights | |
- Strong within-taxon predictions | |
- Decreasing accuracy with increasing phylogenetic distance | |
- Need for taxonomic stratification in prediction models | |
- High predictive power at genus level suggests strong genetic influence on longevity within closely related species | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/hzumPD8BXOEAnyLzrCE5T.png) | |
# PhenoSeq DepMap Analysis Component | |
This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models. | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/_AJXp_IwAx9uzHjlXMLVT.png) | |
## Overview | |
The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth. | |
## Three Models : | |
1. **Baseline Model** | |
- Input: Average protein embedding across all cell lines | |
- Output: Average CrisprScore across all cell lines | |
- Architecture: Simple feedforward network using ESM3-open-small embeddings | |
- Performance: Achieved Pearson correlation of 0.55 | |
- Outperformed KNN baseline across all K values | |
2. **Cell-line-specific Model** | |
- Predicted CrisprScore effects for each protein-cell line combination | |
- Performance: Achieved Pearson correlation of 0.44 | |
- Limited success in predicting protein-specific differences between cell lines | |
- Poor correlation (r=0.01) for individual proteins like MYC across cancer types | |
3. **PPI-informed Model** | |
- Incorporated protein-protein interaction networks | |
- Aimed to predict CrisprScore effects by propagating signals through PPI networks | |
- Results similar to cell-line-specific model | |
## Key Findings | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/T-B8Wm66A-oepjyA562zv.png) | |
### Model Performance | |
- Baseline model showed strong general prediction capability | |
- Distance to nearest neighbors in training set affected performance | |
- Larger networks didn't necessarily improve performance | |
- Model demonstrated true learning rather than memorization | |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/wllLuMZmsRpjmZRr1EJ0S.png) | |
### Technical Insights | |
- Hyperparameter sweeps showed similar training patterns across: | |
- Different numbers of layers | |
- Various hidden dimensions | |
- Model struggled with fine-grained predictions of mutation effects | |
### Limitations | |
- Poor performance in predicting effects of small sequence differences | |
- Limited ability to distinguish between mutations of the same protein | |
- Challenges in cell-line-specific predictions | |
## Technical Details | |
- CrisprScore distribution showed varied effects of protein deletion | |
- Different proteins showed distinct patterns of effect across cell lines | |
- Model performance was consistent across different architectural choices | |
## Future Implications | |
- Need for improved mutation-specific prediction capabilities | |
- Potential for enhanced protein function understanding | |
- Opportunity for better cancer-specific protein effect prediction | |