metadata

title: README
emoji: 🔥
colorFrom: yellow
colorTo: gray
sdk: static
pinned: false

Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes

While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction.

This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.

Project Overview

PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs).

Core Objectives

Develop predictive models for understanding biological drivers of complex diseases
Create frameworks for inferring oncogenic potential of genetic mutations
Analyze clinical significance of protein modifications using sequence embeddings
Establish connections between protein networks and phenotypic outcomes

Data Sources

The project utilized three major public databases:

DepMap: CRISPR-based experimental data measuring protein deletion effects on cancer cell proliferation
TCGA: The Cancer Genome Atlas data
Longevity Database: Species longevity information

Methodological Approach

Model Development

The team developed three distinct models:

Baseline Model
- Fully connected network predicting CRISPR scores from embeddings
- Achieved correlation of 0.55 with ground truth
- Outperformed K-nearest neighbors baseline
- Performance correlated with training set proximity
Cell Line-Specific Model
- Incorporated cell line identity through one-hot embedding
- Included mutation status (wild type vs mutated)
- Achieved 0.44 correlation with ground truth
- Limited success in predicting cell line-specific differences
PPI-Informed Model
- Integrated protein-protein interaction data
- Results comparable to cell line-specific model
- Limited additional performance gain from PPI integration

Additional Analyses

Species Longevity Analysis
- Challenges in cross-phylogenetic prediction
- Limited success across different orders of the phylogenetic tree
TCGA Patient Survival Analysis
- Achieved significant correlations
- Performance below initial expectations

Key Findings

ESM3 embeddings contain valuable functional information
Simple models can outperform basic baselines
Current approach limitations in capturing subtle effects
Challenges in predicting mutation-specific impacts

Future Directions

Integration of additional data types:
- Copy number variation
- Transcriptomic information
Exploration of amino acid level embeddings
Enhanced signal processing methods
Improved model architectures

Technical Achievements

Successful implementation of protein embedding analysis
Development of multiple predictive models
Integration of complex biological datasets
Novel approaches to phenotype prediction

Limitations and Challenges

Limited success in cell line-specific predictions
Challenges in cross-phylogenetic predictions
Subtle effect detection limitations
Data integration complexities

Impact and Applications

Enhanced understanding of disease mechanisms
Improved drug target identification
Better prediction of genetic mutation effects
Advanced protein function analysis

PhenoSeq Longevity Analysis Component

This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions.

Overview

The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales).

Key Findings

1. Taxonomic Order Analysis

The study examined lifespan distributions across multiple orders including:
- Rodentia
- Artiodactyla
- Carnivora
- Primates
- Chiroptera
- Cetacea
- Diprotodontia
- Perissodactyla

2. Prediction Performance

Mean predictions across orders were relatively successful
However, predictions within individual orders showed limited accuracy
High-performing proteins were not well conserved between different orders

3. Model Architecture Insights

Later layers in the neural network did not provide significant additional information
Training curves showed convergence but with limitations in prediction accuracy

4. Protein Embedding Analysis

Analysis of protein ALDOB showed that:
- Nearest neighbor species in embedding space typically belonged to the same Order/Family
- Strong taxonomic clustering was observed in the embedding space

5. Hierarchical Prediction Accuracy

Correlation strength increased with taxonomic specificity:

Order level: r = 0.8 (271 species across 12 orders)
Family level: r = 0.92 (191 species across 27 families)
Genus level: r = 0.97 (47 species across 15 genera)

Technical Limitations

Limited success in cross-order predictions
Difficulty in generalizing predictions across distant phylogenetic relationships
Need for order/family-specific modeling approaches

Key Insights

Strong within-taxon predictions
Decreasing accuracy with increasing phylogenetic distance
Need for taxonomic stratification in prediction models
High predictive power at genus level suggests strong genetic influence on longevity within closely related species

PhenoSeq DepMap Analysis Component

This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models.

Overview

The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth.

Three Models :

Baseline Model

Input: Average protein embedding across all cell lines
Output: Average CrisprScore across all cell lines
Architecture: Simple feedforward network using ESM3-open-small embeddings
Performance: Achieved Pearson correlation of 0.55
Outperformed KNN baseline across all K values

Cell-line-specific Model

Predicted CrisprScore effects for each protein-cell line combination
Performance: Achieved Pearson correlation of 0.44
Limited success in predicting protein-specific differences between cell lines
Poor correlation (r=0.01) for individual proteins like MYC across cancer types

PPI-informed Model

Incorporated protein-protein interaction networks
Aimed to predict CrisprScore effects by propagating signals through PPI networks
Results similar to cell-line-specific model

Key Findings

Model Performance

Baseline model showed strong general prediction capability
Distance to nearest neighbors in training set affected performance
Larger networks didn't necessarily improve performance
Model demonstrated true learning rather than memorization

Technical Insights

Hyperparameter sweeps showed similar training patterns across:
- Different numbers of layers
- Various hidden dimensions
Model struggled with fine-grained predictions of mutation effects

Limitations

Poor performance in predicting effects of small sequence differences
Limited ability to distinguish between mutations of the same protein
Challenges in cell-line-specific predictions

Technical Details

CrisprScore distribution showed varied effects of protein deletion
Different proteins showed distinct patterns of effect across cell lines
Model performance was consistent across different architectural choices

Future Implications

Need for improved mutation-specific prediction capabilities
Potential for enhanced protein function understanding
Opportunity for better cancer-specific protein effect prediction