AI Cancer Detection with Machine Learning

In this session from our March meeting, Round Rock High School sophomore Venika Kakarla presented her groundbreaking machine learning approach to detecting and analyzing hepatocellular carcinoma (HTC), the fatal liver cancer that claimed her grandfather's life.

The Problem: Hepatocellular Carcinoma

HTC is the third leading cause of cancer-related deaths worldwide and presents multiple challenges:

It varies significantly between patients
It's difficult to diagnose in early stages
Traditional analysis methods are prohibitively expensive ($10,000+ per analysis)
Computational simulations require 2,400-2,500 hours to process multiple mutations

Biological Background: P53 Protein

The research focuses on the P53 protein, specifically its DNA binding region:

Plays a critical role in controlling tumor growth
Is mutated in approximately 50% of HTC patients
When mutated, cancer cells can replicate without bounds
Even a single amino acid change can significantly alter protein stability and binding affinity

The Multi-Algorithm Solution

Venika developed a three-part machine learning model that efficiently analyzes P53 mutations:

1. Data Preparation

Started with 1,400 cases of mutated P53 protein from the COSMIC database
Pre-processed to include 249 significant point mutations without repetitions
Created protein structures using AlphaFold
Generated adjacency matrices showing interactions between residues and DNA

2. Deep Neural Network (DNN)

Purpose: Classify whether mutations stabilize or destabilize the protein

Technical implementation:

Input layer processes features like hydrogen bond count and solvent accessible surface area
Hidden layers analyze whether mutations bury or expose residues
Dropout layer prevents overfitting
Output layer predicts stability classification with 96% accuracy

3. XGBoost Regression

Purpose: Predict numerical thermostability values

Technical implementation:

Uses decision trees that learn and improve over time
Analyzes features like solvent accessible surface area, hydrogen bonds, salt bridges
Achieved 99% accuracy in predicting stability values

4. Graph Convolutional Network (GCN)

Purpose: Predict DNA binding affinity of mutated proteins

Technical implementation:

Converts adjacency matrices into graph-based inputs
Analyzes protein structure in its natural graph form
Uses ReLU functions to show non-linearity in residue interactions
Compares wild type and mutated proteins
Predicts interaction probability between mutant P53 and DNA sequences

Results and Significance

This innovative approach achieved remarkable results:

Reduced analysis time from 2,400-2,500 hours to just 3 minutes
96-99% accuracy across different prediction tasks
Identified that R249S mutation (arginine to serine) decreases both stability and binding interaction probability
Found that a statistical majority of P53-related point mutations affect DNA interaction, leading to faster cancer progression

Technical Implementation

The model was built entirely in Google Colab using:

TensorFlow
Torch Geometric (for GCN)
AlphaFold for structure prediction
Python for data processing and adjacency matrix creation

Future Work

Lab validation of computational predictions
Enhancement of the machine learning components
Creation of a web-based interface for researchers
Scaling the model for worldwide research use

Broader Applications

While focused on hepatocellular carcinoma, this approach has implications for:

Other cancers where P53 is mutated
Protein structure analysis in general
Computational biology and drug discovery
Personalized medicine approaches

Watch the Presentation

For the full presentation, watch the YouTube video.

Resources

Blog post with additional details
COSMIC Database - Source of mutation data
AlphaFold - Protein structure prediction
Torch Geometric - Graph Neural Network library

The Problem: Hepatocellular Carcinoma​

Biological Background: P53 Protein​

The Multi-Algorithm Solution​

1. Data Preparation​

2. Deep Neural Network (DNN)​

3. XGBoost Regression​

4. Graph Convolutional Network (GCN)​

Results and Significance​

Technical Implementation​

Future Work​

Broader Applications​

Watch the Presentation​

Resources​

The Problem: Hepatocellular Carcinoma

Biological Background: P53 Protein

The Multi-Algorithm Solution

1. Data Preparation

2. Deep Neural Network (DNN)

3. XGBoost Regression

4. Graph Convolutional Network (GCN)

Results and Significance

Technical Implementation

Future Work

Broader Applications

Watch the Presentation

Resources