Computational Drug discovery 💊

7 min readDec 7, 2021

A machine learning based Drug discovery model to predict the bioactivity of a protein or enzyme.

Introduction
Data Collection
Data analysis
Descriptors’ dataset making
Linear Regression Model
Model deployment

Introduction

It is estimated that a typical drug discovery cycle, from lead identification through to clinical trials, can take 14 years with cost of 800 million US dollars.🤯

But addition of computer aided drug design(CADD) could lead to a revolution by reduction in the cost of drug design and development by up to 50% and moroever, the application of CADD can be cost-effective using experiments to compare predicted and actual drug activity, the results from which can used iteratively to improve compound properties.

Everyone knows and have heard of various uses of ML , one of the most common one being Self driving cars, artificial intelligence , etc. But How does it help in Drug Discovery right! let us see.

(This model is inspired by Chanin Nantasenament aka Dataprofessor.)

CADD (computer aided drug design) is a technique which uses softwares for predicting the structure, value of properties of known, unknown, stable and molecular species using mathematical equations. Machine learning can be used in this for calculating and predicting different important variables that is to be used in Drug discovery of any known or unknown species.

Specifically my model predicts the Pic50 value that marks a potency of a compound with the target protein and also it can be modified to find predicted LogP that marks a permeability of a compound.

Data collection

Database:
For data collection we will be using the ChEMBL bioactivity data that is available for use.
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.

Installing the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

! pip install chembl_webresource_client

Necessary libraries:
Pandas is a library available in python that can be used to analyse as well manipulate the data.
We can import libraries by following

import pandas as pd
from chembl_webresource_client.new_client import new_client

Choosing target and Retrieving the data from target protein
After import database we can choose a target protein from all the data available and then retrieve all the data of standard type IC50 from the target protein i.e. SARS coronavirus 3C-like proteinase.
Finally we can classify the Data into active , intermediate and inactive on the basis of their standard_value by following bulk of code.

bioactivity_class=[]for i in DF2.standard_value:if float(i) >= 10000:bioactivity_class.append("inactive")elif float(i) <= 1000:bioactivity_class.append("active")else:bioactivity_class.append("intermediate")

And at last we can combine our lists of canonical_smiles, standard_value, bioactivity_class and chembl_id into one dataframe , that can be saved in the form of csv file in our google drive or locally in our desktop.

#combining
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])
#saving
df3.to_csv('bioactivity_preprocessed_data.csv', index=False)

Data analysis

In this part we will be performing exploratory data analysis along with descriptor calculations.

Installing & importing libraries:
The rdkit library is a Python library that allows us to handle chemical structures and the calculation of their molecular properties (i.e. for quantitating the molecular features of each molecule that we can subsequently use in the development of a machine learning model)

#installing
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')#importing
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

Calculating descriptors:
A data descriptor is a structure containing information that describes data. Descriptors are also used to hold information about data that is only fully known at run-time.

def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]   
    descriptors = pd.DataFrame(data=baseData,columns=columnNames) return descriptors#defining lipinski data frame.
df_lipinski = lipinski(df3.canonical_smiles)

Combining df3 with df_lipinski

df_combined = pd.concat([df,df_lipinski], axis=1)

Converting IC50 to pIC50
To allow IC50 data to be more uniformly distributed, we will convert IC50 to the negative logarithmic scale which is essentially -log10(IC50)(pIC50)
This custom function pIC50() will accept a DataFrame as input and will:
1. Take the IC50 values from the standard_value column and converts it from nM to M by multiplying the value by 10−9
2. Take the molar value and apply -log10
3. Delete the standard_value column and create a new pIC50 column

def pIC50(input):pIC50 = []for i in input['standard_value_norm']:molar = i*(10**-9) # Converts nM to MpIC50.append(-np.log10(molar))input['pIC50'] = pIC50x = input.drop('standard_value_norm', 1)return x

Optional: Chemical space analysis of these lipinski descriptors
Here is the chemical analysis of descriptors just for the understanding of the concept more.

Bioactivity class

LogP vs Mw

pIC50 value

Similarly we did chemical analysis of rest descriptors.

Conclusion:
Taking a look at pIC50 values, the actives and inactives displayed statistically significant different, which is to be expected since we defined different threshold values.
So our code and data is running correct.

Descriptors dataset making

Installing PaDEL-Descriptor
PaDEL -Descriptor is a software to calculate molecular descriptors and fingerprints that are unique identifiers for their corresponding data and files, much like human fingerprints uniquely identify individual people.
Calculate fingerprint descriptors
we can calculate PaDEL descriptors by jus two short codes like this:

! cat padel.sh
! bash padel.sh

X and Y data matrices
We can then make X data matrix that contains all the fingerprints, and a y data that contains pIC50 of each compound.
after this we can combine these and save it as a csv file to actually help in model building.

Linear Regression model

In this part we will make a linear regression model that is in statistical modeling, is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables, that will in future help my program to predict the dependent variable of any compound.

Libraries(here we go again 😣)
Sklearn is a very useful librari in python that provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.

import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Removing low variance features
remove low variance features i.e. features almost equal to 0 or are constant so they dont play importance in model building.

from sklearn.feature_selection import VarianceThreshold
selection = VarianceThreshold(threshold=(.8 * (1 - .8)))    
X = selection.fit_transform(X)

Data split (80/20 ratio)
We will now proceed to performing data splitting using a split ratio of 80/20 (i.e. we do this by assigning the test_size parameter to 0.2) whereby 80% of the initial dataset will be used as the training set and the remaining 20% of the dataset will be used as the testing set.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

Model building(finally🥱)

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, Y_train)
r2 = model.score(X_test, Y_test)
r2

Scatter plot between predicted and actual IC50

Y_pred = model.predict(X_test)import matplotlib.pyplot as plt

sns.set(color_codes=True)
sns.set_style("white")

ax = sns.regplot(Y_test, Y_pred, scatter_kws={'alpha':0.4})
ax.set_xlabel('Experimental pIC50', fontsize='large', fontweight='bold')
ax.set_ylabel('Predicted pIC50', fontsize='large', fontweight='bold')
ax.set_xlim(0, 12)
ax.set_ylim(0, 12)
ax.figure.set_size_inches(5, 5)
plt.show

Upon running the above code you will have a graph that shows the predicted vs actual value.

So now what right!!lets see

Model Deployment

Now in the last step we would put our model or what our machine learned into action by asking it QUESTIONS🥳(tbh it is the most exciting part of project).
Let us question the machine!!!!

Download the results as a pickle model

import picklepickle.dump(model, open(‘coronavirus_model.pkl’, ‘wb’))

Make a project folder in your desktop

Let’s do some frontend now
After making a app.py file i added some frontend code to make a working website with the help of software called Streamlit.
and Voila after running it on terminal we have our prediction app🤩