Descriptors¶

Molecular Descriptors¶

Molecular descriptors can be defined as mathematical representations of molecules’ properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecules. An example of molecular descriptors is the LogP which is a quantitative representation of the lipophilicity of the molecules, it is obtained by measuring the partitioning of the molecule between an aqueous phase and a lipophilic phase which consists usually of water/n-octanol. - source

Molecular descriptors can generally be classified in four ways:

(source

Tutorial¶

In this tutorial, we’ll show how descriptors can be useful as filters in the drug discovery process. This tutorial was inspired by the TeachOpenCADD talktorial, we highly encourage you to read through the theory, understand ADME and why we care about it in the drug discovery process from the talktorial before diving into this tutorial. It provides the necessary background information to fully understand the purpose of this tutorial.

The set of descriptors that will be focused on today are:

Molecular weight ≤ 500 Da
Number of hydrogen bond acceptors (HBAs) ≤ 10
Number of hydrogen bond donors (HBD) ≤ 5
Calculated LogP (octanol-water coefficient) ≤ 5

These descriptors and their limits are collectively known as Lipinski’s rule of five (Ro5), this is a method used to estimate the bioavailability of a compound based solely on its chemical structure. If a molecule violates any of the rules listed above (i.e. a molecular weight of 700 Da), it’s probable that the compound will exhibit poor absorption or permeation and subsequently be removed from your list.

This tutorial will show you a real-world scenario of obtaining a dataset and subsequently filtering our compounds according to the Ro5.

Part 1: Obtaining a virtual screening library from Enamine
- The DNA library is designed to identify novel active compounds against proteins that are essential for DNA stability. At 5530 compounds, this is one of Enamine’s smaller libraries. The same functions could easily be applied to some of the larger libraries using Datamol’s parallelize functions.
- Note: for this tutorial, we are loading a truncated VS Enamine library.
Part 2: Then calculate the relevant molecular properties for the Ro5 for the list
Part 3: Investigate compliance with Ro5
Part 4: And finally, revealing the statistics for the dataset of compounds using Ro5 as a filter. With this, we will be able to find the answer to our question; how many fulfill vs. violate Ro5?
- Subsequently, we can show different ways of displaying the data to make it more visually appealing using Matplotlib

In [1]:

                
                    Copied!
                    
import datamol as dm

# Part 1: Obtain a list of molecules and visualize
# Load sdf downloaded from Enamine with the flag as_df set to True
# This will automatically create a 'smiles' column from the sdf file
data = dm.read_sdf("./data/Enamine_DNA_Libary_5530cmpds_20200831_SMALL.sdf", as_df=True)

data["mol"] = data["smiles"].apply(dm.to_mol)

mols = data["mol"].tolist()

dm.to_image(mols[:12], mol_size=(200, 150), use_svg=False)
import datamol as dm

# Part 1: Obtain a list of molecules and visualize
# Load sdf downloaded from Enamine with the flag as_df set to True
# This will automatically create a 'smiles' column from the sdf file
data = dm.read_sdf("./data/Enamine_DNA_Libary_5530cmpds_20200831_SMALL.sdf", as_df=True)

data["mol"] = data["smiles"].apply(dm.to_mol)

mols = data["mol"].tolist()

dm.to_image(mols[:12], mol_size=(200, 150), use_svg=False)

Out[1]:

In [2]:

                
                    Copied!
                    
# Calculate a specific descriptor for a compound
n_aromatic_atoms = dm.descriptors.n_aromatic_atoms(dm.copy_mol(mols[0]))
print("Number of aromatic atoms in the compound is", n_aromatic_atoms)

mols[0]
# Calculate a specific descriptor for a compound
n_aromatic_atoms = dm.descriptors.n_aromatic_atoms(dm.copy_mol(mols[0]))
print("Number of aromatic atoms in the compound is", n_aromatic_atoms)

mols[0]

Number of aromatic atoms in the compound is 6

Out[2]:

In [3]:

                
                    Copied!
                    
# Part 2: Calculate the relevant molecular properties for the Ro5 for the list

# Calculate many descriptors for a compound
dm.descriptors.compute_many_descriptors(mols[150])
# Part 2: Calculate the relevant molecular properties for the Ro5 for the list

# Calculate many descriptors for a compound
dm.descriptors.compute_many_descriptors(mols[150])

Out[3]:

{'mw': 210.009913052,
 'fsp3': 0.125,
 'n_lipinski_hba': 5,
 'n_lipinski_hbd': 1,
 'n_rings': 2,
 'n_hetero_atoms': 6,
 'n_heavy_atoms': 14,
 'n_rotatable_bonds': 1,
 'n_radical_electrons': 0,
 'tpsa': 71.66999999999999,
 'qed': 0.7539078378657419,
 'clogp': 0.7626199999999999,
 'sas': 2.5248498164613675,
 'n_aliphatic_carbocycles': 0,
 'n_aliphatic_heterocyles': 0,
 'n_aliphatic_rings': 0,
 'n_aromatic_carbocycles': 0,
 'n_aromatic_heterocyles': 2,
 'n_aromatic_rings': 2,
 'n_saturated_carbocycles': 0,
 'n_saturated_heterocyles': 0,
 'n_saturated_rings': 0}

In [4]:

                
                    Copied!
                    
# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)
df
# Batch compute many descriptors for a list of compounds
df = dm.descriptors.batch_compute_many_descriptors(mols)
df

Out[4]:

	mw	fsp3	n_lipinski_hba	n_lipinski_hbd	n_rings	n_hetero_atoms	n_heavy_atoms	n_rotatable_bonds	n_radical_electrons	tpsa	...	sas	n_aliphatic_carbocycles	n_aliphatic_heterocyles	n_aliphatic_rings	n_aromatic_carbocycles	n_aromatic_heterocyles	n_aromatic_rings	n_saturated_carbocycles	n_saturated_heterocyles	n_saturated_rings
0	122.048013	0.000000	3	2	1	3	9	1	0	55.98	...	1.690816	0	0	0	0	1	1	0	0	0
1	133.063997	0.000000	3	3	2	3	10	0	0	54.70	...	2.795444	0	0	0	1	1	2	0	0	0
2	169.040675	0.000000	3	3	2	4	11	0	0	54.70	...	2.381662	0	0	0	1	1	2	0	0	0
3	152.004434	0.000000	3	1	2	4	10	0	0	45.75	...	2.591944	0	0	0	0	2	2	0	0	0
4	133.063997	0.000000	3	3	2	3	10	0	0	54.70	...	2.232651	0	0	0	1	1	2	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
876	337.092578	0.176471	4	1	2	7	24	5	0	55.40	...	1.761027	0	0	0	2	0	2	0	0	0
877	335.109291	0.384615	7	5	1	10	23	7	0	113.68	...	2.716945	0	0	0	1	0	1	0	0	0
878	426.216809	0.240000	7	1	5	7	32	5	0	65.77	...	2.215072	0	1	1	2	2	4	0	1	1
879	460.087308	0.086957	7	0	5	9	32	6	0	82.52	...	2.379198	0	0	0	2	3	5	0	0	0
880	312.163792	0.315789	3	1	3	4	23	5	0	32.34	...	1.851829	0	1	1	2	0	2	0	0	0

881 rows × 22 columns

In [5]:

                
                    Copied!
                    
# Part 3: Investigate compliance with Ro5

df = df[df["mw"] <= 500]
df = df[df["n_lipinski_hba"] <= 10]
df = df[df["n_lipinski_hbd"] <= 5]
df = df[df["clogp"] <= 5]
df

# 863 of the 881 compounds in the dataset satisfy all criteria in the rule of 5
# Part 3: Investigate compliance with Ro5

df = df[df["mw"] <= 500]
df = df[df["n_lipinski_hba"] <= 10]
df = df[df["n_lipinski_hbd"] <= 5]
df = df[df["clogp"] <= 5]
df

# 863 of the 881 compounds in the dataset satisfy all criteria in the rule of 5

Out[5]:

	mw	fsp3	n_lipinski_hba	n_lipinski_hbd	n_rings	n_hetero_atoms	n_heavy_atoms	n_rotatable_bonds	n_radical_electrons	tpsa	...	sas	n_aliphatic_carbocycles	n_aliphatic_heterocyles	n_aliphatic_rings	n_aromatic_carbocycles	n_aromatic_heterocyles	n_aromatic_rings	n_saturated_carbocycles	n_saturated_heterocyles	n_saturated_rings
0	122.048013	0.000000	3	2	1	3	9	1	0	55.98	...	1.690816	0	0	0	0	1	1	0	0	0
1	133.063997	0.000000	3	3	2	3	10	0	0	54.70	...	2.795444	0	0	0	1	1	2	0	0	0
2	169.040675	0.000000	3	3	2	4	11	0	0	54.70	...	2.381662	0	0	0	1	1	2	0	0	0
3	152.004434	0.000000	3	1	2	4	10	0	0	45.75	...	2.591944	0	0	0	0	2	2	0	0	0
4	133.063997	0.000000	3	3	2	3	10	0	0	54.70	...	2.232651	0	0	0	1	1	2	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
875	363.115381	0.157895	6	0	4	7	26	5	0	69.63	...	2.148096	0	0	0	2	2	4	0	0	0
876	337.092578	0.176471	4	1	2	7	24	5	0	55.40	...	1.761027	0	0	0	2	0	2	0	0	0
877	335.109291	0.384615	7	5	1	10	23	7	0	113.68	...	2.716945	0	0	0	1	0	1	0	0	0
878	426.216809	0.240000	7	1	5	7	32	5	0	65.77	...	2.215072	0	1	1	2	2	4	0	1	1
880	312.163792	0.315789	3	1	3	4	23	5	0	32.34	...	1.851829	0	1	1	2	0	2	0	0	0

863 rows × 22 columns

In [10]:

                
                    Copied!
                    
df["mw"].isnull().sum()
df["mw"].isnull().sum()

Out[10]:

In [11]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Part 4: Reveal the statistics for the dataset of compounds using Ro5 as a filter. How many fulfill vs. violate Ro5?
# Plotting the RO5 descriptors

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(ncols=4, figsize=(25, 4))
axs = axs.flatten()

sns.histplot(df, x="mw", ax=axs[0])
sns.histplot(df, x="n_lipinski_hba", ax=axs[1])
sns.histplot(df, x="n_lipinski_hbd", ax=axs[2])
sns.histplot(df, x="clogp", ax=axs[3])
# Part 4: Reveal the statistics for the dataset of compounds using Ro5 as a filter. How many fulfill vs. violate Ro5?
# Plotting the RO5 descriptors

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(ncols=4, figsize=(25, 4))
axs = axs.flatten()

sns.histplot(df, x="mw", ax=axs[0])
sns.histplot(df, x="n_lipinski_hba", ax=axs[1])
sns.histplot(df, x="n_lipinski_hbd", ax=axs[2])
sns.histplot(df, x="clogp", ax=axs[3])

Out[11]:

<AxesSubplot: xlabel='clogp', ylabel='Count'>

If you’re curious to learn more about some of the other established rules in the drug discovery industry, feel free to run this list through a Google search:

Rules of CNS
BBB score
Rule of Egan
Rule-of-5
Beyond Rule-of-5
Rule-of-4
Ghose Filter
Zinc Rule
Rule of GSK (4/400)
Lead-Like Soft Rule
Oprea’s Rule
Pfizer Rule (3/75)
REOS Filter
Rule-of-3
Extended Rule-of-3
Veber Filter

References:¶

TeachOpenCADD - https://projects.volkamerlab.org/teachopencadd/talktorials/T002_compound_adme.html?highlight=descriptors
ADME criteria (Wikipedia and Mol Pharm. (2010), 7(5), 1388-1405)
What are lead compounds? (Wikipedia)
What is the LogP value? (Wikipedia)
Lipinski et al. “Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.” (Adv. Drug Deliv. Rev. (1997), 23, 3-25)

In [ ]: