datamol.descriptors
¶
Various molecular descriptors¶
Return a descriptor function by name either from
rdkit.Chem import Descriptors
or rdkit.Chem.rdMolDescriptors
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str |
Descriptor name. |
required |
Source code in datamol/descriptors/compute.py
def any_rdkit_descriptor(name: str) -> Callable:
"""Return a descriptor function by name either from
`rdkit.Chem import Descriptors` or `rdkit.Chem.rdMolDescriptors`.
Args:
name: Descriptor name.
"""
fn = getattr(Descriptors, name, None)
if fn is None:
fn = getattr(rdMolDescriptors, name, None)
if fn is None:
raise ValueError(f"Descriptor {name} not found.")
return fn
Compute a list of opiniated molecular properties.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Mol |
A molecule. |
required |
properties_fn |
Optional[Dict[str, Union[Callable, str]]] |
A list of functions that compute properties. If None,
a default list of properties is used. If the function is a string,
|
None |
add_properties |
bool |
Whether to add the computed properties to the default list. |
True |
Returns:
Type | Description |
---|---|
dict |
Computed properties as a dict. |
Source code in datamol/descriptors/compute.py
def compute_many_descriptors(
mol: Mol,
properties_fn: Optional[Dict[str, Union[Callable, str]]] = None,
add_properties: bool = True,
) -> dict:
"""Compute a list of opiniated molecular properties.
Args:
mol: A molecule.
properties_fn: A list of functions that compute properties. If None,
a default list of properties is used. If the function is a string,
`dm.descriptors.any_descriptor()` is used to retrieve the descriptor
function.
add_properties: Whether to add the computed properties to the default list.
Returns:
Computed properties as a dict.
"""
if properties_fn is None:
properties_fn = _DEFAULT_PROPERTIES_FN
elif add_properties:
[properties_fn.setdefault(k, v) for k, v in _DEFAULT_PROPERTIES_FN.items()]
props = {}
for k, v in properties_fn.items():
if isinstance(v, str):
v = any_rdkit_descriptor(v)
props[k] = v(mol)
return props
Compute a list of opiniated molecular properties on a list of molecules.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mols |
List[rdkit.Chem.rdchem.Mol] |
A list of molecules. |
required |
properties_fn |
Optional[Dict[str, Union[Callable, str]]] |
A list of functions that compute properties. If None,
a default list of properties is used. If the function is a string,
|
None |
add_properties |
bool |
Whether to add the computed properties to the default list. |
True |
Returns:
Type | Description |
---|---|
DataFrame |
A dataframe of computed properties with one row per input molecules. |
Source code in datamol/descriptors/compute.py
def batch_compute_many_descriptors(
mols: List[Mol],
properties_fn: Optional[Dict[str, Union[Callable, str]]] = None,
add_properties: bool = True,
n_jobs: int = 1,
batch_size: Optional[int] = None,
progress: bool = False,
progress_leave: bool = True,
) -> pd.DataFrame:
"""Compute a list of opiniated molecular properties on a list of molecules.
Args:
mols: A list of molecules.
properties_fn: A list of functions that compute properties. If None,
a default list of properties is used. If the function is a string,
`dm.descriptors.any_descriptor()` is used to retrieve the descriptor
function.
add_properties: Whether to add the computed properties to the default list.
Returns:
A dataframe of computed properties with one row per input molecules.
"""
compute_fn = functools.partial(
compute_many_descriptors,
properties_fn=properties_fn,
add_properties=add_properties,
)
props = parallelized(
compute_fn,
mols,
batch_size=batch_size,
progress=progress,
n_jobs=n_jobs,
tqdm_kwargs=dict(leave=progress_leave),
)
return pd.DataFrame(props)
CalcExactMolWt( (Mol)mol [, (bool)onlyHeavy=False]) -> float : returns the molecule's exact molecular weight
C++ signature :
double CalcExactMolWt(RDKit::ROMol [,bool=False])
CalcFractionCSP3( (Mol)mol) -> float : returns the fraction of C atoms that are SP3 hybridized
C++ signature :
double CalcFractionCSP3(RDKit::ROMol)
CalcNumHBA( (Mol)mol) -> int : returns the number of H-bond acceptors for a molecule
C++ signature :
unsigned int CalcNumHBA(RDKit::ROMol)
CalcNumHBD( (Mol)mol) -> int : returns the number of H-bond donors for a molecule
C++ signature :
unsigned int CalcNumHBD(RDKit::ROMol)
CalcNumLipinskiHBA( (Mol)mol) -> int : returns the number of Lipinski H-bond acceptors for a molecule
C++ signature :
unsigned int CalcNumLipinskiHBA(RDKit::ROMol)
CalcNumLipinskiHBD( (Mol)mol) -> int : returns the number of Lipinski H-bond donors for a molecule
C++ signature :
unsigned int CalcNumLipinskiHBD(RDKit::ROMol)
CalcNumRings( (Mol)mol) -> int : returns the number of rings for a molecule
C++ signature :
unsigned int CalcNumRings(RDKit::ROMol)
CalcNumHeteroatoms( (Mol)mol) -> int : returns the number of heteroatoms for a molecule
C++ signature :
unsigned int CalcNumHeteroatoms(RDKit::ROMol)
Number of heavy atoms a molecule.
Source code in rdkit/Chem/Lipinski.py
def HeavyAtomCount(mol):
" Number of heavy atoms a molecule."
return mol.GetNumHeavyAtoms()
CalcNumRotatableBonds( (Mol)mol, (bool)strict) -> int : returns the number of rotatable bonds for a molecule. strict = NumRotatableBondsOptions.NonStrict - Simple rotatable bond definition. strict = NumRotatableBondsOptions.Strict - (default) does not count things like amide or ester bonds strict = NumRotatableBondsOptions.StrictLinkages - handles linkages between ring systems. - Single bonds between aliphatic ring Cs are always rotatable. This means that the central bond in CC1CCCC(C)C1-C1C(C)CCCC1C is now considered rotatable; it was not before - Heteroatoms in the linked rings no longer affect whether or not the linking bond is rotatable - the linking bond in systems like Cc1cccc(C)c1-c1c(C)cccc1 is now considered non-rotatable
C++ signature :
unsigned int CalcNumRotatableBonds(RDKit::ROMol,bool)
CalcNumRotatableBonds( (Mol)mol [, (NumRotatableBondsOptions)strict=rdkit.Chem.rdMolDescriptors.NumRotatableBondsOptions.Default]) -> int : returns the number of rotatable bonds for a molecule. strict = NumRotatableBondsOptions.NonStrict - Simple rotatable bond definition. strict = NumRotatableBondsOptions.Strict - (default) does not count things like amide or ester bonds strict = NumRotatableBondsOptions.StrictLinkages - handles linkages between ring systems. - Single bonds between aliphatic ring Cs are always rotatable. This means that the central bond in CC1CCCC(C)C1-C1C(C)CCCC1C is now considered rotatable; it was not before - Heteroatoms in the linked rings no longer affect whether or not the linking bond is rotatable - the linking bond in systems like Cc1cccc(C)c1-c1c(C)cccc1 is now considered non-rotatable
C++ signature :
unsigned int CalcNumRotatableBonds(RDKit::ROMol [,RDKit::Descriptors::NumRotatableBondsOptions=rdkit.Chem.rdMolDescriptors.NumRotatableBondsOptions.Default])
The number of radical electrons the molecule has (says nothing about spin state)
NumRadicalElectrons(Chem.MolFromSmiles('CC')) 0 NumRadicalElectrons(Chem.MolFromSmiles('C[CH3]')) 0 NumRadicalElectrons(Chem.MolFromSmiles('C[CH2]')) 1 NumRadicalElectrons(Chem.MolFromSmiles('C[CH]')) 2 NumRadicalElectrons(Chem.MolFromSmiles('C[C]')) 3
Source code in rdkit/Chem/Descriptors.py
def NumRadicalElectrons(mol):
""" The number of radical electrons the molecule has
(says nothing about spin state)
>>> NumRadicalElectrons(Chem.MolFromSmiles('CC'))
0
>>> NumRadicalElectrons(Chem.MolFromSmiles('C[CH3]'))
0
>>> NumRadicalElectrons(Chem.MolFromSmiles('C[CH2]'))
1
>>> NumRadicalElectrons(Chem.MolFromSmiles('C[CH]'))
2
>>> NumRadicalElectrons(Chem.MolFromSmiles('C[C]'))
3
"""
return sum(atom.GetNumRadicalElectrons() for atom in mol.GetAtoms())
CalcTPSA( (Mol)mol [, (bool)force=False [, (bool)includeSandP=False]]) -> float : returns the TPSA value for a molecule
C++ signature :
double CalcTPSA(RDKit::ROMol [,bool=False [,bool=False]])
Calculate the weighted sum of ADS mapped properties
some examples from the QED paper, reference values from Peter G's original implementation
m = Chem.MolFromSmiles('N=C(CCSCc1csc(N=C(N)N)n1)NS(N)(=O)=O') qed(m) 0.253... m = Chem.MolFromSmiles('CNC(=NCCSCc1nc[nH]c1C)NC#N') qed(m) 0.234... m = Chem.MolFromSmiles('CCCCCNC(=N)NN=Cc1c[nH]c2ccc(CO)cc12') qed(m) 0.234...
Source code in rdkit/Chem/QED.py
@setDescriptorVersion(version='1.1.0')
def qed(mol, w=WEIGHT_MEAN, qedProperties=None):
""" Calculate the weighted sum of ADS mapped properties
some examples from the QED paper, reference values from Peter G's original implementation
>>> m = Chem.MolFromSmiles('N=C(CCSCc1csc(N=C(N)N)n1)NS(N)(=O)=O')
>>> qed(m)
0.253...
>>> m = Chem.MolFromSmiles('CNC(=NCCSCc1nc[nH]c1C)NC#N')
>>> qed(m)
0.234...
>>> m = Chem.MolFromSmiles('CCCCCNC(=N)NN=Cc1c[nH]c2ccc(CO)cc12')
>>> qed(m)
0.234...
"""
if qedProperties is None:
qedProperties = properties(mol)
d = [ads(pi, adsParameters[name]) for name, pi in qedProperties._asdict().items()]
t = sum(wi * math.log(di) for wi, di in zip(w, d))
return math.exp(t / sum(w))
Wildman-Crippen LogP value
Uses an atom-based scheme based on the values in the paper: S. A. Wildman and G. M. Crippen JCICS 39 868-873 (1999)
Arguments
- inMol: a molecule
- addHs: (optional) toggles adding of Hs to the molecule for the calculation. If true, hydrogens will be added to the molecule and used in the calculation.
Source code in rdkit/Chem/Crippen.py
MolLogP = lambda *x, **y: rdMolDescriptors.CalcCrippenDescriptors(*x, **y)[0]
Source code in datamol/descriptors/descriptors.py
def _sasscorer(mol: Mol):
sys.path.append(os.path.join(RDConfig.RDContribDir, "SA_Score"))
try:
import sascorer # type:ignore
except ImportError:
raise ImportError(
"Could not import sascorer. If you installed rdkit-pypi with `pip`, please uninstall it and reinstall rdkit with `conda` or `mamba`."
)
return sascorer.calculateScore(mol)
Number of NHs or OHs
Source code in rdkit/Chem/Lipinski.py
NHOHCount = lambda x: rdMolDescriptors.CalcNumLipinskiHBD(x)
Number of Nitrogens and Oxygens
Source code in rdkit/Chem/Lipinski.py
NOCount = lambda x: rdMolDescriptors.CalcNumLipinskiHBA(x)
GetFormalCharge( (Mol)arg1) -> int : Returns the formal charge for the molecule.
ARGUMENTS:
- mol: the molecule to use
C++ signature :
int GetFormalCharge(RDKit::ROMol)
CalcNumAliphaticCarbocycles( (Mol)mol) -> int : returns the number of aliphatic (containing at least one non-aromatic bond) carbocycles for a molecule
C++ signature :
unsigned int CalcNumAliphaticCarbocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumAliphaticHeterocycles( (Mol)mol) -> int : returns the number of aliphatic (containing at least one non-aromatic bond) heterocycles for a molecule
C++ signature :
unsigned int CalcNumAliphaticHeterocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumAliphaticRings( (Mol)mol) -> int : returns the number of aliphatic (containing at least one non-aromatic bond) rings for a molecule
C++ signature :
unsigned int CalcNumAliphaticRings(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumAromaticCarbocycles( (Mol)mol) -> int : returns the number of aromatic carbocycles for a molecule
C++ signature :
unsigned int CalcNumAromaticCarbocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumAromaticHeterocycles( (Mol)mol) -> int : returns the number of aromatic heterocycles for a molecule
C++ signature :
unsigned int CalcNumAromaticHeterocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumAromaticRings( (Mol)mol) -> int : returns the number of aromatic rings for a molecule
C++ signature :
unsigned int CalcNumAromaticRings(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumSaturatedCarbocycles( (Mol)mol) -> int : returns the number of saturated carbocycles for a molecule
C++ signature :
unsigned int CalcNumSaturatedCarbocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumSaturatedHeterocycles( (Mol)mol) -> int : returns the number of saturated heterocycles for a molecule
C++ signature :
unsigned int CalcNumSaturatedHeterocycles(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
CalcNumSaturatedRings( (Mol)mol) -> int : returns the number of saturated rings for a molecule
C++ signature :
unsigned int CalcNumSaturatedRings(RDKit::ROMol)
Source code in rdkit/Chem/Lipinski.py
_fn = lambda x, y=_cfn: y(x)
Calculate the number of aromatic atoms.
Source code in datamol/descriptors/descriptors.py
def n_aromatic_atoms(mol: Mol):
"""Calculate the number of aromatic atoms."""
matches = mol.GetSubstructMatches(_AROMATIC_QUERY)
return len(matches)
Calculate the aromatic proportion: # aromatic atoms/#atoms total.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Mol |
A molecule. |
required |
Only heavy atoms are considered.
Source code in datamol/descriptors/descriptors.py
def n_aromatic_atoms_proportion(mol: Mol):
"""Calculate the aromatic proportion: # aromatic atoms/#atoms total.
Args:
mol: A molecule.
Only heavy atoms are considered.
"""
return n_aromatic_atoms(mol) / mol.GetNumHeavyAtoms()
Wildman-Crippen MR value
Uses an atom-based scheme based on the values in the paper: S. A. Wildman and G. M. Crippen JCICS 39 868-873 (1999)
Arguments
- inMol: a molecule
- addHs: (optional) toggles adding of Hs to the molecule for the calculation. If true, hydrogens will be added to the molecule and used in the calculation.
Source code in rdkit/Chem/Crippen.py
MolMR = lambda *x, **y: rdMolDescriptors.CalcCrippenDescriptors(*x, **y)[1]
Compute the number of rigid bonds in a molecule.
Rigid bonds are bonds that are not single and not in rings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Mol |
A molecule. |
required |
Returns:
Type | Description |
---|---|
n_rigid_bonds |
number of rigid bonds in the molecule |
Source code in datamol/descriptors/descriptors.py
def n_rigid_bonds(mol: Mol):
"""Compute the number of rigid bonds in a molecule.
Rigid bonds are bonds that are not single and not in rings.
Args:
mol: A molecule.
Returns:
n_rigid_bonds: number of rigid bonds in the molecule
"""
non_rigid_bonds_count = from_smarts("*-&!@*")
n_rigid_bonds = mol.GetNumBonds() - len(mol.GetSubstructMatches(non_rigid_bonds_count))
return n_rigid_bonds
Compute the number of stereocenters in a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Mol |
A molecule. |
required |
Returns:
Type | Description |
---|---|
n_stero_center |
number of stereocenters in the molecule |
Source code in datamol/descriptors/descriptors.py
def n_stereo_centers(mol: Mol):
"""Compute the number of stereocenters in a molecule.
Args:
mol: A molecule.
Returns:
n_stero_center: number of stereocenters in the molecule
"""
n = 0
try:
rdmolops.FindPotentialStereo(mol, cleanIt=False)
n = rdMolDescriptors.CalcNumAtomStereoCenters(mol)
except:
pass
return n
Compute the number of charged atoms in a molecule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mol |
Mol |
A molecule. |
required |
Returns:
Type | Description |
---|---|
n_charged_atoms |
number of charged atoms in the molecule |
Source code in datamol/descriptors/descriptors.py
def n_charged_atoms(mol: Mol):
"""Compute the number of charged atoms in a molecule.
Args:
mol: A molecule.
Returns:
n_charged_atoms: number of charged atoms in the molecule
"""
return sum([at.GetFormalCharge() != 0 for at in mol.GetAtoms()])