Visualization
This tutorial will highligh the major viz related features of Datamol.
import datamol as dm
First let's get a dataset.
data = dm.read_csv(
"https://raw.githubusercontent.com/rdkit/rdkit/master/Data/NCI/first_200.tpsa.csv",
comment="#",
header=None,
)
data.columns = ["smiles", "tpsa"]
# Create a mol column
with dm.without_rdkit_log():
data["mol"] = data["smiles"].apply(dm.to_mol)
# Patch the dataframe to render the molecules in it
dm.render_mol_df(data)
data.iloc[0]["mol"]
Now let's cluster the molecules and only keep the first cluster.
cluster_indices, cluster_mols = dm.cluster_mols(data["mol"].dropna().tolist(), cutoff=0.7)
mols = cluster_mols[1]
Display the molecules of the cluster while aligning then using MCS. This can be done using a simple boolean flag in dm.to_image()
.
dm.to_image(mols, mol_size=(300, 200), align=True, use_svg=False)
Lasso Highlighting¶
The code below will show how the lasso highlight function should be used. The signature for this function is
def lasso_highlight_image(
target_molecule: Union[str, dm.Mol],
search_molecules: Union[str, List[str], dm.Mol, List[dm.Mol]],
mol_size: Optional[Tuple[int, int]] = (300, 300)
) -> Image:
The mol_size is the size of the image returned and the target molecule is accepted in the smiles format or mol object and the substructure search as smarts string or mol object.
It is quite difficult to test the production of images so the edge cases will be entered here with a brief description of each.
An edge case is that you can only search for up to 6 substructures unless more colors are added to the code.
import datamol as dm
smi = "CO[C@@H](O)C1=C(O[C@H](F)Cl)C(C#N)=C1ONNC[NH3+]"
smarts_list = ["CONN", "N#CC~CO", "C=CON", "CONNCN"]
dm.lasso_highlight_image(smi, smarts_list, (400, 400), use_svg=True)
Alternatively you may only have one substructure in mind
smi = "CO[C@@H](O)C1=C(O[C@H](F)Cl)C(C#N)=C1ONNC[NH3+]"
smarts_list = "CONN"
dm.lasso_highlight_image(smi, smarts_list, (300, 300))