Datamol Filesystem Module
The filesystem module datamol.fs
is not strictly related to molecule but it makes it very convenient to work with files both locally and remotely (AWS S3, GCS, HTTP, FTP, Git, etc) in a smooth and transparent manner. Under the hood the Datamol fs
module is built on top of the library fsspec.
import tempfile
import datamol as dm
Destructive path manipulation¶
The below examples are made locally for the purpose of the demo but all the function supports remote path (S3, GCS, etc).
First let's get temp directory
temp_dir = tempfile.mkdtemp()
dm.fs.exists(temp_dir)
True
Create a directory and check if it has correctly been created.
subdir_path = dm.fs.join(temp_dir, "subdir1", "subsubdir293")
dm.fs.mkdir(subdir_path, exist_ok=True)
dm.fs.exists(subdir_path)
True
Copy a file from a source path to a destination path
destination_path = dm.fs.join(subdir_path, "Compound_000000001_000500000.sdf.gz")
dm.fs.copy_file(
source="https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_000000001_000500000.sdf.gz",
destination=destination_path,
progress=True,
force=True,
)
0%| | 0.00/321M [00:00<?, ?B/s]
Now, we would like to copy an full directory tree to a given destination.
subdir2_path = dm.fs.join(temp_dir, "subdir2")
dm.fs.copy_dir(
source="https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/",
destination=subdir2_path,
progress=True,
)
0%| | 0/15 [00:00<?, ?it/s]
Let's check the files have been copied correctly.
dm.fs.glob(dm.fs.join(subdir2_path, "**"))
['file:///tmp/tmphlww_l88/subdir2/README', 'file:///tmp/tmphlww_l88/subdir2/pubchem.asn', 'file:///tmp/tmphlww_l88/subdir2/pubchem.xjb', 'file:///tmp/tmphlww_l88/subdir2/pubchem.xsd', 'file:///tmp/tmphlww_l88/subdir2/pubchem_deposit.pdf', 'file:///tmp/tmphlww_l88/subdir2/pubchem_deposit.txt', 'file:///tmp/tmphlww_l88/subdir2/pubchem_fingerprints.pdf', 'file:///tmp/tmphlww_l88/subdir2/pubchem_fingerprints.txt', 'file:///tmp/tmphlww_l88/subdir2/pubchem_pug.pdf', 'file:///tmp/tmphlww_l88/subdir2/pubchem_pug.txt', 'file:///tmp/tmphlww_l88/subdir2/pubchem_sdtags.pdf', 'file:///tmp/tmphlww_l88/subdir2/pubchem_sdtags.txt', 'file:///tmp/tmphlww_l88/subdir2/pug.dtd', 'file:///tmp/tmphlww_l88/subdir2/pug.xsd', 'file:///tmp/tmphlww_l88/subdir2/pug_soap.readme.txt']
Non destructive path manipulation¶
Retrieve the paths matching a path pattern
dm.fs.glob("https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/**")[:5]
['https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-11-25/', 'https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-11-25/ASN/', 'https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-11-25/ASN/Compound_021500001_022000000.asn.gz', 'https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-11-25/ASN/Compound_021500001_022000000.asn.gz.md5', 'https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-11-25/ASN/Compound_157000001_157500000.asn.gz']
Get the name of the file or directory for a given path
dm.fs.get_basename(
"https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-12-03/SDF/Compound_013500001_014000000.sdf.gz"
)
'Compound_013500001_014000000.sdf.gz'
dm.fs.get_basename("https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-12-03")
'2021-12-03'
Get the extension of a given path
dm.fs.get_extension("s3://an-s3-bucket-random/subdir1/subdir2/hello.txt")
'txt'
Check whether a file or a directory exists
dm.fs.exists("https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/")
True
Check whether a file or a directory exists
dm.fs.exists("gs://a-gcs-bucket-random/subdir1/subdir2/hello.txt")
False
Check whether a path is a file and exists
dm.fs.is_file(
"https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Daily/2021-12-03/SDF/Compound_013500001_014000000.sdf.gz"
)
True
Check whether a path is a directory and exists
dm.fs.is_dir("gs://a-gcs-bucket-random/subdir1/subdir2/")
False
Check whether a path is local or remote
dm.fs.is_local_path("/home/hello/a_subdir")
True
Join paths together
data_dir = "gs://awesome-data-bucket/data_dir"
filename = "molecules.sdf"
dm.fs.join(data_dir, filename)
'gs://awesome-data-bucket/data_dir/molecules.sdf'
Get the size of a file (in byte)
dm.fs.get_size(
"https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_000000001_000500000.sdf.gz"
)
336817141
Get the MD5 checksum of a file
dm.fs.md5("https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/README-Compound")
'b2d7b30c1466ab9582df47b2664d04b5'
Cache directory¶
It's often convenient to get the path of a persistent cache folder. Unfortunately this path will change depending on the OS you're working on. Datamol offers a function to easily retrieve the path of the "official" cache directory on which it's running.
dm.fs.get_cache_dir(app_name="datamol-demo", suffix="subdir1", create=False)
PosixPath('/home/hadim/.cache/datamol-demo/subdir1')