This is a repo created for the 2024 AC BO Hackathon
BO for Drug Discovery: What is the role of molecular representation?
Project leads
Contributors
See our video presentation. Comment, suggest, and vote project 8 in the judging session!

mol2vec performed the best in our Raw Feature benchmark with GP-based surrogates!mordred made its raw feature impossible to be incorporated without processinggraph representations and graph kernels were found to be highly resource-demaning in GP-BO, thus not investigatedRF-based surrogates brought increased performance with mordred and graph2vec featurizations, but demand significantly more resources to train (due to hyperparameter tuning step in each iter) and exhibit high variabilityrdkit and mordred won the benchmark. Preserving 90% of variance, rdkit and mordred had 46 and 50 features, respectively.mol2vec, we would expect similar obervation with graphPhysicochemical featurization with PCA is overall recommended for BO, considering their performance and preservation of chemical information when compared with other representations.Tanimoto Kernel was used for bit-string connectivity fingerprint. Rational Quadratic Kernel was used for physicochemical featurization. The kernel in GP would greatly impact the surrogate’s accuracy and thus need further investigation. But it is beyond the scope (and resource) of the present project.To replicate the work, the following dependancies are necessary:
To set up the environment, follow the steps:
conda create --name bofeat python=3.9
conda activate bofeat
pip install botorch deepchem numpy scikit-learn scikit-optimize torch gpytorch
pip install git+https://github.com/samoturk/mol2vec
P.S. Installation with conda manager is not recommended as it caused weird incompatibility issue.
P.P.S. Make sure pip is from the newly created bofeat environment. If you’re using a Unix-based OS, you can use which pip to check