BRICS
BRICS is a chemical fragment space. Starting from drug-like molecules and synthetically motivated fragmentation rules, BRICS describes a set of fragments and a rule set explaining how these fragments can be combined to new molecules. Due to the riguous selection of molecules and fragmentation rules, newly combined molecules from BRICs space are mostly synthetically accessible and drug-like. The BRICs space can be directly searched with FTreesFS and FlexNovo.
BRICSPublikationen
- Degen, J., Wegscheid-Gerlach, C., Zaliani, A., Rarey, M. (2008) On the Art of Compiling and Using `Drug-Like` Chemical Fragment Spaces . ChemMedChem, 3:1503-1507.
Deep-Sea Protein Structure Dataset
The prediction of molecular protein adaptations is a key challenge in protein engineering. In particular, proteins of extremophiles often exhibit desirable properties, like a tolerance to extremely high temperature and/or pressure. A promising resource for such proteins is the deep-sea, which is the largest extreme environment on earth. In the last years, through large-scale metagenomic projects, increasing protein data from these environments has been provided. Not surprisingly, there is a great interest in systematically analyzing the data currently available.
We compiled a data set of 1281 experimental protein structures from 25 deep-sea organisms from the Protein Databank (PDB) and paired them with orthologous proteins. This data set is one of the first to provide protein structure pairs for building data-driven methods and analyzing structural protein adaptations to the extreme environmental conditions in the deep-sea. We thoroughly removed redundancy and processed the data set into cross-validation folds for easy use in machine learning. We also annotated the protein pairs by the environmental preferences of the deep-sea and decoy source organisms. In this way, thermopiles, mesophiles and piezophiles can be compared directly. The final data set includes 501 deep-sea protein chains and 8200 decoy protein chains that come from 20 different deep-sea and 1379 decoy organisms and form 17 148 pairs. For further details and a machine learning-based analysis of the data set, see [1].
Deep-Sea Protein Structure DatasetPublikationen
- Sieg, J.; Sandmeier, C.C.; Meents, A.; Lemmen, C.; Streit, W.R.; Rarey, M. (2022) Analyzing structural features of proteins from deep-sea organisms . Proteins: Structure, Function, and Bioinformatics, 90(8):1521-1537.
Fragment Growing Validation Dataset
The self-growing set contains a few thousand test cases of ligands from protein-ligand complexes cut into a core and a fragment. Aim of the dataset is to measure a fragment growing tool's performance by evaluating whether it can recreate the original ligand given the protein, the core and the fragment and whether the pose it generates is acceptable when compared to the crystal structure. This is similar to self-docking validation but for fragment growing workflows.
The cross-growing set contains a few hundred test cases in which ligands from one PDB structure are grown into a different PDB structure of the same binding site. Aim of the dataset is to measure a fragment growing tool's performance by seeing whether this non-native ligand can be grown and whether the conformation of this ligand is comparable to the conformation of the aligned binding site. This is similar to cross-docking validation.
More information can be found in the associated publication and here:
Fragment Growing Validation DatasetHELLS Dataset
The Hamburg Enumerated Lead-Like Set (HELLS) is a collection of 503,974,653 lead-like molecules generated from approved drug-molecules. It was generated with FSees and BRICS fragmentation rules using the "Approved Drugs" set from Drugbank. The initial fragment space contained 1214 fragments from 1009 molecules. 183 fragments were selected as starting points since they contain at least two linkers and a ring of size five or more.
Publikationen
- Lauck, F.; Rarey, M. (2016) FSees: Customized Enumeration of Chemical Subspaces with Limited Main Memory Consumption . Journal of Chemical Information and Modeling, 56(9):1641-1653.
iRAISE Dataset
Structure-based computational target prediction methods identify potential protein targets for a bioactive compound. Methods based on protein−ligand docking so far face many challenges, where the greatest probably is the ranking of true targets in a large data set of protein structures. Currently, no standard data sets for evaluation exist, rendering comparison and demonstration of improvements of methods cumbersome. Therefore, we composed two data sets and evaluation strategies for a meaningful evaluation of new target prediction methods, i.e., a small data set consisting of three target classes for detailed proof-of-concept and selectivity studies and a large data set based on the sc-PDB consisting of 7992 protein structures and 72 drug-like ligands from Drugbank allowing statistical evaluation with performance metrics on a drug-like chemical space.
More information and download:
iRAISE DatasetPublikationen
- Schomburg, K.T.; Rarey, M. (2014) Benchmark Data Sets for Structure-Based Computational Target Prediction . Journal of Chemical Information and Modeling, 54(8):2261-2274.
KnowledgeSpace
The KnowledgeSpace is a publicly available combinatorial fragment space containing over 1015 molecules. The space is generated by applying reactions known from literature to reagents of the eMolecules collection.
For more information about the KnowledgeSpace click here. To download the KnowledgeSpace in its topological fragment space representation click here.
KnowledgeSpacemRAISE Dataset
One of the key features of 3D ligand-based virtual screening methods is the calculation and identification of biologically relevant molecular alignments. The mRAISE dataset contains 180 prealigned ligands for 11 diverse targets generated by identifying and aligning identical binding sites using SIENA. The dataset is designed for the validation and comparison of ligand-based virtual screening methods and has been used during the validation of mRAISE.
More information and download:
mRAISE DatasetReactionViewer Datasets
Computer-readable generic reaction schemes are a fundamental technique in the in silico drug design process. Due to their complexity and the richness of features represented in a single line, they can be challenging to work with even for experienced users. To generate avisualization of generic reaction patterns written as Reaction SMILES, Reaction SMARTS or in the SMIRKS language, we developed a novel method, called ReactionViewer. The ReactionViewer is integrated at our web
frontend https://smarts.plus.
In the following we provide the visualization of two datasets of reaction schemes for organic synthesis from recent publications. The first dataset comes from an open-source retrosynthetic planning software called AiZynthFinder (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00472-1).
The tool is available with a set of reaction schemes used for the neural network policy training, among others. The reaction dataset is based on the publicly available US patent office data. The 46696 reaction schemes are written in a retrosynthetic manner. The second data set is extracted from Hartenfeller et al. (https://pubs.acs.org/doi/10.1021/ci200379p), who provided a set of robust organic reaction schemes available for in silico molecule design. The 58 reaction schemes are written in a forward synthetic manner.
We provide a complete visualization of all given reaction schemes for both datasets.
Publikationen
- Schomburg, K., Ehrlich, H.-C., Stierand, K., Rarey, M. (2010) From Structure Diagrams to Visual Chemical Patterns . Journal of Chemical Information and Modeling, 50(9):1529-1535.
SIENA Dataset
Protein binding site ensembles are essential for a comprehensive analysis of protein flexibility. In structure-based design endevours, they help considering conformational degrees of
freedom on the protein side. To automate the process of ensemble creation, we developed SIENA, a five-phase pipeline going from the whole PDB down to a structure ensemble for a protein of
interest including structure selection and superposition. SIENA is available as part of our ProteinsPlus server. Furthermore, we preprocessed the PDB and created a collection of over 180 protein structure ensembles ready to use.
More information and download:
https://www.zbh.uni-hamburg.de/siena
Publikationen
- Bietz, S.; Rarey, M. (2016) SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles . Journal of Chemical Information and Modeling, 56(1):248-59.
SMARTS Dataset
The SMARTS Dataset is a collection of chemical patterns and matching molecules collected for the evaluation and benchmarking of substructure matching algorithms. The dataset contains several hundred SMARTS patterns collected from various sources. The compound sets have controlled hit rates enabling the analysis of run time depending on molecule size, substructure size, and some pattern features. Details can be found in the following paper:
SMARTS DatasetPublikationen
- Ehrlich, H.-C., Rarey, M. (2012) Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2 . Journal of Cheminformatics, 4(13)
SpaceGrow Dataset
A navigation through reaction-driven combinatorial libraries, so-called chemical spaces, offers synthetically accessible compounds far beyond the reach of enumerable databases. The SpaceGrow dataset(cite 1) was compiled for 3D shape-based virtual screening applications to compare approaches for searching in chemical spaces to conventional approaches searching in enumerable databases. The dataset comprises 160 ligands picked from a list of known drugs (cite 2) which were found in PDBbind structures (cite 3). For 56 ligands selected as references, ligands binding in the same active site were superimposed with respect to their native binding mode (cite 4) to form homologous ligand pairs. Both ligands of each pair were fragmented into a chemical space, the validation space, by cutting all acyclic bonds. Enumerating all molecules, the validation library contains 34 134 molecules. The validation space and library are included in the dataset and can be used to benchmark tools regarding the rank and RMSD with which the binding pose of the reference ligand is reproduced. Furthermore, searching the reference ligand, the rank and RMSD of the homologous ligand pose can be evaluated.
SpaceGrow DatasetPublikationen
- Bietz, S.; Rarey, M. (2016) SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles . Journal of Chemical Information and Modeling, 56(1):248-59.
StructureProfiler
Three-dimensional protein structures play a vital role in drug design. Thorough examination of the available data before usage in experimentation and method validation is highly recommended. StructureProfiler assists in automatically profiling structures ranging from model characteristics like low R factor over active site features such as bond lengths in expected ranges to ligand properties such as electron density coverage and frequently seen torsion angles.
Publikationen
- Meyder, A.; Kampen, S.; Sieg, J.; Fährrolfes, R.; Friedrich, N.-O.; Flachsenberg, F.; Rarey, M. (2018) StructureProfiler: An all-in-one Tool for 3D Protein Structure Profiling . Bioinformatics, 35(5):874-876.
Torsion Library
Crystal structure databases offer ample opportunities to derive small molecule conformation preferences. We developed a comprehensive and extendable expert system enabling quick assessment of the probability of a given conformation to occur [1]. It is based on a hierarchical system of torsion patterns that cover a large part of druglike chemical space. Each torsion pattern has associated frequency histograms generated from CSD and PDB data and, derived from the histograms, traffic-light rules for frequently observed, rare, and highly unlikely torsion ranges. Please see the publications for details on the library and how it should be used.
The full histograms derived from CSD data could not be published for the 2013 and 2016 Torsion Library due to legal restrictions. This has changed with the 2022 version of the Torsion Library, which includes both CSD as well as PDB histograms. The full torsion pattern library including preferred angles and tolerances derived are made available as supporting information of the corresponding publications [1][2][3]. We plan to also make further updates of the library and derived data available here.
Torsion LibraryPublikationen
- Penner, P.; Guba, W.; Schmidt, R.; Meyder, A.; Stahl, M.; Rarey, M. (2022) The Torsion Library: Semiautomated Improvement of Torsion Rules with SMARTScompare . Journal of Chemical Information and Modeling, 62(7):1644–1653.
- Guba, W.; Meyder, A.; Rarey, M.; Hert, J. (2016) Torsion Library Reloaded: A New Version of Expert-Derived SMARTS rules
for Assessing Conformations of Small Molecules . Journal of Chemical Information and Modeling, 56(1):1-5. - Schärfer, C., Schulz-Gasch, T., Ehrlich, H.C., Guba, W., Rarey, M., Stahl, M. (2013) Torsion Angle Preferences in Drug-like Chemical Space: A Comprehensive Guide . Journal of Medicinal Chemistry, 56 (6):2016-28.
Water Dataset
Water molecules play important roles in many biological processes, especially when mediating protein-ligand interactions. Despite many attempts in the past years, accurate prediction of water molecules structurally as well as energetically remains a grand challenge. One reason is certainly the lack of experimental data, since energetic contributions of water molecules can only be measured indirectly. However, on the structural side, the electron density clearly shows the positions of stable water molecules. This information has the potential to improve models on water structure and energy in proteins and protein interfaces. We have compiled a high-resolution subset of the Protein Data Bank, containing 2.3 million water molecules. Furthermore, we have discriminated those water molecules into well resolved and those without much evidence of electron density. In order to perform this classification, we have used the new measurement of electron density around an individual atom (EDIA) enabling the automatic quantification of experimental support. On the basis of this measurement, we have characterized the water molecules with a detailed profile of geometric and structural features. This data, which is freely available, can be applied to not only modeling and validation of new water models in structural biology but also in molecular design.
Publikationen
- Nittinger, E.; Schneider, N.; Lange, G.; Rarey, M. (2015) Evidence of Water Molecules—A Statistical Evaluation of Water Molecules Based on Electron Density . Journal of Chemical Information and Modeling, 55(4):771 - 783.