BRICS Datasets

The BRICS ruleset allows the modeling of chemical spaces based on prominent chemical motifs of known inhibitors and commercially available compounds. BRICS consists of a representation of chemical environments for generating and recombining fragments based on a comprehensive and easily adaptable set of rules.
More information and download:
BRICS Datasets
Publikationen
- Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; Rarey, M. (2008) On the Art of Compiling and Using 'Drug-Like' Chemical Fragment Spaces . ChemMedChem, 3(10):1503-1507.
Deep-Sea Protein Structure Dataset

We compiled a dataset of 1281 experimental protein structures from 25 deep-sea organisms available in the Protein Data Bank and paired them with orthologous proteins. This dataset is one of the first to provide protein structure pairs for building data-driven methods and analyzing structural protein adaptations to extreme environmental conditions in the deep-sea. We thoroughly removed redundancy and processed the data set into cross-validation folds for easy use in machine learning. We also annotated the protein pairs by the environmental preferences of the deep-sea and decoy source organisms.
More information and download:
Deep-Sea Protein Structure DatasetPublikationen
- Sieg, J.; Sandmeier, C.C.; Meents, A.; Lemmen, C.; Streit, W.R.; Rarey, M. (2022) Analyzing structural features of proteins from deep-sea organisms . Proteins: Structure, Function, and Bioinformatics, 90(8):1521-1537.
Fragment Growing Validation Dataset

Here, we provide datasets to validate computational methods for fragment growing. The basis for these datasets is the PDBbind refined set, as its structural considerations render it a high-quality dataset for structure-based validation. Below are summaries of the two separate collections. The associated publication contains comprehensive descriptions of the data set generation.
More information and download:
Fragment Growing Validation DatasetPublikationen
- Penner, P.; Martiny, V.; Gohier, A.; Gastreich, M.; Ducrot, P.; Brown, D.; Rarey, M. (2020) Shape-Based Descriptors for Efficient Structure-Based Fragment Growing . Journal of Chemical Information and Modeling, 60(12):6269-6281.
HELLS Dataset

The Hamburg Enumerated Lead-Like Set (HELLS) is a collection of 503,974,653 lead-like molecules generated by recombination of fragments from approved drug molecules. It was generated using the "Approved Drugs" set from DrugBank, the enumeration tool FSees, and BRICS fragmentation rules.
More information and download:
HELLS DatasetPublikationen
- Lauck, F.; Rarey, M. (2016) FSees: Customized Enumeration of Chemical Subspaces with Limited Main Memory Consumption . Journal of Chemical Information and Modeling, 56(9):1641-1653.
iRAISE Datasets

We composed two datasets and evaluation strategies for a meaningful evaluation of new target prediction methods, i.e., a small dataset consisting of three target classes for detailed proof-of-concept and selectivity studies, and a large dataset based on the sc-PDB consisting of 7992 protein structures and 72 drug-like ligands from DrugBank, allowing statistical evaluation with performance metrics on a drug-like chemical space.
More information and download:
iRAISE DatasetsPublikationen
- Schomburg, K.T.; Rarey, M. (2014) Benchmark Data Sets for Structure-Based Computational Target Prediction . Journal of Chemical Information and Modeling, 54(8):2261-2274.
KnowledgeSpace

The KnowledgeSpace is a publicly available combinatorial fragment space containing over 1015 molecules. It was generated by applying 117 reactions known from the literature to reagents of the eMolecules collection.
More information and download:
KnowledgeSpacePublikationen
- Bellmann, L.; Klein, R.; Rarey, M. (2022) Calculating and Optimizing Physicochemical Property Distributions of Large Combinatorial Fragment Spaces . Journal of Chemical Information and Modeling, 62(11):2800-2810.
LOBSTER Dataset

LOBSTER ("Ligand Overlays from Binding SiTe Ensemble Representatives") is a dataset of ligand overlays designed to evaluate small molecule superposition tools.
More information and download:
LOBSTER DatasetPublikationen
- Hönig, S. M. N.; Gutermuth, T.; Ehrt, C.; Lemmen, C.; Rarey, M. (2024) Combining crystallographic and binding affinity data towards a novel dataset of small molecule overlays . Journal of Computer-Aided Molecular Design, 39(1):2.
mRAISE Dataset

One of the key requirements of 3D ligand-based virtual screening methods is the calculation and identification of biologically relevant molecular alignments. The mRAISE dataset contains 180 (166 unique) prealigned ligands for 11 diverse targets. Biologically meaningful alignments were generated by identifying and superposing identical binding sites using the comparison method SIENA. The dataset is designed for the validation and comparison of ligand-based virtual screening methods and has been used for validating mRAISE.
More information and download:
mRAISE DatasetPublikationen
- von Behren, M.; Bietz, S.; Nittinger, E.; Rarey, M. (2016) mRAISE: an alternative algorithmic approach to ligand-based virtual screening . Journal of Computer-Aided Molecular Design, 30(8):583-594.
NAOMInova Datasets

Here, users can download precompiled NAOMInova databases. The database of PDBbind structures was dedicated to analyzing carbonyl interaction patterns. The second database comprises 408 carbonic anhydrase structures. This database was used to analyze the interactions of carbonic anhydrase with its ligands and generate ideas for extending known ligands.
More information and download:
NAOMInova DatasetsPublikationen
- Inhester, T.; Nittinger, E.; Sommer, K.; Schmidt, P.; Bietz, S.; Rarey, M. (2017) NAOMInova: Interactive Geometric Analysis of Noncovalent Interactions in Macromolecular Structures . Journal of Chemical Information and Modeling, 57(9):2132–2142.
- Nittinger, E.; Inhester, T.; Bietz, S.; Meyder, A.; Schomburg, K.T.; Lange, G.; Klein, R.; Rarey, M. (2017) Large-Scale Analysis of Hydrogen Bond Interaction Patterns in Protein-Ligand Interfaces . Journal of Medicinal Chemistry, 60(10):4245-4257.
PELIKAN Datasets

Here, precompiled PELIKAN databases with protein-ligand complexes of different datasets can be downloaded. Please note that these are large binary files in a proprietary database format for use with the PELIKAN software.
More information and download:
PELIKAN DatasetsPublikationen
- Inhester, T.; Bietz, S.; Hilbig, M.; Schmidt, R.; Rarey, M. (2017) Index-based Searching of Interaction Patterns in Large Collections of Protein-Ligand Interfaces . Journal of Chemical Information and Modeling, 57(2):148-158.
ReactionViewer Datasets

Computer-readable generic reaction schemes are a fundamental technique in the in silico drug design process. Due to their complexity and the richness of features represented in a single line, they can be challenging to work with even for experienced users. To generate a visualization of generic reaction patterns written as Reaction SMILES, Reaction SMARTS or in the SMIRKS language, we developed a novel method, called ReactionViewer. The ReactionViewer is based on the SMARTScompareViewer technology and integrated into our web frontend https://smarts.plus.
We provide the visualization of two datasets of reaction schemes for organic synthesis from recent publications.
More information and download:
ReactionViewer DatasetsPublikationen
- Dolfus, U.; Briem, H.; Rarey, M. (2022) Visualizing Generic Reaction Patterns . Journal of Chemical Information and Modeling, 62(19):4680-4689.
- Schomburg, K.; Ehrlich, H.-C.; Stierand, K.; Rarey, M. (2010) From Structure Diagrams to Visual Chemical Patterns . Journal of Chemical Information and Modeling, 50(9):1529-1535.
SAVI-Space

SAVI-Space is a combinatorial encoding of the billion-compound Synthetically Accessible Virtual Inventory (SAVI) library[2] based on 53 reaction rules from the SAVI knowledge base. The dataset enables efficient virtual screening through a memory-efficient reaction-driven data structure encoding transformation rules as reaction SMARTS.
SAVI-SpacePublikationen
- Korn, M.; Judson, P.; Klein, R.; Lemmen, C.; Nicklaus, M.C.; Rarey, M. (2025) SAVI Space - combinatorial encoding of the billion-size synthetically accessible virtual inventory . Scientific Data, 12(1):1064.
SIENA Dataset

Protein binding site ensembles are essential for a comprehensive analysis of protein flexibility. In structure-based design endeavors, they assist in considering conformational degrees of freedom in protein structures. SIENA is an automated five-phase pipeline for creating a conformational ensemble for a protein binding site of interest from the Protein Data Bank (PDB). The process enables on-the-fly structure selection and superposition.
Using SIENA, we created the non-intersecting binding site ensemble dataset (NBSE). The set comprises 182 ensembles with more than 9000 aligned PDB structures. Moreover, it includes alignments, ligand files in SDF format, and reduced ensembles.
More information and download:
SIENA DatasetPublikationen
- Bietz, S.; Rarey, M. (2016) SIENA: Efficient Compilation of Selective Protein Binding Site Ensembles . Journal of Chemical Information and Modeling, 56(1):248-259.
SMARTS Dataset

Smiles ARbitary Target Specification (SMARTS) is a language to formulate chemical patterns, such as substructures in molecules. To evaluate algorithms to search for chemical patterns in molecules, we present a collection of SMARTS expressions extracted from various literature sources and a collection of SMARTS-molecule pairs created from the ZINC database. Additionally, a test case with a highly symmetric SMARTS-SMILES pair and a subset of the ZINC Lead-Like database is provided.
More information and download:
SMARTS DatasetPublikationen
- Ehrlich, H.-C.; Rarey, M. (2012) Systematic Benchmark of Substructure Search in Molecular Graphs - From Ullmann to VF2 . Journal of Cheminformatics, 4:13.
SpaceGrow Dataset

The navigation through reaction-driven combinatorial libraries, chemical fragment spaces, offers synthetically accessible compounds far beyond the reach of enumerable databases. The SpaceGrow dataset was compiled for 3D shape-based virtual screening applications to compare approaches for searching in chemical spaces with conventional approaches searching in enumerable databases.
More information and download:
SpaceGrow DatasetPublikationen
- Hönig, S.M.N.; Flachsenberg, F.; Ehrt, C.; Neumann, A.; Schmidt, R.; Lemmen, C.; Rarey, M. (2024) SpaceGrow: efficient shape-based virtual screening of billion-sized combinatorial fragment spaces . Journal of Computer-Aided Molecular Design, 38:13.
StructureProfiler Dataset

The StructureProfiler Dataset is a dataset identified by StructureProfiler for all structures in the PDB as of February 2018. It consists of a list with the PDB code, HET Code, Chain ID, and InfileID for each ligand that passes all requested quality tests for the ligand, its active site, and the overall structure in the “Combined” StructureProfiler configuration.
More information and download:
StructureProfiler DatasetPublikationen
- Meyder, A.; Kampen, S.; Sieg, J.; Fährrolfes, R.; Friedrich, N.-O.; Flachsenberg, F.; Rarey, M. (2018) StructureProfiler: An All-In-One Tool for 3D Protein Structure Profiling . Bioinformatics, 35(5):874-876.
Torsion Library

Crystal structure databases offer ample opportunities to derive small molecule conformation preferences. We developed a comprehensive and expandable expert system that enables users to quickly assess the occurrence likelihood of a given molecular conformation. It relies on a hierarchical system of torsion patterns covering a large part of drug-like chemical space. Each torsion pattern has associated frequency histograms generated from data stored in the Cambridge Structural Database and the Protein Data Bank. From the histograms, we derived traffic-light rules for frequently observed, rare, and highly unlikely torsion angle ranges. Users can find details on the library and its usage in the corresponding publications.
More information and download:
Torsion LibraryPublikationen
- Penner, P.; Guba, W.; Schmidt, R.; Meyder, A.; Stahl, M.; Rarey, M. (2022) The Torsion Library: Semiautomated Improvement of Torsion Rules with SMARTScompare . Journal of Chemical Information and Modeling, 62(7):1644-1653.
- Guba, W.; Meyder, A.; Rarey, M.; Hert, J. (2016) Torsion Library Reloaded: A New Version of Expert-Derived SMARTS rules
for Assessing Conformations of Small Molecules . Journal of Chemical Information and Modeling, 56(1):1-5. - Schärfer, C.; Schulz-Gasch, T.; Ehrlich, H.-C.; Guba, W.; Rarey, M.; Stahl, M. (2013) Torsion Angle Preferences in Drug-like Chemical Space: A Comprehensive Guide . Journal of Medicinal Chemistry, 56(6):2016-2028.
Water Dataset

We have compiled a high-resolution subset of the Protein Data Bank, containing 2.3 million water molecules. Furthermore, we have discriminated between well-resolved water molecules and those lacking supporting electron density. To perform this classification, we measured the electron density around individual atoms (EDIAscorer), enabling the automatic quantification of experimental support. Finally, we have characterized the water molecules with a detailed profile of geometric and structural features.
More information and download:
Water DatasetPublikationen
- Nittinger, E.; Schneider, N.; Lange, G.; Rarey, M. (2015) Evidence of Water Molecules - A Statistical Evaluation of Water Molecules Based on Electron Density . Journal of Chemical Information and Modeling, 55(4):771-783.