Objective

Structuring chemical datasets with a modern, secure, and scalable tech stack - grounded in information retrieval principles - transforms raw data into actionable knowledge. This foundation is essential for supporting experimentation, enabling scientific exploration, integrating workflows, and powering advanced simulations and machine learning models. Ultimately, it accelerates the discovery and development of new products, providing a strategic advantage in scientific innovation.


Data Pipelining and Governance

WiP

Data Exploration Tech Stack

WiP

Information Retrieval for Chemical Data

WiP

Eligible Datasets


Dataset NameDescriptionFormat(s)SizeUpdate FrequencyReference/Link
NIST TDECritically evaluated thermophysical and thermochemical property data for pure compounds, mixtures, and reactions, widely used for chemical engineering and materials scienceCSV, XML, JSON, proprietary export formats~GBsPeriodic (annually or as new data is available)NIST TDE
OPoly26Large-scale open dataset of 26M+ unique polymer structures with computed properties and rich metadata for polymer informatics and machine learningCSV, JSON, SDF, HDF5, SMILES, SELFIES~1.2 TBBatch releases (every few months)arXiv:2512.23117
PolyInfoPolymer properties, structures, synthesis routesCSV, SDF, JSON~GBsPeriodichttps://polymer.nims.go.jp/
Polymer GenomePolymer property predictions, descriptorsCSV, JSON~GBsPeriodichttps://www.polymergenome.org/
Materials ProjectInorganic materials, properties, structuresJSON, CSV~TBsWeeklyhttps://materialsproject.org/
QM9Small organic molecules, quantum propertiesXYZ, CSV, JSON~GBsStatichttps://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv
Open Catalyst ProjectCatalysts, surface reactions, DFT calculationsHDF5, JSON~TBsPeriodichttps://opencatalystproject.org/
PubChemChemical structures, properties, bioactivitySDF, CSV, JSON~TBsDailyhttps://pubchem.ncbi.nlm.nih.gov/


References

In Search of Better Search https://dl.acm.org/doi/10.1145/3760247

  • No labels