Storage Backends
A StorageBackend
is an object used to store data generated as part of property calculations, and to retrieve that
data for use in future calculations.
In general, most data stored in a storage backend is stored in two parts:
A JSON serialized representation of this class (or a subclass), which contains lightweight information such as the state and composition of a system.
A directory like structure (either directly a directory, or some NetCDF like compressed archive) of ancillary files which do not easily lend themselves to be serialized within a JSON object, such as simulation trajectories, whose files are referenced by their file name by the data object.
The ancillary directory-like structure is not required if the data may be suitably stored in the data object itself.
Data Storage / Retrieval
Each piece of data which is stored in a backend must inherit from the BaseStoredData
class, will be assigned a
unique key. This unique key is both useful for tracking provenance if this data is re-used in future calculations, and
also can be used to retrieve the piece of data from the storage system.
In addition to retrieval using the data keys, each backend offers the ability to perform a ‘query’ to retrieve data
which matches a set of given criteria. Data queries are implemented via BaseDataQuery
objects, which expose
different options for querying for specific types of data (such a simulation data, trained models, etc.).
A query may be used for example to match all simulation data that was generated for a given Substance
in a
particular phase:
# Look for all simulation data generated for liquid water
substance_query = SimulationDataQuery()
substance_query.substance = Substance.from_components("O")
substance_query.property_phase = PropertyPhase.Liquid
found_data = backend.query(substance_query)
The returned found_data
will be a dictionary with keys of tuples and values as lists of tuples. Each key will be a
tuple of the values which were matched, for example the matched thermodynamic state, or the matched substance. For each
value tuple in the tuple list, the first item in the tuple is the unique key of the found data object, the second item
is the data object itself, and the final object is the file path to the ancillary data directory (or None
if none is present).
See the Data Classes and Queries page for more information about the available data classes, queries and their details.
Implementation
A StorageBackend
must at minimum implement a structure of:
class MyStorageBackend(StorageBackend):
def _store_object(self, object_to_store, storage_key=None, ancillary_data_path=None):
...
def _retrieve_object(self, storage_key, expected_type=None):
...
def _object_exists(self, storage_key):
...
where
_store_object()
must store aBaseStoredData
object as well as optionally its ancillary data directory, and return a unique key assigned to that object._retrieve_object()
must return theBaseStoredData
object which has been assigned a given key if the object exists in the system, as well as the file path to ancillary data directory if it exists._object_exists()
should return whether any object still exists in the storage system with a given key.
All of these methods will be called under a reentrant thread lock and may be considered as thread safe.