Data Classes and Queries¶
HashableStoredData- data which is readily hashable and can be quickly queried for in a storage backend. The prime examples of such data are
ForceFieldData, whose hash can be easily computed from the file representation of a force field.
ReplaceableData- data which should be replaced in a storage backend when new data of the same type, but which has a higher information content, is stored in the backend. An example of this is when storing a piece of
StoredSimulationDatain the backend which was generated for a particular
Substanceand at the same
ThermodynamicStateas an existing piece of data, but which stores many more uncorrelated configurations.
Every data class must be paired with a corresponding data query class which inherits from the
class. In addition, each data object must implement a
to_storage_query() function which returns the data query
which would uniquely match that data object. The
to_storage_query() is used heavily by storage backends when checking
if a piece of data already exists within the backend.
Force Field Data¶
ForceFieldData class is used to
ForceFieldSource objects within the storage backend. It is a hashable
storage object which allows for rapidly checking whether any calculations have been previously been performed for
a particular force field source.
It has a corresponding
ForceFieldQuery class which can be used to query for particular force field sources within
a storage backend.
Cached Simulation Data¶
StoredSimulationData class is used to store the data generated by a single molecular simulation. The data object
primarily records the
ThermodynamicState that the simulation was run at, as well as
provenance about the calculation and the force field parameters used (as the key of the force field in the storage
system). Further, the object records the file names of the topology, trajectory and statistics files generated by the
simulation - these files should be stored in an associated ancillary data directory.
Cached simulation data is considered replaceable, whereby data which has the lowest statistical efficiency is preferred. The philosophy here is that we should store the maximum amount of samples (i.e the maximum number of uncorrelated samples for the property which has the shortest correlation time) which will be useful for future calculations, such that future calaculations can simply discard the data which cannot be used (i.e. is likely correlated).
It has a corresponding
SimulationDataQuery class which can be used to query for simulation data which matches a set
of particular criteria within a storage backend, which in part includes querying for data collected:
at a given
thermodynamic_state(i.e temperature and pressure).
for a given
property_phase(e.g. gas, liquid, liquid+gas coexisting, …).
using a given set of force field parameters identified by their unique
force_field_idassigned by the storage system
Included is not only the ability to find data generated for a particular
substance (e.g. only data for methanol),
but also the ability to return data for each component of a given substance by setting the
attribute to a
SubstanceQuery which has the
components_only attribute set to true:
# Load an existing storage backend storage_backend = LocalFileStorage() # Define a system of 50% water and 50% methanol. full_substance = Substance.from_components("O", "CO") # Look for all simulation data generated for the full substance data_query = SimulationDataQuery() data_query.substance = full_substance data_query.property_phase = PropertyPhase.Liquid full_substance_data = storage_backend.query(data_query) # Now look for all of the pure data which has been stored for both pure # water and pure methanol. pure_substance_query = SubstanceQuery() pure_substance_query.components_only = True data_query.substance_query = pure_substance_query component_data = storage_backend.query(data_query)
This is particularly useful for when retrieving data for use in the calculation of excess properties (such as the enthalpy of mixing), where such calculations require information about both the full mixture as well as the pure components.