# Data Classes and Queries¶

All data which is to be stored within a StorageBackend must inherit from the BaseStoredData class. More broadly there are typically two types of data which are expected to be stored:

Every data class must be paired with a corresponding data query class which inherits from the BaseDataQuery class. In addition, each data object must implement a to_storage_query() function which returns the data query which would uniquely match that data object. The to_storage_query() is used heavily by storage backends when checking if a piece of data already exists within the backend.

## Force Field Data¶

The ForceFieldData class is used to ForceFieldSource objects within the storage backend. It is a hashable storage object which allows for rapidly checking whether any calculations have been previously been performed for a particular force field source.

It has a corresponding ForceFieldQuery class which can be used to query for particular force field sources within a storage backend.

## Cached Simulation Data¶

The StoredSimulationData class is used to store the data generated by a single molecular simulation. The data object primarily records the Substance, PropertyPhase and ThermodynamicState that the simulation was run at, as well as provenance about the calculation and the force field parameters used (as the key of the force field in the storage system). Further, the object records the file names of the topology, trajectory and statistics files generated by the simulation - these files should be stored in an associated ancillary data directory.

Cached simulation data is considered replaceable, whereby data which has the lowest statistical efficiency is preferred. The philosophy here is that we should store the maximum amount of samples (i.e the maximum number of uncorrelated samples for the property which has the shortest correlation time) which will be useful for future calculations, such that future calaculations can simply discard the data which cannot be used (i.e. is likely correlated).

It has a corresponding SimulationDataQuery class which can be used to query for simulation data which matches a set of particular criteria within a storage backend, which in part includes querying for data collected:

• at a given thermodynamic_state (i.e temperature and pressure).

• for a given property_phase (e.g. gas, liquid, liquid+gas coexisting, …).

• using a given set of force field parameters identified by their unique force_field_id assigned by the storage system

Included is not only the ability to find data generated for a particular substance (e.g. only data for methanol), but also the ability to return data for each component of a given substance by setting the substance_query attribute to a SubstanceQuery which has the components_only attribute set to true:

# Load an existing storage backend
storage_backend = LocalFileStorage()

# Define a system of 50% water and 50% methanol.
full_substance = Substance.from_components("O", "CO")

# Look for all simulation data generated for the full substance
data_query = SimulationDataQuery()

data_query.substance = full_substance
data_query.property_phase = PropertyPhase.Liquid

full_substance_data = storage_backend.query(data_query)

# Now look for all of the pure data which has been stored for both pure
# water and pure methanol.
pure_substance_query = SubstanceQuery()
pure_substance_query.components_only = True

data_query.substance_query = pure_substance_query
component_data = storage_backend.query(data_query)


This is particularly useful for when retrieving data for use in the calculation of excess properties (such as the enthalpy of mixing), where such calculations require information about both the full mixture as well as the pure components.