Tutorial 01 - Loading Data Sets

In this tutorial we will be exploring the frameworks utilities for loading and manipulating data sets of physical property measurements. The tutorial will cover

  • Loading a data set of density measurements from NISTs ThermoML Archive

  • Filtering the data set down using a range of criteria, including temperature pressure, and composition.

  • Supplementing the data set with enthalpy of vaporization (\(\Delta H_{v}\)) data sourced directly from the literature

For the sake of clarity in this tutorial all warnings will be disabled:

[1]:
import warnings
warnings.filterwarnings('ignore')
import logging
logging.getLogger("openforcefield").setLevel(logging.ERROR)

Extracting Data from ThermoML

For anyone who is not familiar with the ThermoML archive - it is a fantastic database of physical property measurements which have been extracted from data published in the

  • Journal of Chemical and Engineering Data

  • Journal of Chemical Thermodynamics

  • Fluid Phase Equilibria

  • Thermochimica Acta

  • International Journal of Thermophysics

journals. It includes data for a wealth of different physical properties, from simple densities and melting points, to activity coefficients and osmotic coefficients, all of which is freely available. As such, it serves as a fantastic resource for benchmarking and optimising molecular force fields against.

The Evaluator framework has built-in support for extracting this wealth of data, storing the data in easy to manipulate python objects, and for automatically re-computing those properties using an array of calculation techniques, such as molecular simulations and, in future, from trained surrogate models.

This support is provided by the ThermoMLDataSet object:

[2]:
from propertyestimator.datasets.thermoml import ThermoMLDataSet

The ThermoMLDataSet object offers two main routes for extracting data the the archive:

  • extracting data directly from the NIST ThermoML web server

  • extracting data from a local ThermoML XML archive file

Here we will be extracting data directly from the web server. To pull data from the web server we need to specifiy the digital object identifiers (DOIs) of the data we wish to extract - these correspond to the DOI of the publication that the data was initially sourced from.

For this tutorial we will be extracting data using the following DOIs:

[3]:
data_set = ThermoMLDataSet.from_doi(
    "10.1016/j.fluid.2013.10.034",
    "10.1021/je1013476",
)
RDKit WARNING: [16:59:25] Enabling RDKit 2019.09.2 jupyter extensions

We can inspect the data set to see how many properties were loaded:

[4]:
len(data_set)
[4]:
275

and for how many different substances those properties were measured for:

[5]:
len(data_set.substances)
[5]:
254

We can also easily check which types of properties were loaded in:

[6]:
print(data_set.property_types)
{'Density', 'EnthalpyOfMixing'}

Filtering the Data Set

The data set object we just created contains many different functions which will allow us to filter the data down, retaining only those measurements which are of interest to us.

The first thing we will do is filter out all of the measurements which aren’t density measurements:

[7]:
data_set.filter_by_property_types("Density")
print(data_set.property_types)
{'Density'}

Next we will filter out all measurements which were made away from atmospheric conditions:

[8]:
from propertyestimator import unit

print(f"There were {len(data_set)} properties before filtering")

data_set.filter_by_temperature(
    min_temperature=298.0 * unit.kelvin, max_temperature=298.2*unit.kelvin
)

data_set.filter_by_pressure(
    min_pressure=0.999 * unit.atmosphere, max_pressure=1.001 * unit.atmosphere
)

print(f"There are now {len(data_set)} properties after filtering")
There were 213 properties before filtering
There are now 9 properties after filtering

Note

Here we have made use of the propertyestimator.unit module to attach units to the temperatures and pressures we are filtering by. This module simply exposes a UnitRegistry from the fantastic pint library. Pint provides full support for attaching to units to values and is used extensively throughout this framework.

Finally, we will filter out all measurements which were not measured for either ethanol (CCO) or isopropanol (CC(C)O):

[9]:
data_set.filter_by_smiles("CCO", "CC(C)O")
print(f"There are now {len(data_set)} properties after filtering")
There are now 2 properties after filtering

We will convert the filtered data to a pandas DataFrame to more easily visualize the final data set:

[10]:
pandas_data_set = data_set.to_pandas()
pandas_data_set[
    ["Temperature", "Pressure", "Component 1", "Density Value", "Source"]
].head()
[10]:
Temperature Pressure Component 1 Density Value Source
0 298.15 K 101.325 kPa CC(C)O 782.7 kg / m ** 3 10.1016/j.fluid.2013.10.034
1 298.15 K 101.325 kPa CCO 785.07 kg / m ** 3 10.1021/je1013476

Through filtering, we have now cut down from over 250 property measurements down to just 2. There are many more possible filters which can be applied. All of these and more information about the data set object can be found in the PhysicalPropertyDataSet (from which the ThermoMLDataSet class inherits) API documentation.

Adding Extra Data

For the final part of this tutorial, we will be supplementing our newly filtered data set with some enthalpy of vaporization (\(\Delta H_{v}\)) measurements sourced directly from the literature (as opposed to from the ThermoML archive).

We will be sourcing values of the \(\Delta H_{v}\) of ethanol and isopropanol, summarised in the table below, from the Enthalpies of vaporization of some aliphatic alcohols publication:

Compound

Temperature / \(K\)

\(\Delta H_{v}\) / \(kJ mol^{-1}\)

\(\delta \Delta H_{v}\) / \(kJ mol^{-1}\)

Ethanol

298.15

42.26

0.02

Isopropanol

298.15

45.34

0.02

In order to create a new \(\Delta H_{v}\) measurements, we will first define the state (namely temperature and pressure) that the measurements were recorded at:

[11]:
from propertyestimator.thermodynamics import ThermodynamicState

thermodynamic_state = ThermodynamicState(
    temperature=298.15 * unit.kelvin, pressure=1.0 * unit.atmosphere
)

the substances that the measurements were recorded for:

[12]:
from propertyestimator.substances import Substance

ethanol = Substance.from_components("CCO")
isopropanol = Substance.from_components("CC(C)O")

and the source of this measurement (defined as the DOI of the publication):

[13]:
from propertyestimator.datasets import MeasurementSource

source = MeasurementSource(doi="10.1016/S0021-9614(71)80108-8")

We will combine this information with the values of the measurements to create an object which encodes each of the \(\Delta H_{v}\) measurements

[14]:
from propertyestimator.datasets import PropertyPhase
from propertyestimator.properties import EnthalpyOfVaporization

ethanol_hvap = EnthalpyOfVaporization(
    thermodynamic_state=thermodynamic_state,
    phase=PropertyPhase.Liquid,
    substance=ethanol,
    value=42.26*unit.kilojoule / unit.mole,
    uncertainty=0.02*unit.kilojoule / unit.mole,
    source=source
)
isopropanol_hvap = EnthalpyOfVaporization(
    thermodynamic_state=thermodynamic_state,
    phase=PropertyPhase.Liquid,
    substance=isopropanol,
    value=45.34*unit.kilojoule / unit.mole,
    uncertainty=0.02*unit.kilojoule / unit.mole,
    source=source
)

These properties can then be added to our data set:

[15]:
data_set.add_properties(ethanol_hvap, isopropanol_hvap)

If we print the data set again using pandas we should see that our new measurements have been added:

[16]:
pandas_data_set = data_set.to_pandas()
pandas_data_set[
    ["Temperature",
     "Pressure",
     "Component 1",
     "Density Value",
     "EnthalpyOfVaporization Value",
     "Source"
     ]
].head()
[16]:
Temperature Pressure Component 1 Density Value EnthalpyOfVaporization Value Source
0 298.15 K 101.325 kPa CC(C)O 782.7 kg / m ** 3 NaN 10.1016/j.fluid.2013.10.034
1 298.15 K 101.325 kPa CC(C)O NaN 0.0 kJ / mol 10.1016/S0021-9614(71)80108-8
2 298.15 K 101.325 kPa CCO 785.07 kg / m ** 3 NaN 10.1021/je1013476
3 298.15 K 101.325 kPa CCO NaN 0.0 kJ / mol 10.1016/S0021-9614(71)80108-8

and that concludes the first tutorial. In the next tutorial we will be …

See Also

For more information about data sets in the Evaluator framework check out the data set and ThermoML documentation.