You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The datasets used in this data engineering pipeline were taken from the paper [Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Scientific Data, 9(1), 648](https://www.nature.com/articles/s41597-022-01752-1). This is the first automatically generated database of thermoelectric materials and their properties from existing literature. Two datasets are included, "main_tedb.csv" is the main dataset that contains all properties of the thermoelectric-materials, the other one contains predictions of values according to machine learning.
8
+
9
+
Changes include:
10
+
- Standardized the DOI Format
11
+
- Only keeping valid data according to the paper
12
+
- Removed rows where the data was misaligned due to invalid formatting of the CSV
Copy file name to clipboardExpand all lines: src/models/19658787/ThermoelectricMaterialsModel.jv
+9-24Lines changed: 9 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -1,29 +1,17 @@
1
-
/*
2
-
The datasets used in this data engineering pipeline were taken from the paper Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Scientific Data, 9(1), 648. This is the first
3
-
automatically generated database of thermoelectric materials and their properties from existing literature. The database was evaluated to have a precision of 82.25%. Here we have 2 datasets, one is the Main dataset
4
-
that contains all the properties of the thermoelectric-materials: "main_tedb.csv" and the other one contains the machine learning predictions of the ZT, Termal conductivity, Seebeck coefcient, Electrical conductivity
5
-
& Power factor.
6
-
7
-
The inf_tedb.csv has 18509 columns before this data pipeline & 18336 columns after the execution of the pipeline. The main_tedb.csv database had 19707 before & 14617 rows after the execution. The main difference in the
8
-
count of the rows is because a lot of the values in the main_tedb dataset was misplaced in different columns. As a result, after standardisation, all those records were filtered out. For example, temperature value
9
-
was in ZT column and vice-versa. The changes that we introduced are changing a lot of dataypes, from text to appropriate ones. All the datatypes were text before. Secondly, we introduced a lot of constraints according to the
10
-
theory of the paper, like we created an allow list for the models as well as model types, the temperature and the Access types. And we also standardized the doi format in this pipeline: 10.xxxx/yyyy.
11
-
12
-
We also added transform blocks to remove the opening and closing braces from the Value column, so that it can be later treated as a numerical value, rather than a text. More details about all the changes have been
13
-
mentioned right before the code that causes the change.
14
-
*/
15
-
16
-
// The standard format of doi in this paper and in general is 10.xxxx/xxxx, and we have kept that as a regexconstraint for the doi columns.
17
-
18
1
use {
19
2
RemoveOpeningBrace,
20
3
RemoveClosingBrace
21
4
} from "./../../shared/transforms.jv";
22
5
23
6
24
7
/*
25
-
According to the paper, Each record contains a chemical entity and one of the seminal thermoelectric properties: thermoelectric fgure of merit, ZT; thermal conductivity, κ; Seebeck coefcient, S;
26
-
electrical conductivity, σ; power factor, PF. Hence, we set an AllowConstraint on model.
8
+
Each record contains a chemical entity and one of the seminal thermoelectric properties:
The built-in framework of ChemDataExtractor was used to defne the necessary models types. The thermal conductivity has been mentioned to be only in (total, electronic, and lattice contributions).
45
-
refer to the paper (page: 3)
46
-
*/
31
+
// The thermal conductivity has been mentioned to be only in (total, electronic, and lattice contributions), see page 3.
0 commit comments