Skip to content

Commit 2ac013d

Browse files
committed
docs: 📝 Docs for 19658787
1 parent 0643db1 commit 2ac013d

File tree

2 files changed

+21
-24
lines changed

2 files changed

+21
-24
lines changed

src/models/19658787/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# A Thermoelectric Materials Database Auto-Generated from the Scientific Literature using ChemDataExtractor
2+
3+
Source: https://figshare.com/articles/dataset/A_Thermoelectric_Materials_Database_Auto-Generated_from_the_Scientific_Literature_using_ChemDataExtractor/19658787
4+
5+
License: MIT, https://opensource.org/license/MIT
6+
7+
The datasets used in this data engineering pipeline were taken from the paper [Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Scientific Data, 9(1), 648](https://www.nature.com/articles/s41597-022-01752-1). This is the first automatically generated database of thermoelectric materials and their properties from existing literature. Two datasets are included, "main_tedb.csv" is the main dataset that contains all properties of the thermoelectric-materials, the other one contains predictions of values according to machine learning.
8+
9+
Changes include:
10+
- Standardized the DOI Format
11+
- Only keeping valid data according to the paper
12+
- Removed rows where the data was misaligned due to invalid formatting of the CSV

src/models/19658787/ThermoelectricMaterialsModel.jv

Lines changed: 9 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,17 @@
1-
/*
2-
The datasets used in this data engineering pipeline were taken from the paper Sierepeklis, O., & Cole, J. M. (2022). A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Scientific Data, 9(1), 648. This is the first
3-
automatically generated database of thermoelectric materials and their properties from existing literature. The database was evaluated to have a precision of 82.25%. Here we have 2 datasets, one is the Main dataset
4-
that contains all the properties of the thermoelectric-materials: "main_tedb.csv" and the other one contains the machine learning predictions of the ZT, Termal conductivity, Seebeck coefcient, Electrical conductivity
5-
& Power factor.
6-
7-
The inf_tedb.csv has 18509 columns before this data pipeline & 18336 columns after the execution of the pipeline. The main_tedb.csv database had 19707 before & 14617 rows after the execution. The main difference in the
8-
count of the rows is because a lot of the values in the main_tedb dataset was misplaced in different columns. As a result, after standardisation, all those records were filtered out. For example, temperature value
9-
was in ZT column and vice-versa. The changes that we introduced are changing a lot of dataypes, from text to appropriate ones. All the datatypes were text before. Secondly, we introduced a lot of constraints according to the
10-
theory of the paper, like we created an allow list for the models as well as model types, the temperature and the Access types. And we also standardized the doi format in this pipeline: 10.xxxx/yyyy.
11-
12-
We also added transform blocks to remove the opening and closing braces from the Value column, so that it can be later treated as a numerical value, rather than a text. More details about all the changes have been
13-
mentioned right before the code that causes the change.
14-
*/
15-
16-
// The standard format of doi in this paper and in general is 10.xxxx/xxxx, and we have kept that as a regexconstraint for the doi columns.
17-
181
use {
192
RemoveOpeningBrace,
203
RemoveClosingBrace
214
} from "./../../shared/transforms.jv";
225

236

247
/*
25-
According to the paper, Each record contains a chemical entity and one of the seminal thermoelectric properties: thermoelectric fgure of merit, ZT; thermal conductivity, κ; Seebeck coefcient, S;
26-
electrical conductivity, σ; power factor, PF. Hence, we set an AllowConstraint on model.
8+
Each record contains a chemical entity and one of the seminal thermoelectric properties:
9+
- Thermoelectric figure of merit
10+
- ZT
11+
- thermal conductivity, κ
12+
- Seebeck coefcient, S
13+
- electrical conductivity, σ
14+
- power factor, PF
2715
*/
2816
constraint AllowedModels oftype AllowlistConstraint {
2917
allowlist: [
@@ -40,10 +28,7 @@ valuetype Model oftype text {
4028
];
4129
}
4230

43-
/*
44-
The built-in framework of ChemDataExtractor was used to defne the necessary models types. The thermal conductivity has been mentioned to be only in (total, electronic, and lattice contributions).
45-
refer to the paper (page: 3)
46-
*/
31+
// The thermal conductivity has been mentioned to be only in (total, electronic, and lattice contributions), see page 3.
4732
constraint AllowedModelTypes oftype AllowlistConstraint {
4833
allowlist: [
4934
"electronic",
@@ -58,7 +43,7 @@ valuetype ModelType oftype text {
5843
}
5944

6045

61-
// Access type is about the reference of the row. It can be either paid or open datasource. According to the paper, we have kept it open and payment.
46+
// Access type of the source paper, has to be open or payment according to the paper.
6247
constraint AllowedAccessTypes oftype AllowlistConstraint {
6348
allowlist: [
6449
"open",

0 commit comments

Comments
 (0)