Skip to content

Commit 1cdf7de

Browse files
committed
docs: 📝 Docs for 14079863
1 parent 9d12187 commit 1cdf7de

File tree

2 files changed

+16
-30
lines changed

2 files changed

+16
-30
lines changed

src/models/14079863/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Auto-generated Database of Semiconductor Band Gaps Using ChemDataExtractor
2+
3+
Source: https://figshare.com/articles/dataset/Auto-generated_Database_of_Semiconductor_Band_Gaps_Using_ChemDataExtractor/14079863
4+
5+
License: MIT, https://opensource.org/license/MIT
6+
7+
This database is created from a database of semiconductor band gap records, released for [Dong, Q., & Cole, J. M. (2022). Auto-generated database of semiconductor band gaps using chemdataextractor. Scientific Data, 9(1), 193.](https://www.nature.com/articles/s41597-022-01294-6). This work presents an auto-generated database of 100236 semiconductor band gap records, extracted from 128776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0.
8+
9+
10+
There were 100236 records in the database before processing and only a few records after executing the pipeline. This is mainly due to values shifting with the invalid formatting of the source CSV. Many rows include a line break in the "Text" column, making the formatting of the CSV file invalid. Future work in Jayvee should add the functionality to remove those line breaks.
11+
12+
Changes include:
13+
- Standardized the DOI Format
14+
- Filtered out invalid data

src/models/14079863/SemiconductorBandGapsModel.jv

Lines changed: 2 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,3 @@
1-
/*
2-
This database is created from the paper cited as "Dong, Q., & Cole, J. M. (2022). Auto-generated database of semiconductor band gaps using chemdataextractor. Scientific Data, 9(1), 193.". This work presents an
3-
auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using
4-
ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%.
5-
6-
There were 100236 records in the database before the data engineering pipeline and only 29 records after executing the pipeline. The big difference in no. of rows is because of the fact that most of the rows in
7-
the column "Composition" is null, which cannot be the case for this dataset. We have removed all such records where the "Composition" is null. Other than that for multiple rows like Temperature_value, Temperature_Unit, the values have shifted to the adjacent columns. As a result,
8-
we have treated all those rows as noise and filtered out all such data points. This is the major reason behind the reduction in no. of records before and after the pipeline.
9-
10-
Our other changes include: removing opening ("[") and closing ("]") braces from the numerical values like "Value", "Temperature_Value", removeing "(" and ")" from "Raw_Unit", standardizing the DOI Reference column to
11-
a standard foramt of 10.1007/xxxx, and adding "AllowedListConstraint" to "TemperatureUnitType".
12-
13-
All the changes have been mentioned in details right before the part of the code that is responsible to cause these changes in the pipeline.
14-
*/
15-
16-
17-
// Removed white spaces from the Composition formula and kept only those rows in the format of {'element_scientific_name': units}
18-
constraint NonEmptyText oftype LengthConstraint {
19-
minLength: 1;
20-
maxLength: 9007199254740991;
21-
}
22-
valuetype Composition oftype text {
23-
constraints: [
24-
NonEmptyText
25-
];
26-
}
27-
28-
291
use {
302
BracesRemover
313
} from "./../../shared/composite-blocktypes.jv";
@@ -94,7 +66,7 @@ pipeline SemiconductorBandGapsPipeline {
9466
header: true;
9567
columns: [
9668
"Name" oftype text,
97-
"Composition" oftype Composition,
69+
"Composition" oftype text,
9870
"Value" oftype text,
9971
"Unit" oftype text,
10072
"Raw_value" oftype text,
@@ -118,4 +90,4 @@ pipeline SemiconductorBandGapsPipeline {
11890
table: "SemiConductorBandGaps";
11991
file: "./SemiConductorBandGaps.sqlite";
12092
}
121-
}
93+
}

0 commit comments

Comments
 (0)