semanticClimate · anudev21 · Jan 29, 2025 · Jan 29, 2025 · Jan 29, 2025 · Jan 29, 2025
diff --git a/Climmate_academy/Readme.md b/Climmate_academy/Readme.md
@@ -0,0 +1,3 @@
+This document contains the encyclopedia of a publication called **Climate Academy**.
+
+It contains the keyphrases form the publication.
diff --git a/Climmate_academy/climatereport.html b/Climmate_academy/climatereport.html
diff --git a/Cross Chaptor 7/ccp6_processed.html b/Cross Chaptor 7/ccp6_processed.html
diff --git a/Cross Chaptor 7/html_id cleaning.py b/Cross Chaptor 7/html_id cleaning.py
@@ -0,0 +1,63 @@
+import requests
+from bs4 import BeautifulSoup
+import re
+
+def fetch_html(url):
+    """
+    Fetches HTML content from a given url
+    """
+    headers = {"User-agent": "Mozilla/5.0"}
+    response = requests.get(url, headers=headers)
+
+    if response.status_code == 200:
+        return response.text
+    else:
+        print(f"Error: Unable to fetch page (Statust Code: {response.status_code})")
+        return None
+
+def clean_html(html_content):
+    """
+    Removes unnecessary attributes and cleans the HTML
+    """    
+    soup = BeautifulSoup(html_content, "html.parser")
+
+    #Remove Gatsby-related attributes
+    """
+    Remove unnecessary elements and attributes from the parsed HTML
+    """
+    # Remove <script> and <style> tags
+    for tag in soup(["script", "style"]):
+        tag.decompose()
+
+    # Remove unnecessary attributes (React attributes, inline styles, classes)
+    for tag in soup.find_all(True):
+        attrs_to_remove = [attr for attr in tag.attrs if re.match(r"^(data-|aria-|on)", attr)]
+        for attr in attrs_to_remove:
+            del tag[attr]
+            if 'class' in tag.attrs:
+                del tag['class']
+                if 'style' in tag.attrs:
+                    del tag['style']
+                    if 'id' in tag.attrs and not tag['id'].startswith('item-'):
+                        del tag['id']
+
+    return str(soup)
+
+
+def save_html(content, filename="ccp6_processed.html"):
+    """
+    Saves cleaned HTML to a file.
+    """
+    with open(filename, "w", encoding="utf-8") as file:
+        file.write(content)
+        print(f"Processed HTML saved as {filename}")
+
+
+# IPCC webpage URL (replace with the specific URL you want)
+ipcc_url = "https://www.ipcc.ch/report/ar6/wg2/chapter/ccp6/"
+
+# Fetch, clean and save HTML
+html_content = fetch_html(ipcc_url)
+if html_content:
+    cleaned_html = clean_html(html_content)
+    save_html(cleaned_html)
diff --git a/Daily_report.md b/Daily_report.md
@@ -0,0 +1,25 @@
+## Date : 16/04/2025 Wednesday
+
+## Component 
+Resolving the issues with `docanalysis` in **Google colab environment**
+### **current task** : Working of `docanalysis` in **Google colab** environment
+### **current status** : Not able to install `docanalysis` in colab
+
+
+## Date : 04/06/2025 Wednesday
+
+- **Task**: Tested `pygetpapers` for downloading research articles using a query.
+- **Initial Command**:
+  ```bash
+  python -m pygetpapers.pygetpapers --query '"wildlife" AND "biodiversity"' --pdf --limit 5 --output downloaded_file --api openalex --output Wildlife
+
+- **Output** :
+```bash
+pygetpapers.py: error: unrecognized arguments: AND biodiversity'
+```
+- **Reason** : The query string was not correctly escaped, causing the logical operator AND to be interpreted as separate arguments.
+
+- **Solution** : 
+```bash
+python -m pygetpapers.pygetpapers --query "\"wildlife\" AND \"biodiversity\"" --pdf --limit 5 --output Wildlife --api openalex
+```
diff --git a/Dictionary creation/README.MD b/Dictionary creation/README.MD
@@ -0,0 +1,25 @@
+# Dictionary Management  
+
+## Overview  
+
+This document outlines the process of managing the dictionary created from wordlists extracted from IPCC chapters. It includes quality checks, rectification, and updates to ensure consistency, accuracy, and usability.  
+
+## Tasks Involved  
+
+### 1. Managing the Dictionary  
+- Organizing and maintaining the dictionary derived from IPCC chapter wordlists.  
+- Ensuring proper structuring and accessibility for further processing and analysis.  
+
+### 2. Quality Check & Rectification  
+- Identifying and *removing repetitive words* to avoid redundancy.  
+- Removing words that *do not have an associated Wikipedia link* to maintain relevance.  
+- Identifying *ambiguous words* and linking them to *Wiktionary* for better clarity.  
+- Ensuring the dictionary remains well-structured and meaningful for users.  
+
+### 3. Updating the Dictionary Code  
+- Making necessary modifications to the code responsible for generating and managing the dictionary.  
+- Documenting all updates and improvements for future reference.  
+- Ensuring changes are reflected in the latest version of the dictionary.  
+
+## Contribution & Feedback  
+If you have suggestions for improvements or encounter any issues, feel free to raise an issue or submit a pull request.
diff --git a/README.md b/README.md
@@ -1,3 +1,11 @@
 # internship_sC
 
-This contains the work progress by the interns and discussion for different tasks.
+## Poject overview
+I am currently working on **IPCC Working Group 2, Cross Chapter 2: Polar Regions.**
+The focus of this project is to develop resources that enhance the understanding of climate-related terminologies and concepts.
+
+## Objectives
+* Wordlist
+* Dictionary
+* Table of content of my IPCC chapter
+* Network graph of IPCC main page and Syntesis Report
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		This document contains the encyclopedia of a publication called Climate Academy.

		It contains the keyphrases form the publication.