PDF Parser - 

## Overall ambition : 
We should be able able to parse a pdf such that we are able to get [this](https://docs.google.com/spreadsheets/d/1q-aWJFUAlFHr_yNOwv5XNPgbg-Ghk7DuBiKdYm05MV4/edit#gid=1851889542) following structure out of it. 
It includes the following key capabilities : 
1. Ability to process pdfs with multiple languages - English/Odia/Hindi 
2. Ability to create [chunks](#What_is_a_good_chunk) with headings on the basis of the way the pdf is structured. We should be able to recognize that some texts are headings, some are content and then be able to convert that into the structure above.  
3. Be able to process images and tables and convert them into chunks that can be passed to an LLM to answer questions based on them. 

# Where are we on this now : 

## Chunking 

### Free text chunking :
  We are able to chunk free text (unstructured text) [here](https://github.com/Samagra-Development/ai-tools/tree/restructure/src/chunking/MPNet) 


### Structured pdf chunking 
We have looked at 2 approaches for chunking text :  
1. Using Deepdoc detection to extract the text headings and structure of each page and converting it into a json format :  [here](https://colab.research.google.com/drive/1Ki7GgF5tvGR9RQW28ajk5ncx3CAHFfYY)
2. Using Pymupdf to get the boundaries of the text from the pdf and then using that to figure out the headings and the content pieces :  [here](https://colab.research.google.com/drive/1oB_Iat_sVYeJZtFeV6_XKDgci1JHRrbu)



## What is a good chunk
- Should be around 100 to 200 words. 
- The text/topic in a chunk should be on a similar topic which makes semantic sense. 
- The text/topic in a chunk should be different from other chunks 
- Ideally it should cover a small topic in its entirety. It could cover multiple topics  but these small topics should not be a part of some other chunk.  

For example : 
Bad Chunk : 
```
Here is a list of links : 
Cab booking  :  http:/sdjnsdkgj/  
Hotel form :  http:/sfjgkjnfsgn/  
```
This is a bad chunk because : 
1. The chunk is small is size 
2. The links cover multiple topics at once.  Cab booking form should be a part of the chunk that should be a part that describes how to book a cab. Similarly, for the hotel booking lunch


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF Parser - #3

Overall ambition :

Where are we on this now :

Chunking

Free text chunking :

Structured pdf chunking

What is a good chunk

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDF Parser - #3

Description

Overall ambition :

Where are we on this now :

Chunking

Free text chunking :

Structured pdf chunking

What is a good chunk

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions