ENH: Add Native Support for Reading CoNLL Files in Pandas

### Feature Type

- [x] Adding new functionality to pandas

- [ ] Changing existing functionality in pandas

- [ ] Removing existing functionality in pandas


### Problem Description

I wish I could use pandas to directly read .conll files, which are standard formats in NLP for tasks like named entity recognition, part-of-speech tagging, and dependency parsing. Currently, I need to write custom parsers or use workarounds that are error-prone and not reusable across projects.

### Feature Description

Add a lightweight pandas.read_conll function built on existing pandas infrastructure:
\# Implementation would leverage existing read_csv capabilities
def read_conll(filepath_or_buffer, columns=None, group_by_sentence=False, 
               comment_char='#', **kwargs):
    """
    Read CoNLL format files into DataFrame.
    """
    Builds on pd.read_csv with CoNLL-specific preprocessing:
    - Tab-separated values with comment skipping
    - Sentence boundary detection via blank lines  
    - Optional sentence_id assignment
    - No external dependencies required
    """

### Alternative Solutions

1. Using read_csv with manual processing: Requires handling comments, blank lines, and sentence boundaries manually, which is fragile and error-prone.
df = pd.read_csv('data.conll', sep='\t', comment='#', skip_blank_lines=False)
2. Third-party libraries: Packages like conllu require additional dependencies and conversion steps to get DataFrames.
3. Custom parsers: Writing project-specific parsers is time-consuming and not reusable.

### Additional Context

Sample file:
Standard CoNLL-2003 format：
\# Document: example.txt
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
. . O O

The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
. . O O

Current pandas result (problematic):
Option 1: Loses sentence boundaries
df = pd.read_csv('data.conll', sep='\t', comment='#', names=['token', 'pos', 'chunk', 'ner'])
\# Result: All sentences merged, blank lines silently skipped
Option 2: Creates messy NaN rows
df = pd.read_csv('data.concll', sep='\t', comment='#', skip_blank_lines=False, 
                 names=['token', 'pos', 'chunk', 'ner'])
\# Result: Blank lines become NaN rows, requires manual sentence ID processing

Desired result with read_conll:
df = pd.read_conll('data.conll', group_by_sentence=True)
  token       pos chunk     ner  sentence_id
0     EU       NNP  B-NP   B-ORG            0
1 rejects       VBZ  B-VP       O            0
2  German        JJ  B-NP   B-MISC           0
3       .        .     O       O            0
4     The        DT  B-NP       O            1
5 European       NNP  I-NP   B-ORG           1
6 Commission     NNP  I-NP   I-ORG           1
7       .        .     O       O            1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Add Native Support for Reading CoNLL Files in Pandas #63092

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Add Native Support for Reading CoNLL Files in Pandas #63092

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions