Skip to content

ENH: Add Native Support for Reading CoNLL Files in Pandas #63092

@nocoding03

Description

@nocoding03

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could use pandas to directly read .conll files, which are standard formats in NLP for tasks like named entity recognition, part-of-speech tagging, and dependency parsing. Currently, I need to write custom parsers or use workarounds that are error-prone and not reusable across projects.

Feature Description

Add a lightweight pandas.read_conll function built on existing pandas infrastructure:
# Implementation would leverage existing read_csv capabilities
def read_conll(filepath_or_buffer, columns=None, group_by_sentence=False,
comment_char='#', **kwargs):
"""
Read CoNLL format files into DataFrame.
"""
Builds on pd.read_csv with CoNLL-specific preprocessing:
- Tab-separated values with comment skipping
- Sentence boundary detection via blank lines
- Optional sentence_id assignment
- No external dependencies required
"""

Alternative Solutions

  1. Using read_csv with manual processing: Requires handling comments, blank lines, and sentence boundaries manually, which is fragile and error-prone.
    df = pd.read_csv('data.conll', sep='\t', comment='#', skip_blank_lines=False)
  2. Third-party libraries: Packages like conllu require additional dependencies and conversion steps to get DataFrames.
  3. Custom parsers: Writing project-specific parsers is time-consuming and not reusable.

Additional Context

Sample file:
Standard CoNLL-2003 format:
# Document: example.txt
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
. . O O

The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
. . O O

Current pandas result (problematic):
Option 1: Loses sentence boundaries
df = pd.read_csv('data.conll', sep='\t', comment='#', names=['token', 'pos', 'chunk', 'ner'])
# Result: All sentences merged, blank lines silently skipped
Option 2: Creates messy NaN rows
df = pd.read_csv('data.concll', sep='\t', comment='#', skip_blank_lines=False,
names=['token', 'pos', 'chunk', 'ner'])
# Result: Blank lines become NaN rows, requires manual sentence ID processing

Desired result with read_conll:
df = pd.read_conll('data.conll', group_by_sentence=True)
token pos chunk ner sentence_id
0 EU NNP B-NP B-ORG 0
1 rejects VBZ B-VP O 0
2 German JJ B-NP B-MISC 0
3 . . O O 0
4 The DT B-NP O 1
5 European NNP I-NP B-ORG 1
6 Commission NNP I-NP I-ORG 1
7 . . O O 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO Format RequestRequest for a new format to support.Needs TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions