-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I could use pandas to directly read .conll files, which are standard formats in NLP for tasks like named entity recognition, part-of-speech tagging, and dependency parsing. Currently, I need to write custom parsers or use workarounds that are error-prone and not reusable across projects.
Feature Description
Add a lightweight pandas.read_conll function built on existing pandas infrastructure:
# Implementation would leverage existing read_csv capabilities
def read_conll(filepath_or_buffer, columns=None, group_by_sentence=False,
comment_char='#', **kwargs):
"""
Read CoNLL format files into DataFrame.
"""
Builds on pd.read_csv with CoNLL-specific preprocessing:
- Tab-separated values with comment skipping
- Sentence boundary detection via blank lines
- Optional sentence_id assignment
- No external dependencies required
"""
Alternative Solutions
- Using read_csv with manual processing: Requires handling comments, blank lines, and sentence boundaries manually, which is fragile and error-prone.
df = pd.read_csv('data.conll', sep='\t', comment='#', skip_blank_lines=False) - Third-party libraries: Packages like conllu require additional dependencies and conversion steps to get DataFrames.
- Custom parsers: Writing project-specific parsers is time-consuming and not reusable.
Additional Context
Sample file:
Standard CoNLL-2003 format:
# Document: example.txt
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
. . O O
The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
. . O O
Current pandas result (problematic):
Option 1: Loses sentence boundaries
df = pd.read_csv('data.conll', sep='\t', comment='#', names=['token', 'pos', 'chunk', 'ner'])
# Result: All sentences merged, blank lines silently skipped
Option 2: Creates messy NaN rows
df = pd.read_csv('data.concll', sep='\t', comment='#', skip_blank_lines=False,
names=['token', 'pos', 'chunk', 'ner'])
# Result: Blank lines become NaN rows, requires manual sentence ID processing
Desired result with read_conll:
df = pd.read_conll('data.conll', group_by_sentence=True)
token pos chunk ner sentence_id
0 EU NNP B-NP B-ORG 0
1 rejects VBZ B-VP O 0
2 German JJ B-NP B-MISC 0
3 . . O O 0
4 The DT B-NP O 1
5 European NNP I-NP B-ORG 1
6 Commission NNP I-NP I-ORG 1
7 . . O O 1