pyconll

A minimal, all python, no dependency library to parse CoNLL files.

pyconll provides a basic, low level wrapper around the CoNLL annotation scheme. pyconll's sensible API allows for easy automation and little ramp up time, and it works as a great building block for creating larger CL systems.

Code Sample


                        # Make sure 'Lord of the Rings' has 'Lord' and 'Rings' annotated as regular NOUN.
                        import pyconll
                        import pyconll.util

                        UD_ENGLISH_TRAIN = './ud/train.conll'
                        NGRAM = 'Lord of the Rings'.split()

                        train = pyconll.load_from_file(UD_ENGLISH_TRAIN)

                        # util#find_ngrams provides an iterator of tuples where the first element is the
                        # sentence where the ngram was found, and the second element is the numeric index
                        # of the ngram occurence in the sentence.
                        for sentence, i in pyconll.util.find_ngrams(train, NGRAM):
                            tokens = sentence[i:i + len(NGRAM)]
                            tokens[0].UPOS = 'NOUN'
                            tokens[-1].UPOS = 'NOUN'

                        # After making sure these occurrences are properly handled, go through the
                        # corpus and determine how many unique pronouns are annotated.
                        pronouns = set()
                        for sentence in train:
                            for token in sentence:
                                if token.upos == 'PROPN':
                                    pronouns.add(token.lemma)

                        print(len(prounouns))
                    

Install


                        pip install pyconll
                    

Want to find out more?