Python数据科学 专题
您的位置:python > Python数据科学专题 > Python词干与词形化
Python词干与词形化
作者:--    发布时间:2019-11-20

在自然语言处理领域,我们遇到了两个或两个以上单词具有共同根源的情况。 例如,agreed, agreeingagreeable这三个词具有相同的词根。 涉及任何这些词的搜索应该把它们当作是根词的同一个词。 因此将所有单词链接到它们的词根变得非常重要。 nltk库有一些方法来完成这个链接,并给出显示根词的输出。

以下程序使用porter stemming算法进行词干分析。

import nltk
from nltk.stem.porter import porterstemmer
porter_stemmer = porterstemmer()

word_data = "it originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# first word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#next find the roots of the word
for w in nltk_tokens:
       print ("actual: %s  stem: %s"  % (w,porter_stemmer.stem(w)))

执行上面示例代码,得到以下结果 -

actual: it  stem: it
actual: originated  stem: origin
actual: from  stem: from
actual: the  stem: the
actual: idea  stem: idea
actual: that  stem: that
actual: there  stem: there
actual: are  stem: are
actual: readers  stem: reader
actual: who  stem: who
actual: prefer  stem: prefer
actual: learning  stem: learn
actual: new  stem: new
actual: skills  stem: skill
actual: from  stem: from
actual: the  stem: the
actual: comforts  stem: comfort
actual: of  stem: of
actual: their  stem: their
actual: drawing  stem: draw
actual: rooms  stem: room

词形化是类似的词干,但是它为词语带来了上下文。所以它进一步将具有相似含义的词链接到一个词。 例如,如果一个段落有像汽车,火车和汽车这样的词,那么它将把它们全部连接到汽车。 在下面的程序中,使用wordnet词法数据库进行词式化。

import nltk
from nltk.stem import wordnetlemmatizer
wordnet_lemmatizer = wordnetlemmatizer()

word_data = "it originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print ("actual: %s  lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))

当我们执行上面的代码时,它会产生以下结果。

actual: it  lemma: it
actual: originated  lemma: originated
actual: from  lemma: from
actual: the  lemma: the
actual: idea  lemma: idea
actual: that  lemma: that
actual: there  lemma: there
actual: are  lemma: are
actual: readers  lemma: reader
actual: who  lemma: who
actual: prefer  lemma: prefer
actual: learning  lemma: learning
actual: new  lemma: new
actual: skills  lemma: skill
actual: from  lemma: from
actual: the  lemma: the
actual: comforts  lemma: comfort
actual: of  lemma: of
actual: their  lemma: their
actual: drawing  lemma: drawing
actual: rooms  lemma: room

网站声明:
本站部分内容来自网络,如您发现本站内容
侵害到您的利益,请联系本站管理员处理。
联系站长
373515719@qq.com
关于本站:
编程参考手册