在自然语言处理领域,我们遇到了两个或两个以上单词具有共同根源的情况。 例如,agreed
, agreeing
和 agreeable
这三个词具有相同的词根。 涉及任何这些词的搜索应该把它们当作是根词的同一个词。 因此将所有单词链接到它们的词根变得非常重要。 nltk库有一些方法来完成这个链接,并给出显示根词的输出。
以下程序使用porter stemming算法进行词干分析。
import nltk
from nltk.stem.porter import porterstemmer
porter_stemmer = porterstemmer()
word_data = "it originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# first word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#next find the roots of the word
for w in nltk_tokens:
print ("actual: %s stem: %s" % (w,porter_stemmer.stem(w)))
执行上面示例代码,得到以下结果 -
actual: it stem: it
actual: originated stem: origin
actual: from stem: from
actual: the stem: the
actual: idea stem: idea
actual: that stem: that
actual: there stem: there
actual: are stem: are
actual: readers stem: reader
actual: who stem: who
actual: prefer stem: prefer
actual: learning stem: learn
actual: new stem: new
actual: skills stem: skill
actual: from stem: from
actual: the stem: the
actual: comforts stem: comfort
actual: of stem: of
actual: their stem: their
actual: drawing stem: draw
actual: rooms stem: room
词形化是类似的词干,但是它为词语带来了上下文。所以它进一步将具有相似含义的词链接到一个词。 例如,如果一个段落有像汽车,火车和汽车这样的词,那么它将把它们全部连接到汽车。 在下面的程序中,使用wordnet词法数据库进行词式化。
import nltk
from nltk.stem import wordnetlemmatizer
wordnet_lemmatizer = wordnetlemmatizer()
word_data = "it originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
print ("actual: %s lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w)))
当我们执行上面的代码时,它会产生以下结果。
actual: it lemma: it
actual: originated lemma: originated
actual: from lemma: from
actual: the lemma: the
actual: idea lemma: idea
actual: that lemma: that
actual: there lemma: there
actual: are lemma: are
actual: readers lemma: reader
actual: who lemma: who
actual: prefer lemma: prefer
actual: learning lemma: learning
actual: new lemma: new
actual: skills lemma: skill
actual: from lemma: from
actual: the lemma: the
actual: comforts lemma: comfort
actual: of lemma: of
actual: their lemma: their
actual: drawing lemma: drawing
actual: rooms lemma: room