Python数据科学教程:词干与词形化(NLTK库,Porter Stemming算法)

2018-09-3009:17:56后端程序开发Comments4,668 views字数 1931阅读模式

自然语言处理领域,我们遇到了两个或两个以上单词具有共同根源的情况。 例如,agreed,agreeingagreeable这三个词具有相同的词根。 涉及任何这些词的搜索应该把它们当作是根词的同一个词。 因此将所有单词链接到它们的词根变得非常重要。 NLTK库有一些方法来完成这个链接,并给出显示根词的输出。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html

以下程序使用Porter Stemming算法进行词干分析。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print ("Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w)))

执行上面示例代码,得到以下结果 -文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html

Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room
Shell

词形化是类似的词干,但是它为词语带来了上下文。所以它进一步将具有相似含义的词链接到一个词。 例如,如果一个段落有像汽车,火车和汽车这样的词,那么它将把它们全部连接到汽车。 在下面的程序中,使用WordNet词法数据库进行词式化。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
    print ("Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w)))

当我们执行上面的代码时,它会产生以下结果。文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html

Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room
文章源自菜鸟学院-https://www.cainiaoxueyuan.com/bc/6045.html
  • 本站内容整理自互联网,仅提供信息存储空间服务,以方便学习之用。如对文章、图片、字体等版权有疑问,请在下方留言,管理员看到后,将第一时间进行处理。
  • 转载请务必保留本文链接:https://www.cainiaoxueyuan.com/bc/6045.html

Comment

匿名网友 填写信息

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定