NLP introduction--NLTK

import nltk

统计词频

1	freq = nltk.FreqDist(tokens)

处理停用词

1 2	from nltk.corpus import stopwords stopwords.words('english')

使用NLTK Tokenize文本

文本没有Tokenize之前是无法处理的，token化过程意味着将大的部件分割为小部件

你可以将段落tokenize成句子，将句子tokenize成单个词，NLTK分别提供了句子tokenizer和单词tokenizer

sentence_tokenizer


from nltk.tokenize import sent_tokenize
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))
['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

word_tokenizer

from nltk.tokenize import word_tokenize
 
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))
['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']

词干提取

语言形态学和信息检索里，词干提取是去除词缀得到词根的过程，例如working的词干为work。

搜索引擎在索引页面时就会使用这种技术，所以很多人为相同的单词写出不同的版本。

有很多种算法可以避免这种情况，最常见的是波特词干算法。NLTK有一个名为PorterStemmer的类

from nltk.stem import PorterStemmer
 
stemmer = PorterStemmer()
print(stemmer.stem('working'))
print(stemmer.stem('worked'))
work
work