13th March, 2019
Generally textual data is only a set of characters at the initial stage. All processes in text analysis will require the words which are available in the data set. For this reason, the parser used is tokenization. Tokenizer provides the reliability for documents. This process may be trivial as the text is already stored in machine-readable formats. But, some problems are still left, for example the punctuation mark removal. But characters like brackets, hyphens, etc. are processing well. Tokenization helps to divide big quantity of text into smaller parts called tokens.

Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.

Natural Language toolkit has very important module tokenize which further compromises of sub-modules

  1. Word_tokenize
  2. Sentence_tokenize

Tokenization of words

word_tokenize() is used to split a sentence into words. The output of word tokenization can be stored into Data Frame for better text understanding in machine learning applications. It can also be used in further text cleaning steps such as punctuation removal, numeric character removal or stemming. Word tokenization helps in numeric data conversion.

Install NLTK package in anaconda prompt.

pip install nltk

Download necessary libraries in python script

import nltk
nltk.download()

Steps:

  1. import word_tokenize module from NLTK library.
  2. Initialize a variable to store the text data (Here variable is text).
  3. This module breaks each word and punctuation as seen in the output.

Example:

from nltk.tokenize import word_tokenize

text = "The shining sun, the warm breeze, the cool surf, and palms blowing in the breeze are great!"

print(word_tokenize(text))

Output:

['The', 'shining', 'sun', ',', 'the', 'warm', 'breeze', ',', 'the', 'cool', 'surf', ',', 'and', 'palms', 'blowing', 'in', 'the', 'breeze', 'are', 'great', '!']

Tokenization of sentences

Sentence tokenization is different from words tokenization. It splits sentences by punctuations.

Steps:

  1. import sent_tokenize module.
  2. sent module parsed that sentences with punctuations. This function breaks each sentence.

Example:

from nltk.tokenize import sent_tokenize

text = "How awful! What a chaos!."

print(sent_tokenize(text))

Output:

['How awful!', 'What a chaos!.']

POS Tagging

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
Steps:

  1. Tokenize text using word_tokenize
  2. apply nltk.pos_tag(tokenize_text)
  3. POS tagger adds linguistic (mostly grammatical) information of parts-of-speech such as verb, noun, preposition etc. to sub-sentential units (Tokens).

Example:

import nltk

tokenize_text = word_tokenize(text)

nltk.pos_tag(tokenize_text)

Output:

[('I', 'PRP'), ('got', 'VBD'), ('the', 'DT'), ('concert', 'JJ'), ('tickets', 'NNS'), ('!', '.')]

Some examples of POS tags are as below:

Abbreviation Meaning
CC coordinating conjuction
CD cardinal digit
DT Determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ adjective (large)
JJR adjective, comparative (larger)
JSS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ 's)
PRP personal pronoun (hers, herself, him,himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)