13th March, 2019
Generally textual data is only a set of characters at the initial stage. All processes in text analysis will require the words which are available in the data set. For this reason, the parser used is tokenization. Tokenizer provides the reliability for documents. This process may be trivial as the text is already stored in machine-readable formats. But, some problems are still left, for example the punctuation mark removal. But characters like brackets, hyphens, etc. are processing well. Tokenization helps to divide big quantity of text into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.
Natural Language toolkit has very important module tokenize which further compromises of sub-modules
word_tokenize() is used to split a sentence into words. The output of word tokenization can be stored into Data Frame for better text understanding in machine learning applications. It can also be used in further text cleaning steps such as punctuation removal, numeric character removal or stemming. Word tokenization helps in numeric data conversion.
Install NLTK package in anaconda prompt.
pip install nltk
Download necessary libraries in python script
import nltk
nltk.download()
Steps:
Example:
from nltk.tokenize import word_tokenize
text = "The shining sun, the warm breeze, the cool surf, and palms blowing in the breeze are great!"
print(word_tokenize(text))
Output:
['The', 'shining', 'sun', ',', 'the', 'warm', 'breeze', ',', 'the', 'cool', 'surf', ',', 'and', 'palms', 'blowing', 'in', 'the', 'breeze', 'are', 'great', '!']
Sentence tokenization is different from words tokenization. It splits sentences by punctuations.
Steps:
Example:
from nltk.tokenize import sent_tokenize
text = "How awful! What a chaos!."
print(sent_tokenize(text))
Output:
['How awful!', 'What a chaos!.']
Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.
Steps:
Example:
import nltk
tokenize_text = word_tokenize(text)
nltk.pos_tag(tokenize_text)
Output:
[('I', 'PRP'), ('got', 'VBD'), ('the', 'DT'), ('concert', 'JJ'), ('tickets', 'NNS'), ('!', '.')]
Some examples of POS tags are as below:
Abbreviation | Meaning |
---|---|
CC | coordinating conjuction |
CD | cardinal digit |
DT | Determiner |
EX | existential there |
FW | foreign word |
IN | preposition/subordinating conjunction |
JJ | adjective (large) |
JJR | adjective, comparative (larger) |
JSS | adjective, superlative (largest) |
LS | list market |
MD | modal (could, will) |
NN | noun, singular (cat, tree) |
NNS | noun plural (desks) |
NNP | proper noun, singular (sarah) |
NNPS | proper noun, plural (indians or americans) |
PDT | predeterminer (all, both, half) |
POS | possessive ending (parent\ 's) |
PRP | personal pronoun (hers, herself, him,himself) |
PRP$ | possessive pronoun (her, his, mine, my, our ) |
RB | adverb (occasionally, swiftly) |
RBR | adverb, comparative (greater) |
RBS | adverb, superlative (biggest) |
RP | particle (about) |
TO | infinite marker (to) |
UH | interjection (goodbye) |
VB | verb (ask) |
VBG | verb gerund (judging) |
VBD | verb past tense (pleaded) |
VBN | verb past participle (reunified) |
VBP | verb, present tense not 3rd person singular(wrap) |
VBZ | verb, present tense with 3rd person singular (bases) |
WDT | wh-determiner (that, what) |
WP | wh- pronoun (who) |
WRB | wh- adverb (how) |