Character strings can turn up in all stages of a data science project.We might have to clean messy string input before analysis.
Pattern matching is one of the important functions’ family in stringr.
Install the package stringr and rebus. rebus provides START and END shortcuts to specify regular expressions that match the start and end of the string.
library(stringr)
library(rebus)
x <- c("cat", "coat", "scotland", "tic toc")
Match the strings that start with “c”
str_view(x, pattern = START %R% "c")
Match the strings that start with “ca”
str_view(x, pattern = START %R% "ca")
Match the strings that end with “at”
str_view(x, pattern = "at" %R% END)
Match the strings that is exactly “coat”
str_view(x, pattern = START %R% "coat" %R% END)
Matching any character with c and t.
Notice that ANY_CHAR will match a space character (c t in tic toc). It will also match numbers or punctuation symbols,
but ANY_CHAR will only ever match one character, which is why we get no match in coat.
str_view(x,pattern = "c" %R% ANY_CHAR %R% "t")
Match any character followed by a “t”
str_view(x, pattern = ANY_CHAR %R% "t")
Match a “t” followed by any character
str_view(x, pattern = "t" %R% ANY_CHAR)
Match a string with exactly four characters
str_view(x, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% END)
We can pass a regular expression as the pattern argument to any stringr function that has the pattern argument.
Test pattern “a” followed by any character
pattern <- "a" %R% ANY_CHAR
str_view(x, pattern)
Find count that have the pattern
names_with_a <- str_subset(x, pattern)
names_with_a
[1] "cat" "coat" "scotland"
length(names_with_a)
[1] 3
Find just the part of name that matches pattern. Here “an” is appearing once and “at” appearing twice.
part_with_a <- str_extract(x, pattern)
part_with_a
[1] "at" "at" "an" NA
table(part_with_a)
part_with_a
an at
1 2
Does any word have the pattern more than once?
count_of_a <- str_count(x, pattern)
count_of_a
[1] 1 1 1 0
table(count_of_a)
count_of_a
0 1
1 3
count_of_a <- str_count(c("look at cat", "coat", "scotland", "tic toc"), pattern)
count_of_a
[1] 2 1 1 0
table(count_of_a)
count_of_a
0 1 2
1 2 1
Note that,in the above example pattern at is appearing twice in first word.
Which words got these pattern? (get logical vector)
with_a <- str_detect(x, pattern)
with_a
[1] TRUE TRUE TRUE FALSE
What fraction of words got these pattern? Here 3 out of 4 words got the pattern.
mean(with_a)
[1] 0.75