Character strings can turn up in all stages of a data science project.We might have to clean messy string input before analysis.

Pattern matching is one of the important functions’ family in stringr.

Install the package stringr and rebus. rebus provides START and END shortcuts to specify regular expressions that match the start and end of the string.

library(stringr)
library(rebus)

x <- c("cat", "coat", "scotland", "tic toc")

Match the strings that start with “c”

str_view(x, pattern = START %R% "c")

Match the strings that start with “ca”

str_view(x, pattern = START %R% "ca")

Match the strings that end with “at”

str_view(x, pattern = "at" %R% END)

Match the strings that is exactly “coat”

str_view(x, pattern = START %R% "coat" %R% END)

Matching any character with c and t.

Notice that ANY_CHAR will match a space character (c t in tic toc). It will also match numbers or punctuation symbols,

but ANY_CHAR will only ever match one character, which is why we get no match in coat.

str_view(x,pattern = "c" %R% ANY_CHAR %R% "t")

Match any character followed by a “t”

str_view(x, pattern = ANY_CHAR %R% "t")

Match a “t” followed by any character

str_view(x, pattern = "t" %R% ANY_CHAR)

Match a string with exactly four characters

str_view(x, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% END)

We can pass a regular expression as the pattern argument to any stringr function that has the pattern argument.

Test pattern “a” followed by any character

pattern <- "a" %R% ANY_CHAR 
str_view(x, pattern)  

Find count that have the pattern

names_with_a <- str_subset(x, pattern)
names_with_a
[1] "cat"      "coat"     "scotland"
length(names_with_a)
[1] 3

Find just the part of name that matches pattern. Here “an” is appearing once and “at” appearing twice.

part_with_a <- str_extract(x, pattern)
part_with_a
[1] "at" "at" "an" NA  
table(part_with_a)
part_with_a
an at 
 1  2 

Does any word have the pattern more than once?

count_of_a <- str_count(x, pattern)
count_of_a
[1] 1 1 1 0
table(count_of_a)
count_of_a
0 1 
1 3 
count_of_a <- str_count(c("look at cat", "coat", "scotland", "tic toc"), pattern)
count_of_a
[1] 2 1 1 0
table(count_of_a)
count_of_a
0 1 2 
1 2 1 

Note that,in the above example pattern at is appearing twice in first word.

Which words got these pattern? (get logical vector)

with_a <- str_detect(x, pattern)
with_a
[1]  TRUE  TRUE  TRUE FALSE

What fraction of words got these pattern? Here 3 out of 4 words got the pattern.

mean(with_a)
[1] 0.75