For those who are learning R and may be well-versed in SQL, package sqldf
is very useful because it enables us to use SQL commands in R..One who has basic SQL skills can manipulate data frames in R using their SQL skills.
The main function in the package sqldf
is sqldf()
, which takes a single argument which is an SQL select
statement.
Let us see how we can run SQL queries in R usingsqldf() function
.
#Install and load sqldf package
install.packages('sqldf')
library(sqldf)
#Import basic_salary data
salary_data<-read.csv("basic_salary.csv", header=T)
head(salary_data, n=3)
## First_Name Last_Name Grade Location ba ms
## 1 Alan Brown GR1 DELHI 17990 16070
## 2 Agatha Williams GR2 MUMBAI 12390 6630
## 3 Rajesh Kolte GR1 MUMBAI 19250 14960
Display columns ‘First_Name’ and ‘ba’ of salary_data
# Column Subsetting
subcols<-"select First_Name,ba from salary_data"
subset_cols<-sqldf(subcols)
subset_cols
## First_Name ba
## 1 Alan 17990
## 2 Agatha 12390
## 3 Rajesh 19250
## 4 Ameet 14780
## 5 Neha 19235
## 6 Sagar 13390
## 7 Aaron 23280
## 8 John 13500
## 9 Sneha 20660
## 10 Gaurav 13760
## 11 Adela 13660
## 12 Anup 11960
Calculate sum of variable ‘ba’ by variable ‘Location’
agg_sum<-"select Location,sum(ba) as sum_of_ba from salary_data group by Location"
aggdf<-sqldf(agg_sum)
aggdf
Location sum_of_ba
1 DELHI 80850
2 MUMBAI 113005
You can use the aggregation functions like sum(), count(), avg() with select statement.
as clause is used to change the name of the column in the output.
group by clause is used with select statement for aggregation in SQL.
Currently, sqldf does not support right joins or full outer joins. Let’s perform left and inner join on sal_data and bonus_data.
#Import sal_data and bonus_data
sal_data<-read.csv("sal_data.csv", header=T)
bonus_data<-read.csv("bonus_data.csv", header=T)
head(sal_data, n=3)
## Employee_ID First_Name Last_Name Basic_Salary
## 1 E-1001 Mahesh Joshi 16860
## 2 E-1002 Rajesh Kolte 14960
## 3 E-1004 Priya Jain 12670
head(bonus_data, n=3)
## Employee_ID Bonus
## 1 E-1001 16070
## 2 E-1003 15200
## 3 E-1004 13490
# Left Join
leftjoin_string<-"select sal_data.*,bonus_data.Bonus from sal_data left join bonus_data on sal_data.Employee_ID = bonus_data.Employee_ID"
sal_join_bonus1<-sqldf(leftjoin_string)
sal_join_bonus1
## Employee_ID First_Name Last_Name Basic_Salary Bonus
## 1 E-1001 Mahesh Joshi 16860 16070
## 2 E-1002 Rajesh Kolte 14960 NA
## 3 E-1004 Priya Jain 12670 13490
## 4 E-1005 Sneha Joshi 15660 NA
## 5 E-1007 Ram Kanade 15850 NA
## 6 E-1008 Nishi Honrao 15950 15880
## 7 E-1009 Hameed Singh 15120 NA
# Inner Join
innerjoin_string<-"select sal_data.*, bonus_data.Bonus from sal_data inner join bonus_data on sal_data.Employee_ID = bonus_data.Employee_ID"
sal_join_bonus2<-sqldf(innerjoin_string)
sal_join_bonus2
## Employee_ID First_Name Last_Name Basic_Salary Bonus
## 1 E-1001 Mahesh Joshi 16860 16070
## 2 E-1004 Priya Jain 12670 13490
## 3 E-1008 Nishi Honrao 15950 15880
Note: basic_salary, sal_data and bonus_data are csv files, saved in working directory. If your data is not saved in working directory then specify the path while importing it.