Introduction to data.table in R

Sat, 05 Dec 2020 11:49:47 -0500

One of the most exciting packages in the R universe is data.table. This package allows you to create data.table objects as an alternative to data.frame. The benefit of using data.table is the speed and efficiency when working with large datasets, and a concise syntax that can be really easy to use once you get used to it.

Load data.table package and dataset

Lets first start with installing and loading the package. For this demonstration we will use the iris dataset that is preinstalled in R.

install.packages("data.table")
library(data.table)
data = setDT(copy(iris))
#note we have to create a copy because iris is built-in to R and thus not modifiable

In this case we didn’t import our data from a csv file, but if you have a large dataset that you need to import, the fread() function from data.table allows for extremely quick importing of data from either a csv file or a URL.

Also, here we converted an existing data frame to a data table object using setDT(). We could also use as.data.frame() for an existing object, or create our own data table by inputting data using data.table()

Data Table Syntax

The main syntax that data table uses is the DT[i, j, by]

This syntax allows for data to be manipulated extremely easily and in only one line of code! It may seem pretty foreign, so lets go through some explanations and examples.

Filter rows

Let’s say we only want the 15th row of our dataset. Or only data for flowers of the versicolor species. Or only data for the versicolor species with a sepal width less than 3. This is all easy to do in the “i” part of a data table.

#15th row
row_15 <- data[15]
row_15

# versicolor species
versicolor <- data[Species == "versicolor"]
head(versicolor)

#versicolor species with a sepal width less than 3
versicolor_sepal3 <- data[Species == "versicolor" & Sepal.Width < 3]
head(versicolor_sepal3)

Other special operators include %in% and %between%


setosa_virginica <- data[Species %in% c("setosa", "virginica")]
pl_1.2_1.6 <- data[Petal.Length %between% c(1.2, 1.6)]

Working with columns

Let say you want to filter your dataset to contain all rows, but only the Species and Sepal Width columns. Note that you will leave the i section empty by just adding a comma before the j section.


species_sepalw <- data[, c("Species", "Sepal.Width")]
#an alternate way of writing a list in data.table is by using .()
species_sepalw <- data[, .(Species, Sepal.Width)]
head(species_sepalw)

It is also easy to do computations on columns in data.table. Let’s find the median sepal width


median_sw <- data[, median(Sepal.Width)]
median_sw

The := Operator: Creating New Columns

Using the “:=” operator in the j section allows you to create new columns using existing ones. When creating columns by reference, you don’t have to assign the result to a new object, because the column will be directly added to the current data table. For example, let’s compute sepal area (length x width)


data[, Sepal.Area := Sepal.Length * Sepal.Width]
head(data)

Notice how the new column name is on the left of the operator, and the operation is on the right.

Group by variables

We can use the last section to group variables using “by”. For example, we can get the mean Sepal length by species.


mean_by_species <- data[, .(mean.sepal.length = mean(Sepal.Length)), by = Species]
mean_by_species

You can also create a new column by using the := operator, that is grouped by a variable (or multiple variables).


data[, mean.sepal.width := mean(Sepal.Width), by = Species]

The .N special symbol

Using the .N symbol means the number of rows present. This can function similarly to nrows(). For example, here we can get the number of rows where the sepal width is less than 3 and sepal length is less than 5.


less_3_5 <- data[Sepal.Width < 3 & Sepal.Length < 5, .N]
less_3_5

.N becomes really useful when you want to also use “by”. For example let’s get the number of observations per species.


obs <- data[, .N, by = Species]
obs

There are lots of other features of data.table that make it useful to use for data manipulation, but here I have gone over the basics and the operations that I use most often. Notice how easy it is to manipulate tables and combine multiple operations in only one line of code! If you are looking for more resources on using data.table you can:

Visit the CRAN
Complete a datacamp tutorial. Here is one called Data Manipulation with data.table in R
Check out the CRAN Vignette called Introduction to data.table

Happy analyzing!

The Importance of Reproducible Code

Mon, 30 Nov 2020 19:34:08 -0500

After I started conducting data analysis in R, I quickly realized that making your code easily reproducible is one of the most important steps to improving your research workflow. Trust me, I learned some of these lessons the hard way. The major benefits of making your code easy to understand and easy to reproduce are:

You can share code easily with others and they will be able to run it and won’t be completely lost
A few months down the line you can come back to your code and actually understand what the heck was going on
You’re bound to make less errors
It makes writing and reading code a much more pleasant experience

My top tips for making your code readable and reproducible in RStudio:

Start by creating an R project (.RProj file) in a new folder on your computer. This allows you to have a separate directory for each project you work on. You can then add your script and any files you might need into the folder with the RProj and have a simple working directory. Then when loading files into your code, you can use relative file paths. If you need to share the project, you can easily zip it and send it to someone else, and all the necessary files will already be in there (with no need to change the file paths)!

I like to use RMarkdown for my scripts. With RMarkdown you can integrate code “chunks” with plain text, making it easy to organize and explain your code. You can also “knit” the document to create either a PDF or an html file. A great resource for understanding RMarkdown is this cheat sheet.
Make sure to load all packages at the top of your code. If you need to install the packages, make sure to delete or comment out that code.

Make sure your code runs in order. If you aren’t sure if your code runs in order, clear your global environment and restart your session and see if the script runs as a whole. It can sometimes be tempting to test stuff out and run things line by line, but this will make it much harder for you or anyone else to reproduce. Another benefit to using RMarkdown is that your document won’t knit properly unless it can run altogether.
Add comments to your code. This will make it so much easier for others and future you to understand what you were doing. I like to split my code up into sections and add a general explanation of what I am doing and why at the beginning of each section. Then within the body of the code, I will use #comments to explain the individual steps. You don’t need to comment literally every line, but make sure to annotate what is not obvious.
Try to make the names of your objects informative and intuitive. It’s also best not to make them too similar to each other in case you mix them up. For example having two tables called data_clean and data_cleaner will probably lead to some confusion.

These are just some tips that I’ve found useful, but make sure to do what works for you! Good luck and happy coding 😄

Posts | Sarah Berger