Kickstarting R
Getting data in and out of R

The Manual Method

R uses the assignment operator <- to give a data (or other) object its values. ->, which points the other way, can also be used, although the assignment is now from left to right. A very common mistake (due to conventions that used the = sign for both comparison and assignment) is to mix them up in R.

> x<-2

assigns the value of 2 to x.

> y<-c(1,2,3,4,5)

assigns the vector of values shown to y. Note here that you must use the c ("combine") function. However, once you have assigned the value of y, you may then assign its value to other data objects

> z<-y

The cryptically named c will also combine character strings

> names<-c("Abe","Bob","Con")

into vectors or lists into a list of lists.

The underscore character '_' also acted as an assignment oeprator until R v1.8.0. This was a real bummer if you used underscores in place of spaces in naming objects. Fortunately, it has been officially evicted from the pantheon of operators, but may still bedevil users of earlier versions.

Getting pre-existing data into R

Manually entering data is only suitable for small data sets. How do you get your rectangular data file or spreadsheet or data base table into R? The foreign package will import many different data formats.

One of the most straightforward ways to retrieve data is through plain text. Almost all applications used for handling data will export data as a delimited file in ASCII text, and this gives us a rough and ready way to get the vast majority of data into R.

First, export the data, usually using a command like Save As... and selecting ASCII text, CSV or just text.

Some spreadsheets export numeric fields with embedded spaces. These usually are translated as factors, which is often not what you want. Stripping out any embedded spaces with:

tr -d '\40' < old.dat > new.dat

will usually fix things up. Text editors may also be used if they have a search and replace facility, by searching for spaces and replacing them with nothing.

You may have a choice of delimiters (the characters that separate the data values). Commas are often used (as in the CSV format), so say you have a comma-delimited data file named infert.dat that looks like this


and want to import it.

> infert<-read.table("/home/jim/infert.dat",header=T,sep=",")

What read.table does is try to read data from the file named as the first argument. If header is specified as T (True), the first line will be read as the column names for the data frame to which the values are assigned. If there are no labels, header defaults to F (False). If we had used something like TAB for a delimiter, sep would have been defined as a C-style escape sequence beginning with a backslash ("\t"). This is because you can't put a TAB character into the command line. As the file was comma-delimited, the comma was put directly between the quotes. Escape sequences will come in handy quite often.

Getting it out again

The function write.table() performs the opposite transformation, writing out an R date frame object into a rectangular data file. There are other output options like write() to write out a matrix to a data file, and the functions in the foreign() package that let you write out data in proprietary formats.

Squeezing in big data sets

R uses a memory based model to process data. This means that the amount of data that can be handled is critically dependent upon how much memory is available. Earlier versions required the user to increase the amount of memory available when starting up, but there is now a dynamic allocation. However, if you still run out of memory while trying to import a large data set, you may be able to overcome the problem. Using scan() to import the file will use less memory. scan isn't as easy to use, and you have to enter the column names separately.

> infert<-data.frame(scan("/home/jim/infert.dat",list("",0,0,0,0,0,0,0),skip=1))
Read 128 lines
> names(infert)<-c("education","age","parity","induced","case","spontaneous","stratum","pooled.stratum")

Note how the assignment operator was used to assign the names to the data frame.

Going beyond scan(), there are methods to store your data in a database table and access the table using the appropriate interface. This enables the user to access huge amounts of data by only processing it in bits.

For more information, see An Introduction to R: Reading data from files, and the documentation from the foreign, RmSQL, RMySQL, RODBC and ROracle packages.

Back to Table of Contents