data management
Getting data in and out of R (Part 4)
Myth 3b: You can’t get data from R into Excel.
You guessed it – getting data from R into Excel works just as easy the other way round via the clipboard (see the previous post). It’s not (yet) possible to select and copy all data from an object in the spreadsheet view, but that’s no problem because there are other convenient solutions.
As we’ve used read.table()
to read the clipboard’s contents, we can use the function write.table()
and copy data into it. For the sake of compactness I will just give an example for matrices, because it works the same for vectors.
Let’s generate some random data first
> set.seed(1) > x = data.frame(a=rnorm(10),b=rnorm(10),c=rnorm(10),d=letters[1:10]) > x
a b c d 1 1.35867955 -0.1645236 0.3981059 a 2 -0.10278773 -0.2533617 -0.6120264 b 3 0.38767161 0.6969634 0.3411197 c 4 -0.05380504 0.5566632 -1.1293631 d 5 -1.37705956 -0.6887557 1.4330237 e 6 -0.41499456 -0.7074952 1.9803999 f 7 -0.39428995 0.3645820 -0.3672215 g 8 -0.05931340 0.7685329 -1.0441346 h 9 1.10002537 -0.1123462 0.5697196 i 10 0.76317575 0.8811077 -0.1350546 j
When we look at the help for write.table()
we see a lot of parameters to tweak:
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"))
The parameter file
is set to an empty string (""
) by default what tells R to print the output in the console. This is great to try new parameters because you don’t have to paste something to Excel to see the results, so I will not write to "clipboard"
for the moment.
> write.table(x,file="")
"a" "b" "c" "d" "1" 1.35867955152904 -0.164523596253587 0.398105880367068 "a" [ . . . output omitted . . . ] "10" 0.763175748457544 0.881107726454215 -0.135054603880824 "j"
Looks good, but wait … the header contains four variables while there are five columns of data! The parameter row.names = TRUE
is responsible for this, chances are that you don’t need this vector because there’s some other ID-variable in your data, so let’s drop it.
> write.table(x,file="",row.names=FALSE)
"a" "b" "c" "d" 1.35867955152904 -0.164523596253587 0.398105880367068 "a" [ . . . output omitted . . . ] 0.763175748457544 0.881107726454215 -0.135054603880824 "j"
Now the number of columns in the header and data match, but Excel will still not be able to read it, because the delimiter is set to " "
. We need to change this parameter to tab using sep="\t"
.
> write.table(x,file="",row.names=FALSE,sep = "\t")
"a" "b" "c" "d" 1.35867955152904 -0.164523596253587 0.398105880367068 "a" [ . . . output omitted . . . ] 0.763175748457544 0.881107726454215 -0.135054603880824 "j"
At this point we could already send the data to the clipboard (file="clipboard"
) and paste it into Excel without problems, but for the sake of completeness I will show some other parameters that you might need as well.
If you have missing values these will be copied as "NA"
and Excel can’t interpret it, so set na=""
to leave these cells empty. If you don’t want to copy variable names you can use col.names=FALSE
. Usually you also don’t need strings to be quoted – quote=FALSE
. And last but certainly not least some of you will need to change the decimal sign from point to comma dec=","
.
Writing your own I/O functions
One seeming restriction of this approach might be the rather long command you need to enter in order to set all parameters the right way, e.g. write.table(x, file="clipboard", row.names=FALSE, sep = "\t", na="", quote=FALSE)
.
To remedy this just write your own function:
toExcel = function(x,varnames=TRUE){
write.table(x, file="clipboard", row.names=varnames,
sep = "\t", na="", quote=FALSE)
}
Now you have a nice little function and eliminated the problem of forgetting or messing up parameters when entering this looooong command.
It’s also a good idea to do something similar for the import function from my previous post:
fromExcel = function(varnames=TRUE){
read.table("clipboard", header=varnames, sep = "\t")
}
Note that both functions have an argument (varnames=TRUE
) to switch reading/writing of variable names on and off.
The only problem that remains is that you’d have to enter the definition of toExcel()
and fromExcel()
in every R session manually. To make both available in every started R session, open the file Rprofile.site
in the /etc/
folder of your R distribution and append the functions’ definitions at the end. Now restart R and have fun with it.
Getting data in and out of R (Part 3)
Myth 3a: You can’t get data from Excel into R easily.
Nope. Although R can read data from all kinds of files (txt
, csv
, xls
, dta
, sav
, …), there’s a quick way of transferring your from Excel to R.
Caveat: I haven’t tested how this works for other operating systems and/or spreadsheet programs – I have used Windows XP and Excel 2003 for all examples presented below.
Vectors
Method A
Getting a column of data into R is no big deal. Select what you want, copy it and enter data.object = scan()
. Instead of entering single values as we did in the previous post, just paste the copied contents and submit a blank line to terminate scan()
:
> data.object = scan() 1: 0.61 2: 0.83 3: 0.63 4: 0.48 5: 0.67 6: 0.61 7: 0.06 8: 0.48 9: 0.51 10: 0.49 11:
Read 10 items
> data.object
[1] 0.61 0.83 0.63 0.48 0.67 0.61 0.06 0.48 0.51 0.49
If you have strings to read, remember specifying scan(what="character")
(see part 2).
Method B
Instead of scanning all values, you can directly read the clipboard’s contents like this:
> data.clipb = readClipboard() > data.clipb
[1] "0.61" "0.83" "0.63" "0.48" "0.67" "0.61" "0.06" "0.48" "0.51" "0.49"
Important: note that the created object will be character
, so you have to convert the object to numeric
.
> class(data.clipb)
[1] "character"
> data.clipb = as.numeric(data.clipb)
Matrices
To read matrices from Excel, you need the read.table()
function (for help and options see ?read.table
). You don’t have to pay attention to data types because R creates a data.frame
and assigns suitable types to each column (i.e. variable).
Without Variable Names
By default, read.table()
has the option header=FALSE
, this means that the first row is not interpreted as variable names, so I only select and copy the values of three numeric and one string variable. Now create a new object and assign read.table(file="clipboard")
to it.
> data.noheader = read.table( + file="clipboard") > data.noheader
V1 V2 V3 V4 1 0.25 0.80 0.46 a 2 0.44 0.22 0.87 b 3 0.72 0.99 0.17 c 4 0.56 0.13 0.07 d 5 0.88 0.36 0.09 e
We see that the data were read successfully. To check the view internal structure of the data frame, type
> str(data.noheader)
'data.frame': 5 obs. of 4 variables: $ V1: num 0.25 0.44 0.72 0.56 0.88 $ V2: num 0.8 0.22 0.99 0.13 0.36 $ V3: num 0.46 0.87 0.17 0.07 0.09 $ V4: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
From this we see that our data frame has four variables with five observations. Three variables are numeric and for the string variable, a factor with five levels was created.
With Variable Names
If you have variable names in the first row of your data set, you just have to add the option header=TRUE
:
> data.header = read.table( + file="clipboard",header=TRUE) > data.header
var1 var2 var3 sometext 1 0.25 0.80 0.46 a 2 0.44 0.22 0.87 b 3 0.72 0.99 0.17 c 4 0.56 0.13 0.07 d 5 0.88 0.36 0.09 e
Problems with decimals
One problem I have not covered up to this point is the usage of other delimiters and decimal point characters. For instance, if you live in Germany, your operating system will most likely interpret decimal characters as a comma. So if you had problems using the scan()
and read.table()
function this may be the reason. To circumvent related problems, both functions have an option dec
that is dec="."
by default. Adjust this setting to whatever character you use as a decimal point and it should work fine.
> comma.values = scan(dec=",") 1: 1,1 2: 2,2 3: 3,3 4: 4,4 5:
Read 4 items
> comma.values
[1] 1.1 2.2 3.3 4.4
Problems with delimiters
read.table()
assumes that columns are separated by a tabulator – and that’s fine for Windows because copied matrices will have this format, but if you need another delimiter (e.g. comma, semicolon or a blank) use the option sep=","
. I copied this to my clipboard in plain text:
1;2;3;4 5;6;7;8 9;9;9;9
Specifying the delimiters in R gives you the correct results
> read.table("clipboard",sep=";")
V1 V2 V3 V4 1 1 2 3 4 2 5 6 7 8 3 9 9 9 9
Getting data in and out of R (Part 2)
In this post I want to continue where I left off last time. I showed the fix()
and edit()
functions to view and edit data frames. “But what about vectors” you might ask …
Myth 2: Inputting and editing vectors in R is tedious!
Everyone working with R is familiar with the c()
function that combines its arguments into a vector. If you have to enter a vector manually you might do it this way:
> vec = c(1,2,3,4,10,20,30,40,100,200,300,400)
If this were the only way to create vectors it would really be annoying (imagine more data and decimal values – mix up some dots and commas and your vector’ll be a mess), but we can use a function called scan()
that is more convenient for manually inputting vectors.
Basically it works like this: (1) Create a new object that will contain the data and assign scan()
to it. (2) Enter your data hitting ‘enter’ after each datum. (3) When you are finished, submit a blank line (i.e. hit ‘enter’ once more) and R tells you how many items were read and assigned to the new object.
> vec.scan = scan() 1: 1 2: 2 3: 3 4: 4 5: 10 6: 20 7: 30 8: 40 9: 100 10: 200 11: 300 12: 700 13:
Read 12 items
I obviously made type error – element 12 should be ‘400’. I could use indices to correct this cell (vec.scan[12] = 400
) or fix()
(resp. edit()
, see Getting data in and out of R – Part 1) which provides a more intuitive way of editing. Typing fix(vec.scan)
opens a window where I can replace 700 with 400. Close it and you’re done.
What about characters?
No problem, scan()
is a very versatile function (as we will see in subsequent posts) that can handle the data types logical
, integer
, numeric
, complex
, character
, raw
and list
(see ?scan
). So if you want to enter strings manually this could look like this:
> text.scan = scan(what="character") 1: scan() 2: can 3: read 4: text 5: and 6: also 7: special 8: characters 9: \/+-#$§%&_@ 10:
Read 9 items
> text.scan
[1] "scan()" "can" "read" "text" "and" [6] "also" "special" "characters" "\\/+-#$§%&_@"
Note that you don’t have to quote text and additionally R escapes special characters as the backslash!
Beware of blanks!
By default R breaks up elements separated by blanks.
> scan(what="character") 1: this will not be one element 7:
Read 6 items [1] "this" "will" "not" "be" "one" "element"
You can see that although there’s only one line of text, each word becomes a separate element. If you want each line to be a single element you can tell scan()
to separate the input by another character, like the “new line” escape sequence:
> scan(what="character",sep="\n") 1: this is one element! 2:
Read 1 item [1] "this is one element!"
The line was read as one element now.
Getting data in and out of R (Part 1)
I guess one major thing that scares people away from R (and makes them turn to ‘fluffy’ packages like SPSS etc.) is the seeming lack of the familiar spreadsheet user interface.
A couple of persons told me that they have trouble getting their data into SPSS, so they think they’d never be able to do it in R.
I have to disagree with them and below is a small list of I/O-related myths about R that need to be busted.
Myth 1: There are no spreadsheets in R!
No, there is a way to view your data in spreadsheet format and you can even input/edit your data using the same view.
Let’s generate a data frame containing seven variables with five (random) observations:
> set.seed(55)
> d = data.frame(v1 = rnorm(5), v2 = rnorm(5), v3 = rnorm(5), v4 = rnorm(5),
v5 = rnorm(5), v6 = rnorm(5), v7 = rnorm(5))
If we print this object, we get something like:
> d
v1 v2 v3 v4 v5 v6 1 0.120139084 1.1885185 -0.04891095 -0.3662780 -1.5192720 0.960344 2 -1.812376850 -0.5053439 -0.84323377 2.3553639 1.4971178 -0.692181 3 0.151582984 -0.0992344 -2.07527077 1.0933772 0.8196153 1.405998 4 -1.119221005 0.3053532 -0.36076315 0.2858410 1.0660504 -1.633539 5 0.001908206 0.1984097 -0.63768966 0.9936578 0.7337559 0.261831 v7 1 1.5647544 2 0.3145893 3 -0.9346850 4 -0.1251366 5 -0.5267137
This kind of display is cumbersome because the set is split across variables (and we just have five observations and seven variables!). To destroy the first myth, use the edit()
function on the data frame and you will see your data in a neat spreadsheet.
> edit(d)
But R can do a lot more. Suppose you want to enter some data manually. Create a new object and assign an empty data frame to it.
> dat.man = edit(data.frame())
The spreadsheet appears and you can change the variable names by clicking on them. I typed in the following data and viewed them again in R:
> dat.man
X Y 1 1 5 2 2 4 3 3 3 4 4 2 5 8 1
If you spot a mistake in your data input, you’d have to assign the correct value to that cell. In larger sets it can be annoying to get the indices right, so you can use the edit()
function to do this ‘graphically’.
dat.man = edit(dat.man)
opens the editor again and you can go to the cell and edit the value. Note that you have to assign the edit()
function’s result to an object. By using the fix()
function you don’t have to reassign the edited data to the original object, so these two commands do the same:
> dat.man = edit(dat.man)
> fix(dat.man)