R

From TechWiki

A quick start to the R statistical computing program...

See also R: Learning by Doing and More R scripts.

Contents

About

Reference Card

Web Resources

My Resources

Tutorials

Econometrics / Financial Markets / Econophysics

See also Focus: Time Series Analysis

Debian Installation

sudo apt-get install r-recommended r-cran-rmysql

See also:

Updating

To 2.7 on hardy.

Add

deb http://stat.ethz.ch/CRAN/bin/linux/ubuntu hardy/

to your /etc/apt/sources.list file.

For the key, see stat.ethz.ch.

Then

sudp apt-get update
sudo apt-get install r-recommended

and choose "package manager's version" (unless you have local changes in your configuration).

Package Installation

To be able to use packages in a R session, e.g.,

library(packageName)

you need to install it on your system.

Do this either from within a terminal session with:

sudo apt-get install packageName

or from within a R session:

install.packages(packageName)

or download package from the web (e.g., cran.r-project.org: packages) and:

sudo R CMD INSTALL packageName.tar.gz

Note that the downloaded gz file can be extracted and untared (tar -xzf name.tar.gz) and moved to /usr/lib/R/library/ before installing.

Also, you may need the following packages:

sudo apt-get install build-essential r-base-dev refblas3-dev libgfortran1-dev gfortran

Graphical Frontend

See for instance the Eclipse plugin StatEt.

Basic Usage

Start, Help and Quit

To start R, type

R

into a terminal.

Use

?command

for help, or

example(command)

for illustrations on the keyword command.

To quit, type

q()

or

q("no")

to not be prompted to save workspace image.

To see current directory

getwd()

and change it

setwd('/home/usr/R/')

Note that on Linux machines the Tab autocompletion works.

To clear all parameters stored in the current R session

rm(list = ls(all = TRUE))

Loading Data

mydata <- read.table("/directory/source.dat")

works for multi-column data; option header=TRUE; alternatively give URL instead of directory path.

mydata <- scan("/directory/source.dat")

one column of data; alternatively give URL instead of directory path.

mydata <- read.csv("/directory/source.csv", col.names=1, row.names=1)

should read the CSV file using the first row, column as labels. If you have problems with the row/column sizes, use

mydata <- read.csv("/directory/source.csv", col.names=, header=T)

Note that by default, if no headers are used, the variable names for the columns is V1, ...

To access a variable, use

mydata$varName

or for a matrix, e.g.,

mydata[2,4]

Typical problems arise if you mix numbers and strings in the data. This can result in weird problems. Check with

typeof(mydata)

and

typeof(mydata[1,1])

Saving/loading data from the file system

save(obj,file="file.rda")
load("file.rda")

Executing R scripts

Write the commands into a file /home/usr/dir/myStuff.R and start an R session. To execute the .R file, use

source("/home/usr/dir/myStuff.R")
source("/home/usr/dir/myStuff.R", echo = TRUE)

Note that on Linux machines the Tab autocompletion works.

To see command line output, put

cat("output of command", command)

Alternatively, you can use the emacs extension ESS.

Alternatively, under Linux, you can execute an R script from the terminal

R --no-save < myR.r > output.txt

output.txt will contain the contents of myR.r plus anything that the script outputs to the screen.

Summary

summary(myData)

or

str(myData)

or

names(myData)

or

class(myData)

or

typeof(myData)

or

is(myData)

Conversion

myDouble <- as.double(myString)

works for

as.array(x), as.data.frame(x), as.numeric(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), ...

Time

strptime(t1, format="%Y-%m-%d %H:%M:%S"), as.numeric(x, units="secs"), ...


To extract information from summary()

E.g.,

res <- lm(ldat$V2~ldat$V1)
summary(res)

yields

Call:
lm(formula = ldat$V2 ~ ldat$V1) 

Residuals:
     Min        1Q    Median        3Q       Max
-0.630654 -0.019070  0.003826  0.028311  0.083986

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.375669   0.013509   27.81   <2e-16 ***
ldat$V1     1.078772   0.002402  449.20   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04796 on 498 degrees of freedom
Multiple R-Squared: 0.9975,     Adjusted R-squared: 0.9975
F-statistic: 2.018e+05 on 1 and 498 DF,  p-value: < 2.2e-16

To get the intercept and exponent with their corresponding standard errors:

c(summary(res)$coefficients[1,1], summary(res)$coefficients[1,2])
c(summary(res)$coefficients[2,1], summary(res)$coefficients[2,2])

i.e.,

[1] 0.37566855 0.01350871
[1] 1.078772495 0.002401521

Executing String

mystr <- "Whatever"
eval(parse(text=mystr))

Functions

f <- function(x) x^2

Lists

list[n]

list with elements n.

list[[n]]

n-th element of list.

list[["mine"]]

element of list with name "mine".

 list[[2]]$varName

in list of vectors, second element with variable "varName".

Vectors

If vec is a column vector

vec[,]

is its transpose (row vector).

Combine

v <- c(1,3,443,2)
v2 <- seq(2,8,by=2)

Access

v[3]
v[1:3]
v[c(2,4)]
v[-c(2,4)]
v[v>200]

If ldat has two data columns (V1, V2), then

X <- ldat$V1[ldat$V1 > Min]
Y <- ldat$V2[ldat$V1 > Min]

will become the columns where the corresponding V1-values are greater than some value Min. Similarly

X <- ldat$V1[ldat$V1 > Min & ldat$V1 < Max]
Y <- ldat$V2[ldat$V1 > Min & ldat$V1 < Max]

Get index with

which(v == 2)

Sort vector T:

ind <- order(T)
T <- T[ind]

Vector of strings

S <- character(10)

Matrices

Create

mat1 <- cbind(v,v2)
mat2 <- rbind(v,v2,v+v2)

or

mat <- matrix(0, nrow=2, ncol=3)

Computation is done elementwise, so for matrix operations enclose operand with '%'

mat2%*%mat1

Create new column from old one

mat[,"newcol"] <- log(mat[,"col"])

To access columns 1 to 3 of a matrix

mat[1:3]

To extract rows 20 to 120 of a matrix

mat[20:120,]

See all rows of column 2

mat[,3]

Note that when loading data using load.table the default names for the columns id V1, etc. This means that the following two expressions are identical

data[,2]
data$V2

Complex Data Structures

If you have something like this from the output of str(x):

List of 6
$ x1: num 31
$ n0: num 19
$ n1: num 7
$ n2: num 3
$ n4: num 1
$ n9: num 1

i.e., is(x) returns

[1] "list"   "vector"

then you have to do the following to plot the values.

y <- unlist(x)
plot(y,xaxt='n')

This gives you the row names of x as x-axis labels

axis(1, at = seq(1,length(y),by=1), labels = names(x), padj=1.5)

Time

For days, months the following works

as.difftime(t1, t2, units="days")
length(seq(t1, t2,by='1 day')) - 1
as.double(t1 - t2)

However, for hours, minutes (mins) and seconds (secs) the only thing I got to work was

as.numeric(t1 - t2, units='hours') 

NB:

str1 <- "2004-08-02 05:59:39"
t1 <- as.Date(str1)

and then use

as.numeric(strptime(t1, format="%Y-%m-%d %H:%M:%S") - strptime(t2, format="%Y-%m-%d %H:%M:%S"), units='hours')

Alternate creation of a date instance

mydate <- as.Date("12.07.08","%d.%m.%y")

To convert date into milliseconds:

format(mydate,"%s")
format(as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="GMT",origin="1970-01-01 00:00:00"),"%s")

To set timezone:

Sys.setenv(TZ='GMT')

Output and Plotting

New plot window:

X11()

Save plots to file:

postscript()

or

pdf()

and

plot(x,y,col="red")
dev.off()

will create a file Rplots.ps or Rplots.pdf.

Alternatively

jpeg("pic.jpg",  width = 800, height = 600)

or

png()

To generate rectangular plots, use

postscript("myPlt.ps", height=6, width=6, horizontal=F, onefile=F)

To generate multi-figure plots, e.g.,

par(mfrow=c(2,3))

puts three plots next to each other on two rows.

To add a second plot to an existing one, type

par(new=TRUE)

and the the second plot command.

To scale the y axis, use

plot(...,ylim=c(min,max))

Log plots

plot(f, xmin, xmax, log="xy")

To get grid lines

grid()

Minor ticks are available in the Hmisc package (cran.r-project.org)

 minor.tick(nx=10, ny=10, tick.ratio=0.6) 

Alternatively, use

axis(2, at = 1:10, cex.axis=0.2)

For logarithmically spaced minor ticks, see this example: R:_Learning_by_Doing#V3.

To control spacing between labels (xlab, ylab) and x-, y-axis, use

mpg

See this example: R:_Learning_by_Doing#V3.

To set font size of axis labels

 plot(..., cex.axis=1.5)

more formatting parameters for the main title, axis labeling and line width:

plot(..., cex.main=1.7, cex.lab=1.5, lwd=2)

And for the text command

text(x, y, 'Text', col='red', cex = 1.5)

or with more formatting

myTxt <- paste(c("E =", format(0.5, digits = 5), " +/- ", format(0.05, digits = 3)), collapse="")
text(200, 0.02, myTxt, col='red', cex=1.5, 0)

Rotate labels

par(las={1,2,3,4})

Format also gives scientific notation

format(0.05, digits = 3, sci=TRUE)

If you have the problem that setting cex.lab cuts off the y-axis labeling, you need to make the margins bigger, e.g., set

par(mar=c(5,5,4,2))

No tick labels for y-axis

plot(..., yaxt="n")

Print

dev.print()

only works if you have lpr working.

To add additional text to a plot, use

text(x, y, [label])

where [label] can be text or a character vector and x, y can be vectors as well.

To customize axis, deactivate them in the plot

plot(..., yaxt="n", xaxt="n")

then customize (y-axis) with

axis(2, at = Y, cex.axis=2)

where Y can be a sequence

X <- seq(val1, val2, by=val3)

or a number

X <- round(val),digits=1)

Or for log scale

X <- 10^(seq(-4,2,by=1)) 
axis(1, at = X, cex.axis=1.5)

More on graphics in RGraphics library.

Inset Plots

One solution is to use the gridBase library and

library(gridBase)
X11(width=8,height=8)

# produce outer ("main") plot

   plot(x,y,xaxs="i",yaxs="i")
   vp <- baseViewports()
   pushViewport(vp$inner,vp$figure,vp$plot)

# push viewport that will contain the inset

   pushViewport(viewport(x=0.1,y=0.9,width=.5,height=.5,just=c("left","top")))

# now either define viewport to contain the whole inset figure

   par(fig=gridFIG(),new=T) # or gridPLT() # ...or just the plotting are (coordinate system)

   par(plt=gridPLT(),new=T)

# draw frame around selected area (for illustration only)

   grid.rect(gp=gpar(lwd=3,col="red"))

# plot inset figure

   plot(x,y,xaxs="i",yaxs="i",xlab="",ylab="")

# pop all viewports from stack
popViewport(4) 

from [1].

Multiline Plots

Graphs with many lines are hard to understand. R by default places the symbols of the lines very close to each other. If you want a bigger spacing between symbols on individual lines, do this

From https://stat.ethz.ch/pipermail/r-help/2008-February/155240.html

R and LaTeX

If you want to use LaTeX for labeling figures, use quote. E.g.,

xlab=quote(Z==frac(mu[1]-mu[2],sigma/sqrt(n)))

Alternatively, you can replace any labeling in a figure in the LaTeX document using psfrag.

Resources

system Command

system

allows Unix commands to be executed, e.g.,
system('ls -lat')

If you want to save the output as a variable in R, use

t1 <- try(system("ls", intern = TRUE))

Examples

Learning by Doing

Either look at the examples listed below or open

and play with the commands there.

Tricks

while(dev.cur() != 1) dev.off() 

closes all aopen windows.

x <- c()

empties the value of the variable x.

List of Files

filelist <- list.files(path = ".", pattern = "*.dat")

Removing Elements

You have two vectors

x <- (1,2,3,4)
y <- c(3,2,1,0)

and you would like to remove the zero entry.

Then find the indices corresponding to the zero with

ind <- which(y==0)

and create new vector omitting this index with

x <- x[-ind]
y <- y[-ind]

PDF

plot(density(myData$V1))

General syntax for non-parametric kernel density estimation:

density(dat[,1], bw=0.00000001, kernel = "epanechnikov", n=400,from =-.1, to = .1, adjust = 1, cut = 1)

Histogram

hist(myData$V1)

or

hist(myData$V1, breaks=1000)

Fancier stuff:

dist1 <- hist(myDat$V1,100)
plot((dist1$mids),log(dist1$counts))

CDF

plot(ecdf(myData$V1))

Power Law

Fit a straight line (linear model; lm) to a dataset.

ldata <- log(myData)
plot(ldata)
results <- lm(ldata$V2~ldata$V1)
abline(results$coef)

Plotting and outputing values:

# Get data
dat <- read.table("/home/me/mydata.dat")
ldat <- log(dat) 

# Select interval
Min <- -7.5
Max <- -4.6
X <- ldat$V1[ldat$V1 >= Min & ldat$V1 <= Max]
Y <- ldat$V2[ldat$V1 >= Min & ldat$V1 <= Max]

# Compute and plot
plot(X, Y, xlab = 'log(delta x)', ylab = 'log(delta t)', main = 'Title')
res <- lm(Y~X)
abline(res$coef)

# Error propagation
A <- summary(res)$coefficients[1,1]
dA <- summary(res)$coefficients[1,2]
B <- summary(res)$coefficients[2,1]
dB <- summary(res)$coefficients[2,2]
C <- exp(-A/B)
C.err <- sqrt((-C/B*dA)^2+(A/B^2*C*dB)^2)

# Formatting the output
Etxt <- c("E =", B, "+/-", dB)
Ctxt <- c("C =", C, "+/-", C.err)
xmin <- min(X)
ymax <- max(Y)
xrange <- max(X) - min(X)
yrange <- max(Y) - min(Y)
xmin <- xmin - xrange/xmin*0.1
text(c(xmin, xmin + xrange*0.18, xmin + xrange*0.36, xmin + xrange*0.55), ymax, Etxt)
text(c(xmin, xmin + xrange*0.18, xmin + xrange*0.36, xmin + xrange*0.55), ymax+yrange/ymax*0.3, Ctxt)
Discussion

Note that it is your job to state the hypothesis, i.e., what model you would like to fit the data to. If you have a curvature in your data, the linear model should be extended with a quadratic term:

resQuad <- lm(Y~X+I(X^2))

The I is R syntax for adding an arithmetic term. To get the fitted curve

A <- summary(res)$coefficients[1,1]
dA <- summary(res)$coefficients[1,2]
B <- summary(res)$coefficients[2,1]
dB <- summary(res)$coefficients[2,2]
C <- summary(res)$coefficients[3,1]
dC <- summary(res)$coefficients[3,2]
sq <- function(x) A + B*x + C*x^2

If you compare

summary(resQuad)

with

summary(res)

you will get a slightly higher R-squared value

summary(...)$r.squared

for the quadratically extended model and the curved plot. However, if you just look at the values of

summary(res)

for the curved data, you won't really be able to tell that it's a bad fit if you don't plot the data vs. the model. So, as mentioned, you need to decide what model makes the most sense to fit the data to.

If you execute

plot(res)

you get four plots. The first one "Residual vs Fitted" tells you if the variance of the error is constant, or if there is a systematic dependency. The best you can get is randomly scattered dots around y=0. The second plot "Normal Q-Q) tells you if your errors are normally distributed. The dots should be on the diagonal for that. Note that if the errors are not normally distributed, the "Pr(>|t|)" values in summary() will be meaningless.

Confidence Intervals
newData <- data.frame(x=seq(min(X), max(X), by=(max(X)-min(X))/(length(X)-1)))
conf.lim <- predict(res, newdata=newData, interval='confidence')

Plot

matlines(newData$x, conf.lim)
matlines(newData$x, conf.lim, lty=c(1,4,4), col=c("black","red","red"))

NB

 pred.lim <- predict(res, newdata=newData, interval='prediction')

Correlations

Autocorrelation:

acf(x)

Cross-correlation:

ccf(x,y)

More Stuff

  • par()
  • rug(x)
  • matplot(m1, m2)
  • apply(mat, 1, fct)
  • round
  • sample
  • solve
  • step
  • class
  • sapply
  • earth; library(earth); actuall new MARS implementation
  • gam; library(mgcv)
  • rbinom, rpois, rhyper
  • runif, rnorm, rlnorm, rt, rF, rchisq, rgamma


Replace a column in a data set with transformed values, e.g.,

newdata <- subset(transform(mydata,logC1=log(C1)),select=-C1)

where column C1 is replaces with its log value in the data frame mydata.

Error Propagation in Scaling Laws

General equation for a function f of n variables with uncertainties x_j \pm \Delta x_j

\Delta f = \Delta f \left(x_1, x_2, ..., x_n, \Delta x_1, \Delta x_2, ..., \Delta x_n \right) = \left( \sum_{i=1}^n \left(\frac{\partial f}{\partial x_i}\Delta x_i \right)^2 \right)^{1/2}.

The scaling law

y = \left( \frac{x}{C} \right)^E,

can be transformed to a linear expression

Y = A + BX,

where Y = ln(y), X = ln(x), E = B, and C = exp( − A / B).


In measuring the linear model, the standard errors are ΔA and ΔB. The errors for the original parameters are thus

ΔE = ΔB,
\Delta C = \Delta f = \sqrt{\left(\frac{\partial f}{\partial A}\Delta A \right)^2 + \left(\frac{\partial f}{\partial B}\Delta B \right)^2} ,

where C = f(A,B) = exp( − A / B), and

\frac{\partial f}{\partial A} = -\frac{1}{B} e^{-A/B},
\frac{\partial f}{\partial B} = \frac{A}{B^2} e^{-A/B}.

To be precise, this computation only works if A,B \sim \mathcal{N} (\mu_{A,B}, \sigma_{A,B}^2). And even then the relation is only approximate:

Var[exp(-A/B)] \approx  \left(\frac{\partial f}{\partial A} <\mu_A, \mu_B> \sigma_A \right)^2 + \left(\frac{\partial f}{\partial B} <\mu_A, \mu_B> \sigma_B \right)^2

Recall also, that if you execute

plot()

in R, the first plot should have randomly scattered dots around y \equiv 0 and the second plot should show the dots on the diagonal for the assumption of the errors being normally distributed with constant variance to be correct.

R Code
 
dat <- read.table("/home/jbg/docs/dev/it_vol/overshoot/chf-jpy.dat")
ldat <- log(dat)
plot(ldat, xlab = 'log(delta x)', ylab = 'log(delta x)', main = 'CHF_JPY DC Overshoot')
res <- lm(ldat$V2~ldat$V1)
abline(res$coef)
# Error propagation
A <- summary(res)$coefficients[1,1]
dA <- summary(res)$coefficients[1,2]
B <- summary(res)$coefficients[2,1]
dB <- summary(res)$coefficients[2,2]
C <- exp(-A/B)
C.err <- sqrt((-C/B*dA)^2+(A/B^2*C*dB)^2)
# Formatting the output
Etxt <- c("E =", B, "+/-", dB)
Ctxt <- c("C =", C, "+/-", C.err)
xmin <- min(ldat[,1])
ymax <- max(ldat[,2])
range <- max(ldat[,1]) - min(ldat[,1])
xmin <- xmin - range/xmin*0.1
text(c(xmin, xmin - range/xmin*1.5, xmin - range/xmin*3, xmin - range/xmin*4.75), ymax, Etxt)
text(c(xmin, xmin - range/xmin*1.7, xmin - range/xmin*3.35, xmin - range/xmin*5.05), ymax-0.5, Ctxt)

And MySQL

Alternatively, you can mess around with packages from within R:

  • Package: R CMD INSTALL pkg-name_X.Y-Z.tar.gz; package from cran.r-project.org
  • R command: install.packages("RMySQL")

Syntax

From within R:

library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, "database", "user", "pwd", "host")

To close the connection and free resources:

dbClearResult(res)
dbDisconnect(con)
dbUnloadDriver(drv)
Queries
dbListTables(con)
res <- dbSendQuery(con, "SELECT * FROM GAU_GAU LIMIT 3;")
data <- fetch(res, n = -1)
Export

Given a data frame mydata

dbWriteTable(con, "table1", mydata, overwrite = TRUE)

writes the content to "table1", using the column names as fields.

For a time series object, e.g.,

sim <- fbmSim(100)

you need to convert the data content to a data frame

mysim <- sim[1:100]
mysim.df <- as.data.frame(mysim)

and write to MySQL with

dbWriteTable(con, "table1", mysim.df, overwrite = TRUE)

Extensions

Java

  • Communication between R and Java is bidirectional: JRI handles Java->R, and rJava handles R->Java. When talking to R, one has to use an API. It looks like they provide a high level API. Interestingly, it should be possible from R to push objects back directly into the JVM.
  • JGR uses both JRI and rJava to provide a graphical front end.
  • sjava
  • omegahat.org

GUI