R
From TechWiki
A quick start to the R statistical computing program...
See also R: Learning by Doing.
Contents |
About
Reference Card
Web Resources
- Homepage: r-project.org
- Wiki: wiki.r-project.org
My Resources
Tutorials
- cran.r-project.org: index
- uni-erlangen.de: PDF
- ethz.ch: PDF
- Example datasets: stat.ethz.ch
Econometrics / Financial Markets / Econophysics
- Rmetrics (econophysics]): itp.phys.ethz.ch
- cran.r-project.org: fSeries documentation
- berkeley.edu: fSeries documentation
See also Focus: Time Series Analysis
Debian Installation
sudo apt-get install r-recommended r-cran-rmysql
See also:
Updating
To 2.7 on hardy.
Add
deb http://stat.ethz.ch/CRAN/bin/linux/ubuntu hardy/
to your /etc/apt/sources.list file.
For the key, see stat.ethz.ch.
Then
sudp apt-get update sudo apt-get install r-recommended
and choose "package manager's version" (unless you have local changes in your configuration).
Package Installation
To be able to use packages in a R session, e.g.,
library(packageName)
you need to install it on your system.
Do this either from within a terminal session with:
sudo apt-get install packageName
or from within a R session:
install.packages(packageName)
or download package from the web (e.g., cran.r-project.org: packages) and:
sudo R CMD INSTALL packageName.tar.gz
Note that the downloaded gz file can be extracted and untared (tar -xzf name.tar.gz) and moved to /usr/lib/R/library/ before installing.
Also, you may need the following packages:
sudo apt-get install build-essential r-base-dev refblas3-dev libgfortran1-dev gfortran
Basic Usage
Start, Help and Quit
To start R, type
R
into a terminal.
Use
?command
for help, or
example(command)
for illustrations on the keyword command.
To quit, type
q()
or
q("no")
to not be prompted to save workspace image.
To see current directory
getwd()
and change it
setwd('/home/usr/R/')
Note that on Linux machines the Tab autocompletion works.
To clear all parameters stored in the current R session
rm(list = ls(all = TRUE))
Loading Data
mydata <- read.table("/directory/source.dat")
works for multi-column data; option header=TRUE; alternatively give URL instead of directory path.
mydata <- scan("/directory/source.dat")
one column of data; alternatively give URL instead of directory path.
mydata <- read.csv("/directory/source.csv", col.names=1, row.names=1)
should read the CSV file using the first row, column as labels. If you have problems with the row/column sizes, use
mydata <- read.csv("/directory/source.csv", col.names=, header=T)
Note that by default, if no headers are used, the variable names for the columns is V1, ...
To access a variable, use
mydata$varName
or for a matrix, e.g.,
mydata[2,4]
Typical problems arise if you mix numbers and strings in the data. This can result in weird problems. Check with
typeof(mydata)
and
typeof(mydata[1,1])
Saving/loading data from the file system
save(obj,file="file.rda")
load("file.rda")
Executing R scripts
Write the commands into a file /home/usr/dir/myStuff.R and start an R session. To execute the .R file, use
source("/home/usr/dir/myStuff.R")
source("/home/usr/dir/myStuff.R", echo = TRUE)
Note that on Linux machines the Tab autocompletion works.
To see command line output, put
cat("output of command", command)
Alternatively, you can use the emacs extension ESS.
Alternatively, under Linux, you can execute an R script from the terminal
R --no-save < myR.r > output.txt
output.txt will contain the contents of myR.r plus anything that the script outputs to the screen.
Summary
summary(myData)
or
str(myData)
or
names(myData)
or
class(myData)
or
typeof(myData)
or
is(myData)
Conversion
myDouble <- as.double(myString)
works for
as.array(x), as.data.frame(x), as.numeric(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), ...
Time
strptime(t1, format="%Y-%m-%d %H:%M:%S"), as.numeric(x, units="secs"), ...
To extract information from summary()
E.g.,
res <- lm(ldat$V2~ldat$V1) summary(res)
yields
Call:
lm(formula = ldat$V2 ~ ldat$V1)
Residuals:
Min 1Q Median 3Q Max
-0.630654 -0.019070 0.003826 0.028311 0.083986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.375669 0.013509 27.81 <2e-16 ***
ldat$V1 1.078772 0.002402 449.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04796 on 498 degrees of freedom
Multiple R-Squared: 0.9975, Adjusted R-squared: 0.9975
F-statistic: 2.018e+05 on 1 and 498 DF, p-value: < 2.2e-16
To get the intercept and exponent with their corresponding standard errors:
c(summary(res)$coefficients[1,1], summary(res)$coefficients[1,2]) c(summary(res)$coefficients[2,1], summary(res)$coefficients[2,2])
i.e.,
[1] 0.37566855 0.01350871 [1] 1.078772495 0.002401521
Functions
f <- function(x) x^2
Lists
list[n]
list with elements n.
list[[n]]
n-th element of list.
list[["mine"]]
element of list with name "mine".
list[[2]]$varName
in list of vectors, second element with variable "varName".
Vectors
If vec is a column vector
vec[,]
is its transpose (row vector).
Combine
v <- c(1,3,443,2) v2 <- seq(2,8,by=2)
Access
v[3] v[1:3] v[c(2,4)] v[-c(2,4)] v[v>200]
If ldat has two data columns (V1, V2), then
X <- ldat$V1[ldat$V1 > Min] Y <- ldat$V2[ldat$V1 > Min]
will become the columns where the corresponding V1-values are greater than some value Min. Similarly
X <- ldat$V1[ldat$V1 > Min & ldat$V1 < Max] Y <- ldat$V2[ldat$V1 > Min & ldat$V1 < Max]
Get index with
which(v == 2)
Sort vector T:
ind <- order(T) T <- T[ind]
Matrices
Create
mat1 <- cbind(v,v2) mat2 <- rbind(v,v2,v+v2)
or
mat <- matrix(0, nrow=2, ncol=3)
Computation is done elementwise, so for matrix operations enclose operand with '%'
mat2%*%mat1
Create new column from old one
mat[,"newcol"] <- log(mat[,"col"])
To access columns 1 to 3 of a matrix
mat[1:3]
To extract rows 20 to 120 of a matrix
mat[20:120,]
See all rows of column 2
mat[,3]
Note that when loading data using load.table the default names for the columns id V1, etc. This means that the following two expressions are identical
data[,2] data$V2
Time
For days, months the following works
as.difftime(t1, t2, units="days") length(seq(t1, t2,by='1 day')) - 1 as.double(t1 - t2)
However, for hours, minutes (mins) and seconds (secs) the only thing I got to work was
as.numeric(t1 - t2, units='hours')
NB:
str1 <- "2004-08-02 05:59:39" t1 <- as.Date(str1)
and then use
as.numeric(strptime(t1, format="%Y-%m-%d %H:%M:%S") - strptime(t2, format="%Y-%m-%d %H:%M:%S"), units='hours')
Alternate creation of a date instance
mydate <- as.Date("12.07.08","%d.%m.%y")
To convert date into milliseconds:
format(mydate,"%s")
format(as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="GMT",origin="1970-01-01 00:00:00"),"%s")
To set timezone:
Sys.setenv(TZ='GMT')
Output and Plotting
New plot window:
X11()
Save plots to file:
postscript()
or
pdf()
and
plot(x,y,col="red") dev.off()
will create a file Rplots.ps or Rplots.pdf.
Alternatively
jpeg("pic.jpg", width = 800, height = 600)
or
png()
To generate rectangular plots, use
postscript("myPlt.ps", height=6, width=6, horizontal=F, onefile=F)
To generate multi-figure plots, e.g.,
par(mfrow=c(2,3))
puts three plots next to each other on two rows.
To add a second plot to an existing one, type
par(new=TRUE)
and the the second plot command.
To scale the y axis, use
plot(...,ylim=c(min,max))
Log plots
plot(f, xmin, xmax, log="xy")
To get grid lines
grid()
Minor ticks are available in the Hmisc package (cran.r-project.org)
minor.tick(nx=10, ny=10, tick.ratio=0.6)
Alternatively, use
axis(2, at = 1:10, cex.axis=0.2)
For logarithmically spaced minor ticks, see this example: R:_Learning_by_Doing#V3.
To control spacing between labels (xlab, ylab) and x-, y-axis, use
mpg
See this example: R:_Learning_by_Doing#V3.
To set font size of axis labels
plot(..., cex.axis=1.5)
more formatting parameters for the main title, axis labeling and line width:
plot(..., cex.main=1.7, cex.lab=1.5, lwd=2)
And for the text command
text(x, y, 'Text', col='red', cex = 1.5)
or with more formatting
myTxt <- paste(c("E =", format(0.5, digits = 5), " +/- ", format(0.05, digits = 3)), collapse="")
text(200, 0.02, myTxt, col='red', cex=1.5, 0)
Rotate labels
par(las={1,2,3,4})
Format also gives scientific notation
format(0.05, digits = 3, sci=TRUE)
If you have the problem that setting cex.lab cuts off the y-axis labeling, you need to make the margins bigger, e.g., set
par(mar=c(5,5,4,2))
No tick labels for y-axis
plot(..., yaxt="n")
dev.print()
only works if you have lpr working.
To add additional text to a plot, use
text(x, y, [label])
where [label] can be text or a character vector and x, y can be vectors as well.
To customize axis, deactivate them in the plot
plot(..., yaxt="n", xaxt="n")
then customize (y-axis) with
axis(2, at = Y, cex.axis=2)
where Y can be a sequence
X <- seq(val1, val2, by=val3)
or a number
X <- round(val),digits=1)
Or for log scale
X <- 10^(seq(-4,2,by=1)) axis(1, at = X, cex.axis=1.5)
More on graphics in RGraphics library.
Inset Plots
One solution is to use the gridBase library and
library(gridBase)
X11(width=8,height=8)
# produce outer ("main") plot
plot(x,y,xaxs="i",yaxs="i")
vp <- baseViewports()
pushViewport(vp$inner,vp$figure,vp$plot)
# push viewport that will contain the inset
pushViewport(viewport(x=0.1,y=0.9,width=.5,height=.5,just=c("left","top")))
# now either define viewport to contain the whole inset figure
par(fig=gridFIG(),new=T) # or gridPLT() # ...or just the plotting are (coordinate system)
par(plt=gridPLT(),new=T)
# draw frame around selected area (for illustration only)
grid.rect(gp=gpar(lwd=3,col="red"))
# plot inset figure
plot(x,y,xaxs="i",yaxs="i",xlab="",ylab="")
# pop all viewports from stack
popViewport(4)
from [1].
Multiline Plots
Graphs with many lines are hard to understand. R by default places the symbols of the lines very close to each other. If you want a bigger spacing between symbols on individual lines, do this
From https://stat.ethz.ch/pipermail/r-help/2008-February/155240.html
R and LaTeX
If you want to use LaTeX for labeling figures, use quote. E.g.,
xlab=quote(Z==frac(mu[1]-mu[2],sigma/sqrt(n)))
Alternatively, you can replace any labeling in a figure in the LaTeX document using psfrag.
Resources
- http://wiki.r-project.org/rwiki/doku.php?id=tips:tips#graphics
- http://wiki.r-project.org/rwiki/doku.php?id=tips:graphics-base:errbars
- http://wiki.r-project.org/rwiki/doku.php?id=tips:graphics-base:pch_groups
Examples
Learning by Doing
Either look at the examples listed below or open
and play with the commands there.
Tricks
while(dev.cur() != 1) dev.off()
closes all aopen windows.
x <- c()
empties the value of the variable x.
Removing Elements
You have two vectors
x <- (1,2,3,4) y <- c(3,2,1,0)
and you would like to remove the zero entry.
Then find the indices corresponding to the zero with
ind <- which(y==0)
and create new vector omitting this index with
x <- x[-ind] y <- y[-ind]
plot(density(myData$V1))
General syntax for non-parametric kernel density estimation:
density(dat[,1], bw=0.00000001, kernel = "epanechnikov", n=400,from =-.1, to = .1, adjust = 1, cut = 1)
Histogram
hist(myData$V1)
or
hist(myData$V1, breaks=1000)
CDF
plot(ecdf(myData$V1))
Power Law
Fit a straight line (linear model; lm) to a dataset.
ldata <- log(myData) plot(ldata) results <- lm(ldata$V2~ldata$V1) abline(results$coef)
Plotting and outputing values:
# Get data
dat <- read.table("/home/me/mydata.dat")
ldat <- log(dat)
# Select interval
Min <- -7.5
Max <- -4.6
X <- ldat$V1[ldat$V1 >= Min & ldat$V1 <= Max]
Y <- ldat$V2[ldat$V1 >= Min & ldat$V1 <= Max]
# Compute and plot
plot(X, Y, xlab = 'log(delta x)', ylab = 'log(delta t)', main = 'Title')
res <- lm(Y~X)
abline(res$coef)
# Error propagation
A <- summary(res)$coefficients[1,1]
dA <- summary(res)$coefficients[1,2]
B <- summary(res)$coefficients[2,1]
dB <- summary(res)$coefficients[2,2]
C <- exp(-A/B)
C.err <- sqrt((-C/B*dA)^2+(A/B^2*C*dB)^2)
# Formatting the output
Etxt <- c("E =", B, "+/-", dB)
Ctxt <- c("C =", C, "+/-", C.err)
xmin <- min(X)
ymax <- max(Y)
xrange <- max(X) - min(X)
yrange <- max(Y) - min(Y)
xmin <- xmin - xrange/xmin*0.1
text(c(xmin, xmin + xrange*0.18, xmin + xrange*0.36, xmin + xrange*0.55), ymax, Etxt)
text(c(xmin, xmin + xrange*0.18, xmin + xrange*0.36, xmin + xrange*0.55), ymax+yrange/ymax*0.3, Ctxt)
Discussion
Note that it is your job to state the hypothesis, i.e., what model you would like to fit the data to. If you have a curvature in your data, the linear model should be extended with a quadratic term:
resQuad <- lm(Y~X+I(X^2))
The I is R syntax for adding an arithmetic term. To get the fitted curve
A <- summary(res)$coefficients[1,1] dA <- summary(res)$coefficients[1,2] B <- summary(res)$coefficients[2,1] dB <- summary(res)$coefficients[2,2] C <- summary(res)$coefficients[3,1] dC <- summary(res)$coefficients[3,2] sq <- function(x) A + B*x + C*x^2
If you compare
summary(resQuad)
with
summary(res)
you will get a slightly higher R-squared value
summary(...)$r.squared
for the quadratically extended model and the curved plot. However, if you just look at the values of
summary(res)
for the curved data, you won't really be able to tell that it's a bad fit if you don't plot the data vs. the model. So, as mentioned, you need to decide what model makes the most sense to fit the data to.
If you execute
plot(res)
you get four plots. The first one "Residual vs Fitted" tells you if the variance of the error is constant, or if there is a systematic dependency. The best you can get is randomly scattered dots around y=0. The second plot "Normal Q-Q) tells you if your errors are normally distributed. The dots should be on the diagonal for that. Note that if the errors are not normally distributed, the "Pr(>|t|)" values in summary() will be meaningless.
Confidence Intervals
newData <- data.frame(x=seq(min(X), max(X), by=(max(X)-min(X))/(length(X)-1))) conf.lim <- predict(res, newdata=newData, interval='confidence')
Plot
matlines(newData$x, conf.lim)
matlines(newData$x, conf.lim, lty=c(1,4,4), col=c("black","red","red"))
NB
pred.lim <- predict(res, newdata=newData, interval='prediction')
Correlations
Autocorrelation:
acf(x)
Cross-correlation:
ccf(x,y)
More Stuff
- par()
- rug(x)
- matplot(m1, m2)
- apply(mat, 1, fct)
- round
- sample
- solve
- step
- class
- sapply
- earth; library(earth); actuall new MARS implementation
- gam; library(mgcv)
- rbinom, rpois, rhyper
- runif, rnorm, rlnorm, rt, rF, rchisq, rgamma
Replace a column in a data set with transformed values, e.g.,
newdata <- subset(transform(mydata,logC1=log(C1)),select=-C1)
where column C1 is replaces with its log value in the data frame mydata.
Error Propagation in Scaling Laws
General equation for a function f of n variables with uncertainties
The scaling law
can be transformed to a linear expression
- Y = A + BX,
where Y = ln(y), X = ln(x), E = B, and C = exp( − A / B).
In measuring the linear model, the standard errors are ΔA and ΔB. The errors for the original parameters are thus
- ΔE = ΔB,
where C = f(A,B) = exp( − A / B), and
To be precise, this computation only works if
. And even then the relation is only approximate:
Recall also, that if you execute
plot()
in R, the first plot should have randomly scattered dots around
and the second plot should show the dots on the diagonal for the assumption of the errors being normally distributed with constant variance to be correct.
R Code
dat <- read.table("/home/jbg/docs/dev/it_vol/overshoot/chf-jpy.dat")
ldat <- log(dat)
plot(ldat, xlab = 'log(delta x)', ylab = 'log(delta x)', main = 'CHF_JPY DC Overshoot')
res <- lm(ldat$V2~ldat$V1)
abline(res$coef)
# Error propagation
A <- summary(res)$coefficients[1,1]
dA <- summary(res)$coefficients[1,2]
B <- summary(res)$coefficients[2,1]
dB <- summary(res)$coefficients[2,2]
C <- exp(-A/B)
C.err <- sqrt((-C/B*dA)^2+(A/B^2*C*dB)^2)
# Formatting the output
Etxt <- c("E =", B, "+/-", dB)
Ctxt <- c("C =", C, "+/-", C.err)
xmin <- min(ldat[,1])
ymax <- max(ldat[,2])
range <- max(ldat[,1]) - min(ldat[,1])
xmin <- xmin - range/xmin*0.1
text(c(xmin, xmin - range/xmin*1.5, xmin - range/xmin*3, xmin - range/xmin*4.75), ymax, Etxt)
text(c(xmin, xmin - range/xmin*1.7, xmin - range/xmin*3.35, xmin - range/xmin*5.05), ymax-0.5, Ctxt)
And MySQL
- Debian package (i.e., use apt-get) r-cran-rmysql
- Manuals
Alternatively, you can mess around with packages from within R:
- Package: R CMD INSTALL pkg-name_X.Y-Z.tar.gz; package from cran.r-project.org
- R command: install.packages("RMySQL")
Syntax
From within R:
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, "database", "user", "pwd", "host")
To close the connection and free resources:
dbClearResult(res) dbDisconnect(con) dbUnloadDriver(drv)
Queries
dbListTables(con) res <- dbSendQuery(con, "SELECT * FROM GAU_GAU LIMIT 3;") data <- fetch(res, n = -1)
Export
Given a data frame mydata
dbWriteTable(con, "table1", mydata, overwrite = TRUE)
writes the content to "table1", using the column names as fields.
For a time series object, e.g.,
sim <- fbmSim(100)
you need to convert the data content to a data frame
mysim <- sim[1:100] mysim.df <- as.data.frame(mysim)
and write to MySQL with
dbWriteTable(con, "table1", mysim.df, overwrite = TRUE)
Extensions
Java
- Communication between R and Java is bidirectional: JRI handles Java->R, and rJava handles R->Java. When talking to R, one has to use an API. It looks like they provide a high level API. Interestingly, it should be possible from R to push objects back directly into the JVM.
- JGR uses both JRI and rJava to provide a graphical front end.
- sjava
- omegahat.org
