12
votes

Goal

I want to create my data analysis reproducible by making chunks depend on all previous chunks. So, if there are 3 chunks and I change something in the 1st chunk the subsequent 2 chunks should re-run so that they reflect the change made in the outputs. I want to add this condition in the global chunk options at the top of the document so that I don't have to use dependson multiple times.

Problems

The outputs of a chunk don't change if it is not modified and cache=TRUE. For the chunks containing the code, I can make them dependable on all previous ones using following at the top of the document:

```{r setup, echo=FALSE}
# set global chunk options: 
library(knitr)
opts_chunk$set(cache=TRUE, autodep = TRUE)
dep_auto()
```

If any of the above chunks is changed, all subsequent chunks are re-run. But this does not work if I use source() in chunks to read R scripts. Following is an example document:

---
title: "Untitled"
output: html_document
---
```{r setup, echo=FALSE}
# set global chunk options: 
library(knitr)
opts_chunk$set(cache=TRUE, autodep = TRUE)
dep_auto()
```


# Create Data
```{r}
#source("data1.R")
x <- data.frame(col1 = 4:10, col2 = 6:12)
x
```

# Summaries
```{r}
#source("data2.R")

median1.of.x <- sapply(x, function(x) median(x)-1)

sd.of.x <- sapply(x, sd)

plus.of.x <- sapply(x, function(x) mean(x)+1)

jj <- rbind(plus.of.x, sd.of.x, median1.of.x)

```

```{r}
jj
```

Now, if I change any of the 1st 2 chunks the third chunk gives correct output after knitting. But if instead I put the first chunk's code in a source file data1.R and second chunk's in file data2.R, keeping the global chunk options same as before, if I make any changes in source files they are not reflected in the output of third chunk correctly. For example, changing x to x <- data.frame(col1 = 5:11, col2 = 6:12) should yield:

 > jj
                 col1      col2
plus.of.x    9.000000 10.000000
sd.of.x      2.160247  2.160247
median1.of.x 8.000000  9.000000 

But with use of source() as discussed above, the knitr document reports:

 jj
##                col1      col2
## mean.of.x  5.000000  9.000000
## sd.of.x    2.160247  2.160247
## minus.of.x 6.000000 10.000000 

What settings do I need to change to use source in knitr docs correctly?

3
when you use the source method, you are commenting out the x <- data.frame() line, correct?rawr
Knitr isn't very suitable for the kind of declarative workflow you need to make this happen. I'd recommend make & makefiles, or if your want to stay completely within R, the excellent remake packageBen
@rawr Yes, I only keep the source command and comment out all others.umair durrani
@Ben I'll look into remake. But is my goal here impossible in knitr?umair durrani

3 Answers

14
votes

When you use source(), knitr is unable to analyze the possible objects to be created from it; knitr must be able to see the full source code to analyze the dependencies among code chunks. There are two approaches to solve your problem:

  1. Tell the second chunk that it depends on the value of x by adding an arbitrary chunk option that uses the value of x, e.g. ```{r cache.extra = x}; then whenever x changes, the cache of this code chunk will be automatically invalidated (more info);
  2. Let knitr see the full source code; you can pass the source code to a code chunk via the chunk option code, e.g. ```{r code = readLines('data1.R')} (same for data2.R); then dep_auto() should be able to figure out x was created from the first chunk, and used in the second chunk, so the second chunk must depend on the first chunk.
2
votes

I found that this works (knitr 1.17):

<<..., dependson=all_labels()>>=
...
@
0
votes

I think, by default, chunks do depend on previous chunks, and the author went to great lengths to try to make each chunk start with the same environment that the last one ended (although there are numerous ways of screwing this up, like sourcing files with caching turned on...) I can't recall the syntax, but you can include knitr chunks in external documents. There is also a trick to reuse knitr chunks in the same doc in a function-like manner by reusing the label, and you may be able to build some non linear dependency from this. But why not set cache to FALSE when you don't want caching? Sourcing seems like a bad idea but I can't put my finger on why. I would make the knitr workflow linear and put logic in functions, and turn off caching if the same function call can return different things with the same input parameters.

Another trick that might be useful to you is the recently added ability to knit a document using input parameters. This could possibly extract some logic from your knitr doc, which I think is the avoidable root of your problems.