%\VignetteIndexEntry{parathyroidExons and parathyroidGenes}
\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}
\usepackage{cite}
\usepackage{color}
\usepackage{url}

\usepackage[utf8]{inputenc}

\SweaveOpts{keep.source=FALSE,eps=FALSE,height=4.5,width=4} 
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}
\newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}}

\title{\textsf{\textbf{Creation of \Robject{parathyroidGenes} and \Robject{parathyroidExons}}}}
\author{Michael Love}
\begin{document}
\maketitle
\begin{abstract}
This vignette describes the construction of the data objects in the \Rpackage{parathyroid} package,
which are a \Rclass{CountDataSet} and an \Rclass{ExonCountSet}.
Note that the \Rclass{ExonCountSet} is beginning to be deprecated (as of March 2014 and \Rpackage{DEXSeq} version 1.10.0).
Another Bioconductor data package, \Rpackage{parathyroidSE}, describes the creation of \Rclass{SummarizedExperiment}
objects from the same RNA-Seq experiment files.
\end{abstract}

\tableofcontents

\section{Dataset description}

We downloaded the RNA-Seq data from the publication of Haglund et al. \cite{Haglund2012Evidence}.  The paired-end sequencing was performed on primary cultures from parathyroid tumors of 4 patients at 2 time points over 3 conditions (control, treatment with diarylpropionitrile (DPN) and treatment with 4-hydroxytamoxifen (OHT)).  DPN is a selective estrogen receptor $\beta$ 1 agonist and OHT is a selective estrogen receptor modulator.  One sample (patient 4, 24 hours, control) was omitted by the paper authors due to low quality.  

\section{Downloading the data}

The raw sequencing data is publicly available from the NCBI Gene Expression Omnibus under accession number GSE37211\footnote{\url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211}}.  The read sequences in FASTQ format were extracted from the NCBI short read archive file (.sra files), using the sra toolkit\footnote{\url{http://www.ncbi.nlm.nih.gov/books/NBK56560/}}.

\section{Aligning and counting reads}

The sequenced reads in the FASTQ files were aligned using TopHat version 2.0.4\footnote{\url{http://tophat.cbcb.umd.edu/}} with default parameters to the GRCh37 human reference genome using the Bowtie index available at the Illumina iGenomes page\footnote{\url{http://tophat.cbcb.umd.edu/igenomes.html}}.  An example call for alignment (substituting the SRR number for \texttt{file}):

\begin{verbatim}
  tophat2 -o file_tophat_out -p 8 genome file_1.fastq file_2.fastq 
  samtools index file_tophat_out/accepted_hits.bam 
  samtools view file_tophat_out/accepted_hits.bam | \
      sort -k1,1 -k2,2n > \
  file_tophat_out/accepted_hits_sorted.sam 
\end{verbatim}

For counting reads in genes, we used \texttt{htseq-count} from the HTSeq Python package\footnote{\url{http://www-huber.embl.de/users/anders/HTSeq/}}.  We counted reads in genes, using the GTF file packaged with the GRCh37 Illumina iGenome, Ensembl release 66 downloaded 9 March 2012, under the \texttt{Ensembl} directory.  We used the minimum read quality setting \texttt{-a 10}, which is the same setting used by the \Rpackage{DEXSeq} python script described in the next paragraph, and \texttt{-s no}, because the assay is not strand-specific.  An example call for read counting:

\begin{verbatim} 
  htseq-count -a 10 -s no file_tophat_out/accepted_hits_sorted.sam \ 
      Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > \ 
      file_tophat_out/file_gene_counts.txt
\end{verbatim}

For counting reads in exons, we used Python scripts provided in the \Rpackage{DEXSeq} packge.  For features we again used the Ensembl gene annotations.  We use the options \texttt{-p yes -s no} to indicate the reads are paired-end, and the assay is not strand-specific. An example call for preparing the exon annotations and read counting:

\begin{verbatim}
  python DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py \  
      Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \ 
      Hsap.GRCh37.DEXSeq.gff  
  python DEXSeq/inst/python_scripts/dexseq_count.py -p yes -s no \ 
      Hsap.GRCh37.DEXSeq.gff file_tophat_out/accepted_hits_sorted.sam \ 
      file_tophat_out/file_exon_counts.txt 
\end{verbatim}

\section{Obtaining sample annotations from GEO}

In order to provide phenotypic data for the samples, we used the \Rpackage{GEOquery} package to parse the series matrix file downloaded from the NCBI Gene Expression Omnibus under accession number GSE37211.  We included this file as well in the package, and read it in locally in the code below.

<<>>=
library("parathyroid")
library("GEOquery")
gse37211 <- getGEO(filename=system.file("extdata/GSE37211_series_matrix.txt",
                               package="parathyroid",mustWork=TRUE))
samples <- pData(gse37211)[,c("characteristics_ch1","characteristics_ch1.2",
                              "characteristics_ch1.3","relation")]
colnames(samples) <- c("patient","treatment","time","experiment")
samples$patient <- sub("patient: (.+)","\\1",samples$patient)
samples$treatment <- sub("agent: (.+)","\\1",samples$treatment)
samples$time <- sub("time: (.+)","\\1",samples$time)
samples$experiment <- sub("SRA: http://www.ncbi.nlm.nih.gov/sra\\?term=(.+)","\\1",
                          samples$experiment)
samples
@

\section{Matching GEO experiments with SRA runs}

The sample information from GEO must be matched to the individual runs from the Short Read Archive (the FASTQ files), as some samples are spread over multiple sequencing runs.  The run information can be obtained from the Short Read Archive using the \Rpackage{SRAdb} package (note that the first step involves a large download of the SRA metadata database).  We included the conversion table in the package.

<<eval=FALSE>>=
library("SRAdb")
sqlfile <- getSRAdbFile()
sra_con <- dbConnect(SQLite(),sqlfile)
conversion <- sraConvert(in_acc = samples$experiment, out_type = 
                         c("sra","submission","study","sample","experiment","run"), 
                         sra_con = sra_con)
write.table(conversion,file="inst/extdata/conversion.txt")
@ 

We used the \Rfunction{merge} function to match the sample annotations to the run information.  We ordered the \Rclass{data.frame} \Robject{samplesFull} by the run number and then set all columns as factors.

<<>>=
conversion <- read.table(system.file("extdata/conversion.txt",
                                     package="parathyroid",mustWork=TRUE))
samplesFull <- merge(samples, conversion)
samplesFull <- samplesFull[order(samplesFull$run),]
samplesFull <- as.data.frame(lapply(samplesFull, factor))
@ 

\section{Creating the \Rclass{CountDataSet} \Robject{parathyroidGenes}}

We used the function \Rfunction{newCountDataSetFromHTSeqCount} of the \Rpackage{DESeq} package to read in our gene-level counts.  We create a \Rclass{data.frame} \Robject{sampleTable}, with the sample names as the first column and the count files as the second column. The count files are not included in the package due to memory constraints. 

<<eval=FALSE>>=
library("DESeq")
genecountfiles <- paste(samplesFull$run,"_gene_counts.txt",sep="")
sampleTable <- cbind(samplenames=samplesFull$run, 
                     countfiles = genecountfiles, 
                     samplesFull)
parathyroidGenes <- newCountDataSetFromHTSeqCount(sampleTable,directory=".")
@ 

We included experiment data, PubMed ID and protocol data from the NCBI Gene Expression Omnibus.

<<eval=FALSE>>=
expdata = new("MIAME",
  name="Felix Haglund",
  lab="Science for Life Laboratory Stockholm",
  contact="Mikael Huss",
  title="DPN and Tamoxifen treatments of parathyroid adenoma cells",
  url="http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211",
  abstract="Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease.")
pubMedIds(expdata) <- "23024189"
n <- nrow(samplesFull)
protocoldata <- AnnotatedDataFrame(data.frame(treatment=rep("cells were plated and treated with 100 nM DPN (Tocris Bioscience), or 100 nM OHT (Sigma-Aldrich, Stockholm, Sweden) for 24h or 48h, respectively. Untreated cells cultured in parallel were used as controls.",n), growth=rep("Tissue for cell culturing was obtained from four parathyroid adenomas collected directly at surgery. All samples were from female, postmenopausal patients with chief cell adenoma with no preoperative signs of malignancy or kidney failure. Isolation of cells and cell culturing were performed essentially as previously described by Lu et al. In brief, tissues were cut and digested into cells with 1.5 mg/mL collagenase type II. About 1×106 cells were plated",n),extracted_molecule=rep("total RNA",n),extraction=rep("Illumina TruSeq RNA",n),library_strategy=rep("RNA-Seq",n),library_source=rep("transcriptomic",n),library_selection=rep("cDNA",n),instrument_model=rep("Illumina HiSeq 2000",n),data_processing=rep("Alignment to GRCh37 using TopHat-2.0.4, default settings for paired-end reads",n),row.names=samplesFull$run))
experimentData(parathyroidGenes) <- expdata
protocolData(parathyroidGenes) <- protocoldata
@ 

\section{Creating the \Rclass{ExonCountSet} \Robject{parathyroidExons}}

We used the \Rfunction{read.HTSeqCounts} function of the \Rpackage{DEXSeq} package to assemble an \Rclass{ExonCountSet} using the count files and the gene annotation file produced by the Python scripts.

<<eval=FALSE>>=
library("DEXSeq")
exoncountfiles <- paste(samplesFull$run,"_exon_counts.txt",sep="")
annotationfile <- "Hsap.GRCh37.DEXSeq.gff"
design <- data.frame(lapply(samplesFull, factor))
parathyroidExons <- read.HTSeqCounts(countfiles = exoncountfiles, 
                                     design = design,
                                     flattenedfile = annotationfile)
sampleNames(parathyroidExons) <- pData(parathyroidExons)$run
experimentData(parathyroidExons) <- expdata
protocolData(parathyroidExons) <- protocoldata
@ 

<<eval=FALSE>>=
save(parathyroidGenes,file="data/parathyroidGenes.RData")
save(parathyroidExons,file="data/parathyroidExons.RData")
@

\section{Session information}

<<results=tex, print=TRUE>>=
toLatex(sessionInfo())
@ 

\bibliography{library}
\bibliographystyle{plain}
\end{document}