stringsAsFactors trouble in R
Recently, I was reading a counts matrix in R -
> counts <- read.table("htseq_counts_matrix.txt")
> head(counts)
Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000001 4676 2314
ENSMUSG00000000003 0 0
ENSMUSG00000000028 558 172
ENSMUSG00000000031 25272 6334
ENSMUSG00000000037 28 118
ENSMUSG00000000049 3 0
Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000001 3422 4300
ENSMUSG00000000003 0 0
ENSMUSG00000000028 328 581
ENSMUSG00000000031 7878 22343
ENSMUSG00000000037 212 21
ENSMUSG00000000049 2 3
Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000001 2221 857
ENSMUSG00000000003 0 0
ENSMUSG00000000028 158 159
ENSMUSG00000000031 27124 3655
ENSMUSG00000000037 5 32
ENSMUSG00000000049 0 0
I required the counts for just two gene IDs. The two IDs were stored in a separate file named 'some_genes.txt'.
> some_genes <- read.table("some_genes.txt")
> some_genes <- some_genes$V1
> some_genes
[1] ENSMUSG00000043587 ENSMUSG00000040276
Levels: ENSMUSG00000040276 ENSMUSG00000043587
The 2 IDs were ENSMUSG00000043587 and ENSMUSG00000040276. I tried to subset the counts matrix to retain the counts of just these IDs, but something unexpected happened.
> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000003 0 0
ENSMUSG00000000001 4676 2314
Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000003 0 0
ENSMUSG00000000001 3422 4300
Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000003 0 0
ENSMUSG00000000001 2221 857
Instead of obtaining the counts for the 2 IDs, I obtained the counts of 2 totally random genes! When I investigated further -
> typeof(some_genes)
[1] "integer"
The variables in 'some_genes' were being converted to integers, and the integers were being used as the row index to subset the counts matrix. This is totally undesirable! To fix this, I included 'stringsAsFactors = F' while reading the file into R.
> some_genes <- read.table("some_genes.txt", stringsAsFactors = F)
> some_genes <- some_genes$V1
> some_genes
[1] "ENSMUSG00000043587" "ENSMUSG00000040276"
> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000043587 385 510
ENSMUSG00000040276 5 942
Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000043587 836 365
ENSMUSG00000040276 1021 4
Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000043587 44 229
ENSMUSG00000040276 12 515
> typeof(some_genes)
[1] "character"
This time, I obtained the counts of the right set of IDs.
R by default sets stringsAsFactors = T because this setting is convenient while running models using the glm() or lm() function. But in this case, it is just very deceiving. An alternative to including stringsAsFactors = F for every file you read into R is to just set options(stringsAsFactors = FALSE)
at the beginning of the R script. You can read more about this issue here.