2 min read

stringsAsFactors trouble in R

stringsAsFactors trouble in R

Recently, I was reading a counts matrix in R -

> counts <- read.table("htseq_counts_matrix.txt")
> head(counts)
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000001                     4676                     2314
ENSMUSG00000000003                        0                        0
ENSMUSG00000000028                      558                      172
ENSMUSG00000000031                    25272                     6334
ENSMUSG00000000037                       28                      118
ENSMUSG00000000049                        3                        0
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000001                    3422                    4300
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     328                     581
ENSMUSG00000000031                    7878                   22343
ENSMUSG00000000037                     212                      21
ENSMUSG00000000049                       2                       3
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000001                    2221                     857
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     158                     159
ENSMUSG00000000031                   27124                    3655
ENSMUSG00000000037                       5                      32
ENSMUSG00000000049                       0                       0

I required the counts for just two gene IDs. The two IDs were stored in a separate file named 'some_genes.txt'.

> some_genes <- read.table("some_genes.txt")
> some_genes <- some_genes$V1
> some_genes
[1] ENSMUSG00000043587 ENSMUSG00000040276
Levels: ENSMUSG00000040276 ENSMUSG00000043587

The 2 IDs were ENSMUSG00000043587 and ENSMUSG00000040276. I tried to subset the counts matrix to retain the counts of just these IDs, but something unexpected happened.

> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000003                        0                        0
ENSMUSG00000000001                     4676                     2314
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    3422                    4300
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    2221                     857

Instead of obtaining the counts for the 2 IDs, I obtained the counts of 2 totally random genes! When I investigated further -

> typeof(some_genes)
[1] "integer"

The variables in 'some_genes' were being converted to integers, and the integers were being used as the row index to subset the counts matrix. This is totally undesirable! To fix this, I included 'stringsAsFactors = F' while reading the file into R.

> some_genes <- read.table("some_genes.txt", stringsAsFactors = F)
> some_genes <- some_genes$V1
> some_genes
[1] "ENSMUSG00000043587" "ENSMUSG00000040276"
> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000043587                      385                      510
ENSMUSG00000040276                        5                      942
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000043587                     836                     365
ENSMUSG00000040276                    1021                       4
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000043587                      44                     229
ENSMUSG00000040276                      12                     515
> typeof(some_genes)
[1] "character"

This time, I obtained the counts of the right set of IDs.

R by default sets stringsAsFactors = T because this setting is convenient while running models using the glm() or lm() function. But in this case, it is just very deceiving. An alternative to including stringsAsFactors = F for every file you read into R is to just set options(stringsAsFactors = FALSE) at the beginning of the R script. You can read more about this issue here.