Jun 2, 2018 2 min read Bioinformatics

stringsAsFactors trouble in R

Recently, I was reading a counts matrix in R -

> counts <- read.table("htseq_counts_matrix.txt")
> head(counts)
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000001                     4676                     2314
ENSMUSG00000000003                        0                        0
ENSMUSG00000000028                      558                      172
ENSMUSG00000000031                    25272                     6334
ENSMUSG00000000037                       28                      118
ENSMUSG00000000049                        3                        0
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000001                    3422                    4300
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     328                     581
ENSMUSG00000000031                    7878                   22343
ENSMUSG00000000037                     212                      21
ENSMUSG00000000049                       2                       3
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000001                    2221                     857
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     158                     159
ENSMUSG00000000031                   27124                    3655
ENSMUSG00000000037                       5                      32
ENSMUSG00000000049                       0                       0

I required the counts for just two gene IDs. The two IDs were stored in a separate file named 'some_genes.txt'.

> some_genes <- read.table("some_genes.txt")
> some_genes <- some_genes$V1
> some_genes
[1] ENSMUSG00000043587 ENSMUSG00000040276
Levels: ENSMUSG00000040276 ENSMUSG00000043587

The 2 IDs were ENSMUSG00000043587 and ENSMUSG00000040276. I tried to subset the counts matrix to retain the counts of just these IDs, but something unexpected happened.

> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000003                        0                        0
ENSMUSG00000000001                     4676                     2314
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    3422                    4300
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    2221                     857

Instead of obtaining the counts for the 2 IDs, I obtained the counts of 2 totally random genes! When I investigated further -

> typeof(some_genes)
[1] "integer"

The variables in 'some_genes' were being converted to integers, and the integers were being used as the row index to subset the counts matrix. This is totally undesirable! To fix this, I included 'stringsAsFactors = F' while reading the file into R.

> some_genes <- read.table("some_genes.txt", stringsAsFactors = F)
> some_genes <- some_genes$V1
> some_genes
[1] "ENSMUSG00000043587" "ENSMUSG00000040276"
> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000043587                      385                      510
ENSMUSG00000040276                        5                      942
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000043587                     836                     365
ENSMUSG00000040276                    1021                       4
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000043587                      44                     229
ENSMUSG00000040276                      12                     515
> typeof(some_genes)
[1] "character"

This time, I obtained the counts of the right set of IDs.

R by default sets stringsAsFactors = T because this setting is convenient while running models using the glm() or lm() function. But in this case, it is just very deceiving. An alternative to including stringsAsFactors = F for every file you read into R is to just set options(stringsAsFactors = FALSE) at the beginning of the R script. You can read more about this issue here.

You might also like...

Interesting Bioinformatics Articles

Screen command in UNIX

Types of models in DESeq2

Steps in DESeq function

Normalize Counts Matrix in DESeq2