Skip to content

stringsAsFactors trouble in R

Aarthi Ramakrishnan
2 min read
stringsAsFactors trouble in R

Recently, I was reading a counts matrix in R -

> counts <- read.table("htseq_counts_matrix.txt")
> head(counts)
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000001                     4676                     2314
ENSMUSG00000000003                        0                        0
ENSMUSG00000000028                      558                      172
ENSMUSG00000000031                    25272                     6334
ENSMUSG00000000037                       28                      118
ENSMUSG00000000049                        3                        0
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000001                    3422                    4300
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     328                     581
ENSMUSG00000000031                    7878                   22343
ENSMUSG00000000037                     212                      21
ENSMUSG00000000049                       2                       3
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000001                    2221                     857
ENSMUSG00000000003                       0                       0
ENSMUSG00000000028                     158                     159
ENSMUSG00000000031                   27124                    3655
ENSMUSG00000000037                       5                      32
ENSMUSG00000000049                       0                       0

I required the counts for just two gene IDs. The two IDs were stored in a separate file named 'some_genes.txt'.

> some_genes <- read.table("some_genes.txt")
> some_genes <- some_genes$V1
> some_genes
[1] ENSMUSG00000043587 ENSMUSG00000040276
Levels: ENSMUSG00000040276 ENSMUSG00000043587

The 2 IDs were ENSMUSG00000043587 and ENSMUSG00000040276. I tried to subset the counts matrix to retain the counts of just these IDs, but something unexpected happened.

> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000000003                        0                        0
ENSMUSG00000000001                     4676                     2314
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    3422                    4300
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000000003                       0                       0
ENSMUSG00000000001                    2221                     857

Instead of obtaining the counts for the 2 IDs, I obtained the counts of 2 totally random genes! When I investigated further -

> typeof(some_genes)
[1] "integer"

The variables in 'some_genes' were being converted to integers, and the integers were being used as the row index to subset the counts matrix. This is totally undesirable! To fix this, I included 'stringsAsFactors = F' while reading the file into R.

> some_genes <- read.table("some_genes.txt", stringsAsFactors = F)
> some_genes <- some_genes$V1
> some_genes
[1] "ENSMUSG00000043587" "ENSMUSG00000040276"
> some_genes_counts <- counts[some_genes, ]
> some_genes_counts
                   Pool5_RNAseq_12_S33_L006 Pool5_RNAseq_19_S34_L006
ENSMUSG00000043587                      385                      510
ENSMUSG00000040276                        5                      942
                   Pool5_RNAseq_4_S29_L006 Pool5_RNAseq_5_S30_L006
ENSMUSG00000043587                     836                     365
ENSMUSG00000040276                    1021                       4
                   Pool5_RNAseq_6_S31_L006 Pool5_RNAseq_7_S32_L006
ENSMUSG00000043587                      44                     229
ENSMUSG00000040276                      12                     515
> typeof(some_genes)
[1] "character"

This time, I obtained the counts of the right set of IDs.

R by default sets stringsAsFactors = T because this setting is convenient while running models using the glm() or lm() function. But in this case, it is just very deceiving. An alternative to including stringsAsFactors = F for every file you read into R is to just set options(stringsAsFactors = FALSE) at the beginning of the R script. You can read more about this issue here.

Bioinformatics

Related Posts

Interesting Bioinformatics Articles

Following is a collection of articles which I feel every Bioinformatician must be aware of. I will keep updating this list from time to time - 1. All biology is computational biology 2. Core services: Reward bioinformaticians 3. Importance of stupidity in scientific research

Interesting Bioinformatics Articles

Screen command in UNIX

Screen is a very useful command to have in your toolbox if you frequently use interactive sessions on your supercomputer logged in through a VPN. A VPN typically has a time limit, and you may get disconnected from it without any warning when you have poor internet connection. Screen program

Screen command in UNIX

Types of models in DESeq2

There are 2 major types of regression models one can specify in DESeq2 to explore the raw count matrices from an RNA-seq experiment - * Mean-reference model for Factors * Regression model for Covariates Mean-reference model for Factors - Factors typically represent categorical variable such as Gender, Ethnicity, Race etc. The mean-reference

Types of models in DESeq2