With development of single cell technologies, numerous cell samples with various biological signals have had their transcriptomes sequenced at the single cell level across different platforms. Analysing all this data requires us to choose efficient integration tools, and computational simulators which are able to assess the performance of integration methods. Although existing single cell RNA-seq (scRNA-seq) simulators can simulate library size, biological and batch effect separately, they currently do not capture associations among these three factors. Here we present GLMsim, the first scRNA-seq simulator to simultaneously capture the library size, biology and unwanted variation and their associations via a generalized linear model. Our simulator enables us to capture most of essential characteristics from single cell data, apart from gene-gene associations in gene expression, and thus simulate data resembling to the original real data. Further, our method is robust to outliers since we provide ways to efficiently handle over outlier values. Our simulator can simulate scRNA-seq datasets from different platforms and with complex biology and batch conditions, which helps biologists choose appropriate protocols for their experiments.
GLMsim is a simulator that has several applications. Firstly, it is capable of quantitatively benchmarking different scRNA-seq integration methods, and assessing their abilities to retain biology and remove batch effects. Currently, none of existing simulators can produce a dataset with batch and biology associations to evaluate integration methods. In addition to benchmarking, GLMsim also enables us to explore the assumptions of a model or method. GLMsim can simulate scRNA-seq datasets under a variety of scenarios and examine whether a method works on them. Moreover, GLMsim can also be used to provide guidance for differential expression analyses after integrating multiple scRNA-seq datasets. GLMsim benefits scRNA-seq analysis by generating faithful synthetic data that can help compare different methods and support developers in studying their novel methods.