simulate_reads.py¶

simulate_reads.chooseInsert(seq, args, READ_COUNTER, max_attempts=5)[source]¶: choose the insert start and end positions based on a predetermined distribution of insert sizes

simulate_reads.createAltAllele(seq, SNP_frequency)[source]¶: simulate the mutation process to generate a heterozygous allele

simulate_reads.createRandomSequence(mu=3392, sigma=2600, m=300, M=10000)[source]¶: create a random ATCG string with random length. Note: use the normal distribution to choose the length (mean: 3392bp, s.d 2600bp), but discard lengths outside the range [m,M]

simulate_reads.createRandomSequenceIndex(n)[source]¶: constructs a dictionary {seq_id: sequence} of random ATCG sequences (for default values see the randomGene() function

simulate_reads.createRandomSequenceLibrary(n_seqs=400)[source]¶: generate a random sequence library and save to file

simulate_reads.createReadLibrary(df, reads_file_1, reads_file_2, args)[source]¶: loop through a dataframe containing sequences (ref and alt alleles) and expression levels (ref and alt counts) and create two dictionaries of simulated NGS reads (fwd and reverse read pairs)

simulate_reads.createReadPair(seq, reads_file_1, reads_file_2, READ_COUNTER, args)[source]¶: simulate a single pair of reads in an NGS experiment

simulate_reads.createTestInsert(s_fwd, args)[source]¶: create an insert (for testing purposes only)

simulate_reads.defineCountsAndSequences(df, args)[source]¶: loop through each gene (transcript) constructing a second allele (based on a defined mutation rate) and read counts for each allele (divide up the total reads between ref & alt alleles: proportions depend on whether the gene has been assigned as ASE or not)

simulate_reads.fastWrap(text, width=None)[source]¶: much quicker than textwrap.wrap

simulate_reads.fasta(seq_index, wrap=None)[source]¶: format a dictionary of {label:sequence} pairs into a fasta string

simulate_reads.fastq(seq_index, args, wrap=None)[source]¶: format a dictionary of {label:sequence} pairs into a fastq string

simulate_reads.indexFasta(fasta, limit=None)[source]¶: re-written to account for some entries having ‘>’ in the title string!

simulate_reads.invert(i, s)[source]¶

return a coordinate corresponding to a position on the reverse strand that matches a coordinate on the forward strand, i.e.

0     5
| --> | 
AAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTT
      |        |
      9    <-- 0

simulate_reads.mutateString(ref_seq, SNP_frequency)[source]¶

based on a specified frequency, return a list of positions to mutate

Note: there are many possible ways this could be done

simulate_reads.parseHeaders(s)[source]¶

custom fasta header parser specific to this dataset

might not work with other datasets

simulate_reads.prepareData(args)[source]¶: index the reference transcriptome, assign read counts, then decide which genes are going to be ASE

simulate_reads.qualScore(args)[source]¶

select a quality string at random from a real RNA-Seq dataset

NOTE: It is assumed that the available quality strings are >= the length of the sequence strings. There is an assert statement in createReadLibrary() which enforces this at the time of args.qualdata construction.

simulate_reads.randomDecision(probability=0.5)[source]¶: returns a random True/False based on: <probability> chance of returning True 1 - <probability> chance of returning False

simulate_reads.readQualscores(args)[source]¶: basic function for extracting quality strings from a fastq file

simulate_reads.simulateSequencingExperiment(args)[source]¶

create a simulated NGS dataset containing allelic inbalance in the expression of a defined proportion of genes

this function can be modified to output multiple sequencing simulations from a single simulated RNA sample, but with varying read counts to assess the effect of increasing sequencing depth

simulate_reads.test_1(args)[source]¶: plot insert sizes from an invariant pseudo-transcriptome

simulate_reads.test_2(args)[source]¶: plot insert sizes from an actual transcriptome

simulate_reads.test_3(args)[source]¶

model the insert size distribution of an actual transcriptome

>>> from scipy import stats

# choose some parameters >>> a, loc, scale = 1.3, -0.1, 2.2 # draw a sample >>> sample = stats.skewnorm(a, loc, scale).rvs(1000)

# estimate parameters from sample >>> ae, loce, scalee = stats.skewnorm.fit(sample) >>> ae 1.2495366661560348 >>> loce -0.039775813819310835 >>> scalee 2.1126121580965536

NOTE: the file ‘ERR2353209.insert_size_metrics.txt’ has insert data based on mapping RNA-Seq reads to a genome. Therefore, some reported inserts will span splice junctions and thus be recorded as longer than they actually are. It was generated by running picard::CollectInsertSizeMetrics

I do not plan to modify it, however one way would be to map reads to the transcriptome instead, then re-measure the insert sizes.

simulate_reads.test_4(args)[source]¶