simulate_reads.py¶
-
simulate_reads.
chooseInsert
(seq, args, READ_COUNTER, max_attempts=5)[source]¶ choose the insert start and end positions based on a predetermined distribution of insert sizes
-
simulate_reads.
createAltAllele
(seq, SNP_frequency)[source]¶ simulate the mutation process to generate a heterozygous allele
-
simulate_reads.
createRandomSequence
(mu=3392, sigma=2600, m=300, M=10000)[source]¶ create a random ATCG string with random length. Note: use the normal distribution to choose the length (mean: 3392bp, s.d 2600bp), but discard lengths outside the range [m,M]
-
simulate_reads.
createRandomSequenceIndex
(n)[source]¶ constructs a dictionary {seq_id: sequence} of random ATCG sequences (for default values see the randomGene() function
-
simulate_reads.
createRandomSequenceLibrary
(n_seqs=400)[source]¶ generate a random sequence library and save to file
-
simulate_reads.
createReadLibrary
(df, reads_file_1, reads_file_2, args)[source]¶ loop through a dataframe containing sequences (ref and alt alleles) and expression levels (ref and alt counts) and create two dictionaries of simulated NGS reads (fwd and reverse read pairs)
-
simulate_reads.
createReadPair
(seq, reads_file_1, reads_file_2, READ_COUNTER, args)[source]¶ simulate a single pair of reads in an NGS experiment
-
simulate_reads.
defineCountsAndSequences
(df, args)[source]¶ loop through each gene (transcript) constructing a second allele (based on a defined mutation rate) and read counts for each allele (divide up the total reads between ref & alt alleles: proportions depend on whether the gene has been assigned as ASE or not)
-
simulate_reads.
fasta
(seq_index, wrap=None)[source]¶ format a dictionary of {label:sequence} pairs into a fasta string
-
simulate_reads.
fastq
(seq_index, args, wrap=None)[source]¶ format a dictionary of {label:sequence} pairs into a fastq string
-
simulate_reads.
indexFasta
(fasta, limit=None)[source]¶ re-written to account for some entries having ‘>’ in the title string!
-
simulate_reads.
invert
(i, s)[source]¶ return a coordinate corresponding to a position on the reverse strand that matches a coordinate on the forward strand, i.e.
0 5 | --> | AAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTT | | 9 <-- 0
-
simulate_reads.
mutateString
(ref_seq, SNP_frequency)[source]¶ based on a specified frequency, return a list of positions to mutate
Note: there are many possible ways this could be done
-
simulate_reads.
parseHeaders
(s)[source]¶ custom fasta header parser specific to this dataset
might not work with other datasets
-
simulate_reads.
prepareData
(args)[source]¶ index the reference transcriptome, assign read counts, then decide which genes are going to be ASE
-
simulate_reads.
qualScore
(args)[source]¶ select a quality string at random from a real RNA-Seq dataset
NOTE: It is assumed that the available quality strings are >= the length of the sequence strings. There is an assert statement in createReadLibrary() which enforces this at the time of args.qualdata construction.
-
simulate_reads.
randomDecision
(probability=0.5)[source]¶ returns a random True/False based on: <probability> chance of returning True 1 - <probability> chance of returning False
-
simulate_reads.
readQualscores
(args)[source]¶ basic function for extracting quality strings from a fastq file
-
simulate_reads.
simulateSequencingExperiment
(args)[source]¶ create a simulated NGS dataset containing allelic inbalance in the expression of a defined proportion of genes
this function can be modified to output multiple sequencing simulations from a single simulated RNA sample, but with varying read counts to assess the effect of increasing sequencing depth
-
simulate_reads.
test_3
(args)[source]¶ model the insert size distribution of an actual transcriptome
>>> from scipy import stats
# choose some parameters >>> a, loc, scale = 1.3, -0.1, 2.2 # draw a sample >>> sample = stats.skewnorm(a, loc, scale).rvs(1000)
# estimate parameters from sample >>> ae, loce, scalee = stats.skewnorm.fit(sample) >>> ae 1.2495366661560348 >>> loce -0.039775813819310835 >>> scalee 2.1126121580965536
NOTE: the file ‘ERR2353209.insert_size_metrics.txt’ has insert data based on mapping RNA-Seq reads to a genome. Therefore, some reported inserts will span splice junctions and thus be recorded as longer than they actually are. It was generated by running picard::CollectInsertSizeMetrics
I do not plan to modify it, however one way would be to map reads to the transcriptome instead, then re-measure the insert sizes.