Dear Jessie,
MetaPhlAn 3 default output is made of relative abundances: therefore dividing each by 100 gives a representation of the probability of
each bug. Keeping this in mind, you can easily compute Shannon entropy (as Shannon diversity) based on the standard formula:
-sum( each probability * log( each probability) ), excluding the zero values. If you don't feel confident you can easily use functions in scipy or scikit-bio, or test yours against these.
For what concerns rarefaction, the most straight forward method is to directly subset the fastq file of each sample down to the desired number of reads: you can count the number of reads in the file with a fast function:
def rawpycount(filename):
f = openr(filename,'rb')
f_gen = _make_gen(f.read)
return sum( buf.count(b'\n') for buf in f_gen)
and then printing out the 4 lines fastq randomly choosen:
def random_sample(par):
N,n = rawpycount(par['inp_f'])/4, par['nreads']
if N<n: exit(1)
sample = (np.fromiter(np.random.choice(N,n,replace=False), dtype=np.int64))
length_sample = len(sample)
sample = set(sample)
length, line = 0, 1
f, i, c = openr(par['inp_f'],'r'), -1, 0
while (line and length<length_sample):
line = f.readline()
if c%4==0:
i += 1 ## read..
if i in sample:
length += 1
print line.rstrip()
for ii in range(3):
linetokeep = f.readline()
print linetokeep.rstrip()
c += 3
c+=1
Another method is to identify a percentage of reads that corresponds to the desired number, then use the "fake coin".
Suppose that you want to rarefy to the 40%, for each fastq you generate a random number between 0 and 1. If it is <= 0.4,
you preserve the read, otherwise you discard it. This method can be applied also to the bowtie output from metaphlan,
making you gaining time.
matching_reads_subset = [ m for m,rand in zip( bowtie_matching_reads, np.random.rand( tot_matches )) if rand<=choosen_percentage_wrt_min ]
Bye