扩增子的注释一般都会先聚类,但如果手里的序列非常少,只有几千条,那不一定能得到结果,或者就是想看看每条序列都是什么物种,那就可以使用blastn比对以后汇总结果
汇总代码如下
import re with open("results_clean_batch1.txt", "r", encoding="utf-8") as file: text = file.read() matches = re.findall(r'(Query=.*?)(?=Query=|$)', text, re.DOTALL) results = [] for match in matches: query_id = re.search(r'Query= (\S+)', match).group(1) species_matches = re.findall(r'>([^ ]+) ([^>]+?)\nLength', match, re.DOTALL) identities = re.findall(r'Identities = (.*?)\,', match) query_results = [query_id] for accession, species in species_matches: species_name = ' '.join(species.split()) if identities: identity = identities.pop(0) query_results.extend([accession, f"Species: {species_name}", f"Identity: {identity}"]) results.append(query_results) # Output to a file with open('output_batch2.txt', 'w') as f: # Add header header = "Sequence ID\tMatch 1 Accession\tMatch 1 Species\tMatch 1 Identity\tMatch 2 Accession\tMatch 2 Species\tMatch 2 Identity\tMatch 3 Accession\tMatch 3 Species\tMatch 3 Identity" f.write(header + '\n') # Write the results for result in results: f.write('\t'.join(result) + '\n')
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!