在处理fasta序列时,常常会遇到一条序列多行排列的现象,如下所示:
$cat test.fasta>test_1TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_3TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACTTTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAGTCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_4TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCTGTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGCTCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACGGG>test_5TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_6GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTCAGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCCAAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCCCGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGATATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACAGG
我的一个简单处理方法是,【整体读入-->分隔符分割为列表-->字符串合并列表】,代码如下:
seq_file=open("test.fasta") seq_list=seq_file.read().split(">")for seq in seq_list : if seq : seq_name=seq.split("\n")[0] seq_fa="".join(seq.split("\n")[1:]) print ">" + seq_name + "\n" + seq_fa
打印结果为:
>test_1TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_3TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACTTTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAGTCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_4TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCTGTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGCTCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACGGG>test_5TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG>test_6GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTCAGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCCAAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCCCGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGATATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACAGG