GFF/GTF是基因组注释中常用的文件格式,两者的主要差别在最后一列。 第八列中的frame,用来表示当前特征在CDS编码中的相关信息。
1. frame 可选值
- 0, 0表示当前编码的CDS是从当前特征的第一个碱基开始解码,即该CDS的第一个碱基是完整密码子的第一个碱基。
- 1, 1表示当前编码的CDS是从第二个碱基开始解码,即该CDS的第二个碱基是完整密码子的第一个碱基。
- 2, 2表示当前编码的CDS是从第三个碱基开始解码,即该CDS的第三个碱基是完整密码子的第一个碱基。
2. frame 计算
定义函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| def b(length, frame): '''frame calculate formula length - frame: 前面的CDS特征长度减去它自己的frame值,表示在前面CDS特征上的第一个完整的密码子开始的长度。 ((length-frame) % 3): 前面CDS特征3'端最后一个完整密码子之后剩下的碱基数。 (3- ((length-frame) % 3)): 是除去特征的3'端所代表的碱基后,密码子中剩余的碱基数目。 (3- ((length-frame) % 3)) % 3: 把3变成0,因为三个碱基组成一个完整的密码子,1和2保持不变。 params: length: previous feature length (5'->3') frame: previous feature frame (5'->3') return: next feature frame ''' return (3- ((length-frame) % 3)) % 3
def cal_frame(lengths, strand): if strand == '-': lengths = lengths[::-1] frames = [] for i in range(len(lengths)): if i == 0: frames.append(0) else: frame = b(lengths[i-1], frames[i-1]) frames.append(frame)
return frames
if __name__ == '__main__': length_list = [278,778,231] strand = '-' print(cal_frame(length_list, strand))
|
对于正链 +
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| $ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/' Chr1 Araport11 CDS 325473 325641 . + 0 ID=AT1G01950:CDS:1;Parent=AT1G01950.1;Name=ARK2:CDS:1 Chr1 Araport11 CDS 325913 326013 . + 2 ID=AT1G01950:CDS:2;Parent=AT1G01950.1;Name=ARK2:CDS:2 Chr1 Araport11 CDS 326106 326228 . + 0 ID=AT1G01950:CDS:3;Parent=AT1G01950.1;Name=ARK2:CDS:3 Chr1 Araport11 CDS 326332 326517 . + 0 ID=AT1G01950:CDS:4;Parent=AT1G01950.1;Name=ARK2:CDS:4 Chr1 Araport11 CDS 326594 326815 . + 0 ID=AT1G01950:CDS:6;Parent=AT1G01950.1;Name=ARK2:CDS:6 Chr1 Araport11 CDS 326931 327084 . + 0 ID=AT1G01950:CDS:7;Parent=AT1G01950.1;Name=ARK2:CDS:7 Chr1 Araport11 CDS 327237 327386 . + 2 ID=AT1G01950:CDS:8;Parent=AT1G01950.1;Name=ARK2:CDS:8 Chr1 Araport11 CDS 327488 327576 . + 2 ID=AT1G01950:CDS:9;Parent=AT1G01950.1;Name=ARK2:CDS:9 Chr1 Araport11 CDS 327663 327866 . + 0 ID=AT1G01950:CDS:10;Parent=AT1G01950.1;Name=ARK2:CDS:10 Chr1 Araport11 CDS 328000 328107 . + 0 ID=AT1G01950:CDS:11;Parent=AT1G01950.1;Name=ARK2:CDS:11 Chr1 Araport11 CDS 328188 328322 . + 0 ID=AT1G01950:CDS:12;Parent=AT1G01950.1;Name=ARK2:CDS:12 Chr1 Araport11 CDS 328424 328555 . + 0 ID=AT1G01950:CDS:14;Parent=AT1G01950.1;Name=ARK2:CDS:14 Chr1 Araport11 CDS 328898 329030 . + 0 ID=AT1G01950:CDS:17;Parent=AT1G01950.1;Name=ARK2:CDS:17 Chr1 Araport11 CDS 329119 329211 . + 2 ID=AT1G01950:CDS:18;Parent=AT1G01950.1;Name=ARK2:CDS:18 Chr1 Araport11 CDS 329311 329436 . + 2 ID=AT1G01950:CDS:19;Parent=AT1G01950.1;Name=ARK2:CDS:19 Chr1 Araport11 CDS 329573 329698 . + 2 ID=AT1G01950:CDS:20;Parent=AT1G01950.1;Name=ARK2:CDS:20 Chr1 Araport11 CDS 329781 329924 . + 2 ID=AT1G01950:CDS:21;Parent=AT1G01950.1;Name=ARK2:CDS:21 Chr1 Araport11 CDS 330027 330155 . + 2 ID=AT1G01950:CDS:22;Parent=AT1G01950.1;Name=ARK2:CDS:22 Chr1 Araport11 CDS 330243 330403 . + 2 ID=AT1G01950:CDS:23;Parent=AT1G01950.1;Name=ARK2:CDS:23
$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'
$ $
|
对于负链 -
1 2 3 4 5 6 7 8 9
| $ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01090.1 | awk '$3~/CDS/' Chr1 Araport11 CDS 47705 47982 . - 2 ID=AT1G01090:CDS:3;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:3 Chr1 Araport11 CDS 48075 48852 . - 0 ID=AT1G01090:CDS:2;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:2 Chr1 Araport11 CDS 48936 49166 . - 0 ID=AT1G01090:CDS:1;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:1
$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01090.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'
$ $
|