aibiology

Artificial intelligence in biology

0%

GFF和GTF中frame计算过程

GFF/GTF是基因组注释中常用的文件格式,两者的主要差别在最后一列。 第八列中的frame,用来表示当前特征在CDS编码中的相关信息。

1. frame 可选值

  • 0, 0表示当前编码的CDS是从当前特征的第一个碱基开始解码,即该CDS的第一个碱基是完整密码子的第一个碱基。
  • 1, 1表示当前编码的CDS是从第二个碱基开始解码,即该CDS的第二个碱基是完整密码子的第一个碱基。
  • 2, 2表示当前编码的CDS是从第三个碱基开始解码,即该CDS的第三个碱基是完整密码子的第一个碱基。

2. frame 计算

定义函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def b(length, frame):
'''frame calculate formula
length - frame: 前面的CDS特征长度减去它自己的frame值,表示在前面CDS特征上的第一个完整的密码子开始的长度。
((length-frame) % 3): 前面CDS特征3'端最后一个完整密码子之后剩下的碱基数。
(3- ((length-frame) % 3)): 是除去特征的3'端所代表的碱基后,密码子中剩余的碱基数目。
(3- ((length-frame) % 3)) % 3: 把3变成0,因为三个碱基组成一个完整的密码子,1和2保持不变。

params:
length: previous feature length (5'->3')
frame: previous feature frame (5'->3')
return:
next feature frame
'''
return (3- ((length-frame) % 3)) % 3

def cal_frame(lengths, strand):
if strand == '-':
lengths = lengths[::-1]
frames = []
for i in range(len(lengths)):
if i == 0:
frames.append(0)
else:
frame = b(lengths[i-1], frames[i-1])
frames.append(frame)

return frames

if __name__ == '__main__':
# + example
# length_list = [169,101,123,186,222,154,150,89,204,108,135,132,133,93,126,126,144,129,161]
# strand = "+"

# - example
length_list = [278,778,231]
strand = '-'
print(cal_frame(length_list, strand))

对于正链 +

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/'
Chr1 Araport11 CDS 325473 325641 . + 0 ID=AT1G01950:CDS:1;Parent=AT1G01950.1;Name=ARK2:CDS:1
Chr1 Araport11 CDS 325913 326013 . + 2 ID=AT1G01950:CDS:2;Parent=AT1G01950.1;Name=ARK2:CDS:2
Chr1 Araport11 CDS 326106 326228 . + 0 ID=AT1G01950:CDS:3;Parent=AT1G01950.1;Name=ARK2:CDS:3
Chr1 Araport11 CDS 326332 326517 . + 0 ID=AT1G01950:CDS:4;Parent=AT1G01950.1;Name=ARK2:CDS:4
Chr1 Araport11 CDS 326594 326815 . + 0 ID=AT1G01950:CDS:6;Parent=AT1G01950.1;Name=ARK2:CDS:6
Chr1 Araport11 CDS 326931 327084 . + 0 ID=AT1G01950:CDS:7;Parent=AT1G01950.1;Name=ARK2:CDS:7
Chr1 Araport11 CDS 327237 327386 . + 2 ID=AT1G01950:CDS:8;Parent=AT1G01950.1;Name=ARK2:CDS:8
Chr1 Araport11 CDS 327488 327576 . + 2 ID=AT1G01950:CDS:9;Parent=AT1G01950.1;Name=ARK2:CDS:9
Chr1 Araport11 CDS 327663 327866 . + 0 ID=AT1G01950:CDS:10;Parent=AT1G01950.1;Name=ARK2:CDS:10
Chr1 Araport11 CDS 328000 328107 . + 0 ID=AT1G01950:CDS:11;Parent=AT1G01950.1;Name=ARK2:CDS:11
Chr1 Araport11 CDS 328188 328322 . + 0 ID=AT1G01950:CDS:12;Parent=AT1G01950.1;Name=ARK2:CDS:12
Chr1 Araport11 CDS 328424 328555 . + 0 ID=AT1G01950:CDS:14;Parent=AT1G01950.1;Name=ARK2:CDS:14
Chr1 Araport11 CDS 328898 329030 . + 0 ID=AT1G01950:CDS:17;Parent=AT1G01950.1;Name=ARK2:CDS:17
Chr1 Araport11 CDS 329119 329211 . + 2 ID=AT1G01950:CDS:18;Parent=AT1G01950.1;Name=ARK2:CDS:18
Chr1 Araport11 CDS 329311 329436 . + 2 ID=AT1G01950:CDS:19;Parent=AT1G01950.1;Name=ARK2:CDS:19
Chr1 Araport11 CDS 329573 329698 . + 2 ID=AT1G01950:CDS:20;Parent=AT1G01950.1;Name=ARK2:CDS:20
Chr1 Araport11 CDS 329781 329924 . + 2 ID=AT1G01950:CDS:21;Parent=AT1G01950.1;Name=ARK2:CDS:21
Chr1 Araport11 CDS 330027 330155 . + 2 ID=AT1G01950:CDS:22;Parent=AT1G01950.1;Name=ARK2:CDS:22
Chr1 Araport11 CDS 330243 330403 . + 2 ID=AT1G01950:CDS:23;Parent=AT1G01950.1;Name=ARK2:CDS:23

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'

$ # 169,101,123,186,222,154,150,89,204,108,135,132,133,93,126,126,144,129,161
$ # [0, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2]

对于负链 -

1
2
3
4
5
6
7
8
9
$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff  | grep AT1G01090.1 | awk '$3~/CDS/'
Chr1 Araport11 CDS 47705 47982 . - 2 ID=AT1G01090:CDS:3;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:3
Chr1 Araport11 CDS 48075 48852 . - 0 ID=AT1G01090:CDS:2;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:2
Chr1 Araport11 CDS 48936 49166 . - 0 ID=AT1G01090:CDS:1;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:1

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01090.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'

$ # 278,778,231
$ # [0, 0, 2]