序列比对中的动态规划算法

Posted on 2023-12-04 Edited on 2023-12-26

动态规划

动态规划(dynamic programming), 是一种求解多阶段决策过程最优化问题的算法。核心思想就是: (1),将一个大问题拆解成一堆子问题，(2),求解出子问题的解，存储起来当再次需要求解此问题时候可以被重复利用，避免重复的计算过程。能够使用DP必须满足下面三个特征:
(1), 最优子结构性质。一个问题的最优解包含其子问题的最优解。
(2), 重叠子问题性质。在求解子问题的过程中，会有大量的子问题的是重复的，可以将子问题的结果存储，后续直接查询，避免重复计算。 (3), 无后效性。子问题的解只与之前的阶段有关，与后面的阶段无关。

1 算法解释

对于序列比对问题，我们想知道两个序列之间的存在的差异，首先我们想了解全局比对的差异。对于全局比对，前人开发出Needleman-Wunsch算法，该算法是基于动态规划的全局比对算法。

\[ F(i, j)=max \left\{ \begin{aligned} F(i-1, j-1) + s(x_i, y_j) \\ F(i-1, j) + s(x_i, -) \\ F(i, j-1) + s(-, y_j) \end{aligned} \right. \]

$F(i-1, j-1) + s(x_i, y_j)$ 前半部分$F(i-1, j-1)$表示前面$i-1$和$j-1$比对计算完成的情况下，比对两条序列的当前碱基$x_i$和$y_j$, $s(x_i, y_j)$表示当前的替换分数。

$F(i-1, j) + s(x_i, -)$ 前半部分$F(i-1, j)$表示前面$i-1$和$j$比对计算完成的情况下，比对$x$序列的当前碱基$x_i$和$y$序列空白$-$, $s(x_i, -)$表示$x$序列当前碱基和gap的替换分数。

$F(i, j-1) + s(-, y_j)$ 前半部分$F(i, j-1)$表示前面$i$和$j-1$比对计算完成的情况下，
比对$y$序列的当前碱基$y_j$和$x$序列空白$-$, $s(-, y_j)$表示$y$序列当前碱基和gap的替换分数。

2 罚分矩阵

常见的罚分矩阵有几种：

等价矩阵：碱基相同为1，不同为0
转换-颠换矩阵：嘌呤(AG)之间为转换，嘧啶(TC)之间为转换，嘧啶和嘌呤之间为颠换
BLAST矩阵：碱基相同为5，碱基不同为-4

NCBI Blast matrices

这里实现的代码采用自定的一个矩阵，具体如下：
对角线为自身比对上的分数0，即match。
嘌呤A/G之间比对分数为2，即mismatch。
嘧啶T/C之间比对分数为2，即mismatch。
嘧啶(T/C)和嘌呤(A/G)之间的比对分数为4, 即mismatch。
空白罚分为8，即gap。
Gap一般分2种:
Gap opening(只有出现一个就罚分)，
Gap extension(在一空位的扩大，拓展空位罚分，长度越长，罚分越多)

# 因为我们采用的是正数罚分矩阵，
# 所以最佳比对是罚分最小的
   A  T  G  C  -
A  0  4  2  4  8
T  4  0  4  2  8
G  2  4  0  4  8
C  4  2  4  0  8
-  8  8  8  8  8

3 全局比对python实现

alphabet = ['A', 'T', 'G', 'C']
score = [[0, 4, 2, 4, 8],
         [4, 0, 4, 2, 8],
         [2, 4, 0, 4, 8],
         [4, 2, 4, 0, 8],
         [8, 8, 8, 8, 8]]

import numpy as np 

def globalAlignment(x, y):
    """
    Needleman-Wunsch algorithm
    x : (str) sequence a
    y : (str) sequence b
    """
    # 创建一个距离矩阵，记录比对过程
    D = []
    # 初始化矩阵，全部填充0
    for i in range(len(x)+1):
        D.append([0] * (len(y)+1))
     
    
    # 初始化第一列，都填上空白的罚分，随着位置累加
    for i in range(1, len(x)+1):
        # 二维数组的第二索引为0
        D[i][0] = D[i-1][0] + score[alphabet.index(x[i-1])][-1]
        
    # 初始化第一行,都填上空白的罚分，随着位置累加
    for j in range(1, len(y)+1):
        # 二维数组的第一索引为0 
        D[0][j] = D[0][j-1] + score[-1][alphabet.index(y[j-1])]
    
    # 填充剩余的矩阵 
    for i in range(1, len(x)+1):
        for j in range(1, len(y)+1):
            # 水平距离，x(i)和y(j-1)比对上，yj位置gap罚分
            # D[i][j-1]未知，后一项罚分已知
            distHor = D[i][j-1] + score[-1][alphabet.index(y[j-1])]
            # 垂直距离，x(i-1)和y(j)比对上，xi位置gap罚分
            # D[i-1][j]未知，后一项罚分已知
            distVer = D[i-1][j] + score[alphabet.index(x[i-1])][-1]
            # 对角线距离，x(i-1)和y(j-1)比对上，xi和yj错配罚分
            # D[i-1][j-1]未知，后一项罚分已知
            distDiag = D[i-1][j-1] + score[alphabet.index(x[i-1])][alphabet.index(y[j-1])]
            # 取最小值的比对路线，记录比对值, 该值在后续上面的三个距离计算中重复使用，避免循环暴力递归
            # 这里才是动态规划算法的核心所在。
            D[i][j] = min(distHor, distVer, distDiag) # 即问题分解成上述三个子问题，罚分最小才是最佳比对,视罚分函数而定
    # 返回右下角的值, 最佳的比对值
    return D, D[-1][-1]

x = 'TATGTCATGC'
y = 'AATGTCATGC'
D, value = globalAlignment(x,y)
import numpy as np 
d = np.array(D)
np.diagonal(d)
print(value)
print(D)

4 局部比对

在生物学序列比对中，由于内含子或者motif的因素的存在，我们常常需要算出局部最佳的比对情况，类似最大子串问题，因此需要局部比对，前人实现了smith-waterman算法来解决这个问题。和全局比对相比，局部比对在算法上只有一点不同。

\[ F(i, j)= max \left\{ \begin{aligned} F(i-1, j-1) + s(x_i, y_j) \\ F(i-1, j) + s(x_i, -) \\ F(i, j-1) + s(-, y_j) \\ 0 \end{aligned} \right. \]

5 局部比对python实现

import numpy

def costFunction(xc, yc):
    ''' 自定义一个罚分矩阵 
    2 to match, -6 to gap, -4 to mismatch 
    '''
    # 比对上match 
    if xc == yc:
        return 2
    # 空白 gap
    if xc == '-' or yc == '-': 
        return -6
    # 错配 mismatch 
    return -4

def localAlignment(x, y, s):
    '''
    Smith-Waterman algorithm
    Calculate local alignment values of sequences x and y using
    dynamic programming.  Return maximal local alignment value. 
    '''
    # 初始化二维数组记录子问题的解
    D = numpy.zeros((len(x)+1, len(y)+1), dtype=int)
    # 开始循环比对
    for i in range(1, len(x)+1):
        for j in range(1, len(y)+1):
            # 这里选最大值是因为match的罚分是正数, gap和mismatch的是负数
            D[i, j] = max(D[i-1, j-1] + s(x[i-1], y[j-1]), # 对角线，  前i-1和j-1比对完成, 计算i,j位置的罚分
                          D[i-1, j  ] + s(x[i-1], '-'),    # 垂直方向，前i-1和j比对完成，计算i和gap的罚分
                          D[i  , j-1] + s('-',    y[j-1]), # 水平方向，前i和j-1比对完成，计算gap和j的罚分
                          0)                               # 空值
    argmax = numpy.where(D == D.max()) # 最大值的索引
    return D, int(D[argmax])

x, y = 'GGTATGCTGGCGCTA', 'TATATGCGGCGTTT'
D, best = localAlignment(x, y, exampleCost)
print(D)
print("Best score=%d, in cell %s" % (best, numpy.unravel_index(numpy.argmax(D), D.shape)))

SAM文件flags解释

Posted on 2023-12-03 Edited on 2024-01-07

SAM flag

SAM文件是二进制比对文件，其中FLAG值记录了该read的比对信息。 FLAG巧妙地采用二进制来存储信息，解读FLAG即可确定read属性，所以samtools常常依据FLAG来过滤处理SAM/BAM。

Python code

import argparse

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', type=int, required=True,
                           help='flags in SAM')
    return parser.parse_args()


def flags_cal(flag: int):
    falgs = [0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800]
    flags_dict = {
        0x1: ['PAIRED', '..paired-end (or multiple-segment) sequencing technology'],
        0x2: ['PROPER_PAIR', '..each segment properly aligned according to the aligner'],
        0x4: ['UNMAP', '..segment unmapped'],
        0x8: ['MUNMAP', '..next segment in the template unmapped'],
        0x10: ['REVERSE', '..SEQ is reverse complemented'],
        0x20: ['MREVERSE', '..SEQ of the next segment in the template is reversed'],
        0x40: ['READ1', '..the first segment in the template'],
        0x80: ['READ2', '..the last segment in the template'],
        0x100: ['SECONDARY', '..secondary alignment'],
        0x200: ['QCFAIL', '..not passing quality controls'],
        0x400: ['DUPLICATE', '..PCR or optical duplicate'],
        0x800: ['SUPPLEMENTARY', '..supplementary alignment']
        }
    binary = bin(flag)
    index_falgs = list(map(int, list(binary[2:][::-1])))
    index_list = []
    hex_list = []
    print()
    print(f'Flags: {flag}')
    print()
    print("Hex\tDec\tProperty\tInformation") 
    Props = []
    for index,i in enumerate(index_falgs):
        if i == 1:
            index_list.append(index_list)
            hexs = falgs[index]
            ints = int(hexs)
            info = flags_dict[hexs]
            Props.append(info[0])
            print("{:<#x}\t{:<5d}\t{:<11s}\t{:<50s}".format(hexs, ints, info[0], info[1]))
            # print("%#x %s %s %s "%(hexs, hexs, info[0], info[1]))
    print()
    print("{}\t{:<5d}\t{}".format(hex(flag), flag, ",".join(Props)))


if __name__ == '__main__':
    args = parse_args()
    flags_cal(args.input)

执行程序即可解析

$ python flags_explain.py -i 1294 
Flags: 1294

Hex	Dec	Property	Information
0x2	2    	PROPER_PAIR	..each segment properly aligned according to the aligner
0x4	4    	UNMAP      	..segment unmapped                                
0x8	8    	MUNMAP     	..next segment in the template unmapped           
0x100	256  	SECONDARY  	..secondary alignment                             
0x400	1024 	DUPLICATE  	..PCR or optical duplicate                        

0x50e	1294 	PROPER_PAIR,UNMAP,MUNMAP,SECONDARY,DUPLICATE

## htslib

htslib

/*! @abstract the read is paired in sequencing, no matter whether it is mapped in a pair */
#define BAM_FPAIRED        1
/*! @abstract the read is mapped in a proper pair */
#define BAM_FPROPER_PAIR   2
/*! @abstract the read itself is unmapped; conflictive with BAM_FPROPER_PAIR */
#define BAM_FUNMAP         4
/*! @abstract the mate is unmapped */
#define BAM_FMUNMAP        8
/*! @abstract the read is mapped to the reverse strand */
#define BAM_FREVERSE      16
/*! @abstract the mate is mapped to the reverse strand */
#define BAM_FMREVERSE     32
/*! @abstract this is read1 */
#define BAM_FREAD1        64
/*! @abstract this is read2 */
#define BAM_FREAD2       128
/*! @abstract not primary alignment */
#define BAM_FSECONDARY   256
/*! @abstract QC failure */
#define BAM_FQCFAIL      512
/*! @abstract optical or PCR duplicate */
#define BAM_FDUP        1024
/*! @abstract supplementary alignment */
#define BAM_FSUPPLEMENTARY 2048

GFF和GTF中frame计算过程

Posted on 2023-12-01 In file format

GFF/GTF是基因组注释中常用的文件格式，两者的主要差别在最后一列。第八列中的frame，用来表示当前特征在CDS编码中的相关信息。

1. frame 可选值

0, 0表示当前编码的CDS是从当前特征的第一个碱基开始解码，即该CDS的第一个碱基是完整密码子的第一个碱基。
1, 1表示当前编码的CDS是从第二个碱基开始解码，即该CDS的第二个碱基是完整密码子的第一个碱基。
2, 2表示当前编码的CDS是从第三个碱基开始解码，即该CDS的第三个碱基是完整密码子的第一个碱基。

2. frame 计算

定义函数

def b(length, frame):
    '''frame calculate formula
    length - frame: 前面的CDS特征长度减去它自己的frame值，表示在前面CDS特征上的第一个完整的密码子开始的长度。
    ((length-frame) % 3): 前面CDS特征3'端最后一个完整密码子之后剩下的碱基数。
    (3- ((length-frame) % 3)): 是除去特征的3'端所代表的碱基后，密码子中剩余的碱基数目。
    (3- ((length-frame) % 3)) % 3: 把3变成0，因为三个碱基组成一个完整的密码子，1和2保持不变。
    
    params: 
      length: previous feature length (5'->3')
      frame: previous feature frame (5'->3')
    return: 
       next feature frame
    '''
    return (3- ((length-frame) % 3)) % 3

def cal_frame(lengths, strand):
    if strand == '-':
        lengths = lengths[::-1]
    frames = []
    for i in range(len(lengths)):
        if i == 0:
            frames.append(0)
        else:
            frame = b(lengths[i-1], frames[i-1])
            frames.append(frame)

    return frames

if __name__ == '__main__':
    # + example
    # length_list = [169,101,123,186,222,154,150,89,204,108,135,132,133,93,126,126,144,129,161]
    # strand = "+"
    
    # - example
    length_list = [278,778,231]
    strand = '-'
    print(cal_frame(length_list, strand))

对于正链 +

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/'
Chr1	Araport11	CDS	325473	325641	.	+	0	ID=AT1G01950:CDS:1;Parent=AT1G01950.1;Name=ARK2:CDS:1
Chr1	Araport11	CDS	325913	326013	.	+	2	ID=AT1G01950:CDS:2;Parent=AT1G01950.1;Name=ARK2:CDS:2
Chr1	Araport11	CDS	326106	326228	.	+	0	ID=AT1G01950:CDS:3;Parent=AT1G01950.1;Name=ARK2:CDS:3
Chr1	Araport11	CDS	326332	326517	.	+	0	ID=AT1G01950:CDS:4;Parent=AT1G01950.1;Name=ARK2:CDS:4
Chr1	Araport11	CDS	326594	326815	.	+	0	ID=AT1G01950:CDS:6;Parent=AT1G01950.1;Name=ARK2:CDS:6
Chr1	Araport11	CDS	326931	327084	.	+	0	ID=AT1G01950:CDS:7;Parent=AT1G01950.1;Name=ARK2:CDS:7
Chr1	Araport11	CDS	327237	327386	.	+	2	ID=AT1G01950:CDS:8;Parent=AT1G01950.1;Name=ARK2:CDS:8
Chr1	Araport11	CDS	327488	327576	.	+	2	ID=AT1G01950:CDS:9;Parent=AT1G01950.1;Name=ARK2:CDS:9
Chr1	Araport11	CDS	327663	327866	.	+	0	ID=AT1G01950:CDS:10;Parent=AT1G01950.1;Name=ARK2:CDS:10
Chr1	Araport11	CDS	328000	328107	.	+	0	ID=AT1G01950:CDS:11;Parent=AT1G01950.1;Name=ARK2:CDS:11
Chr1	Araport11	CDS	328188	328322	.	+	0	ID=AT1G01950:CDS:12;Parent=AT1G01950.1;Name=ARK2:CDS:12
Chr1	Araport11	CDS	328424	328555	.	+	0	ID=AT1G01950:CDS:14;Parent=AT1G01950.1;Name=ARK2:CDS:14
Chr1	Araport11	CDS	328898	329030	.	+	0	ID=AT1G01950:CDS:17;Parent=AT1G01950.1;Name=ARK2:CDS:17
Chr1	Araport11	CDS	329119	329211	.	+	2	ID=AT1G01950:CDS:18;Parent=AT1G01950.1;Name=ARK2:CDS:18
Chr1	Araport11	CDS	329311	329436	.	+	2	ID=AT1G01950:CDS:19;Parent=AT1G01950.1;Name=ARK2:CDS:19
Chr1	Araport11	CDS	329573	329698	.	+	2	ID=AT1G01950:CDS:20;Parent=AT1G01950.1;Name=ARK2:CDS:20
Chr1	Araport11	CDS	329781	329924	.	+	2	ID=AT1G01950:CDS:21;Parent=AT1G01950.1;Name=ARK2:CDS:21
Chr1	Araport11	CDS	330027	330155	.	+	2	ID=AT1G01950:CDS:22;Parent=AT1G01950.1;Name=ARK2:CDS:22
Chr1	Araport11	CDS	330243	330403	.	+	2	ID=AT1G01950:CDS:23;Parent=AT1G01950.1;Name=ARK2:CDS:23

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff | grep AT1G01950.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'

$ # 169,101,123,186,222,154,150,89,204,108,135,132,133,93,126,126,144,129,161
$ # [0, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2]

对于负链 -

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff  | grep AT1G01090.1 | awk '$3~/CDS/'
Chr1	Araport11	CDS	47705	47982	.	-	2	ID=AT1G01090:CDS:3;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:3
Chr1	Araport11	CDS	48075	48852	.	-	0	ID=AT1G01090:CDS:2;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:2
Chr1	Araport11	CDS	48936	49166	.	-	0	ID=AT1G01090:CDS:1;Parent=AT1G01090.1;Name=PDH-E1 ALPHA:CDS:1

$ cat Araport11_GFF3_genes_transposons.Mar92021.correct.gff  | grep AT1G01090.1 | awk '$3~/CDS/' | awk '{print$5-$4+1}' | xargs | sed 's/ /,/g'

$ # 278,778,231
$ # [0, 0, 2]

python函数递归

Posted on 2023-11-17 In python

python 函数递归 (recursion)

Python中能使用循环替代递归的尽量使用，但是有时候为了代码的可读性、简洁性必须使用递归。

例如求一个嵌套列表的和

a = [1,[2,[3,4],5],6,[7,8]] 
def sumtree(L: list) -> int:
    total = 0
    for x in L:
        if not isinstance(x, list):
            total += x          # 对于数字直接求和
        else:
            total += sumtree(x) # 递归调用自身，循环子列表
    return total

阶乘

# 1*2*3*4*5

def calc_factorial(x: int) -> int:
    if x == 1:
        return 1
    else:
        return (x * calc_factorial(x-1))

递归的优点

代码干净整洁，可读性强
将复杂的任务分解成简单的子任务

递归缺点

递归占用大量内存和时间
递归函数调试困难

ggplot桑基图和冲击图

Posted on 2023-11-16

桑基图是描述一组数据到另一组数据的流向图，可以观察数据流向的比例关系。冲击图是一种特殊类型的流图。

ggsankey

ggsankey是基于ggplot2开发的一个包，用于可视化桑基图和冲击图。

1 2	# install.packages("devtools") devtools::install_github("davidsjoberg/ggsankey")

ggsankey 构图元素

- 每一列表示一个stage, 每个stage有若干个node组成 - 相邻两个stage之间的node存在flow流的关系

图形参数

fill设置填充色，分为node.fii和flow.fill
color设置边框颜色，可分为node.color和flow.color
width设置node宽度
flow.alpha设置flow的透明度
space设置组内node的间距

测试数据

library(ggsankey)
library(dplyr)
library(ggplot2)

df <- mtcars %>%
  make_long(cyl, vs, am, gear, carb)

ggplot(df, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node))) +
  geom_sankey()

标注node名称

主题设置

ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) +
  geom_sankey(flow.alpha = .6,
              node.color = "gray30") +
  geom_sankey_label(size = 3, color = "white", fill = "gray40") +
  scale_fill_viridis_d() +
  theme_sankey(base_size = 18) +
  labs(x = NULL) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = .5)) +
  ggtitle("Car features")

geom_alluvial冲击图冲积图与桑基图非常相似，但节点之间没有空间，并且以y = 0开始，而不是以x轴为中心。

ggplot(df, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) +
  geom_alluvial(flow.alpha = .6) +
  geom_alluvial_text(size = 3, color = "white") +
  scale_fill_viridis_d() +
  theme_alluvial(base_size = 18) +
  labs(x = NULL) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = .5)) +
  ggtitle("Car features")

统计学中的正态性检验

Posted on 2022-03-22 Edited on 2022-03-23

统计学检验的前提条件

大多数统计学检验（相关性，回归，t-test, ANOVA）都需要数据符合正态分布，这些检验也叫参数检验，因为它们依赖数据分布。在进行参数检验之前，我们要保证这些数据分布的假设能满足；不能满足的话，就进行非参检验。

预安装的R包

# dplyr
install.packages('dplyr')

# ggpubr
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")

install.packages("ggpubr")

密度图和QQ图，可视化查看是否满足正态分布

library("dplyr")
library("ggpubr")

set.seed(1234)
dplyr::sample_n(ToothGrowth, 10)

library("ggpubr")
ggdensity(ToothGrowth$len, 
          main = "Density plot of tooth length",
          xlab = "Tooth length")

1 2	library(ggpubr) ggqqplot(ToothGrowth$len)

正态性检验

前面可视化的查看是否符合正态分布通常是不可靠的，需要我们进行显著性检验，比较我们的数据分布和正太数据分布的差异。有两种正态性检验方法：Kolmogorov-Smirnov(K-S)检验和Shapiro-Wilk's 检验。零假设是：样本的数据分布是正态分布，检验如果显著，则数据是非正态分布。注意：正态性检验对样本数量是非常敏感的，小样本数目容易通过正态检验，因此要结合数据的可视化和检验来确定数据的分布。

1	shapiro.test(ToothGrowth$len)

统计学中的均值比较

Posted on 2022-03-11 Edited on 2022-03-14

均值比较（compare means）

在统计学中，我们常常需要比较两组或者多组之间的均值是否存在差异，计算差异的显著性，同时也要用图形的方式呈现。这里推荐一个比较实用的R包，ggpubr (publication ready plot in ggplot2)。 ggpubr作图上与ggplot2无缝对接，同时也可以添加显著性的P值，对文章发表非常实用。

ggpubr

# Instrall from CRAN
install.packages("ggpubr")

# or from github 
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/ggpubr")

均值比较的方法

t检验 T-test (需要假设数据分布为正态分布，有参检验）
Wilcoxon test (无参检验）
ANOVA test (需要假设数据分布为正态分布，有参检验）
Kruskal-Wallis test （无参检验）

作图数据集

> data("ToothGrowth")
> head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

比较两个独立的样本

p <- ggboxplot(ToothGrowth, x = "supp", y = "len",
          color = "supp", palette = "jco",
          add = "jitter")
#  Add p-value
p + stat_compare_means()
# Change method
p + stat_compare_means(method = "t.test")

成对样本的比较

ggpaired(ToothGrowth, x = "supp", y = "len",
         color = "supp", line.color = "gray", line.size = 0.4,
         palette = "jco")+
  stat_compare_means(paired = TRUE)

多组样本之间的比较（>=2）

# Default method = "kruskal.test" for multiple groups
ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+
  stat_compare_means()

# Change method to anova
ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+
  stat_compare_means(method = "anova")

指定比较组合

my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )
ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+ 
  stat_compare_means(comparisons = my_comparisons)+ # Add pairwise comparisons p-value
  stat_compare_means(label.y = 50)     # Add global p-value

指定bar的位置

ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+ 
  stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))+
  stat_compare_means(label.y = 45)

指定一个参考，进行多重比较

ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+
  stat_compare_means(method = "anova", label.y = 40)+      # Add global p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = "0.5")                    # Pairwise comparison against reference

对于全局（base-mean），进行多重比较

ggboxplot(ToothGrowth, x = "dose", y = "len",
          color = "dose", palette = "jco")+
  stat_compare_means(method = "anova", label.y = 40)+      # Add global p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = ".all.")

一个真实的例子 DEPDC1 在proliferation组中显著上调，在Hyperdiploid 和 Low bone disease中显著下调。

# Load myeloma data from GitHub
myeloma <- read.delim("https://raw.githubusercontent.com/kassambara/data/master/myeloma.txt")
# Perform the test
compare_means(DEPDC1 ~ molecular_group,  data = myeloma,
              ref.group = ".all.", method = "t.test")

ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", 
          add = "jitter", legend = "none") +
  rotate_x_text(angle = 45)+
  geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean
  stat_compare_means(method = "anova", label.y = 1600)+        # Add global annova p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = ".all.")                      # Pairwise comparison against all

指定参数 hide.ns = TRUE, 隐藏ns。

ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group", 
          add = "jitter", legend = "none") +
  rotate_x_text(angle = 45)+
  geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean
  stat_compare_means(method = "anova", label.y = 1600)+        # Add global annova p-value
  stat_compare_means(label = "p.signif", method = "t.test",
                     ref.group = ".all.", hide.ns = TRUE)      # Pairwise comparison against all

分页

compare_means(len ~ supp, data = ToothGrowth, 
              group.by = "dose")

# Box plot facetted by "dose"
p <- ggboxplot(ToothGrowth, x = "supp", y = "len",
          color = "supp", palette = "jco",
          add = "jitter",
          facet.by = "dose", short.panel.labs = FALSE)
# Use only p.format as label. Remove method name.
p + stat_compare_means(label = "p.format")

1	p + stat_compare_means(label = "p.signif", label.x = 1.5)

成对组合在一张图中

p <- ggboxplot(ToothGrowth, x = "dose", y = "len",
               color = "supp", palette = "jco",
               add = "jitter")
p + stat_compare_means(aes(group = supp))

1	p + stat_compare_means(aes(group = supp), label = "p.format")

在分组后，成对比较

# Box plot facetted by "dose"
p <- ggpaired(ToothGrowth, x = "supp", y = "len",
          color = "supp", palette = "jco", 
          line.color = "gray", line.size = 0.4,
          facet.by = "dose", short.panel.labs = FALSE)
# Use only p.format as label. Remove method name.
p + stat_compare_means(label = "p.format", paired = TRUE)

其他图形

柱形图和线图

一个变量

ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se")+
  stat_compare_means() +                                         # Global p-value
  stat_compare_means(ref.group = "0.5", label = "p.signif",
                     label.y = c(22, 29))                   # compare to ref.group
# Line plot of mean +/-se
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se")+
  stat_compare_means() +                                         # Global p-value
  stat_compare_means(ref.group = "0.5", label = "p.signif",
                     label.y = c(22, 29))

两个变量

ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se",
          color = "supp", palette = "jco", 
          position = position_dodge(0.8))+
  stat_compare_means(aes(group = supp), label = "p.signif", label.y = 29)
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se",
       color = "supp", palette = "jco")+
  stat_compare_means(aes(group = supp), label = "p.signif", 
                     label.y = c(16, 25, 29))

逻辑回归

Posted on 2021-09-07 Edited on 2021-09-08

逻辑回归的前世今生

与传统的线性回归不同，逻辑回归在传统的线性回归的输出端加上一个logit函数，使得输出有连续变成离散，每个取值代表不同的类别，从回归任务转成了分类任务，所以逻辑回归在名字上还保留的回归字眼，但是却成了分类的代表之一。
一句话总结：逻辑回归假设数据分布服从伯努利分布，通过极大似然函数的方法，运用梯度下降来求解参数，来实现二分类。

1.逻辑回归的基本假设

第一个基本假设

假设数据分布服从伯努利分布。伯努利分布，对于抛硬币来说， $p$是硬币为正的概率，$1-p$为硬币为负的概率；同样，在逻辑回归的模型中，假设$h_{\theta}(x)$ 为样本正的概率，$1-h_{\theta}(x)$为样本为负的概率。整个模型如下： \[ h_\theta(x;\theta) = p \]

第二个假设

假设样本为正的概率为： \[ p = \frac{1}{1+e^{-\theta^{T}x}} \]

所以最终的表达式为： \[ h_\theta(x;\theta) = \frac{1}{1+e^{-\theta^{T}x}} \]

2.逻辑回归的损失函数

逻辑回归采用最大似然函数作为损失函数。

逻辑回归中$y\in {0,1}$。约定$\hat{y}=P(y=1|x)$。

对一个样本

可知 \[ P(y|x) = \begin{cases} \hat{y} , y=1, \\ 1 - \hat{y}, y=0 \end{cases} \]

写成一个函数 \[ P(y|x) = \hat{y}^{y}(1-\hat{y})^{1-y} \]

$\log$变换不改变函数的单调性，同时乘法变成加法简化计算，所以上式换个马甲，整理如下： \[ \log(P(y|x)) = y\log(\hat{y}) + (1-y)\log(1-\hat{y}) \]

我们的目的就是最大化上述函数，使得模型更加拟合我们的数据。使用梯度下降法来求解，因此需要对上式取反，求最大值转化成求最小值。 \[ L(w,b) = -[y\log\hat{y} + (1-y)\log(1-\hat{y})] \]

全部样本 (training)

在全部训练集m上训练求得全部的损失函数。假设样本独立同分布，概率分布函数为每个样本的连乘。 \[ P(y|x_1,x_2,...,x_m) = \prod_{i=1}^{m} P(y|x_i) \] 上式取对数，连成转化成加法，加速计算，使用极大似然法求得概率最大时对应的参数。采取一个样本的计算处理，最小化函数取反，得:

\[ J(w,b) = -\sum_{i=1}^{m}y^{i}\log{\hat{y}^i} + (1-y^i)\log({1-\hat{y}^{i}}) \]

方差分析ANOVA

Posted on 2021-07-13 Edited on 2022-03-11

ANOVA

ANOVA, "Analysis of Variance",方差分析通常用来进行三组或者三组以上的独立样本的之间的均值检验。

ANOVA test

方差分析具备的条件 1. 在每个因子水平的数据中，观察值是独立的，随机的获取的。 2. 每个因子的数据是正态分布的。 3. 这些正太分布的数据有相同的方差。

One-way ANOVA test

One-way ANOVA test是独立两个样本t-test均值比较的拓展，它常常用来进行多组的比较。在one-way ANOVA中，会用一个变量(factor variable) 将数据分成多个组别，然后在来进行One-way方差分析。

One-way 的假设

零假设（Null hypothesis）: 不同组之间的均值无差异。
备择假设（Alternative hypothesis）：至少有一个样本均值和其他的不相同。

Two-way ANOVA

用来判断双因素影响变量，以及两个变量之间的是否能相互影响来相应变量。这时候来分组的因子变量（factor variable）有2个。

Two-way 假设

零假设： 1. 因子A分成的不同组别之间的均值没有差异 2. 因子B分成的不同组别之间的均值没有差异 3. 因子A和B之间没有相互影响备择假设：假设1和2他们的均值不同等。假设3，A和B之间存在相互作用

tanh激活函数

Posted on 2021-05-10

tanh

tanh 函数是机器学习常用的一种激活函数，公式如下： \[ f(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} \] tanh 的函数取值为(-1,1)，相比sigmoid的是(0,1). 两函数的图形形状类似。

求导

根据乘法法则，可知 \[ (uv)^{\prime} = u^{\prime}v + uv^{\prime} \]

所以对于$tanh(x)$

\[\begin{equation} \begin{aligned} tanh(x)^{\prime} &= \frac{d}{dx}(\frac{e^x-e^{-x}}{e^x+e{-x}}) \\ &= \frac{1}{e^x+e^{-x}}\frac{d}{dx}(e^x-e^{-x}) + (e^x-e^{-x})\frac{d}{dx}(\frac{1}{e^x+e^{-x}}) \\ &= \frac{e^x+e^{-x}}{e^x+e^{-x}} + (e^x-e^{-x})\frac{1}{(e^x+e^{-x})^2}(-1)(e^x-e^{-x}) \\ &= 1 - (\frac{e^x-e^{-x}}{e^x+e^{-x}})^2\\ &= 1 - (tanh(x))^2 \end{aligned} \end{equation}\]