vcftools使用

vcftools是一种可以对VCF文件和BCF文件进行格式转换及过滤的工具，其中很多过滤及计算功能我们可以自己使用perl或者python编写脚本实现，但都不如这个工具的运算速度快。

有些奇怪的是需要到网页上查看他的使用参数，Linux上没有参数查看
参考：vcftools使用手册

基本参数

输入参数

–vcf <input_filename> 支持v4.0、v4.1或者v4.2版本的VCF文件

–gzvcf <input_filename> 通过gzipped压缩过的VCF文件

–bcf <input_filename> BCF2文件

输出参数

–out <output_prefix> 输出文件，后面直接对输出文件命名

–stdout 可接管道符对输出结果进行重新定向

–temp <temporary_directory> 指定结果的输出目录

过滤参数

根据位置过滤

–chr <chromosome>

–not-chr <chromosome>
包含或排除匹配的染色体位点

–from-bp

–to-bp
这两个参数需要和–chr一起使用
指定要处理的一系列站点的下限和上限

–positions<filename>

–exclude-positions <filename>
根据文件中的位置列表包括或排除一组位点。输入文件的每一行应包含（制表符分隔的）染色体和位置
······

根据位点过滤

–snp <string>字符串的名称可以匹配dbSNP的数据，适合人类基因组，该指令可多次使用</string>

–snps<filename>

-exclude <filename>
包括或排除文件中给出的SNP列表

变异类型过滤

–keep-only-indels 只保留indel标记

–remove-indels 删除indel标记

根据flag过滤

–remove-filtered-all Removes all sites with a FILTER flag other than PASS.

–keep-filtered

–remove-filtered

根据INFO过滤

–keep-INFO<string>

–remove-INFO<string>

根据ALLELE过滤

–maf <float> MAF最小值过滤

–max-maf <float> MAF最大值过滤

此处省去很多参数，具体参见vcftools官网

根据基因型数值过滤

–min-meanDP<float>

–max-meanDP <float>根据测序深度进行过滤

–hwe<float>

–max-missing <float>完整度，该参数介于0，1之间

根据材料过滤

–indv

–remove-indv

–keep<filename></filename>

–remove<filename></filename>

–max-indv

基因型过滤参数

–remove-filtered-geno-all 排除flag不为’.’和’PASS’的基因型

–remove-filtered-geno <string>排除flag为string的基因型</string>

–minGQ <float>排除GQ低于这个参数的基因型</float>

–minDP<float></float>

–maxDP<float></float>

计算统计

核算多样性统计

–site-pi 计算所有SNP

–window-pi

–window-pi-step

FST计算

–weir-fst-pop<filename></filename>

–fst-window-size

–fst-window-step

其它计算

–het

–hardy

–site-quality 主要用于提取VCF文件中每个位点的QUAL信```

--missing-indv

--missing-site 计算每个位点的缺失率

vcftools --vcf test.recode.vcf --missing-site  --out ms

–SNPdensity <integer>计算SNP在设定bin内的密度</integer>

...太多了详情见参考手册

输出格式

–recode

–recode-bcf

–recode-INFO

–recode-INFO-all

–contigs

格式转换

–012

–IMPUTE

–ldhat

–ldhat-geno

–BEAGLE-GL

–BEAGLE-PL

–plink

vcftools --vcf all.filter.vcf --plink --out aa ;

–plink-tped

–chrom-map

比较选项

DIFF VCF FILE

–diff<filename></filename>

–gzdiff<filename></filename>

–diff-bcf<filename></filename>

–diff-site

–diff-indv

–diff-site-discordance

–diff-indv-discordance

–diff-indv-map<filename></filename>

–diff-discordance-matrix

–diff-switch-error

实例

1.输出来自染色体1的输入vcf文件中所有位点的等位基因频率

vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

2.从输入vcf文件输出新的vcf文件，该文件删除任何indel位点

vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only

3.输出文件比较两个vcf文件中的站点

vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out in1_v_in2

4.将新的vcf文件输出到标准输出，没有任何具有过滤器标记的位点，然后使用gzip压缩它

vcftools --gzvcf input_file.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > output_PASS_only.vcf.gz

5.为bcf文件中的每个站点输出Hardy-Weinberg p值，该站点没有任何缺失的基因型

vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out output_noMissing

6.在一系列位置输出核苷酸多样性

zcat input_file.vcf.gz | vcftools --vcf - --site-pi --positions SNP_list.txt --out nucleotide_diversity

转载自：https://www.jianshu.com/p/badd24cbc538

此外，我们在网易云课堂上有各种教学视频，有兴趣可以了解一下：

1. 文章越来越难发？是你没发现新思路，基因家族分析发2-4分文章简单快速，学习链接：基因家族分析实操课程

2. 转录组数据理解不深入？图表看不懂？点击链接学习深入解读数据结果文件，学习链接：转录组（有参）结果解读；转录组（无参）结果解读

3. 转录组数据深入挖掘技能-WGCNA，提升你的文章档次，学习链接：WGCNA-加权基因共表达网络分析

4. 转录组数据怎么挖掘？学习链接：转录组标准分析后的数据挖掘

5. 微生物16S/ITS/18S分析原理及结果解读

6. 更多学习内容：linux、perl、R语言画图，更多免费课程请点击以下链接：

https://study.omicsclass.com/

发表于 2019-04-26 14:41
阅读 ( 8552 )
分类：软件工具