5 关于使用nohup vcftools过滤vcf文件的问题

我用了组学大讲堂GWAS分析课程中的示例数据rice.raw.vcf.gz,并使用vcftools软件进行过滤,并试验了使用nohup和不使用nohup后得到的文件结果比较,使用nohup得到的vcf文件结果有异常不能进行后续分析,具体如下:

1)不使用nohup时的代码

vcftools --gzvcf rice.raw.vcf.gz --recode --recode-INFO-all \
    --stdout --maf 0.05 --max-missing 0.9 --minDP 2 --maxDP 1000 \
    --minQ 30 --minGQ 0 --min-alleles 2 --max-alleles 2 \
    --remove-indels | gzip > myresult/normal.vcf.gz &

2)使用nohup时的代码

nohup vcftools --gzvcf rice.raw.vcf.gz --recode --recode-INFO-all \
    --stdout --maf 0.05 --max-missing 0.9 --minDP 2 --maxDP 1000 \
    --minQ 30 --minGQ 0 --min-alleles 2 --max-alleles 2 \
    --remove-indels | gzip > myresult/nohup1.vcf.gz &

这两个文件得到的结果不一样:

1)不用nohup得到的normal.vcf.gz文件共120218行

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.GenotypeGVCFs=<ID=GenotypeGVCFs,Version=3.6-0-g89b7209,Date="Thu Aug 06 01:28:08 CST 2020",Epoch=1596648488645,CommandLineOptions="analysis_type=GenotypeGVCFs input_file=[] showFullBamList=>
##GATKCommandLine=<ID=CombineGVCFs,CommandLine="CombineGVCFs  --output /share/nas1/zhenzq/project/reseq_177/ana/3.SNPCalling/variant_calling/cohort.1.20.2.g.vcf --variant /share/nas1/zhenzq/project/reseq_177>
##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
...

2)使用nohup得到的nohup1.vcf.gz文件共120169行

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
        --gzvcf rice.raw.vcf.gz
        --recode-INFO-all
        --maf 0.05
        --max-alleles 2
        --maxDP 1e+03
        --min-alleles 2
        --minDP 2
        --minGQ 0
        --minQ 30
        --max-missing 0.9
        --recode
        --remove-indels
        --stdout
Using zlib version: 1.2.11
Warning: Expected at least 2 parts in FORMAT entry: ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another>
Warning: Expected at least 2 parts in FORMAT entry: ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects reco>
Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
Warning: Expected at least 2 parts in FORMAT entry: ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT all>
Warning: Expected at least 2 parts in INFO entry: ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT all>
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT al>
Warning: Expected at least 2 parts in INFO entry: ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT al>
After filtering, kept 176 out of 176 Individuals
Outputting VCF file...
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
...
问题:
1)nohup1.vcf.gz文件对于下游分析肯定是不能用的,为什么多了`##`前面那些行的内容?
2)nohup1.vcf.gz文件其中的Warning部分是什么意思?为什么不使用nohup时候给出这些Warning?
3)nohup1.vcf.gz文件前面多了一些内容,直觉上应该是比不使用nohup得到的normal.vcf.gz文件行数多,但测试下来反而少,也就是后面正式snp分型的内容肯定有出入。是不是和前面的Warning有关?正常的normal.vcf.gz是不是“真”地没有问题?
4)如果使用`nohup`命令时要得到normal.vcf.gz同样的文件,正确的写法是什么?
恳请各位老师和有类似经验解决的大牛不吝赐教!谢谢!
请先 登录 后评论

1 个回答

omicsgene - 生物信息
擅长:重测序,遗传进化,转录组,GWAS
nohup vcftools --gzvcf rice.raw.vcf.gz --recode --recode-INFO-all \
    --stdout --maf 0.05 --max-missing 0.9 --minDP 2 --maxDP 1000 \
    --minQ 30 --minGQ 0 --min-alleles 2 --max-alleles 2 \
    --remove-indels 2>vcftools.log| gzip - >myresult/nohup1.vcf.gz &

试试上面的这个命令,并且看看这个文章:https://www.omicsclass.com/article/1725

请先 登录 后评论