What is HAP file?
HAP is a text file format. It contains several meta-information lines followed by ## , one line containing only the # sign, one header line, and those data lines. Each data line contains the information for one SNP， including the chromosome ID, the position of the SNP in the chromosome, the rs ID, and the haplotypes of each individual.
#CHROM POS ID NA00001 NA00002
1 126113 . A:A A:A
1 535131 . G:T G:T
1 567239 . C:C C:C
1 570254 . G:A G:A
1 592368 . G:A G:A
The meta-information lines are the lines after the ## string. A single ‘fileformat’ line is required, it must be the first line in the data file. For example, for HAP version 1.0, this line should be read as: ##fileformat=HAPv1.0 The other meta-information lines are not required by the functioning of the AncestryHub, but we strongly encourage the users to use these lines to provide any informative message here.
The header line names the 3 mandatory columns (CHROM, POS , ID) and additonal columns of sample IDs. Duplicate sample IDs are not allowed. The header line is tab-delimited.
All data lines are tab-delimited.
block within the HAP file. This information is required.
chromosome (Integer, Required) This infromation is required. Please note that the current version of AncestryHub 1.0 only supports the GRCh37 (hg19) coordinates.
identifier available, then a dot (‘.’) can be put here to represent a missing value.
separated by ‘:’. For Diploid calls, it should be A:A, A:G, etc. For haploid calls, e.g. on Y chromosome, male non-pseudoautosomal X, or mitochondrial chromosomes, only one allele should be given. Triploid calls are not supported by this version of AncestryHub. For mmissing values where an allele call cannot be made at a given locus, ‘.’ should be specified for each missing allele, for example ‘.:.’ for a diploid haplotype and ‘.’ for haploid genotype. In all cases, missing values are specified with a dot (‘.’).
How to create a gz compressed file?
What are the common issues of the input files?
1) File size. We require that each file is not larger than 1 G;
HAP formate check
Quality control statistics: duplicate sites, SNPs removed, NonSNP sites, monomorphic sites, MAF check.
Check SNPs number for each chromosome or specific area. ( 2000 SNPs required for each chromosome or specifc are)