What is HAP file?

The HAP format
HAP is a text file format. It contains several meta-information lines followed by ## , one line containing only the # sign, one header line, and those data lines. Each data line contains the information for one SNP, including the chromosome ID, the position of the SNP in the chromosome, the rs ID, and the haplotypes of each individual.
1. An example of the HAP format. ##fileformat=HAPv1.0
#CHROM POS ID NA00001 NA00002
1 126113 . A:A A:A
1 535131 . G:T G:T
1 567239 . C:C C:C
1 570254 . G:A G:A
1 592368 . G:A G:A 2. Meta-information lines.
The meta-information lines are the lines after the ## string. A single ‘fileformat’ line is required, it must be the first line in the data file. For example, for HAP version 1.0, this line should be read as: ##fileformat=HAPv1.0 The other meta-information lines are not required by the functioning of the AncestryHub, but we strongly encourage the users to use these lines to provide any informative message here.
3. Header line syntax.
The header line names the 3 mandatory columns (CHROM, POS , ID) and additonal columns of sample IDs. Duplicate sample IDs are not allowed. The header line is tab-delimited. 4. Data lines.
All data lines are tab-delimited. 1) CHROM: The ID of the chromosome where a SNP is located. All entries for a specific CHROM should form a contiguous
block within the HAP file. This information is required. 2) POS: The position of a SNP along a chromosome. Positions must be sorted numerically in increasing order within each
chromosome (Integer, Required) This infromation is required. Please note that the current version of AncestryHub 1.0 only supports the GRCh37 (hg19) coordinates. 3) ID: The rs ID of each SNP. Please note that no identifier should be present in more than one data liine. If there is no
identifier available, then a dot (‘.’) can be put here to represent a missing value. 4) Haplotypes: An individual's two haplotypes should be put in one column named with the corresponding sample ID,
separated by ‘:’. For Diploid calls, it should be A:A, A:G, etc. For haploid calls, e.g. on Y chromosome, male non-pseudoautosomal X, or mitochondrial chromosomes, only one allele should be given. Triploid calls are not supported by this version of AncestryHub. For mmissing values where an allele call cannot be made at a given locus, ‘.’ should be specified for each missing allele, for example ‘.:.’ for a diploid haplotype and ‘.’ for haploid genotype. In all cases, missing values are specified with a dot (‘.’).

How to create a gz compressed file?

If you need help on creating this format of files, please click here.

What are the common issues of the input files?

1) File size. We require that each file is not larger than 1 G; 2) File format. We will check the following:

  • HAP formate check
  • Quality control statistics: duplicate sites, SNPs removed, NonSNP sites, monomorphic sites, MAF check.
  • Check SNPs number for each chromosome or specific area. ( 2000 SNPs required for each chromosome or specifc are)