DIRECTORY STRUCTURE:
sequences/${ENCODE_REGION}/${COMMON_NAME}.${ENCODE_REGION}.fa
alignments/${ALIGNER}/*.maf
alignments/${ALIGNER}/${CONSERVATION}/
Each FASTA file will have all the sequence entries for a given species/region.
HEADER STRUCTURE:
>${COMMON_NAME}|${ENCODE_REGION}|${FREEZE_DATE}|${NCBI_TAXON_ID}|${ASSEMBLY_PROVIDER}|${ASSEMBLY_DATE}|${ASSEMBLY_ID}|${CHROMOSOME}|${CHROMOSOME_START}|${CHROMOSOME_END}|${CHROMOSOME_LENGTH}|${STRAND}|${ACCESSION}.${VERSION}|${NUM_BASES}|${NUM_N}|${THIS_CONTIG_NUM}|${TOTAL_NUM_CONTIGS}|${COMMENT}
Where:
${COMMON_NAME} like 'baboon' or 'dusky_titi'
${ENCODE_REGION} like 'ENm001' or 'ENr223'
${FREEZE_DATE} like 'AUG-2004'; latest date for inclusion in this freeze of the set of sequences encompassing the ENCODE regions
${NCBI_TAXON_ID} like '9555' or '9523'
${ASSEMBLY_PROVIDER} like 'NISC' or 'RGSC'
${ASSEMBLY_DATE} like 'NOV-2003' or '21-JUN-2003'; Date associated with the specific sequence assembly represented in this ENCODE freeze
${ASSEMBLY_ID} like 'rn3' or 'panTro1'
${CHROMOSOME} like 'chr1' or 'chr19_random'
${CHROMOSOME_START} [1 based]
${CHROMOSOME_END} [1 based]
${CHROMOSOME_LENGTH} length of entire ${CHROMOSOME}
${STRAND} as in '+' or '-' indicating whether the sequence came from the top or bottom DNA strand
${ACCESSION}.${VERSION} like 'NT_107546.1'
${NUM_BASES} Total number of called bases in the sequence entry, including N's
${NUM_N} Total number of N's in the sequence entry
${THIS_CONTIG_NUM} ID of sequence contig (see next variable).
${TOTAL_NUM_CONTIGS} Total number of sequence contigs syntenic to a human region.
${COMMENT} This is an example I hope we all agree on.
>rat|ENr223|APR-2005|10116|Baylor HGSC v. 3.1|JUN-2003|rn3|chr8|77886141|77906031|129061546|+|.|19891|0|1|3|This is an example I hope we all agree on.
Not all fields need to contain information. For example when
${ASSEMBLY_PROVIDER} = NISC, there will be no ${ASSEMBLY_ID} or
chrom:start-stop coordinates.
Data Release Terms
------------------
All data in this directory and any subdirectories is subject to the terms
of the ENCODE Project Data Release Policy of the National Human Genome Research
Institute. This policy is posted at:
http://www.genome.gov/12513440
http://genome.ucsc.edu/encode/terms.html
Apache/2.2.3 (Red Hat) Server at genome-test.cse.ucsc.edu Port 80