Schema for N-SCAN - N-SCAN Gene Predictions

Schema for N-SCAN - N-SCAN Gene Predictions

Database: bosTau4 Primary Table: nscanGene Row Count: 10,433
Format description: A gene prediction with some additional info.

field	example	SQL type	info	description
`bin`	585	`smallint unsigned`	range	Indexing field to speed chromosome range queries.
`name`	chr1.001.1	`varchar(255)`	values	Name of gene (usually transcript_id from GTF)
`chrom`	chr1	`varchar(255)`	values	Reference sequence chromosome or scaffold
`strand`	+	`char(1)`	values	+ or - for strand
`txStart`	27302	`int unsigned`	range	Transcription start position
`txEnd`	52879	`int unsigned`	range	Transcription end position
`cdsStart`	27302	`int unsigned`	range	Coding region start
`cdsEnd`	52879	`int unsigned`	range	Coding region end
`exonCount`	5	`int unsigned`	range	Number of exons
`exonStarts`	27302,27880,37987,41438,52360,	`longblob`		Exon start positions
`exonEnds`	27574,28125,38141,41864,52879,	`longblob`		Exon end positions
`score`	0	`int`	range
`name2`	chr1.001	`varchar(255)`	values	Alternate name (e.g. gene_id from GTF)
`cdsStartStat`	incmpl	`enum('none','unk','incmpl','cmpl')`	values	enum('none','unk','incmpl','cmpl')
`cdsEndStat`	cmpl	`enum('none','unk','incmpl','cmpl')`	values	enum('none','unk','incmpl','cmpl')
`exonFrames`	1,0,2,0,0,	`longblob`		Exon frame {0,1,2}, or -1 if no frame for exon

Connected Tables and Joining Fields


	bosTau4.nscanPep.name (via nscanGene.name)

Sample Rows

bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	name2	cdsStartStat	cdsEndStat	exonFrames
585	chr1.001.1	chr1	+	27302	52879	27302	52879	5	27302,27880,37987,41438,52360,	27574,28125,38141,41864,52879,	chr1.001	incmpl	cmpl	1,0,2,0,0,
73	chr1.002.1	chr1	-	61889	158265	61889	158259	10	61889,82665,86614,102563,103498,114362,115700,116377,117418,156978,	62896,83073,87162,102719,103557,114544,115807,116503,117528,158265,	chr1.002	cmpl	cmpl	1,1,2,2,0,1,2,2,0,0,
73	chr1.003.1	chr1	-	742357	913377	742357	913370	34	742357,746210,748787,749873,752001,756112,762322,764366,773063,796877,798999,802512,805036,808906,813208,814523,818392,820594,83 ...	742506,746384,748870,749995,752214,756280,762421,764615,773185,797069,799117,802634,805082,809073,813291,814729,818552,820842,83 ...	chr1.003	cmpl	cmpl	1,1,2,0,0,0,0,0,1,1,0,1,0,1,2,0,2,0,1,2,0,1,0,0,0,1,2,2,1,1,2,1,1,0,
592	chr1.004.1	chr1	+	995223	1041991	995229	1041991	29	995223,997164,998806,999699,1001812,1003535,1012150,1013410,1015778,1017532,1018682,1021170,1021760,1023954,1024968,1026074,1027 ...	995250,997236,998884,999772,1001857,1003604,1012284,1013522,1015877,1017654,1018766,1021220,1021845,1024035,1025172,1026253,1027 ...	chr1.004	cmpl	cmpl	0,0,0,0,1,1,1,0,1,1,0,0,2,0,0,0,2,1,2,2,0,0,0,2,1,2,2,0,0,
9	chr1.005.1	chr1	-	1043635	1073698	1043635	1073692	11	1043635,1044313,1046385,1050565,1051692,1058902,1059284,1061333,1062791,1070450,1073615,	1043794,1044429,1046533,1050682,1051803,1059091,1059431,1061494,1068695,1070617,1073698,	chr1.005	cmpl	cmpl	0,1,0,0,0,0,0,1,1,2,0,
593	chr1.006.1	chr1	+	1077491	1170764	1077498	1170764	27	1077491,1078725,1079298,1082717,1083798,1084369,1085572,1086149,1086399,1089540,1091651,1092040,1092492,1093733,1094277,1096433, ...	1077643,1078821,1079473,1082829,1083867,1084495,1085660,1086235,1086568,1089772,1091746,1092150,1092691,1093985,1094430,1096640, ...	chr1.006	cmpl	cmpl	0,1,1,2,0,0,0,1,0,1,2,1,0,1,1,1,1,1,0,1,0,1,0,2,1,1,2,
74	chr1.007.1	chr1	-	1187595	1427689	1187595	1427683	28	1187595,1191711,1201180,1208700,1279959,1281418,1282055,1285850,1286116,1289134,1290786,1291034,1293457,1300723,1374868,1380979, ...	1187826,1191869,1201389,1208833,1280191,1281564,1282206,1286008,1286316,1289249,1290928,1291192,1293639,1300844,1375004,1381137, ...	chr1.007	cmpl	cmpl	0,1,2,1,0,1,0,1,2,1,0,1,2,1,0,1,0,1,2,0,0,1,0,1,2,1,1,0,
74	chr1.008.1	chr1	+	1855392	2024108	1855399	2024108	53	1855392,1874834,1876720,1881229,1882950,1884045,1884704,1885435,1889216,1892797,1892983,1894988,1896272,1897076,1897937,1898386, ...	1856038,1875180,1876849,1881406,1883172,1884149,1884922,1885625,1889340,1892897,1893099,1895188,1896417,1897198,1898014,1898453, ...	chr1.008	cmpl	cmpl	0,0,1,1,1,1,0,2,0,1,2,1,0,1,0,2,0,0,2,1,1,2,0,0,2,0,2,0,0,1,1,1,2,2,0,0,1,2,2,0,1,1,1,0,1,0,0,1,1,0,0,0,0,
9	chr1.009.1	chr1	-	2067085	2132655	2067085	2132649	8	2067085,2078420,2080221,2091167,2104672,2115101,2118386,2132639,	2067462,2078510,2080302,2091311,2104825,2115225,2118583,2132655,	chr1.009	cmpl	cmpl	1,1,1,1,1,0,1,0,
75	chr1.010.1	chr1	+	2194767	2267792	2194773	2267792	37	2194767,2195828,2202084,2204526,2206179,2209303,2213458,2215420,2218300,2219177,2219997,2221472,2223433,2225020,2225821,2226746, ...	2194925,2195961,2202181,2204612,2206305,2209428,2213654,2215558,2218474,2219307,2220114,2221585,2223553,2225137,2225960,2226917, ...	chr1.010	cmpl	cmpl	0,2,0,1,0,0,2,0,0,0,1,1,0,0,0,1,1,0,0,0,1,2,0,2,1,1,2,0,1,0,1,0,0,2,2,1,1,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

N-SCAN (nscanGene) Track Description

Description

This track shows gene predictions using the N-SCAN gene structure prediction software provided by the Computational Genomics Lab at Washington University in St. Louis, MO, USA.

Methods

N-SCAN PASA-EST

N-SCAN combines biological-signal modeling in the target genome sequence along with information from a multiple-genome alignment to generate de novo gene predictions. It extends the TWINSCAN target-informant genome pair to allow for an arbitrary number of informant sequences as well as richer models of sequence evolution. N-SCAN models the phylogenetic relationships between the aligned genome sequences, context-dependent substitution rates, insertions, and deletions.

For creating predictions on cow, N-SCAN uses mouse (mm9) as the informant.

N-SCAN PASA-EST combines EST alignments into N-SCAN. Similar to the conservation sequence models in TWINSCAN, separate probability models are developed for EST alignments to genomic sequence in exons, introns, splice sites and UTRs, reflecting the EST alignment patterns in these regions. N-SCAN PASA-EST is more accurate than N-SCAN while retaining the ability to discover novel genes to which no ESTs align.

No manual annotation was performed to generate any of the gene models.

Credits

Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing these data.

Special thanks for this implementation of N-SCAN to Aaron Tenney in the Brent lab, and Robert Zimmermann, currently at Max F. Perutz Laboratories in Vienna, Austria.

References

Gross SS, Brent MR. Using multiple alignments to improve gene prediction. In Proc. 9th Int'l Conf. on Research in Computational Molecular Biology (RECOMB '05):374-388 and J Comput Biol. 2006 Mar;13(2):379-93.

Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001 Jun 1;17(90001):S140-8.

van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006 May;16(5):678-85.

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003 Oct 1;31(19):5654-66.