ANSWERS TO HOMEWORK 3
- The result for searching the pattern "p1=TATA p2=15...100 p3=ATG" (which is the same pattern as "TATA 15...100 ATG") from the human unmasked genomic sequence is here.
- The result for searching the same pattern from the masked human genomic sequence is here. In the masked genomic sequence, all the repeated nucleotides were replaced with "N". If the sequences with the pattern were inside these repeated regions, they would not be recognized by the PatScan program. So, there were fewer numbers of sequences with the same pattern in the genomic sequence masked by RepeatMasker.
- After running BLASTX with the masked genomic sequence, two human protein sequences had e-value below 0. If you used the combination of GENSCAN and BLASTP to find protein sequences, only one human protein sequence had e-value below 0. The coding regions sequence predicted by GENOMESCAN program with these two different sets of proteins are at the links for BLASTX and GENSCAN and BPASTP. If you compared these two GENOMESCAN results with blast 2 program, you would find that the sequences were identical. You can get graphic view of the result with the link on top of the page.
- Click here to see the locations of the coding regions predicted by MZEF program. By comparing the table in the MZEF result to that in GENOMESCAN, the locations of the exons that both programs predicted the same were: 24473-24612, 37700-37788, and 54211-54401. MZEF missed 6 exons: 23778-23866, 28856-28958, 31443-31488, 32810-32886, 46159-46327 and 57494-57804. MZEF also missed to predict 2kb coding region between 34kb and 36kb of the masked genomic sequence.
- The result from the pairwise alignment showed that the sequence of coding regions predicted by GENOMSCAN was identical to the sequence found by experiments, except one gap at position 148 in the predicted sequence. But the insertion did not cause the frameshift. Recall from the previous class that the repeated sequences in the target sequences are not masked in the blast search while those in query sequences are. So, for these questions, it's better to do the blast2 search without filter parameter. The total length of genomic sequence in the question is 60kb. The genomic sequence found by experiments is from 4kb to 83kb, and locations of exons around 60kb are 57494-57804 (predicted by GENOMESCAN) and 61038-61125. This is the reason why the predicted protein sequence is shorter than the one found by experiments.