Homework 2: DNA analysis
Due: at 11pm on Wednesday, January 18, 2017.
Submit via this turnin page. (REQUIRED survey)
- gain experience writing Python code using loops, conditionals (if statements), functions and string manipulation
- become familiar with running Python programs from the command line and using command line parameters to locate data files
- write Python code to analyze DNA data sets
Although you will fill in the body of one function and call it, you do not need to write additional functions for this assignment. You are encouraged to examine the provided function , which opens and reads in data files, but you do not need to understand anything about opening and reading files in order to do this assignment. As of lecture on 1/11, you know everything you need to do this entire assignment. For all of our assignments (except your final project) you should NOT use parts of Python not yet discussed in class or the course readings.
Advice from previous students about this assignment: 14wi15sp
You will use, modify, and extend a program to compute the GC content of DNA data. The GC content of DNA is the percentage of nucleotides that are either G or C.
DNA can be thought of as a sequence of nucleotides. Each nucleotide is adenine, cytosine, guanine, or thymine. These are abbreviated as A, C, G, and T. A nucleotide is also called a nucleotide base, nitrogenous base, nucleobase, or just a base.
Biologists have multiple reasons to be interested in GC content:
- GC content can identify genes within the DNA, and can identify types of genes. Genes tend to have higher GC content than other parts of the DNA. Genes with longer coding regions have even higher GC content.
- Regions of DNA with higher GC content require higher temperatures for certain chemical reactions, such as when copying/duplicating the DNA.
- GC content can be used in determining classification of species.
If you are curious, Wikipedia has more information about GC content. That reading is optional and is not required to complete this assignment.
Your program will read data files produced by a high-throughput sequencer — a machine that takes as input some DNA, and produces as output a file containing a sequence of nucleotides.
Here are the first 8 lines of output from a particular sequencer:@SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SOLEXA-1GA-2_2_FC30DNN:1:2:574:1722 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh @SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745 GTGGGGGTGATGTCCACGATTACGCCGACCGGCTGG +SOLEXA-1GA-2_2_FC30DNN:1:2:478:1745 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
The nucleotide data is in the second line, the sixth line, the tenth line, etc. Your program will not use the rest of the file, which provides information about the sequencer and the sequencing process that created the nucleotide data.
Problem 1: Obtain the files, add your name
Obtain the files you need by downloading the homework2.zip file. (This is a large download — be patient.)
Unzip the file to create a directory/folder. You will do your work here. The directory/folder contains:
- , a partial Python program that you will complete
- , a file where you will answer textual questions
- , a directory. Which contains the data that you will process:
- files, which are output from DNA sequencers; this is the data that the program analyzes
- , a directory containing example runs of the final result of your program.
You will do your work by modifying two files — and — and then submitting the modified versions. Add your name to the top of each of these files.
Each problem will ask you to make some changes to the program (or to write text in the file, or both). When you do so, you will generally add to the program. Do not remove changes from earlier problems when you work on later problems; your final program should solve all the problems.
In either file, keep the number of characters on a particular line below 80, otherwise your files become hard to read. One technique to do this in Python is to break large equations into smaller ones by storing subexpressions in variables. Breaking things down into smaller computations can also help with debugging.
By the end of the assignment, we would like to produce output of the exact form:GC-content: ___ AT-content: ___ G count: ___ C count: ___ A count: ___ T count: ___ Sum of G+C+A+T counts: ___ Total count: ___ Length of nucleotides: ___ AT/GC Ratio: ___ GC Classification: ___
Where ___ is replaced by values that you will calculate. Of course, the exact values in each category will vary depending on the input data that you are using. We expect the formatting of your program output to exactly match this.
Check your output - You should validate your program's output using the Diff Checker before submitting your assignment. You can compare your output to the files given in the directory of the homework2 files. Drag a file to a window in the Diff Checker. Select text in a command prompt window and paste it into the other window and select "Find Difference!".
Editing text files - You will submit as a text file. Plain text is the standard for communicating information among programmers, because it can be read on any computer without installing proprietary software. You can edit text files using Canopy or another text editor. If you use a word processor, then be sure to save the files as text. Windows users should never use Notepad, as Notepad will mangle the line endings in the file; WordPad or Notepad++ are better alternatives.
Command Prompt - In the past some students have had trouble running from the command line when folder names had spaces in them. You may find it easiest to avoid this. Also, be sure that you have configured Enthought Python as the Python version used on your command line. When you type: you should see:
To cut and paste things from the Windows command prompt, left click and drag the mouse to select text from anywhere in the command prompt window. When you have highlighted what you want, hit return. Now you can use control-V to paste into Diff Checker or a text file.
You do not need to modify or for this problem.
Problem 2: Run the program
When writing programs that analyze data (or any other type of program) it is important to check the correctness of your programs. One way to do this is by comparing the output of your program to a computation done in some other way, such as by hand or by a different program. We have provided the file for this purpose. This file is small enough that you can easily open it up in a text editor and calculate the GC content by hand. Then, run your program to verify that it provides the correct answer for this file.
You should be able to open up and other input and output files in a text editor of your choice (although just clicking on a file is not likely to work since the file extension is not one that is known to your operating system). You can open up files in the Canopy editor. In Canopy, just go to the file menu, select Open and navigate to the data folder, which may first appear empty. Next to file name, switch from "All supported files" to "All files". Select the file you want to open.
For this assignment you will run your program by opening a shell or command prompt (*NOT* Canopy's Python interpreter). Follow the directions found on this page which will teach you the basics of command-line navigation for your operating system. You should navigate to your directory, then type the following command:
On Mac/Linux:python dna_analysis.py data/test-small.fastq
On Windows:python dna_analysis.py data\test-small.fastq
If you get a "can't open file 'dna_analysis.py'" error or a "No such file or directory" error, then perhaps you are not in your directory, or you mistyped the file name.
After you have confirmed that your program runs correctly on , run your program on each of the 6 real files provided, by executing 6 commands such as:python dna_analysis.py data/sample_1.fastq
or if you are a Windows user,python dna_analysis.py data\sample_1.fastq
Run your program on different data files by changing to a different file name in the commands above. Be patient — you are processing a lot of data, and it might take a minute or so to run.
(If you are interested, and are from Streptococcus pneumoniae TIGR4, and is from Human herpesvirus 5.)
If you have already used the Output Comparison Tool (referenced at the bottom of the page), you might notice that some of your results are different than the example results. Don't worry about this — this issue will be resolved in Problem 6.
Cut and paste the line of output produced by your program regarding GC-content when run on into your file. (Note, this could take a minute or so to run.) For example, your answer might look like:GC-content: 0.42900139393
(Note that this is not the answer you should expect to get, this is just an example of the format that your answer should be in.)
Problem 3: Remove some lines
- In your program, comment out this line: gc_count = 0
by prefixing it by the character. Save the file and then re-run the program, just as you did for Problem 2. In , explain what happened, and why it happened.
- Now, restore the line to its original state by removing the that you added. What would happen if you commented out this line instead? (Feel free to try it!) nucleotides = filename_to_string(filename)
Explain what happens and why in .
Problem 4: Compute AT content
Augment your program so that, in addition to computing and printing the GC ratio, it also computes and prints the AT content. The AT content is the percentage of nucleotides that are A or T.
Two ways to compute the AT content are:
- Copy the existing loop that examines each nucleotide and modify it. You will now have two loops, one of which computes the GC count and one of which computes the AT count. OR
- Add more statements into the existing loop, so that one loop computes both the GC count and the AT count.
You may use whichever approach you prefer. Add whatever new variables you need.
Check your work by manually computing the AT content for file , then comparing it to the output of running your program on .
Run your program on . Cut-and-paste the relevant line of output into .
Problem 5: Count nucleotides
Augment your program so that it also computes and prints the number of A nucleotides, the number of T nucleotides, the number of G nucleotides, and the number of C nucleotides. Add whatever new variables you need.
When doing this, add at most one extra loop to your program. You can solve this part without adding any new loops at all, by reusing an existing loop. At this point you should also feel free to modify the code we have given you if another structure of if statements makes more sense to you. We just caution you against looping through the data more times than you need to as this could cause your code to run very slowly.
Check your work by manually computing the results for file , then comparing them to the output of running your program on .
Run your program on . Cut-and-paste the relevant lines of output into (the lines that indicate the G count, C count, A count, and T count).
Problem 6: Sanity-check the data
For each of the eleven files you have been given, calculate and print the following three quantities:
- the sum of: the A count, the C count, the G count, and the T count (store this in a new variable called )
- the variable (total number of nucleotides)
- the length of the string variable. You can compute this with .
In other words, compute the three quantities for and then do the same for , etc.
For at least one file, at least one of these quantities will be different from the other two. In your file, state which file(s) and which quantities differ. (If all three quantities are equal for each file, then your code contains a mistake.) In your file, write a short paragraph that explains why these differ.
Explaining why (or debugging your code if all three quantities were the same in all files) might require you to do some detective work.
This exercise is meant to expose you to a situation you might encounter when processing a data file of your own (say on your Final project). When your program does not give the results you expect, there are two likely sources of the problem. One is that your program contains a bug! Check your code carefully to be sure you are calculating all values correctly. We will talk about testing in more detail later but for now, try walking through your code with a very small data set and calculating values by hand. A second source of unexpected results that is very common with data files is that there is something you were assuming about the contents of the data files that was an incorrect assumption. This could include things like assuming each line would contain a certain number of characters or words, or that all characters would be uppercase or lowercase, or that values might only be in a certain range. If you wrote your program assuming something about your data files that was not correct, your program may not give correct results.
To track down a wrong assumption about a data file, think about ways you can modify your program to help you determine what is happening. This could include having it print out values when they do not meet some asumption you are making about the file. You could also try just loading a data file into a text editor and examining it with your eyes to see if you see something you did not expect. (Although if you try this approach we strongly suggest that you start with the smallest data file for which the three quantities are not all the same.) Another approach would be to modify your program, or create a new program, to compute the three quantities for each line of a data file separately (as opposed to for the file as a whole as you have been doing): if the quantities differ for an entire file, then they must differ for at least one specific line in that file. Examining that/those line(s) will help you understand the problem.
If all of the three quantities that you measured in problem 6 are the same, then it would not matter which one you used in the denominator when computing the GC content. However, you saw that the three quantities are not all the same. In , state which of these quantities should be used in the denominator and which should not, and why.
If your program incorrectly computed the GC content (which should be equal to (G+C)/(A+C+G+T)), then state that fact in your file. Then, go back and correct your program, **and also update any incorrect answers elsewhere in your file. It is fine to change the code we provided you if needed.
**If you are unsure if you are calculating things correctly, now would be a good time to validate your program's output using the Diff Checker. You can compare your output to the files given in the directory of the homework2 files. You have not yet completed the assignment, so your output will not be identical. But things like GC-content, AT-content and individual counts should be identical. You will produce the last two lines of output in the files in Problem 7 and Problem 8 below.
Problem 7: Compute the AT/GC ratio
Sometimes biologists use the AT/GC ratio, defined as (A+T)/(G+C), rather than the GC-content, which is defined as (G+C)/(A+C+G+T).
Modify your program so that it also computes the AT/GC ratio.
Check your work by manually computing the results for file . Compare them to the output of running your program on .
Run your program on . Cut-and-paste the relevant lines of output into (the line that indicates the AT/GC ratio).
Problem 8: Categorize organisms
The GC content can be used to categorize microorganisms.
Fill in the body of the function to return the classification of the organism ("high", "moderate" or "low") described in the data file given using these classifications:
If the GC content is above 60%, the organism is considered “high GC content”.
If the GC content is below 40%, the organism is considered “low GC content”.
Otherwise, the organism is considered “moderate GC content”.
Biologists can use GC content for classifying species, for determining the melting temperature of the DNA (useful for both ecology and experimentation, for example PCR is more difficult on organisms with high GC content), and for other purposes. Here are some examples:
The GC content of Streptomyces coelicolor A3(2) is 72%.
The GC content of Yeast (Saccharomyces cerevisiae) is 38%.
The GC content of Thale Cress (Arabidopsis thaliana) is 36%.
The GC content of Plasmodium falciparum is 20%.
Again, test that your program works on some data files with known outputs. The file has low GC content. We have provided four other test files, whose names explain their GC content: , , , .
You will find a "skeleton" or "stub" for the function near the top of , just before where the main program begins. There is an assignment statement inside of the function body that is there only as a placeholder. You should edit/remove it once you have added your code. The function takes the gc_content as an input parameter and returns an appropriate string indicating the GC Classification. Once you have filled in the body of the function you should call the function from your main program in the appropriate place and use the string it returns to print out a message that matches what is expected.
After your program works for all the test files, run it on . Cut-and-paste just the relevant line of output from your program into .
Submit your work
You are almost done!
We recommend doing a quick search on this web page for to confirm that each place we asked you to answer a question, you have answered it.
At the bottom of your file, in the “Collaboration” part, state which students or other people (besides the course staff) helped you with the assignment, or that no one did.
Submit the following files via this turnin page.
**Please be sure to validate your program's output using the Diff Checker before submitting your assignment. You can compare your output to the files given in the directory of the homework2 files. Be sure that both the values are correct AND that the messages you are printing are formatted correctly and have no typos.
Answer a REQUIRED survey asking how much time you spent and other reflections on this assignment.
Now you are done!
Biologists in the 1940s had difficulty in accepting DNA as the genetic material because of the apparent simplicity of its chemistry. DNA was known to be a long polymer composed of only four types of subunits, which resemble one another chemically. Early in the 1950s, DNA was first examined by x-ray diffraction analysis, a technique for determining the three-dimensional atomic structure of a molecule (discussed in Chapter 8). The early x-ray diffraction results indicated that DNA was composed of two strands of the polymer wound into a helix. The observation that DNA was double-stranded was of crucial significance and provided one of the major clues that led to the Watson-Crick structure of DNA. Only when this model was proposed did DNA's potential for replication and information encoding become apparent. In this section we examine the structure of the DNA molecule and explain in general terms how it is able to store hereditary information.
A DNA Molecule Consists of Two Complementary Chains of Nucleotides
A DNAmolecule consists of two long polynucleotide chains composed of four types of nucleotide subunits. Each of these chains is known as a DNA chain, or a DNA strand. Hydrogen bonds between the base portions of the nucleotides hold the two chains together (Figure 4-3). As we saw in Chapter 2 (Panel 2-6, pp. 120-121), nucleotides are composed of a five-carbon sugar to which are attached one or more phosphate groups and a nitrogen-containing base. In the case of the nucleotides in DNA, the sugar is deoxyribose attached to a single phosphate group (hence the name deoxyribonucleic acid), and the base may be either adenine (A), cytosine (C), guanine (G), or thymine (T). The nucleotides are covalently linked together in a chain through the sugars and phosphates, which thus form a “backbone” of alternating sugar-phosphate-sugar-phosphate (see Figure 4-3). Because only the base differs in each of the four types of subunits, each polynucleotide chain in DNA is analogous to a necklace (the backbone) strung with four types of beads (the four bases A, C, G, and T). These same symbols (A, C, G, and T) are also commonly used to denote the four different nucleotides—that is, the bases with their attached sugar and phosphate groups.
DNA and its building blocks. DNA is made of four types of nucleotides, which are linked covalently into a polynucleotide chain (a DNA strand) with a sugar-phosphate backbone from which the bases (A, C, G, and T) extend. A DNA molecule is composed of two (more...)
The way in which the nucleotide subunits are lined together gives a DNA strand a chemical polarity. If we think of each sugar as a block with a protruding knob (the 5′ phosphate) on one side and a hole (the 3′ hydroxyl) on the other (see Figure 4-3), each completed chain, formed by interlocking knobs with holes, will have all of its subunits lined up in the same orientation. Moreover, the two ends of the chain will be easily distinguishable, as one has a hole (the 3′ hydroxyl) and the other a knob (the 5′ phosphate) at its terminus. This polarity in a DNA chain is indicated by referring to one end as the 3′ end and the other as the 5′ end.
The three-dimensional structure of DNA—the double helix—arises from the chemical and structural features of its two polynucleotide chains. Because these two chains are held together by hydrogen bonding between the bases on the different strands, all the bases are on the inside of the double helix, and the sugar-phosphate backbones are on the outside (see Figure 4-3). In each case, a bulkier two-ring base (a purine; see Panel 2-6, pp. 120–121) is paired with a single-ring base (a pyrimidine); A always pairs with T, and G with C (Figure 4-4). This complementary base-pairing enables the base pairs to be packed in the energetically most favorable arrangement in the interior of the double helix. In this arrangement, each base pair is of similar width, thus holding the sugar-phosphate backbones an equal distance apart along the DNA molecule. To maximize the efficiency of base-pair packing, the two sugar-phosphate backbones wind around each other to form a double helix, with one complete turn every ten base pairs (Figure 4-5).
Complementary base pairs in the DNA double helix. The shapes and chemical structure of the bases allow hydrogen bonds to form efficiently only between A and T and between G and C, where atoms that are able to form hydrogen bonds (see Panel 2-3, pp. 114–115) (more...)
The DNA double helix. (A) A space-filling model of 1.5 turns of the DNA double helix. Each turn of DNA is made up of 10.4 nucleotide pairs and the center-to-center distance between adjacent nucleotide pairs is 3.4 nm. The coiling of the two strands around (more...)
The members of each base pair can fit together within the double helix only if the two strands of the helix are antiparallel—that is, only if the polarity of one strand is oriented opposite to that of the other strand (see Figures 4-3 and 4-4). A consequence of these base-pairing requirements is that each strand of a DNAmolecule contains a sequence of nucleotides that is exactly complementary to the nucleotide sequence of its partner strand.
The Structure of DNA Provides a Mechanism for Heredity
Genes carry biological information that must be copied accurately for transmission to the next generation each time a cell divides to form two daughter cells. Two central biological questions arise from these requirements: how can the information for specifying an organism be carried in chemical form, and how is it accurately copied? The discovery of the structure of the DNAdouble helix was a landmark in twentieth-century biology because it immediately suggested answers to both questions, thereby resolving at the molecular level the problem of heredity. We discuss briefly the answers to these questions in this section, and we shall examine them in more detail in subsequent chapters.
DNA encodes information through the order, or sequence, of the nucleotides along each strand. Each base—A, C, T, or G—can be considered as a letter in a four-letter alphabet that spells out biological messages in the chemical structure of the DNA. As we saw in Chapter 1, organisms differ from one another because their respective DNA molecules have different nucleotide sequences and, consequently, carry different biological messages. But how is the nucleotide alphabet used to make messages, and what do they spell out?
As discussed above, it was known well before the structure of DNA was determined that genes contain the instructions for producing proteins. The DNA messages must therefore somehow encode proteins (Figure 4-6). This relationship immediately makes the problem easier to understand, because of the chemical character of proteins. As discussed in Chapter 3, the properties of a protein, which are responsible for its biological function, are determined by its three-dimensional structure, and its structure is determined in turn by the linear sequence of the amino acids of which it is composed. The linear sequence of nucleotides in a gene must therefore somehow spell out the linear sequence of amino acids in a protein. The exact correspondence between the four-letter nucleotide alphabet of DNA and the twenty-letter amino acid alphabet of proteins—the genetic code—is not obvious from the DNA structure, and it took over a decade after the discovery of the double helix before it was worked out. In Chapter 6 we describe this code in detail in the course of elaborating the process, known as gene expression, through which a cell translates the nucleotide sequence of a gene into the amino acid sequence of a protein.
The relationship between genetic information carried in DNA and proteins.
The complete set of information in an organism's DNA is called its genome, and it carries the information for all the proteins the organism will ever synthesize. (The term genome is also used to describe the DNA that carries this information.) The amount of information contained in genomes is staggering: for example, a typical human cell contains 2 meters of DNA. Written out in the four-letter nucleotide alphabet, the nucleotide sequence of a very small human gene occupies a quarter of a page of text (Figure 4-7), while the complete sequence of nucleotides in the human genome would fill more than a thousand books the size of this one. In addition to other critical information, it carries the instructions for about 30,000 distinct proteins.
The nucleotide sequence of the human β-globin gene. This gene carries the information for the amino acid sequence of one of the two types of subunits of the hemoglobin molecule, which carries oxygen in the blood. A different gene, the α-globin (more...)
At each cell division, the cell must copy its genome to pass it to both daughter cells. The discovery of the structure of DNA also revealed the principle that makes this copying possible: because each strand of DNA contains a sequence of nucleotides that is exactly complementary to the nucleotide sequence of its partner strand, each strand can act as a template, or mold, for the synthesis of a new complementary strand. In other words, if we designate the two DNA strands as S and S′, strand S can serve as a template for making a new strand S′, while strand S′ can serve as a template for making a new strand S (Figure 4-8). Thus, the genetic information in DNA can be accurately copied by the beautifully simple process in which strand S separates from strand S′, and each separated strand then serves as a template for the production of a new complementary partner strand that is identical to its former partner.
DNA as a template for its own duplication. As the nucleotide A successfully pairs only with T, and G with C, each strand of DNA can specify the sequence of nucleotides in its complementary strand. In this way, double-helical DNA can be copied precisely. (more...)
The ability of each strand of a DNAmolecule to act as a template for producing a complementary strand enables a cell to copy, or replicate, its genes before passing them on to its descendants. In the next chapter we describe the elegant machinery the cell uses to perform this enormous task.
In Eucaryotes, DNA Is Enclosed in a Cell Nucleus
Nearly all the DNA in a eucaryotic cell is sequestered in a nucleus, which occupies about 10% of the total cell volume. This compartment is delimited by a nuclear envelope formed by two concentric lipid bilayer membranes that are punctured at intervals by large nuclear pores, which transport molecules between the nucleus and the cytosol. The nuclear envelope is directly connected to the extensive membranes of the endoplasmic reticulum. It is mechanically supported by two networks of intermediate filaments: one, called the nuclear lamina, forms a thin sheetlike meshwork inside the nucleus, just beneath the inner nuclear membrane; the other surrounds the outer nuclear membrane and is less regularly organized (Figure 4-9).
A cross-sectional view of a typical cell nucleus. The nuclear envelope consists of two membranes, the outer one being continuous with the endoplasmic reticulum membrane (see also Figure 12-9). The space inside the endoplasmic reticulum (the ER lumen) (more...)
The nuclear envelope allows the many proteins that act on DNA to be concentrated where they are needed in the cell, and, as we see in subsequent chapters, it also keeps nuclear and cytosolic enzymes separate, a feature that is crucial for the proper functioning of eucaryotic cells. Compartmentalization, of which the nucleus is an example, is an important principle of biology; it serves to establish an environment in which biochemical reactions are facilitated by the high concentration of both substrates and the enzymes that act on them.
Genetic information is carried in the linear sequence of nucleotides in DNA. Each molecule of DNA is a double helix formed from two complementary strands of nucleotides held together by hydrogen bonds between G-C and A-T base pairs. Duplication of the genetic information occurs by the use of one DNA strand as a template for formation of a complementary strand. The genetic information stored in an organism's DNA contains the instructions for all the proteins the organism will ever synthesize. In eucaryotes, DNA is contained in the cell nucleus.