5.29. Working with files#

The location of a file (its file path) is specified as a string (see the screencast on Unix Paths). We use the open() function to open files. Whether a file is opened for reading or writing is defined by the mode argument. For example mode="w" means write. Any pre-existing contents in the file would be lost.

As shown in the code blocks below and the next section, the open() statement returns a TextIOWrapper object that you can use to access the file contents. It does not return the files contents.

Note

File objects have a close() method, which you should use when finished.

seqfile = open("data/sample.fasta", mode="r")
print(type(seqfile))
<class '_io.TextIOWrapper'>
print(seqfile)
<_io.TextIOWrapper name='data/sample.fasta' mode='r' encoding='UTF-8'>

Then closing it using the close() method.

seqfile.close()
print(seqfile)
<_io.TextIOWrapper name='data/sample.fasta' mode='r' encoding='UTF-8'>

There is another approach to ensuring the file is always closed. This involves using the with statement. This statement invokes what’s referred to as a “context manager”. The code indented under the with statement is executed and on leaving that indented block, Python closes the file for you.

with open("data/sample.fasta", mode="r") as seqfile:
    print(seqfile)
<_io.TextIOWrapper name='data/sample.fasta' mode='r' encoding='UTF-8'>
# after the context block, seqfile is now closed
seqfile.closed
True

5.29.1. Reading contents of a file#

There are several possible approaches to read contents of a file that you have opened. One approach uses the fact that file objects are iterable and the “unit” of iteration is a line, i.e. the file object returns all data up until the next line-feed character. So you can treat a file object as if it was a list of characters. (This approach is slow on large files.)

# the default mode argument value is "r"
with open("data/sample.fasta") as seqfile:
    for line in seqfile:
        print(repr(line))
'>seq1\n'
'GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCAAACGGA\n'
'AAGGCCCTGATTTTGTGGGGTGCTCGAGTGGCGAACGGGTGAG\n'
'CGTTCCCGGGGCT\n'
'>seq2\n'
'GATGAACGCTGGCGGCGTGCTTAACACATACAAGTCGAACGAT\n'
'GAAATTGGGGATAAGTCGTAACAAGGTAACC\n'
'>seq3\n'
'AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCATGGATA\n'
'AGACATGCAAGTCGAACGGAGAATTTCAATAGCTTGCTAATGA\n'
'AATTCTTAGAGGCGGACGGGTTAGTAACAAGTGAAAATCCTAC\n'
'CCT\n'

Note

I’ve used a built-in function repr() (which shows the representation of the object) here because it shows the new-line characters at the end of each line.

The .read(), .readline(), .readlines() methods provide alternate approaches to getting contents. I demonstrate using .read() only. This method returns the entire contents of the file as a string. We can then use string methods to convert this into a line-based list that can be iterated over as per the previous code snippet.

# the default mode argument value is "r"
with open("data/sample.fasta") as seqfile:
    seqdata = seqfile.read()

seqdata = seqdata.splitlines()
print(seqdata)
['>seq1', 'GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCAAACGGA', 'AAGGCCCTGATTTTGTGGGGTGCTCGAGTGGCGAACGGGTGAG', 'CGTTCCCGGGGCT', '>seq2', 'GATGAACGCTGGCGGCGTGCTTAACACATACAAGTCGAACGAT', 'GAAATTGGGGATAAGTCGTAACAAGGTAACC', '>seq3', 'AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCATGGATA', 'AGACATGCAAGTCGAACGGAGAATTTCAATAGCTTGCTAATGA', 'AATTCTTAGAGGCGGACGGGTTAGTAACAAGTGAAAATCCTAC', 'CCT']

5.29.2. Writing data to a file#

In order to write data to a file, we must specify the mode="w".

The data also needs to be converted to strings. One way to do this is to use a string format conversion. For instance, consider the example of having a list of float’s. If we try to write this to a file, it will raise an exception.

nums = [0.378, 0.711, 0.349, 0.897]

with open("some-data.txt", mode="w") as outfile:
    outfile.writelines(nums)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 4
      1 nums = [0.378, 0.711, 0.349, 0.897]
      3 with open("some-data.txt", mode="w") as outfile:
----> 4     outfile.writelines(nums)

TypeError: write() argument must be str, not float

Note

I’ve used the writelines() method, which attempts to write every element of the series.

So we need to convert to strings AND we need to put a new-line character at the end of each one.

text = ["%f\n" % v for v in nums]
with open("some-data.txt", mode="w") as outfile:
    outfile.writelines(text)

5.29.3. Writing delimited output#

One of the most common data file formats are ones where multiple fields on line correspond to one record. The different fields are separated from each other by a common delimiter, a specific character. Such a format is very easy to parse.

For instance, the GFF format (Generic File Format) is a file format commonly employed in genomics for storing genome annotation data, e.g. locations of genes or exons. GFF is a plain text file format with the following fields:

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

According to the format specification, these fields are tab ('\t') delimited. To generate such output we need to store the field values in a series object (such as a list). This allows us to then use the string join() method to produce a single string with all field elements.

Note

Writing comma delimited files is done in the same way. Just replace '\t'.join with ','.join.

5.30. Exercises#

  1. Below I have two GFF records stored as a list of records, each record being a list. Write these data to a tab-delimited file.

    annotations = [
        [
            "scaffold-650",
            "projected",
            "gene",
            "71406",
            "72760",
            ".",
            "+",
            ".",
            "ID=TRIVIDRAFT_53420;Name=TRIVIDRAFT_53420",
        ],
        [
            "scaffold-650",
            "projected",
            "exon",
            "71406",
            "71690",
            ".",
            "+",
            "0",
            "Name=exon-1;Parent=TRIVIDRAFT_53420",
        ],
    ]
    
  2. On linux and MacOS, the \n character is used to denote line endings. Windows uses \r\n. Using help(open). Figure out how you would specify a file is written using line endings that differ to your operating system. Then do that for the data above.

  3. How you can check the line-endings of a file using Python. Is their another tool for your operating system?

  4. The file [1] contains two columns: Donor_ID, Project_Code. Parse this file to produce a list of Donor_ID whose Project_Code equals "Skin-Melanoma". Use plain python only (no 3rd party libraries).

  5. Read the lines from the file [1] and create a dict with keys corresponding to Project_Code and values being the list of all corresponding Donor_ID, e.g. {'CNS-PiloAstro': ['DO36068', 'DO35934', .... Use plain python only (no 3rd party libraries).