PE100-05: Files

You can write a good deal of software that runs entirely inside Jupyter notebooks or that runs on the command line and only communicates through the screen and the keyboard. Sometimes, though, you have to do with files. It may not be practical to hardcode all your data into assignment statements in Python, or maybe you have to deal with a number of files and therefore you can’t use pipes. link to Sys Fundamentals page here

We’ve already seen two basic ways to do Input and Output (often referred to as “I/O”). We’ve used input() to read from the keyboard and print() to send output to the screen. Those functions work quite well, except you might have to do a lot of typing or deal with your output scrolling off the screen. In neither case is the data durable - it goes away as soon as the program is done or you close the Jupyter notebook.

The input() and print() functions are just the tip of the proverbial iceberg in terms of getting information in and out of running Python code. Some of our other options include: * GUI controls: text box, menu, dialog box… * Networks: HTTP, TCP/IP sockets, Infiniband… * Databases: Relational (SQL) and NoSQL * Other: cameras, microphones, speakers, LabView

Files

Practically everyone is more-or-less familiar with the idea of a file, even if fairly few people know how they work. We’re going to ignore a lot of details for the moment and say this: a file is a long-lasting collection of bytes. It has a first byte, a last byte, and every one in between stays in the same order.

This begs the question “What is a byte?” A byte is just a small number from 0 to 255 (inclusive). We can assign meaning to those numbers, and if we’re smart about how we do it then we can represent any information a computer can process as long as we use enough of these bytes.

We like to think of files as being one of two types: binary files and text files. Binary files are pure data. We decide how to write bytes to a file to represent data. Then when we’re ready to read it in again, we read the bytes, process them somehow, and reconstruct the original data. It’s a great technique - it’s fast and efficient.

We won’t be talking about binary files in this notebook or even in this module. Fifteen years ago we wouldn’t have had a choice, we would have had to. These days, it’s unusual to have to deal with binary files, especially in Python, because there is so often a library function already available to do the work for us.

Text files, on the other hand, are probably something you’re already familar with - they are what you get when you edit a “plain text” file in “notepad” or “textedit”. In a text file, every one of the letter, number, and punctuation mark characters is assigned its own number. For instance, capital “A” is 65. “B” is 66. Not that it should ever matter, but here’s a complete list and then some!

Let’s say you open an editor and type “CAT”. When you save that to a file, there will be a file that is three bytes long and contains the three bytes 67, 65, and 84. Actually there will usually be a fourth byte, 10, which is the character you get when you press “Enter” or “Return”.

For now, at least for a few minutes, we’re going to pretend the only language on earth is English. We’ll talk about other languages when we talk about networks.

It’s about time for an example, don’t you think?

In [4]:

my_file_object = open("/tmp/first_file.txt", "w")
my_file_object.write("First Post!")
my_file_object.close()

Three lines of code was all it took to create a file, write to it, and tidy up after ourselves. What does each of those lines do?

my_file_object = open("/tmp/first_file.txt", "w")

my_file_object is an object variable. Think of an object as a way to store data in a variable along with some functions that only make sense to that data. They hide a lot of complexity from us. A file object is one that keeps track of a filename, how to get to it, and how to use it. It has some functions built in to it to help us do things to the file.

Python gives us the function “open”. It gets a file ready to be used by our code. It takes two arguments. The first is the file’s name, and the second is the mode we want to use the file in. In our example, we specified that the file’s name was “first_file.txt” and that it was in the “/tmp” directory. Then in the second argument we specified “w”, meaning we wanted to write to the file. The “w” mode will cause the file to be created if it didn’t already exist. If it did already exist, on the other hand, all the contents of it will be deleted and we’ll start writing from the beginning just as if the file was created from scratch. We’ll see more modes as we go.

my_file_object.write("First Post!")

This line uses one of those functions that are tucked away inside an object. In this case, we’re calling the file object’s “write” function. It does what we expect - it takes its argument, in this case “First Post!”, and causes it to be written to disk byte by byte.

my_file_object.close()

Finally, we call one more of the file object’s functions: close. When we run this, Python tells the operating system “Hey, we’re done with the file. You can get rid of any of the tedious housekeeping data that operating systems keep behind the scenes!”

Closing files is considered “good programming hygene”. You’re allowed 1024 file objects to be open and connected to files in one program on the CLASSE cluster of computers. I’ll say from my experience: if you think you need that many, you’re probably doing something the wrong way.

Writing files, then, is fairly easy. What about reading files? I’m glad you asked.

In [6]:

input_file = open("/tmp/first_file.txt","r")
the_contents = input_file.read()
input_file.close()

the_contents

'First Post!'

You can probably tell mostly how that worked just by looking at. We used the open() function again, but this time with a “r” for our mode. This means “read”. Also, this time we used read() instead of write(). The read() function reads in an entire file and saves it a string variable. Finally, we call close() again to close the file and tidy up after ourselves.

Note that if the file is, say, 500 megabytes long, the string variable is going to be very, very large - roughly half a gigabyte. Python can handle this, but it may not be terribly convenient. If the file is more than 100-200 gigabytes, the CLASSE servers are probably not going to be able to handle. I say “probably” because there are a lot of factors at play.

Just writing one line to a file is probably not very useful. Let’s try writing two lines:

In [8]:

my_file_object = open("/tmp/first_file.txt", "w")
my_file_object.write("First line written.")
my_file_object.write("This is my second line.")
my_file_object.close()

When we run that, it will open /tmp/first_file.txt for writing and it will delete anything already in it (that’s what the “w” means, remember?). Then it will write “First line written.” and “This is my second line.”.

Let’s read the file again and prove to ourselves that it worked…

In [9]:

input_file = open("/tmp/first_file.txt","r")
the_contents = input_file.read()
input_file.close()

the_contents

'First line written.This is my second line.'

Oh no! The two lines ran together!

And that is one of the first differences we’ll see between write() and print(). Print() always adds a newline character after it prints out anything. Remember when I said there would usually be a byte at the end of a line, represented by the number 10? This character is called “newline” and it, as the name implies, marks where a new line starts.

In all likelihood, when we do two write() statements like we did, we want to put a newline character in the file to make it into two lines. Fortunately, there are several ways to do that. Here are two of them.

The first way is simple and direct - call write() three times instead of two and put a newline in there “by hand”, as it were:

In [17]:

my_file_object = open("/tmp/first_file.txt", "w")
my_file_object.write("First line written.")
my_file_object.write('\n')
my_file_object.write("This is my second line.")
my_file_object.close()

input_file = open("/tmp/first_file.txt","r")
the_contents = input_file.read()
input_file.close()

the_contents

'First line written.\nThis is my second line.'

The output looks a little strange. We put an extra write() function call, but we gave it an odd looking argument - . That is a backslash (usually between the Enter and the backspace keys on a US keyboard) immediately followed by a lowercase “n”. The combination together means “newline character”. This much is fairly straightforward.

Next we read the contents of the file. This is just like before.

Finally, and this is where things take an unexpected turn, we evaluate the_contents and let Jupyter print that out for us. And when Jupyter does that, we see the “” there. It seems like Python didn’t convert those two characters to a newline, just sticking them in there as-is, and still left us with one long line. But is that true? Has Python foresaken us?

Run the code in the next cell:

In [18]:

print(the_contents)

First line written.
This is my second line.

Salvation! print() did the right thing. This is a key difference between just typing a variable or an expression at the end of a cell and letting Python evaluate it versus putting a print() in there and having absolute control over what gets sent to the notebook and on to the screen.

This also illustrates something else important and useful: all of the code cells in this notebook are being run by the same Python “interpreter”. This means if we set a variable to a value in one cell, we will see the same value stored in that variable in other cells. That’s how we were able to print what was stored in the_contents in the cell above even though we had set its value to the file contents two cells above that.

If a file only has a line or two, it’s not a big deal dealing with that with string functions. If a file has millions of lines, then it becomes a bit of a hassle. We need a way to read a file one line at a time. Fortunately, there’s readline():

In [19]:

input_file = open("/tmp/first_file.txt","r")
line_one = input_file.readline()
line_two = input_file.readline()
input_file.close()

print(line_one)
print(line_two)

First line written.

This is my second line.

This does almost what we expect: it reads both lines from the file, one at a time, and prints them out. The only snag is that blank space between the lines. What has happened? It turns out readline() reads the entire line, even the newline character at the end. We can see this if we evaluate the string instead of just printing it:

In [20]:

line_one

'First line written.\n'

There’s that \n again! What about the second line?

In [21]:

line_two

'This is my second line.'

When readline() reads a line, it includes the newline character at the end unless it reaches the end of the file and the file didn’t end with a newline.

It’s rare that we would want to read a bunch of lines in a file with the newlines included. That’s just not something we do very often, and practically never in scientific software. We’ll almost always want to trim off the newline character. And for that, we have the rstrip() function. It takes a string, strips off any newlines on the right side of it, and returns that cleaned-up string. rstrip() does that for the right side of the string, lstrip() cleans up the left side (the beginning of the string) and strip() goes crazy and does both ends at the same time.

Let’s try it:

In [23]:

clean_first_line = line_one.rstrip('\n')
clean_second_line = line_two.rstrip('\n')

print(clean_first_line)
print(clean_second_line)

First line written.
This is my second line.

What’s going on here? A couple of things. The first thing to note is that rstrip() and its close companions lstrip() and strip() take one argument, which is the character to be stripped. Practically always we’ll want to get rid of the trailing newline character.

The other interesting things is how we called the rstrip() function in the first place. We gave the name of the string variable, a period, and the name of the function we were calling. This is just like how we called the close() function on a file object. And in fact, strings are another kind of object in Python. We’ll see a lot more on this later.

Historical note: The original programming language that had objects was named “Smalltalk”. In Smalltalk, the functions that were inside of objects were called “methods”. You’ll still hear people call them that. Later, the “C++” language came along and it called methods “member functions”. When programmers talk about the functions that are contained in objects, we’ll use either term interchangably, sometimes even switching in the middle of a sentence. We now return to your Python tutorial, already in progress…

We read both lines in the file we created. We were able to call readline() twice and know that we had all of our lines in the file because (1) we created the file ourselves and (2) we therefore knew it had precisely two lines. It wasn’t even too bad having to type those readline() and rstrip() lines twice. But what if we had a lot more lines? We would certainly want to use a loop.

For example, what do we do with a five-line file?

In [24]:

my_file_object = open("/tmp/five-liner.txt", "w")
my_file_object.write("Line 1.")
my_file_object.write('\n')
my_file_object.write("Line 2.")
my_file_object.write('\n')
my_file_object.write("Line 3.")
my_file_object.write('\n')
my_file_object.write("Line 4.")
my_file_object.write('\n')
my_file_object.write("Line 5.")
my_file_object.write('\n')
my_file_object.close()

input_file = open("/tmp/five-liner.txt","r")
for i in range(5):
    input_line = input_file.readline()
    print(input_line.rstrip('\n'))

input_file.close()

Line 1.
Line 2.
Line 3.
Line 4.
Line 5.

No problem - we just use a for loop and do the readline() inside of it. It repeats the five times we asked for. In this case, after we read each line we cleaned it up a little and printed it.

But what if we can’t know the number of lines ahead of time? One approach is to have whatever program that creates the file write the number of lines that will be in it first. I won’t say this is a common approach in scientific software, but it isn’t exactly rare either.

In [25]:

my_file_object = open("/tmp/five-liner.txt", "w")
my_file_object.write("5")
my_file_object.write('\n')
my_file_object.write("Line 1.")
my_file_object.write('\n')
my_file_object.write("Line 2.")
my_file_object.write('\n')
my_file_object.write("Line 3.")
my_file_object.write('\n')
my_file_object.write("Line 4.")
my_file_object.write('\n')
my_file_object.write("Line 5.")
my_file_object.write('\n')
my_file_object.close()

input_file = open("/tmp/five-liner.txt","r")

first_line = input_file.readline()
how_many_lines = int(first_line.rstrip('\n'))

for i in range(how_many_lines):
    input_line = input_file.readline()
    print(input_line.rstrip('\n'))

input_file.close()

Line 1.
Line 2.
Line 3.
Line 4.
Line 5.

The overall scheme for this is probably obvious by now. In the first half, when we’re writing the file, we write a “5” on its own line, and then write five more lines. In the second part, we 1. Read the first line. 2. rstrip() to get rid of the trailing newline 3. Use the results of that as the argument to int(), converting that string (“5”) to an actual integer (5). 4. and finally go through a for loop that many times just like before

Most of the time we won’t have the luxury of knowing how many lines are in a file, though. We need a way to read all of the lines, line by line, without limit. For that, we can loop through the file and quit when Python returns an empty string with not even a newline character.

In [27]:

input_file = open("/tmp/five-liner.txt","r")

line = input_file.readline()
while line != '':
    print(line.rstrip('\n'))
    line = input_file.readline()
input_file.close()

5
Line 1.
Line 2.
Line 3.
Line 4.
Line 5.

The while loop behaved just like we expected - strat by reading a line, and then every time the line isn’t empty, print it out and read another line. When you finally hit a line that is completely empty, exit the while loop and close the file.

Looping through a file all the way to the end is such a common thing to do, Python has a shortcut for doing it. Remember when we talked about a for loop iterating over an ordered set? A file can be thought of as an ordered set of strings. They’re not in alphabetical order, but rather they are ordered by line number. That means we can:

In [28]:

input_file = open("/tmp/five-liner.txt","r")

for line in input_file:
    print(line.rstrip('\n'))

input_file.close()

5
Line 1.
Line 2.
Line 3.
Line 4.
Line 5.

As you can imagine, reading isn’t the only file operation you can do with a loop. You can also write to a file that way. For instance,

In [29]:

my_file_object = open("/tmp/five-liner.txt", "w")
for i in range(7):
    output_string = str(i)
    my_file_object.write(output_string + '\n')
my_file_object.close()

input_file = open("/tmp/five-liner.txt","r")
for line in input_file:
    print(line.rstrip('\n'))
input_file.close()

Finally, we don’t have to erase the contents of a file every time we write to it. It’s perfectly normal to append to an existing file, and for that the “a” mode can be used with open().

In [30]:

my_file_object = open("/tmp/five-liner.txt", "a")
for i in range(7,10):
    output_string = str(i)
    my_file_object.write(output_string + '\n')
my_file_object.close()

input_file = open("/tmp/five-liner.txt","r")
for line in input_file:
    print(line.rstrip('\n'))
input_file.close()

When you use the append mode, the write() calls will either add to the existing file or, if it doesn’t already exist, it will be created and then written to as though we used the “w” mode.

So far in this lesson we’ve acted like everything just works perfectly every time. In reality, it’s not that neat. Filenames get typed in wrong, didks get full, and lines that are supposed to be numbers might contain text instead. Any of these problems is enough to bring our Python code to a grinding halt. Our next lesson is all about how to handle these problems and many, many more like them. We’re going to learn about Exceptions!