Updated Mon Feb 27 20:26:11 EST 2023
[The Python programs shown here are stored as individual files in the directory py. It might be easier to copy them from there than copying from the web page.]
This page shows Python equivalents for some of the Awk programs that appeared in Studio 3. If you want to run a Python program, it's easiest to copy it into a file, save it as whatever.py, and then run it:
$ python whatever.pyOn macOS, the default version of python is likely to be Python 2, which has some minor but irritating incompatibilities with Python 3. It might be easier to download Python 3 from python.org than to cope with the differences.
On Windows with WSL, you are likely to already have Python 3.
In either case, to find out, run Python and it will tell you what version you're running:
$ python Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information.
As an alternative to running Python on your own computer, you can invoke Python explicitly in Colab. Note the exclamation point !, which signals that the rest of the line is a shell command.
$ !python Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>I do not fully understand this mechanism, but it seems to work if the input to the Python program comes from files that you have uploaded.
$ awk '{print}'
which prints every input line, the simplest Python equivalent
is this version of cat:
# cat: one way to read & copy a file
# one line at a time from stdin
# equivalent to Unix cat command for text
# awk '{print}'
import sys
line = sys.stdin.readline()
while line != "":
print(line, end="") # removes trailing newline (python3 only)
line = sys.stdin.readline()
readline() is a function that reads an input stream
a line at a time, returning each line (including the newline
character at the end). It returns an empty line when there is
no more input. sys.stdin.readline() reads from the
standard input stream, which is the keyboard or from <file
or from a pipe.
To run this, put it in a file, say cat.py, then say
$ python cat.pyIn Colab, cat.py doesn't seem to work right if the input comes from the keyboard, but it's ok if the input comes from a file like this:
$ !python cat.py <cat.py
# cat2: read & copy a file one line at a time
# from stdin or list of filenames
# equivalent to cat for text
# awk '{print}'
import sys
def cat(f): # print a single text file
line = f.readline()
while line != "":
print(line, end="") # removes trailing newline (python3 only)
line = f.readline()
def main():
if len(sys.argv) == 1:
cat(sys.stdin)
else:
for i in range(1, len(sys.argv)):
f = open(sys.argv[i])
cat(f)
f.close()
main()
This program defines two functions, main to handle the
overall processing, and cat to print a single file. Program
execution begins by calling main (the last line). This is a
very standard pattern for organizing a program.
The command-line arguments are availabe to a Python program in an array called sys.argv; the first argument is the first filename, and argv[0] is the name of the program itself. So the program tests whether there are any filename arguments; if not, then it uses the standard input, and otherwise it loops over the filenames. (Are you starting to get some idea of how Awk simplifies some aspects of computing? We'll mostly just read from the standard input from now on, but you can imagine adding this code to handle the more general case.)
Awk splits each input line into fields, by default strings of characters separated by spaces and/or tabs. To achieve the same in Python, we have to explicitly split each input line.
Here's the Awk program to just count the number of fields on each input line:
$ awk -F/ '{print NF}' 18.csv
And here's the same thing for Python.
# flds: read stdin one line at a time, text only
# split into fields, white space only
# awk '{print NF}'
import sys
line = sys.stdin.readline()
while line != "":
flds = line.strip().split() # split by white space
print(len(flds)) # print NF
line = sys.stdin.readline()
strip() strips white space from both ends of a string
of characters; split() splits a string of characters
separated by spaces into separate fields, like the default
behavior of Awk. The flds array starts at zero, however,
not at 1 as in Awk. Exercise: how could you fix this?
$ awk -F/ '{print $5}' 18.csv | sort | uniq -c | sort -n
that prints the 5th field could be replaced by a
Python program that prints the 5th field; the principle
lines would be
flds = line.strip().split('/') # split by /
print(flds[4])
and the pipeline would be
$ python whatever.py <18.csv | sort | uniq -c | sort -nSome of this could be replaced by Python code that uses a dictionary to accumulate the different counts:
# fld5: print number of times each item in field 5 (base 1) occurs
# split on /
# equivalent to
# awk -F/ '{print $5}' 18.csv | sort | uniq -c
import sys
count = {}
line = sys.stdin.readline()
while line != "":
flds = line.strip().split('/') # split by /
count[flds[4]] = count.get(flds[4],0) + 1
line = sys.stdin.readline()
for i in count:
print(count[i], i)
An example for selecting some lines and printing some fields:
$ awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv
# gt14: print number of times each item in field 5 (origin 1) occurs
# split on /
# equivalent to
# awk -F/ 'NR > 1 && $5 > 14 {print $5, $6, $7}' 18.csv
import sys
NR = 0
line = sys.stdin.readline()
while line != "":
NR += 1
flds = line.strip().split('/') # split by /
if NR > 1 and int(flds[4]) > 14:
print(flds[4], flds[5], flds[6])
line = sys.stdin.readline()
There are other ways to skip the first line. Any thoughts?
$ awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'
# auth1: print names of authors with only one entry
# split on /
# equivalent to
# awk -F/ '{print $3, $4}' 18.csv | sort | uniq -c | sort -nr | awk '$1 == 1'
import sys
count = {}
line = sys.stdin.readline()
while line != "":
flds = line.strip().split('/') # split by /
author = flds[2] + " " + flds[3]
count[author] = count.get(author,0) + 1
line = sys.stdin.readline()
for i in count:
if count[i] == 1:
print(count[i], i)
$ awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/' 18.csv >tempHere's the Python version:
# ex4.py: print age if first and second field are both integers
# equivalent to
# awk -F/ '$1 ~ /^[0-9]+$/ && $2 ~ /^[0-9]+$/ {print $2-$1, $3, $4}'
import sys
import re
line = sys.stdin.readline()
while line != "":
flds = line.strip().split('/') # split by white space
if re.search('^[0-9]+$', flds[0]) != None and \
re.search('^[0-9]+$', flds[1]) != None:
print(int(flds[1]) - int(flds[0]))
line = sys.stdin.readline()
It imports the re library, to use a single function,
re.search, which returns a Match object if there was a
match, and None if there was no match. Since we only
care about whether there was a match or not, the test is simple
and we can ignore the Match object.
If we have created a temporary file with valid-dates-only data, then subsequent processing is easier.
$ awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages
END { print "average age =", ages / NR }' <temp
$ awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 }
END { print "oldest:", fname, lname, " age", max }' <temp
Combining these two into a single Python program:
# ages.py: compute ages assuming first and second field are both integers
# equivalent to
# awk -F/ '{ ages = ages + $2 - $1 } # add up all the ages
# END { print "average age =", ages / NR }' <temp
# awk -F/ '$2-$1 > max { max = $2 - $1; fname = $3; lname = $4 }
# END { print "oldest:", fname, lname, " age", max }' <temp
import sys
import re
ages = 0
max = 0
fname = ""
lname = ""
NR = 0
line = sys.stdin.readline()
while line != "":
NR += 1
flds = line.strip().split('/') # split by white space
age = int(flds[1]) - int(flds[0])
ages += age
if age > max:
max = age
fname = flds[2]
lname = flds[3]
line = sys.stdin.readline()
print("average age =", ages / NR)
print("oldest:", fname, lname, "age", max)
As you can see, it's a lot more work to write a Python program to do quick and dirty explorations than it is with Awk. But once you have the lay of the land, then you can switch to Python, perhaps with a set of functions that you have written for your specific case.