Search This Blog

Tuesday, August 04, 2009

Reading from rather big files in Python

Recently I needed to open the big file (apache log - 14 GB or so) and cut some information from it. Of course use of file.read() and/or file.readlines() method wasn't possible. On the other hand, using file.readline() few (rather more than 20) million times doesn't sound right. Therefore, I looked for another resolution and found that you can limit the size of readlines().
f=open('filename','r')
opensize=2**27
longlist=[]
while 1:
   shortlist=[[l.split()[n] for n in [0,4,-2,-1]] for l in f.readlines(opensize)]
   if not list:
       break
   else:
       longlist.extend(shortlist)

The script open the 'filename' file and next in the loop:
  • read from that file lines of size close to 128 Mb (2**27),
  • cut first, fifth, next to last and last column from each line,
  • add created (temporary) list to the output list.
It's worth to note that if
shortlist
is not created the script will leave the loop (lines 6 and 7). It not obligatory, but I like to work with 2 powers, therefore opensize=2**27.

No comments: