Parsing a very large XML file with Python

In the process of doing some experiments, I was looking for a suitable text corpus to use; the ideal candidate would have a few GB and would not cause any monetary loss on my wallet. Stumbling around I found out that Wikipedia has various dumps of its pages, and decided that the 2013 dump would be a suitable bag of text for my experiments – a single XML of 9GB, which by the way is the compressed size. The actual size is 42GB. I had in my computer the largest XML file I’ve seen in my life, and I didn’t have space left in disk to even extract it – parsing it would be a hell of a job.

Fortunately I already knew how to parse XML with the nice lxml Python module, but how to proceed when I just cannot extract the file? It turns out that there is a module called bzr which exposes an interface for opening .bz2 files in the same fashion the open() function does. After some hours of struggling, searching around several sites, blogs and forums and filling all of memory + swap more than once, I managed to parse the file without blowing up the RAM with this Python script:

from lxml import etree
import sys
import bz2
import unicodedata

TAG = '{http://www.mediawiki.org/xml/export-0.8/}text'

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    # modified to call func() only in the event and elem needed
    for event, elem in context:
        if event == 'end' and elem.tag == TAG:
            func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem, fout):
        global counter
        normalized = unicodedata.normalize('NFKD', \
                unicode(elem.text)).encode('ASCII','ignore').lower()
        print >>fout, normalized.replace('\n', ' ')
        if counter % 10000 == 0: print "Doc " + str(counter)
        counter += 1

def main():
    fin = bz2.BZ2File(sys.argv[1], 'r')
    fout = open('2013_wikipedia_en_pages_articles.txt', 'w')
    context = etree.iterparse(fin)
    global counter
    counter = 0
    fast_iter(context, process_element, fout)

if __name__ == "__main__":
    main()

The output format is a single file with one document per line. Probably it would be better to put all these lines in a DBMS or a Berkeley DB file, as the lines can be really, really big, but for now the plain text format will suffice. Every document is normalized to contain only ASCII characters, avoiding potential encoding problems, but besides that no further processing is done. I parsed only the tag “text” from the XML, but if you need to parse more than one tag change TAG into a dict:

TAG = {
    '{http://www.mediawiki.org/xml/export-0.8/}text',
    '{http://www.mediawiki.org/xml/export-0.8/}text'
}
(...)
        if event == 'end' and elem.tag in TAG:

The URL enclosed in brackets is the namespace of the tags, which lxml annoyingly insists in using, so I just ended using the “full name” of the tag I needed. Also, I know that globals are evil, I just wanted to easily count the number of documents processed and I was too lazy to properly write a class. I killed the process when the output file reached around 20GB, which I considered big enough for my experimental needs. Now all that is left is to actually run the benchmarks and see what happens.

And just for the record: I’m not a Python specialist, feel free to point eventual problems or bottlenecks on this code.