Friday, July 1, 2011

Python, json and garbage collection

We have a webapp which exposes a REST interface. Json is used as the data format for most of the apis.  98% of those json messages were less than 500 Kb. But 2% of them can go above 100 MB. It was observed that after processing one such 100 MB json message, the process memory went up to 500 MB and stayed there. It never came down even after running the webapp for hours and processing small json messages. Interesting observation is that memory remained at 500 MB even after processing multiple 100 MB json messages. On analyzing the problem it was found that 'json.loads' is the culprit. Calling gc.collect does releases the memory. And for now that seems to be the only solution.

The memory is not held up in any caches or python's internal memory allocator as the explicit call to gc.collect is releasing memory. It seems the gc threshold was never reached and as a result garbage collection never kicked in. But it seems strange that threshold was never reached even after running the webapp for hours.

Test code

The test code shown below simulates the situation .If the call to json.loads is omitted then this issue is not observed.  GC counts printed after invocation of jsontest is (150, 9, 0)  which indicates that gc threshold (700, 10, 10) is not met.

import json
import time
import gc
from itertools import count


def keygen(size):
    for i in count(1):
        s = str(i)
        yield '0' * (size - len(s)) + str(s)

def jsontest(num):
    keys = keygen(20)
    kvjson = json.dumps(dict((, '0' * 200) for i in range(num)))
    kvpairs = json.loads(kvjson)
    del kvpairs # Not required. Just to see if it makes any difference                            
    print 'load completed'

print gc.get_count()

while 1:

No comments:

Post a Comment