So I have this web-scraping routine, for my personal stock screener app, that runs a function doing about 2000 or so http-get requests sequentially with a little processing time between each. Every once and a while I get a system fork error (I did a little research and it appears to be a case of the underlying OS getting wedged). My set up: Arc3.1 (customized), Linode Ubuntu Jaunty, Apache with Mod Proxy to pass data to the arc server. I have a few ideas that I think might help (or make it worse - lol). 1. I have already done some work to move towards nginx (thanks to paslecam)... I could make this a priority to move into production. 2. I could thread the http requests to run concurrently. The idea being that somehow the threads release memory faster/incrementally, although I don't see why a sequential routine wouldn't gc well enough. I also don't fully understand if network traffic eats RAM such that arc gc starts to be a factor - or not?) 3. I could slow down the routine to only do an http-get every x seconds or so, but if this is the case I am worried my server isn't robust enough to handle increased traffic. 4. I notice my CPU usage sits at 96% ish. I do quite a few file writes (2 writes and 2 deletes per request) so I am guessing my CPU is spending a lot on this. I could load the results of the request directly to memory vs. saving a file then loading to memory and also I could save all results at the end vs. incrementally saving. Just wondering, based upon experiences, what factors make sense to focus on (In retrospect # 4 is starting to appear the best option). [Edit - new error from tonights run, haven't seen this one until now:
...
PLT Scheme virtual machine has run out of memory; aborting
Aborted] |