Setting up a distributed processing farm

So we’re in the process of switching over all our animations onto new characters. This consists of saving out about 10,000 animations and reloading them onto new rigs. We’ve had to do stuff like this multiple times, as well as other tasks, like batch processing all gagillion gigs of textures in our game.

Since these tasks are far too large for a couple machines, and are easily parellelized, I ended up writing a distributed work system. It basically works by creating a ‘work log’ that lives on the network and whenever it is read or written, it ‘locks’ and other requests queue up for their reads/writes (it is like a memory/sync lock for seperate CPU threads, if each CPU were a computer and the memory were a file on the network).

The details are unimportant and maybe I’ll get to them later. Mostly what I’m curious about are, what is desired in a ‘distributed processing’ farm, and has anyone written something like it?

Hey Rob,

I had something like that setup for a few months at work to churn through the data, by running a few commandlets on data in Unreal. I ended up making the commandlet faster in the end so I didn’t need to parallelize but the farm was working.

Here’s how I did it and to answer your question more directly below I’ll list the requirements I see in a farm.

I started by doing something similar to what you have, had a MySQL db setup to keep track of all of the work, and the clients would grab the latest available job from the DB. It worked fine until I wanted to have different clients do different things and pass the data back and forward between them.

The pipeline was, run a commandlet on a piece of data, parse the output and dump it to a database, it was to manage the assets and basic data mining.

The parsing that data easily became the choke point as some of the files were over 100 megs of text. So I changed the system to be written with Twisted in Python and I set up a master server which managed all of the jobs that were requested, as well as kept temporary data sent between clients.

So it would be, Client A logged in to the server as a crawler, grabbed the asset name, ran whatever it needed to run, then pickled the data and sent it up to the server, rinse and repeat, then clients C D E would grab the data, parse it and throw it up into the DB.

The advantage of this was that now clients C D and E didn’t need any software installed, just python, so I could harness all of the available desktops I could get my hands on.

That’s the 5 second overview but on the requirements.

Need to log what files need to be worked on
Need to know what machine grabbed what file
Need to know at what time that machine grabbed that file and have the job timeout if the machine doesn’t report back
Ideally it’s nice to see what machine is doing what and be able to cancel jobs remotely

That’s the barebones, then you can get fancy and keep track of average times, determine which machines are faster and throw larger files at them, develop a smarter client in case you’re using a common machine that is needed for a dog and pony show you can kill it remotely, as well as the ability of having machine pools. Like we can have 10 dedicated machines that run all day, but at night we can turn on an extra 50 and have them shut off automatically.

Hope that helps some,

Luiz

A couple years back I wrote something similar as an exercise. It used Pyro, a Python package designed to share objects across multiple machines. My tool accepted jobs in the form of Python scripts, batchfiles or command-line strings. It was a good learning experience. Pyro is cool, but today I would definitely write it to use Twisted. It’s a steeper learning curve, but more robust.

PiCloud is another interesting Python package, designed to make cloud processing simple and accessible to Python programmers. I’ve only toyed with it, but it seems to have promise. Offsite cloud work might not be viable for what you’re doing, however.

A couple guys at Bungie gave an interesting GDC talk awhile back on their distributed processing farm. Here are the slides:
http://www.bungie.net/images/Inside/publications/presentations/Life_on_the_Bungie_Farm.pptx

As for features, I’d ultimately want to make sure the tool had a simple, intuitive UI, and practical documentation. I’ve found it’s too easy for tools like this to fall into the realm of arcane lore, making them only really usable/tweakable by one or two people at the studio. UI, docs, training and support strategies can overcome that for sure, but it takes attention/time.

More specifically, I’d ultimately want a job management interface that allowed restarts, canceling, re-prioritizing of jobs, notifies, plus a good amount of job status information and logging. If it offered a web-based interface that’s remotely available, it would add to the appeal, even if the main UI was a regular GUI. Full-fledged Twisted applications are particularly good at offering multiple client interfaces.