Typically these solutions grow reactively as a response to the increasing need to schedule individual jobs, and usually because current incarnation of the system doesn’t allow for simple scaling. Eventually simple frameworks emerge to solve problems like storing the status of jobs and dependencies. The next step forward is to have scripts call other scripts, and that can work for a short period of time. There’s always the good old cron scheduler to get started, and many vendor packages ship with scheduling capabilities. Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. Heterogenous: the stack for modern analytics is changing quickly, and most companies run multiple systems that need to be glued together.Evolving: as the company and the data team matures, so does the data processing.Mission critical: if some of the jobs aren’t running, we are in trouble.Scheduled: each job should run at a certain scheduled interval.These networks of jobs are typically DAGs ( directed acyclic graphs) and have the following properties: This complexity can become a significant burden for the data teams to manage, or even comprehend. Now if you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands. Throw a few data workers together for even a short amount of time and quickly you have a growing complex graph of computation batch jobs. These jobs need to run on a schedule, typically have a set of dependencies on other existing datasets, and have other jobs that depend on them. As people who work with data begin to automate their processes, they inevitably write batch jobs.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |