lor.tasks package

Submodules

lor.tasks.fs module

Utility tasks for the local filesystem

class lor.tasks.fs.EnsureExistsOnLocalFilesystemTask(*args, **kwargs)

Bases: luigi.task.ExternalTask

description = 'Ensure a file/dir exists on the local filesystem'
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

path = <luigi.parameter.Parameter object>
run()

The task run method, to be overridden in a subclass.

See Task.run

lor.tasks.general module

General utility tasks

class lor.tasks.general.AlwaysRunsTask(*args, **kwargs)

Bases: luigi.task.ExternalTask

complete()

If the task has any outputs, return True if all outputs exist. Otherwise, return False.

However, you may freely override this method with custom logic.

description = 'A trivial task that always runs successfully'
class lor.tasks.general.DictPluckTask(*args, **kwargs)

Bases: luigi.task.ExternalTask

description = 'Pluck a single output from a task that produces a dict of outputs'
key = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

upstream_task = <luigi.parameter.Parameter object>

lor.tasks.hdfs module

Utility tasks for HDFS

class lor.tasks.hdfs.ClusterDeployTask(*args, **kwargs)

Bases: luigi.task.Task

description = 'Deploy a local file/dir to the cluster'
destination()
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

run()

The task run method, to be overridden in a subclass.

See Task.run

source()
class lor.tasks.hdfs.DownloadFromHdfsTask(*args, **kwargs)

Bases: luigi.task.Task

description = 'Move a file/dir from HDFS to the local filesystem'
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

output_path = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

upstream_task = <luigi.parameter.TaskParameter object>
class lor.tasks.hdfs.EnsureExistsOnHdfsTask(*args, **kwargs)

Bases: luigi.task.ExternalTask

description = 'Ensure a file/dir exists on HDFS'
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

path = <luigi.parameter.Parameter object>
run()

The task run method, to be overridden in a subclass.

See Task.run

class lor.tasks.hdfs.MoveToHdfsTask(*args, **kwargs)

Bases: luigi.task.Task

Move the output of a task (assuming it’s a LocalTarget) onto HDFS

cache_invalidator = <luigi.parameter.Parameter object>
description = 'Move the output of a task to HDFS'
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

upstream_task = <luigi.parameter.TaskParameter object>

lor.tasks.tar module

class lor.tasks.tar.TarballTask(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

A task that puts another task’s output (assuming it outputs a FileTarget) into a tarball)

describe = "Package a task's output into an uncompressed tarball."
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

output_path = <luigi.parameter.Parameter object>
program_args()

Override this method to map your task parameters to the program arguments

Returns:list to pass as args to subprocess.Popen
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

upstream_task = <luigi.parameter.TaskParameter object>

Module contents