Parallel writes

Parallel writes is a configurable feature relating to the number of web service requests across an instance of DataHub, on a given node, that will operate in parallel. However, it is very important to note that increasing the number of parallel writes does not always equal faster/better. There is a long list of concepts that have to be taken into account. 

Parallel Writes & Memory Usage

  • A job in DataHub does not use a fixed amount of memory
  • Memory usage for individual jobs will vary based on a number of factors, the most significant one is the number of files and how those files are distributed (all in one folder, throughout sub-folders, etc)
  • To avoid excessive memory usage related to how content is distributed, DataHub recommends preserving the system default for Directory Item Limits | max_items_per_container
  • The main factors for memory usage for a DataHub node will be number of concurrent jobs + Parallel Writes for each job and the memory impact of the specific jobs 

Addressing Memory Issues

If memory issues occur due to increasing the Directory Item Limit or Parallel Writes, there is no other mitigation other than reducing the number of current jobs or breaking up the source content in to multiple jobs. DataHub will keep using memory until it runs out (it will not self-limit) and it will eventually reach the environment max. Reaching an environment max may result in a non-graceful termination of DataHub that could result in jobs re-transferring files, permissions or metadata.  In the case of larger jobs being stopped in this manner, they will enter recovery mode, continue to use all the memory, then get stopped again, in a loop causing a loss of throughput.

Parallel Writes Default = 2

Set Parallel Writes Globally

This can be done by adding the following to your appSettings.json file

{
  "performance": {
        "parallel_writes": 2
       }
}

Set Parallel Writes Per-Job

Add the following in the transfer block of your job

{
  "performance": {
        "parallel_writes": {
                "requested": 2
         }
     }
}

Example:

{
    "name": "Test Parallel Writes",
        "kind": "transfer",
        "transfer": {
            "audit_level": "trace",
            "transfer_type": "copy",
            "performance": {
                "parallel_writes": {
                    "requested": 2
    }
},
    "source": {
        "connection": {
            "id": "{{cloud_connection}}"
},
    "target": {
        "path": "/MASTER_TESTS/BASIC TRANSFER TESTS"
    }
},
    "destination": {
        "connection": {
            "id": "{{cloud_connection}}"
},
    "target": {
        "path": "/SAP/LB/Test_ParallelWrites"
        }
    }
},
    "schedule": {
        "mode": "manual"
    }
}

To review your job to confirm your request parallel writes, use:

GET {{url}}v1/jobs?include=all

Update Parallel Writes on an Existing Job

The following body in a PATCH request to {{url}}v1/jobs/{{job}} will update the parallel_writes value to 8.

{
    "kind": "transfer",
    "transfer": {
      "performance": {
            "parallel_writes": {
                "requested": 8
        }
      }
    }
}

Related Links