Support dynamic formation of process groups
Copyright (c) 2018 Intel, Inc. All rights reserved.
This document is subject to all provisions relating to code contributions to the PMIx community as defined in the community’s LICENSE file. Code Components extracted from this document must include the License text as described in that file.
PMIx Groups are defined as a collection of processes desiring a unified identifier for purposes such as passing events or participating in PMIx fence operations. Groups differ from processes that PMIx_Connect with each other in the following key areas:
Relation to the host resource manager (RM)
- Calls to PMIx_Connect are relayed to the host RM. This means that the RM should treat the failure of any process in the specified assemblage as a reportable event and take appropriate action. However, the RM does not define a new identifier for the connected assemblage, nor does it define a new rank for each process within that group. In addition, the PMIx server does not provide any tracking support for the assemblage. Thus, the caller is responsible for maintaining the membership list of the assemblage
- Calls to PMIx_Group are first processed within the local PMIx server, which creates a tracker that associates the specified processes with the user-provided group identifier. Each process in the group is assigned a group rank based on their relative position in the array of processes provided in the call to Group_construct. Members of the group can subsequently utilize the provided group identifier in PMIx function calls to address the group’s members. Note that such calls must include the PMIX_GROUP_OPERATION attribute to inform PMIx that the nspace being provided in the call is a PMIx Group identifier and not an RM-assigned value
- PMIx_Connect calls require that every process call the API before completing – i.e., it is modeled upon the bulk synchronous traditional MPI connect/accept methodology
- PMIx Groups are designed to be more flexible in their construction procedure by also providing for dynamic definition of membership based on an invite/join model. Processes wishing to form or extend a group can "invite" another process to join and receive notification when that occurs. Invitations result in a PMIx event being delivered to the invited process, with subsequent events being delivered to each member of the group when the process joins via the PMix_Group_join API
- Processes that combine via PMIx_Connect must all depart the group together – i.e., no member can depart the group while leaving the remaining members in it. Even the non-blocking form of "disconnect" retains this requirement in that members remain a part of the group until all members have called PMIx_Disconnect_nb
- Members of a PMIx Group may depart the group at any time via the PMIx_Group_leave API. Other members are notified via event of the intent to depart, and the departing process will not complete the function call until all members confirm the departure. This is intended to provide a chance for any in-progress collective operations to be resolved prior to the member’s departure
Another way to think of it might be that members of PMIx Groups are "loosely" coupled as opposed to "tightly" connected when constructed via PMIx_Connect. The APIs are explained below.
Critical Note: The reliance on PMIx events in the PMIx Group concept dictates that processes utilizing these APIs must register for the corresponding events. Failure to do so will likely lead to operational failures. Users are recommended to utilize the PMIX_TIMEOUT directive (or retain an internal timer) on calls to PMIx Group APIs (especially the blocking form of those functions) as processes that have not registered for required events will never respond.
PMIX_EXPORT pmix_status_t PMIx_Group_construct(const pmix_group_identifier_t id, const pmix_proc_t procs, size_t nprocs, const pmix_info_t info, size_t ninfo);
Construct a new group composed of the specified processes and identified with the provided pmix_group_identifier_t. Both blocking and non-blocking versions are provided (the callback function for the non-blocking form will be called once all specified processes have joined the group). The group identifier is a user-defined, NULL-terminated character array of length less than or equal to PMIX_MAX_NSLEN. Only characters accepted by standard string comparison functions (e.g., strncmp) are supported.
Processes may engage in multiple simultaneous group construct operations as desired so long as each is provided with a unique group ID. The info array can be used to pass user-level directives regarding timeout constraints and other options available from the PMIx server. The construct API will return an error if any specified process fails or terminates prior to calling PMIx_Group_construct or its non-blocking version.
Return information about which processes failed to construct or which ones succeeded?
Some API-specific info keys are provided for this operation:
- PMIX_GROUP_NOTIFY_EACH (bool): generate an event notification using the PMIX_PROC_HAS_JOINED event each time a process joins the group. The default is false
- PMIX_GROUP_NOTIFY_REQ (bool): notify each of the indicated processes that they are requested to join using the PMIX_GROUP_REQUESTED event. The default is false
- PMIX_GROUP_OPTIONAL (bool): participation is optional – do not return an error if any of the specified processes terminate without having joined. The default is false
- PMIX_GROUP_NOTIFY_DEPARTURE (bool): notify remaining members when another member requests to leave the group by calling PMIx_Group_leave. The default is false
- PMIX_GROUP_NOTIFY_TERMINATION (bool): notify remaining members when another member terminates without first leaving the group. The default is false
- PMIX_TIMEOUT (int): return an error if the group doesn’t assemble within the specified number of seconds. Targets the scenario where a process fails to call PMIx_Group_connect due to hanging
Processes in the group under construction are not allowed to accept requests to join the group from external processes, approve requests to leave the group, or issue invitations to join the group until the group construction is complete.
PMIX_EXPORT pmix_status_t PMIx_Group_join(const pmix_group_identifier_t grp, const pmix_info_t info, size_t ninfo);
Request to join an existing group. The group must previously have been constructed by at least one process – i.e., one process must initialize the group via a call to PMIx_Group_construct prior to any other process requesting to join that group. A request to join a PMIx group generates a PMIX_GROUP_JOIN_REQUEST event that specifies the provided group ID as the intended recipients. Thus, the PMIx event notification system will only deliver the event to processes that already belong to the group and have registered to receive that event.
Since the PMIx server hosting the requestor may not know the given identifier (i.e., no local process has previously joined that group), it may not have any knowledge of the group’s membership. In such a case, it must ask its host RM to broadcast the request event. The default range of the broadcast is dependent on the host environment, but typically spans the allocation of the requestor. Requestors may wish to specify the range in the directives. Upon receiving the request, each PMIx server shall check its local list of groups to determine if it knows of the group – if so, then it will call a PMIx server function to communicate the request to each local member.
The client library utilizes the PMIX_GROUP_JOIN_REQUEST event to alert the member process to the request. The registered callback function must respond to the event chain with a result that indicates either PMIX_SUCCESS (indicating approval of the request), or a PMIx error code rejecting the request. This result will be returned to the server by the client library.
Once the server has received a response from each local member, it shall communicate the aggregated answer (including any error codes so the caller can ascertain the reason for a rejection) plus the list of current known members to the requestor’s PMIx server. The requestor’s server will use the list to determine when a response has been received from all existing members – at that time, it will send the collective response to the requestor.
Note: there is some inherent race conditions here that remain to be resolved. It may be necessary to require some kind of "fence" operation to ensure that members aren’t added to a server’s list during processing of another "join" request, thus causing the lists across the servers to fall out-of-sync.
PMIX_EXPORT pmix_status_t PMIx_Group_invite(const pmix_group_identifier_t grp, const pmix_proc_t procs, size_t nprocs, const pmix_info_t info, size_t ninfo);
Invite specified processes to join a PMIx group by sending an event notification to the specified procs inviting them to join the group. The group may or may not have been previously constructed.
PMIX_EXPORT pmix_status_t PMIx_Group_leave(const pmix_group_identifier_t grp, const pmix_info_t info, size_t ninfo);
Leave a PMIx Group. The caller will issue a PMIX_GROUP_LEAVE_REQUEST event notifying all members of the group of its intent to depart. The function will return (or the non-blocking function will execute the specified callback function) once a response is received from all members approving the departure.
Note: there is an inherent race condition here where processes may be joining the group while others are leaving it. It may prove difficult to coordinate adequately to ensure that the caller knows when "all" members have approved.
PMIX_EXPORT pmix_status_t PMIx_Group_destruct(const pmix_group_identifier_t grp, const pmix_info_t info, size_t ninfo);
Teardown an existing group – collective operation that returns when all members have called it.
Note: again, we need to handle the race condition.
Provide a reference link to the accompanying Pull Request (PR) against the PMIx master repository. If the prototype implementation has been tested against an appropriately modified resource manager and/or client program, then references to those prototypes should be provided. Note that approval of any RFC will be far more likely to happen if such validation has been performed!
Ralph H. Castain