w_assign

w_assign uses simulation output to assign walkers to user-specified bins and macrostates. These assignments are required for some other simulation tools, namely w_kinetics and w_kinavg.

w_assign supports parallelization (see general work manager options for more on command line options to specify a work manager).

Overview

Usage:

w_assign [-h] [-r RCFILE] [--quiet | --verbose | --debug] [--version]
               [-W WEST_H5FILE] [-o OUTPUT]
               [--bins-from-system | --bins-from-expr BINS_FROM_EXPR | --bins-from-function BINS_FROM_FUNCTION]
               [-p MODULE.FUNCTION]
               [--states STATEDEF [STATEDEF ...] | --states-from-file STATEFILE | --states-from-function STATEFUNC]
               [--wm-work-manager WORK_MANAGER] [--wm-n-workers N_WORKERS]
               [--wm-zmq-mode MODE] [--wm-zmq-info INFO_FILE]
               [--wm-zmq-task-endpoint TASK_ENDPOINT]
               [--wm-zmq-result-endpoint RESULT_ENDPOINT]
               [--wm-zmq-announce-endpoint ANNOUNCE_ENDPOINT]
               [--wm-zmq-listen-endpoint ANNOUNCE_ENDPOINT]
               [--wm-zmq-heartbeat-interval INTERVAL]
               [--wm-zmq-task-timeout TIMEOUT]
               [--wm-zmq-client-comm-mode MODE]

Command-Line Options

See the general command-line tool reference for more information on the general options.

Input/output Options

-W, --west-data /path/to/file

    Read simulation result data from file *file*. (**Default:** The
    *hdf5* file specified in the configuration file, by default
    **west.h5**)

-o, --output /path/to/file
    Write assignment results to file *outfile*. (**Default:** *hdf5*
    file **assign.h5**)

Binning Options

Specify how binning is to be assigned to the dataset.:

--bins-from-system
  Use binning scheme specified by the system driver; system driver can be
  found in the west configuration file, by default named **west.cfg**
  (**Default binning**)

--bins-from-expr bin_expr
  Use binning scheme specified in *``bin_expr``*, which takes the form a
  Python list of lists, where each inner list corresponds to the binning a
  given dimension. (for example, "[[0,1,2,4,inf],[-inf,0,inf]]" specifies bin
  boundaries for two dimensional progress coordinate. Note that this option
  accepts the special symbol 'inf' for floating point infinity

--bins-from-function bin_func
  Bins specified by calling an external function *``bin_func``*.
  *``bin_func``* should be formatted as '[PATH:]module.function', where the
  function 'function' in module 'module' will be used

Macrostate Options

You can optionally specify how to assign user-defined macrostates. Note that macrostates must be assigned for subsequent analysis tools, namely w_kinetics and w_kinavg.:

--states statedef [statedef ...]
  Specify a macrostate for a single bin as *``statedef``*, formatted
  as a coordinate tuple where each coordinate specifies the bin to
  which it belongs, for instance:
  '[1.0, 2.0]' assigns a macrostate corresponding to the bin that
  contains the (two-dimensional) progress coordinates 1.0 and 2.0.
  Note that a macrostate label can optionally by specified, for
  instance: 'bound:[1.0, 2.0]' assigns the corresponding bin
  containing the given coordinates the macrostate named 'bound'. Note
  that multiple assignments can be specified with this command, but
  only one macrostate per bin is possible - if you wish to specify
  multiple bins in a single macrostate, use the
  *``--states-from-file``* option.

--states-from-file statefile
  Read macrostate assignments from *yaml* file *``statefile``*. This
  option allows you to assign multiple bins to a single macrostate.
  The following example shows the contents of *``statefile``* that
  specify two macrostates, bound and unbound, over multiple bins with
  a two-dimensional progress coordinate:

---
states:
  - label: unbound
    coords:
      - [9.0, 1.0]
      - [9.0, 2.0]
  - label: bound
    coords:
      - [0.1, 0.0]

Specifying Progress Coordinate

By default, progress coordinate information for each iteration is taken from pcoord dataset in the specified input file (which, by default is west.h5). Optionally, you can specify a function to construct the progress coordinate for each iteration - this may be useful to consolidate data from several sources or otherwise preprocess the progress coordinate data.:

--construct-pcoord module.function, -p module.function
  Use the function *module.function* to construct the progress
  coordinate for each iteration. This will be called once per
  iteration as *function(n_iter, iter_group)* and should return an
  array indexable as [seg_id][timepoint][dimension]. The
  **default** function returns the 'pcoord' dataset for that iteration
  (i.e. the function executes return iter_group['pcoord'][...])

Examples

westpa.cli.tools.w_assign module

westpa.cli.tools.w_assign.seg_id_dtype: alias of int64

westpa.cli.tools.w_assign.weight_dtype: alias of float64

westpa.cli.tools.w_assign.index_dtype: alias of uint16

westpa.cli.tools.w_assign.assign_and_label(nsegs_lb, nsegs_ub, parent_ids, assign, nstates, state_map, last_labels, pcoords, subsample): Assign trajectories to bins and last-visted macrostates for each timepoint.

westpa.cli.tools.w_assign.accumulate_labeled_populations(weights, bin_assignments, label_assignments, labeled_bin_pops): For a set of segments in one iteration, calculate the average population in each bin, with separation by last-visited macrostate.

class westpa.cli.tools.w_assign.WESTParallelTool(wm_env=None)

Bases: WESTTool

Base class for command-line tools parallelized with wwmgr. This automatically adds and processes wwmgr command-line arguments and creates a work manager at self.work_manager.

make_parser_and_process(prog=None, usage=None, description=None, epilog=None, args=None): A convenience function to create a parser, call add_all_args(), and then call process_all_args(). The argument namespace is returned.

add_args(parser): Add arguments specific to this tool to the given argparse parser.

process_args(args): Take argparse-processed arguments associated with this tool and deal with them appropriately (setting instance variables, etc)

go(): Perform the analysis associated with this tool.

main(): A convenience function to make a parser, parse and process arguments, then run self.go() in the master process.

class westpa.cli.tools.w_assign.WESTDataReader

Bases: WESTToolComponent

Tool for reading data from WEST-related HDF5 files. Coordinates finding the main HDF5 file from west.cfg or command line arguments, caching of certain kinds of data (eventually), and retrieving auxiliary data sets from various places.

add_args(parser): Add arguments specific to this component to the given argparse parser.

process_args(args): Take argparse-processed arguments associated with this component and deal with them appropriately (setting instance variables, etc)

open(mode='r')

close()

property weight_dsspec

property parent_id_dsspec

class westpa.cli.tools.w_assign.WESTDSSynthesizer(default_dsname=None, h5filename=None)

Bases: WESTToolComponent

Tool for synthesizing a dataset for analysis from other datasets. This may be done using a custom function, or a list of “data set specifications”. It is anticipated that if several source datasets are required, then a tool will have multiple instances of this class.

group_name = 'input dataset options'

add_args(parser): Add arguments specific to this component to the given argparse parser.

process_args(args): Take argparse-processed arguments associated with this component and deal with them appropriately (setting instance variables, etc)

class westpa.cli.tools.w_assign.BinMappingComponent

Bases: WESTToolComponent

Component for obtaining a bin mapper from one of several places based on command-line arguments. Such locations include an HDF5 file that contains pickled mappers (including the primary WEST HDF5 file), the system object, an external function, or (in the common case of rectilinear bins) a list of lists of bin boundaries.

Some configuration is necessary prior to calling process_args() if loading a mapper from HDF5. Specifically, either set_we_h5file_info() or set_other_h5file_info() must be called to describe where to find the appropriate mapper. In the case of set_we_h5file_info(), the mapper used for WE at the end of a given iteration will be loaded. In the case of set_other_h5file_info(), an arbitrary group and hash value are specified; the mapper corresponding to that hash in the given group will be returned.

In the absence of arguments, the mapper contained in an existing HDF5 file is preferred; if that is not available, the mapper from the system driver is used.

This component adds the following arguments to argument parsers:

--bins-from-system

Obtain bins from the system driver

—bins-from-expr=EXPR Construct rectilinear bins by parsing EXPR and calling RectilinearBinMapper() with the result. EXPR must therefore be a list of lists.

–bins-from-function=[PATH:]MODULE.FUNC
Call an external function FUNC in module MODULE (optionally adding PATH to the search path when loading MODULE) which, when called, returns a fully-constructed bin mapper.

—bins-from-file Load bin definitions from a YAML configuration file.

--bins-from-h5file

Load bins from the file being considered; this is intended to mean the master WEST HDF5 file or results of other binning calculations, as appropriate.

add_args(parser, description='binning options', suppress=[]): Add arguments specific to this component to the given argparse parser.

add_target_count_args(parser, description='bin target count options'): Add options to the given parser corresponding to target counts.

process_args(args): Take argparse-processed arguments associated with this component and deal with them appropriately (setting instance variables, etc)

set_we_h5file_info(n_iter=None, data_manager=None, required=False): Set up to load a bin mapper from the master WEST HDF5 file. The mapper is actually loaded from the file when self.load_bin_mapper() is called, if and only if command line arguments direct this. If required is true, then a mapper must be available at iteration n_iter, or else an exception will be raised.

set_other_h5file_info(topology_group, hashval): Set up to load a bin mapper from (any) open HDF5 file, where bin topologies are stored in topology_group (an h5py Group object) and the desired mapper has hash value hashval. The mapper itself is loaded when self.load_bin_mapper() is called.

class westpa.cli.tools.w_assign.ProgressIndicatorComponent

Bases: WESTToolComponent

add_args(parser): Add arguments specific to this component to the given argparse parser.

process_args(args): Take argparse-processed arguments associated with this component and deal with them appropriately (setting instance variables, etc)

class westpa.cli.tools.w_assign.WESTPAH5File(*args, **kwargs)

Bases: File

Generalized input/output for WESTPA simulation (or analysis) data.

Create a new file object.

See the h5py user guide for a detailed explanation of the options.

name

Name of the file on disk, or file-like object. Note: for files created with the ‘core’ driver, HDF5 still requires this be non-empty.

mode

r Readonly, file must exist (default) r+ Read/write, file must exist w Create file, truncate if exists w- or x Create file, fail if exists a Read/write if exists, create otherwise

driver

Name of the driver to use. Legal values are None (default, recommended), ‘core’, ‘sec2’, ‘direct’, ‘stdio’, ‘mpio’, ‘ros3’.

libver

Library version bounds. Supported values: ‘earliest’, ‘v108’, ‘v110’, ‘v112’ and ‘latest’. The ‘v108’, ‘v110’ and ‘v112’ options can only be specified with the HDF5 1.10.2 library or later.

userblock_size

Desired size of user block. Only allowed when creating a new file (mode w, w- or x).

swmr

Open the file in SWMR read mode. Only used when mode = ‘r’.

rdcc_nbytes

Total size of the dataset chunk cache in bytes. The default size is 1024**2 (1 MiB) per dataset. Applies to all datasets unless individually changed.

rdcc_w0

The chunk preemption policy for all datasets. This must be between 0 and 1 inclusive and indicates the weighting according to which chunks which have been fully read or written are penalized when determining which chunks to flush from cache. A value of 0 means fully read or written chunks are treated no differently than other chunks (the preemption is strictly LRU) while a value of 1 means fully read or written chunks are always preempted before other chunks. If your application only reads or writes data once, this can be safely set to 1. Otherwise, this should be set lower depending on how often you re-read or re-write the same data. The default value is 0.75. Applies to all datasets unless individually changed.

rdcc_nslots

The number of chunk slots in the raw data chunk cache for this file. Increasing this value reduces the number of cache collisions, but slightly increases the memory used. Due to the hashing strategy, this value should ideally be a prime number. As a rule of thumb, this value should be at least 10 times the number of chunks that can fit in rdcc_nbytes bytes. For maximum performance, this value should be set approximately 100 times that number of chunks. The default value is 521. Applies to all datasets unless individually changed.

track_order

Track dataset/group/attribute creation order under root group if True. If None use global default h5.get_config().track_order.

fs_strategy

The file space handling strategy to be used. Only allowed when creating a new file (mode w, w- or x). Defined as: “fsm” FSM, Aggregators, VFD “page” Paged FSM, VFD “aggregate” Aggregators, VFD “none” VFD If None use HDF5 defaults.

fs_page_size

File space page size in bytes. Only used when fs_strategy=”page”. If None use the HDF5 default (4096 bytes).

fs_persist

A boolean value to indicate whether free space should be persistent or not. Only allowed when creating a new file. The default value is False.

fs_threshold

The smallest free-space section size that the free space manager will track. Only allowed when creating a new file. The default value is 1.

page_buf_size

Page buffer size in bytes. Only allowed for HDF5 files created with fs_strategy=”page”. Must be a power of two value and greater or equal than the file space page size when creating the file. It is not used by default.

min_meta_keep

Minimum percentage of metadata to keep in the page buffer before allowing pages containing metadata to be evicted. Applicable only if page_buf_size is set. Default value is zero.

min_raw_keep

Minimum percentage of raw data to keep in the page buffer before allowing pages containing raw data to be evicted. Applicable only if page_buf_size is set. Default value is zero.

locking

The file locking behavior. Defined as:

False (or “false”) – Disable file locking
True (or “true”) – Enable file locking
“best-effort” – Enable file locking but ignore some errors
None – Use HDF5 defaults

Warning

The HDF5_USE_FILE_LOCKING environment variable can override this parameter.

Only available with HDF5 >= 1.12.1 or 1.10.x >= 1.10.7.

alignment_threshold

Together with alignment_interval, this property ensures that any file object greater than or equal in size to the alignment threshold (in bytes) will be aligned on an address which is a multiple of alignment interval.

alignment_interval

This property should be used in conjunction with alignment_threshold. See the description above. For more details, see https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT

meta_block_size

Set the current minimum size, in bytes, of new metadata block allocations. See https://portal.hdfgroup.org/display/HDF5/H5P_SET_META_BLOCK_SIZE

Additional keywords

Passed on to the selected file driver.

default_iter_prec = 8

replace_dataset(*args, **kwargs)

iter_object_name(n_iter, prefix='', suffix=''): Return a properly-formatted per-iteration name for iteration n_iter. (This is used in create/require/get_iter_group, but may also be useful for naming datasets on a per-iteration basis.)

create_iter_group(n_iter, group=None): Create a per-iteration data storage group for iteration number n_iter in the group group (which is ‘/iterations’ by default).

require_iter_group(n_iter, group=None): Ensure that a per-iteration data storage group for iteration number n_iter is available in the group group (which is ‘/iterations’ by default).

get_iter_group(n_iter, group=None): Get the per-iteration data group for iteration number n_iter from within the group group (‘/iterations’ by default).

westpa.cli.tools.w_assign.get_object(object_name, path=None): Attempt to load the given object, using additional path information if given.

westpa.cli.tools.w_assign.parse_pcoord_value(pc_str)

class westpa.cli.tools.w_assign.WAssign

Bases: WESTParallelTool

prog = 'w_assign'

description = 'Assign walkers to bins, producing a file (by default named "assign.h5")\nwhich can be used in subsequent analysis.\n\nFor consistency in subsequent analysis operations, the entire dataset\nmust be assigned, even if only a subset of the data will be used. This\nensures that analyses that rely on tracing trajectories always know the\noriginating bin of each trajectory.\n\n\n-----------------------------------------------------------------------------\nSource data\n-----------------------------------------------------------------------------\n\nSource data is provided either by a user-specified function\n(--construct-dataset) or a list of "data set specifications" (--dsspecs).\nIf neither is provided, the progress coordinate dataset \'\'pcoord\'\' is used.\n\nTo use a custom function to extract or calculate data whose probability\ndistribution will be calculated, specify the function in standard Python\nMODULE.FUNCTION syntax as the argument to --construct-dataset. This function\nwill be called as function(n_iter,iter_group), where n_iter is the iteration\nwhose data are being considered and iter_group is the corresponding group\nin the main WEST HDF5 file (west.h5). The function must return data which can\nbe indexed as [segment][timepoint][dimension].\n\nTo use a list of data set specifications, specify --dsspecs and then list the\ndesired datasets one-by-one (space-separated in most shells). These data set\nspecifications are formatted as NAME[,file=FILENAME,slice=SLICE], which will\nuse the dataset called NAME in the HDF5 file FILENAME (defaulting to the main\nWEST HDF5 file west.h5), and slice it with the Python slice expression SLICE\n(as in [0:2] to select the first two elements of the first axis of the\ndataset). The ``slice`` option is most useful for selecting one column (or\nmore) from a multi-column dataset, such as arises when using a progress\ncoordinate of multiple dimensions.\n\n\n-----------------------------------------------------------------------------\nSpecifying macrostates\n-----------------------------------------------------------------------------\n\nOptionally, kinetic macrostates may be defined in terms of sets of bins.\nEach trajectory will be labeled with the kinetic macrostate it was most\nrecently in at each timepoint, for use in subsequent kinetic analysis.\nThis is required for all kinetics analysis (w_kintrace and w_kinmat).\n\nThere are three ways to specify macrostates:\n\n 1. States corresponding to single bins may be identified on the command\n line using the --states option, which takes multiple arguments, one for\n each state (separated by spaces in most shells). Each state is specified\n as a coordinate tuple, with an optional label prepended, as in\n ``bound:1.0`` or ``unbound:(2.5,2.5)``. Unlabeled states are named\n ``stateN``, where N is the (zero-based) position in the list of states\n supplied to --states.\n\n 2. States corresponding to multiple bins may use a YAML input file specified\n with --states-from-file. This file defines a list of states, each with a\n name and a list of coordinate tuples; bins containing these coordinates\n will be mapped to the containing state. For instance, the following\n file::\n\n ---\n states:\n - label: unbound\n coords:\n - [9.0, 1.0]\n - [9.0, 2.0]\n - label: bound\n coords:\n - [0.1, 0.0]\n\n produces two macrostates: the first state is called "unbound" and\n consists of bins containing the (2-dimensional) progress coordinate\n values (9.0, 1.0) and (9.0, 2.0); the second state is called "bound"\n and consists of the single bin containing the point (0.1, 0.0).\n\n 3. Arbitrary state definitions may be supplied by a user-defined function,\n specified as --states-from-function=MODULE.FUNCTION. This function is\n called with the bin mapper as an argument (``function(mapper)``) and must\n return a list of dictionaries, one per state. Each dictionary must contain\n a vector of coordinate tuples with key "coords"; the bins into which each\n of these tuples falls define the state. An optional name for the state\n (with key "label") may also be provided.\n\n\n-----------------------------------------------------------------------------\nOutput format\n-----------------------------------------------------------------------------\n\nThe output file (-o/--output, by default "assign.h5") contains the following\nattributes datasets:\n\n ``nbins`` attribute\n *(Integer)* Number of valid bins. Bin assignments range from 0 to\n *nbins*-1, inclusive.\n\n ``nstates`` attribute\n *(Integer)* Number of valid macrostates (may be zero if no such states are\n specified). Trajectory ensemble assignments range from 0 to *nstates*-1,\n inclusive, when states are defined.\n\n ``/assignments`` [iteration][segment][timepoint]\n *(Integer)* Per-segment and -timepoint assignments (bin indices).\n\n ``/npts`` [iteration]\n *(Integer)* Number of timepoints in each iteration.\n\n ``/nsegs`` [iteration]\n *(Integer)* Number of segments in each iteration.\n\n ``/labeled_populations`` [iterations][state][bin]\n *(Floating-point)* Per-iteration and -timepoint bin populations, labeled\n by most recently visited macrostate. The last state entry (*nstates-1*)\n corresponds to trajectories initiated outside of a defined macrostate.\n\n ``/bin_labels`` [bin]\n *(String)* Text labels of bins.\n\nWhen macrostate assignments are given, the following additional datasets are\npresent:\n\n ``/trajlabels`` [iteration][segment][timepoint]\n *(Integer)* Per-segment and -timepoint trajectory labels, indicating the\n macrostate which each trajectory last visited.\n\n ``/state_labels`` [state]\n *(String)* Labels of states.\n\n ``/state_map`` [bin]\n *(Integer)* Mapping of bin index to the macrostate containing that bin.\n An entry will contain *nbins+1* if that bin does not fall into a\n macrostate.\n\nDatasets indexed by state and bin contain one more entry than the number of\nvalid states or bins. For *N* bins, axes indexed by bin are of size *N+1*, and\nentry *N* (0-based indexing) corresponds to a walker outside of the defined bin\nspace (which will cause most mappers to raise an error). More importantly, for\n*M* states (including the case *M=0* where no states are specified), axes\nindexed by state are of size *M+1* and entry *M* refers to trajectories\ninitiated in a region not corresponding to a defined macrostate.\n\nThus, ``labeled_populations[:,:,:].sum(axis=1)[:,:-1]`` gives overall per-bin\npopulations, for all defined bins and\n``labeled_populations[:,:,:].sum(axis=2)[:,:-1]`` gives overall\nper-trajectory-ensemble populations for all defined states.\n\n\n-----------------------------------------------------------------------------\nParallelization\n-----------------------------------------------------------------------------\n\nThis tool supports parallelized binning, including reading/calculating input\ndata.\n\n\n-----------------------------------------------------------------------------\nCommand-line options\n-----------------------------------------------------------------------------\n'

add_args(parser): Add arguments specific to this tool to the given argparse parser.

process_args(args): Take argparse-processed arguments associated with this tool and deal with them appropriately (setting instance variables, etc)

parse_cmdline_states(state_strings)

load_config_from_west(scheme)

load_state_file(state_filename)

states_from_dict(ystates)

load_states_from_function(statefunc)

assign_iteration(n_iter, nstates, nbins, state_map, last_labels): Method to encapsulate the segment slicing (into n_worker slices) and parallel job submission Submits job(s), waits on completion, splices them back together Returns: assignments, trajlabels, pops for this iteration

go(): Perform the analysis associated with this tool.

westpa.cli.tools.w_assign.entry_point()