### Abstract

The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

Original language | English (US) |
---|---|

Pages (from-to) | 236-252 |

Number of pages | 17 |

Journal | Journal of Mathematical Analysis and Applications |

Volume | 112 |

Issue number | 1 |

DOIs | |

State | Published - Nov 15 1985 |

### Fingerprint

### ASJC Scopus subject areas

- Analysis
- Applied Mathematics

### Cite this

*Journal of Mathematical Analysis and Applications*,

*112*(1), 236-252. https://doi.org/10.1016/0022-247X(85)90288-4

**Optimal policies for controlled Markov chains with a constraint.** / Beutler, Frederick J.; Ross, Keith.

Research output: Contribution to journal › Article

*Journal of Mathematical Analysis and Applications*, vol. 112, no. 1, pp. 236-252. https://doi.org/10.1016/0022-247X(85)90288-4

}

TY - JOUR

T1 - Optimal policies for controlled Markov chains with a constraint

AU - Beutler, Frederick J.

AU - Ross, Keith

PY - 1985/11/15

Y1 - 1985/11/15

N2 - The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

AB - The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

UR - http://www.scopus.com/inward/record.url?scp=0022151359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0022151359&partnerID=8YFLogxK

U2 - 10.1016/0022-247X(85)90288-4

DO - 10.1016/0022-247X(85)90288-4

M3 - Article

AN - SCOPUS:0022151359

VL - 112

SP - 236

EP - 252

JO - Journal of Mathematical Analysis and Applications

JF - Journal of Mathematical Analysis and Applications

SN - 0022-247X

IS - 1

ER -