Optimal policies for controlled Markov chains with a constraint

Frederick J. Beutler, Keith Ross

    Research output: Contribution to journalArticle

    Abstract

    The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

    Original languageEnglish (US)
    Pages (from-to)236-252
    Number of pages17
    JournalJournal of Mathematical Analysis and Applications
    Volume112
    Issue number1
    DOIs
    StatePublished - Nov 15 1985

    Fingerprint

    Controlled Markov Chains
    Time-average
    Optimal Policy
    Dynamic programming
    Markov processes
    Reward
    Lagrange multipliers
    Costs
    Dynamic Programming
    Average Cost
    Compact Metric Space
    Unconstrained Optimization
    Supremum
    Accessibility
    Convergence Properties
    Markov Process
    Multiplier
    Biased
    Compactness
    State Space

    ASJC Scopus subject areas

    • Analysis
    • Applied Mathematics

    Cite this

    Optimal policies for controlled Markov chains with a constraint. / Beutler, Frederick J.; Ross, Keith.

    In: Journal of Mathematical Analysis and Applications, Vol. 112, No. 1, 15.11.1985, p. 236-252.

    Research output: Contribution to journalArticle

    Beutler, Frederick J. ; Ross, Keith. / Optimal policies for controlled Markov chains with a constraint. In: Journal of Mathematical Analysis and Applications. 1985 ; Vol. 112, No. 1. pp. 236-252.
    @article{21115d36ad9a45ebb44565e3c56152e2,
    title = "Optimal policies for controlled Markov chains with a constraint",
    abstract = "The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.",
    author = "Beutler, {Frederick J.} and Keith Ross",
    year = "1985",
    month = "11",
    day = "15",
    doi = "10.1016/0022-247X(85)90288-4",
    language = "English (US)",
    volume = "112",
    pages = "236--252",
    journal = "Journal of Mathematical Analysis and Applications",
    issn = "0022-247X",
    publisher = "Academic Press Inc.",
    number = "1",

    }

    TY - JOUR

    T1 - Optimal policies for controlled Markov chains with a constraint

    AU - Beutler, Frederick J.

    AU - Ross, Keith

    PY - 1985/11/15

    Y1 - 1985/11/15

    N2 - The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

    AB - The time average reward for a discrete-time controlled Markov process subject to a time-average cost constraint is maximized over the class of al causal policies. Each epoch, a reward depending on the state and action, is earned, and a similarly constituted cost is assessed; the time average of the former is maximized, subject to a hard limit on the time average of the latter. It is assumed that the state space is finite, and the action space compact metric. An accessibility hypothesis makes it possible to utilize a Lagrange multiplier formulation involving the dynamic programming equation, thus reducing the optimization problem to an unconstrained optimization parametrized by the multiplier. The parametrized dynamic programming equation possesses compactness and convergence properties that lead to the following: If the constraint can be satisfied by any causal policy, the supremum over time-average rewards respective to all causal policies is attained by either a simple or a mixed policy; the latter is equivalent to choosing independently at each epoch between two specified simple policies by the throw of a biased coin.

    UR - http://www.scopus.com/inward/record.url?scp=0022151359&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0022151359&partnerID=8YFLogxK

    U2 - 10.1016/0022-247X(85)90288-4

    DO - 10.1016/0022-247X(85)90288-4

    M3 - Article

    AN - SCOPUS:0022151359

    VL - 112

    SP - 236

    EP - 252

    JO - Journal of Mathematical Analysis and Applications

    JF - Journal of Mathematical Analysis and Applications

    SN - 0022-247X

    IS - 1

    ER -