Reinforcement Learning Viva Questions & Answer – MU AI & DS Semester 8

1 Introduction to Reinforcement Learning (4 Hours)

Key features and Elements of RL
Types of RL
Rewards
Reinforcement Learning Algorithms:
- Q-Learning
- State-Action-Reward-State-Action (SARSA)

q1
What is Reinforcement Learning (RL)?
a1
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. It differs from supervised learning in that it does not require labeled input-output pairs but instead learns from the consequences of its actions.

q2
What are the key features and elements of Reinforcement Learning?
a2
The key features of RL include an agent, an environment, actions, states, and rewards. The agent interacts with the environment by taking actions based on its current state, and the environment responds with a new state and a reward. The goal of the agent is to maximize the cumulative reward over time.

q3
Explain the term “reward” in the context of Reinforcement Learning.
a3
In RL, a reward is a scalar feedback signal that the agent receives after performing an action in a particular state. It indicates the immediate benefit of the action and helps the agent determine whether it is on the right path to achieving its goal. The agent aims to maximize the total reward over time.

q4
What are the different types of Reinforcement Learning?
a4
The main types of RL are:

Model-Free RL: The agent learns directly from interactions with the environment without building a model of it.
Model-Based RL: The agent builds a model of the environment’s dynamics to plan and predict the outcomes of actions.

q5
How does Q-Learning work in Reinforcement Learning?
a5
Q-Learning is a model-free RL algorithm where the agent learns the value of state-action pairs. It uses a Q-table to store Q-values (expected future rewards) for each state-action pair. The Q-values are updated iteratively based on the Bellman equation until the agent converges to the optimal policy.

q6
What is the Bellman equation in Q-Learning?
a6
The Bellman equation defines the relationship between the value of a state-action pair and the rewards the agent can expect from that state onward. In Q-Learning, the Q-value is updated using the formula:
Q(s, a) = Q(s, a) + α [r + γ * max(Q(s’, a’)) – Q(s, a)],
where r is the reward, γ is the discount factor, and α is the learning rate.

q7
Explain the difference between Q-Learning and SARSA.
a7
Q-Learning is an off-policy algorithm, meaning it updates the Q-values based on the best possible action (maximizing the future reward) regardless of the agent’s actual actions.
SARSA (State-Action-Reward-State-Action) is an on-policy algorithm, where Q-values are updated based on the action actually taken by the agent in the next state.

q8
What is the “State-Action-Reward-State-Action” (SARSA) algorithm in RL?
a8
SARSA is an on-policy reinforcement learning algorithm that updates Q-values based on the action actually taken in the next state. The agent uses the formula:
Q(s, a) = Q(s, a) + α [r + γ * Q(s’, a’) – Q(s, a)],
where r is the reward, γ is the discount factor, α is the learning rate, and s’, a’ are the next state and action taken.

q9
How do exploration and exploitation balance in RL algorithms like Q-Learning and SARSA?
a9
In RL, exploration refers to trying new actions to discover their effects, while exploitation involves choosing actions that are known to give high rewards. In Q-Learning and SARSA, this balance is often achieved using an ε-greedy strategy, where the agent explores with probability ε and exploits with probability (1-ε).

q10
What role does the discount factor (γ) play in RL algorithms?
a10
The discount factor (γ) determines the importance of future rewards compared to immediate rewards. A value of γ close to 0 makes the agent focus more on immediate rewards, while a γ close to 1 makes the agent consider long-term rewards. It helps balance short-term vs. long-term decision making.

2. Bandit Problems and Online Learning (7 Hours)

n-Armed Bandit Problem
Action-Value Methods
Tracking a Nonstationary Problem
Optimistic Initial Values
Upper-Confidence-Bound Action Selection
Gradient Bandits

q1
What is the n-Armed Bandit Problem in reinforcement learning?
a1
The n-Armed Bandit Problem is a simplified RL problem where an agent has to choose from n actions (bandit arms), each associated with an unknown reward distribution. The goal is to maximize the total reward by selecting the optimal actions over time, balancing exploration (trying new actions) and exploitation (choosing the best-known actions).

q2
Explain Action-Value Methods in the context of Bandit Problems.
a2
Action-Value Methods are strategies used to estimate the value of each action in the bandit problem. The value of an action is represented by the expected reward it produces. The agent updates the value of each action based on the rewards received over time, usually using running averages to improve its decision-making.

q3
How do Action-Value Methods work to solve the n-Armed Bandit Problem?
a3
In Action-Value Methods, the agent maintains an estimate of the expected reward for each arm. After selecting an arm and receiving a reward, the agent updates the estimated value of that arm using the formula:
Q(a) = Q(a) + (1/n) * (r – Q(a)),
where Q(a) is the estimated value, r is the reward, and n is the number of times action a has been taken.

q4
What is the challenge of tracking a nonstationary problem in Bandit Problems?
a4
In nonstationary problems, the reward distributions of the actions change over time, making it difficult for the agent to maintain accurate estimates of action values. The challenge is to adjust the action-value estimates quickly in response to these changes while still making good decisions.

q5
How can Optimistic Initial Values help in solving Bandit Problems?
a5
Optimistic Initial Values are a technique where the agent starts with high initial estimates of the value of each action. This encourages exploration, as the agent is motivated to try different actions to improve the value estimates. Over time, the estimates are refined, leading to more exploitation of the best-performing actions.

q6
What is the Upper-Confidence-Bound (UCB) action selection strategy?
a6
The Upper-Confidence-Bound (UCB) strategy is a method for balancing exploration and exploitation. It selects actions based on both the current estimated value of the action and the uncertainty (variance) in that estimate. The action is chosen that maximizes the upper confidence bound, encouraging exploration of actions with high uncertainty.

q7
Explain how Upper-Confidence-Bound Action Selection works mathematically.
a7
The UCB action selection works by selecting the action a with the highest value of:
Q(a) + c * sqrt((2 * ln(t)) / n(a)),
where Q(a) is the estimated action value, c is a constant controlling exploration, t is the total number of actions taken, and n(a) is the number of times action a has been taken. This equation ensures that actions with high uncertainty (low n(a)) are more likely to be selected.

q8
What are Gradient Bandits in the context of the Bandit Problem?
a8
Gradient Bandits are a class of methods used to directly optimize the parameters of the policy in the context of bandit problems. Instead of estimating values for each action, Gradient Bandits use a preference function that adjusts over time based on the rewards. The action selection is done based on these preferences, and the updates are done using the gradient of the reward distribution.

q9
How does the Gradient Bandit Algorithm update the action preferences?
a9
In the Gradient Bandit algorithm, the action preferences are updated based on the reward received for each action. The update rule is:
H(a) = H(a) + α * (r – b) * (π(a) – 1/N),
where H(a) is the preference for action a, α is the learning rate, r is the received reward, b is a baseline (average reward), and π(a) is the probability of selecting action a. This approach ensures that actions with higher rewards become more likely to be chosen.

q10
What is the role of exploration vs exploitation in solving Bandit Problems?
a10
Exploration vs exploitation is a key challenge in Bandit Problems. Exploration means trying different actions to discover their reward distributions, while exploitation involves choosing actions that are already known to provide high rewards. Balancing both is essential to maximize the total reward over time, and strategies like ε-greedy, UCB, and Gradient Bandits are designed to manage this balance.

3. Markov Decision Processes (7 Hours)

The Agent–Environment Interface
Goals and Rewards
Returns
Markov Properties
Markov Decision Process
Value Functions
Optimal Value Functions

q1
What is the Agent–Environment Interface in Reinforcement Learning?
a1
The Agent–Environment Interface refers to the interaction between the agent and the environment. The agent takes actions based on the current state of the environment, and the environment responds by transitioning to a new state and providing a reward. This cycle continues as the agent aims to maximize cumulative rewards through its actions.

q2
Explain the concepts of Goals and Rewards in the context of Markov Decision Processes.
a2
In Markov Decision Processes (MDPs), the agent’s goal is to maximize its cumulative reward over time. The reward is a feedback signal that indicates the immediate benefit of an action in a given state. The agent strives to find a policy that maximizes the long-term total reward, often referred to as the return.

q3
What are Returns in Reinforcement Learning?
a3
In Reinforcement Learning, the return refers to the total accumulated reward the agent receives from a specific point in time onward. The return can be discounted over time to account for future rewards. It is typically represented as the sum of rewards, either in a discounted or undiscounted form, depending on the problem setup.

q4
What are the Markov Properties in the context of MDPs?
a4
The Markov Property (also known as the memoryless property) states that the future state of the system depends only on the current state, not on the sequence of states that preceded it. In MDPs, this means that the decision-making process depends solely on the current state and action, not on the history of past states or actions.

q5
What is a Markov Decision Process (MDP)?
a5
A Markov Decision Process (MDP) is a mathematical model used to describe decision-making problems where an agent interacts with an environment that follows the Markov Property. An MDP is defined by the tuple (S, A, P, R, γ), where:

S is the set of states,
A is the set of actions,
P is the transition probability function,
R is the reward function,
γ is the discount factor.
The agent’s objective is to choose a policy that maximizes the expected cumulative reward over time.

q6
What are Value Functions in Markov Decision Processes?
a6
A Value Function is a function that estimates the long-term reward that can be expected from a given state or state-action pair. The two main types are:

State Value Function (V(s)): The expected return starting from state s and following a certain policy.
Action Value Function (Q(s, a)): The expected return starting from state s, taking action a, and following a certain policy.
The value function helps the agent evaluate how good a state or action is in terms of the rewards it will eventually yield.

q7
What is the difference between State Value Function (V(s)) and Action Value Function (Q(s, a))?
a7
The State Value Function (V(s)) represents the expected return starting from a state s and following a particular policy.
The Action Value Function (Q(s, a)) represents the expected return starting from state s, taking action a, and following a particular policy.
In essence, the state value function is concerned with the long-term value of being in a state, while the action value function is concerned with the value of taking a specific action in that state.

q8
What is the Optimal Value Function?
a8
The Optimal Value Function is the value function that gives the highest possible expected return from any state. It is the value of the state under the best possible policy. The optimal value function is denoted as *V(s), and the optimal action value function is denoted as *Q(s, a). These functions represent the highest achievable rewards, regardless of the specific policy currently followed.

q9
How do you compute the Optimal Value Function in MDPs?
a9
The Optimal Value Function can be computed using Dynamic Programming techniques such as Value Iteration or Policy Iteration. In Value Iteration, the function is updated iteratively by using the Bellman equation for optimality:
V(s) = max_a [R(s, a) + γ * Σ P(s’ | s, a) * V(s’)]**,
where R(s, a) is the immediate reward for action a in state s, γ is the discount factor, and P(s’ | s, a) is the transition probability from state s to state s’ when action a is taken.

q10
What is the role of the discount factor (γ) in Markov Decision Processes?
a10
The discount factor (γ) in MDPs determines how much future rewards are valued compared to immediate rewards. A discount factor close to 0 makes the agent focus mainly on immediate rewards, while a value close to 1 encourages the agent to consider future rewards more significantly. It plays a crucial role in determining the long-term strategy of the agent.

4. Dynamic Programming (7 Hours)

Policy Evaluation (Prediction)
Policy Improvement
Policy Iteration
Value Iteration
Asynchronous Dynamic Programming
Generalized Policy Iteration

q1
What is Policy Evaluation (Prediction) in Dynamic Programming?
a1
Policy Evaluation (also known as Prediction) refers to the process of calculating the value function for a given policy. It involves iteratively updating the value of each state based on the expected returns when following the policy. The goal is to determine how good each state is under the current policy, and it is done using the Bellman equation.

q2
How is the Policy Evaluation process implemented?
a2
Policy evaluation is performed by iteratively updating the value function for each state. The value of each state V(s) is updated using the equation:
V(s) ← Σ P(s’ | s, π(s)) [R(s, π(s), s’) + γ * V(s’)],
where π(s) is the action prescribed by the policy in state s, P(s’ | s, π(s)) is the transition probability from state s to state s’, and R(s, π(s), s’) is the reward for the transition from state s to state s’.

q3
What is Policy Improvement in Dynamic Programming?
a3
Policy Improvement is the process of improving the current policy by making it greedier with respect to the current value function. After performing policy evaluation, the agent updates its policy by choosing the action that maximizes the expected return for each state based on the current value function. This leads to a better policy, which can then be further evaluated and improved.

q4
How is Policy Improvement performed mathematically?
a4
Policy improvement is performed by selecting the action a in each state s that maximizes the action value function Q(s, a), which is derived from the current value function V(s). The new policy π’ is updated as:
π'(s) = argmax_a [R(s, a) + γ * Σ P(s’ | s, a) * V(s’)],
where the action a that maximizes the right-hand side is chosen as the new policy for state s.

q5
What is Policy Iteration in Dynamic Programming?
a5
Policy Iteration is an iterative process that alternates between policy evaluation and policy improvement. In each iteration, the agent first evaluates the current policy and then improves the policy by selecting actions that maximize the value function. This process continues until the policy converges to the optimal policy.

q6
Explain the steps involved in Policy Iteration.
a6
The steps of Policy Iteration are:

Policy Evaluation: Evaluate the current policy by calculating the value function for all states.
Policy Improvement: Improve the policy by selecting actions that maximize the expected return based on the current value function.
Repeat the above steps until the policy converges (i.e., no changes are made during the improvement step).

q7
What is Value Iteration and how does it differ from Policy Iteration?
a7
Value Iteration is an approach where the value function is iteratively updated until it converges to the optimal value function. Unlike Policy Iteration, which updates the policy and value function in separate steps, Value Iteration combines policy evaluation and policy improvement into a single step. At each iteration, the value of each state is updated using the Bellman equation, and the policy is implicitly improved as the value function converges.

q8
How does Value Iteration update the value function?
a8
In Value Iteration, the value function for each state is updated iteratively using the Bellman optimality equation:
V(s) ← max_a [R(s, a) + γ * Σ P(s’ | s, a) * V(s’)],
where the maximum is taken over all possible actions a. The process continues until the value function converges, indicating that the optimal policy has been reached.

q9
What is Asynchronous Dynamic Programming?
a9
Asynchronous Dynamic Programming is a variation of Dynamic Programming where the updates to the value function or policy are done asynchronously rather than synchronously for all states. This means that updates are applied to states at different times, allowing for faster convergence in some cases. It is particularly useful when the state space is large, and updating all states at once is computationally expensive.

q10
What is Generalized Policy Iteration (GPI)?
a10
Generalized Policy Iteration (GPI) refers to the combination of policy evaluation and policy improvement processes in a way that they do not have to be performed in strict alternation. In GPI, both the policy and the value function are updated in parallel, and the updates may occur at different rates. The process of GPI continues until the policy and value function converge to their optimal forms, making it a more flexible approach compared to traditional policy iteration or value iteration.

5. Monte Carlo Methods and Temporal-Difference Learning (7 Hours)

Monte Carlo Prediction
Monte Carlo Estimation of Action Values
Monte Carlo Control
TD Prediction
TD Control using Q-Learning

q1
What is Monte Carlo Prediction in Reinforcement Learning?
a1
Monte Carlo Prediction is a method for estimating the value function of a policy by averaging the returns (sum of rewards) from multiple episodes. It works by generating episodes using the policy, and then calculating the average return for each state encountered in those episodes. The value function is updated based on these averages, providing an estimate of how good it is to be in each state under the given policy.

q2
How does Monte Carlo Estimation of Action Values work?
a2
Monte Carlo Estimation of Action Values is similar to Monte Carlo prediction but focuses on estimating the value of taking specific actions in states. The Q-value (action-value) is updated by averaging the returns received after taking a particular action in a given state. The action-value function Q(s, a) is updated as:
Q(s, a) = (1/n) * Σ (r_t),
where n is the number of times action a has been taken in state s, and r_t is the return received from that action.

q3
What is Monte Carlo Control in Reinforcement Learning?
a3
Monte Carlo Control is a method used to find the optimal policy by combining Monte Carlo prediction and policy improvement. It works by repeatedly evaluating the current policy and improving it based on the action-value function. The policy is improved by selecting actions that maximize the Q-values, and this process continues until the policy converges to the optimal one.

q4
How is Monte Carlo Control used to find the optimal policy?
a4
Monte Carlo Control finds the optimal policy through an iterative process of policy evaluation and policy improvement. In each iteration, the action-value function Q(s, a) is estimated for each state-action pair by averaging the returns. Then, the policy is updated by selecting the action with the highest Q-value in each state. This process continues until the policy converges to the optimal policy.

q5
What is TD Prediction in Temporal-Difference Learning?
a5
TD Prediction (Temporal-Difference Prediction) is a method for estimating the value function of a policy by updating estimates after each step rather than waiting until the end of an episode. In TD learning, the value of a state is updated based on the observed reward and the value of the next state, using the formula:
V(s) ← V(s) + α [r + γ * V(s’) – V(s)],
where r is the reward, γ is the discount factor, V(s) is the current state value, and V(s’) is the value of the next state.

q6
How does TD Prediction differ from Monte Carlo Prediction?
a6
The main difference is that TD Prediction updates the value function after each step based on the observed reward and the estimated value of the next state, without waiting for the end of the episode. In contrast, Monte Carlo Prediction updates the value function only after the entire episode has completed, using the total return from that episode. TD learning is thus more efficient as it can update the value function incrementally.

q7
What is TD Control using Q-Learning?
a7
TD Control using Q-Learning is a model-free method that combines Temporal-Difference (TD) learning with action-value estimation to learn the optimal policy. The agent learns the Q-values for state-action pairs by using the following update rule:
Q(s, a) ← Q(s, a) + α [r + γ * max_a’ Q(s’, a’) – Q(s, a)],
where r is the reward, γ is the discount factor, and the max_a’ Q(s’, a’) term ensures that the agent selects the action that maximizes future rewards. Q-Learning is an off-policy algorithm, meaning the agent does not need to follow the current policy to update the Q-values.

q8
What makes Q-Learning an off-policy algorithm?
a8
Q-Learning is an off-policy algorithm because it updates the Q-values based on the assumption of taking the optimal action, even if the agent is not actually following the optimal policy. The update rule uses the maximum Q-value for the next state max_a’ Q(s’, a’), regardless of the action actually taken. This allows Q-Learning to learn the optimal policy even if the agent is exploring suboptimal actions.

q9
Explain how the TD Learning algorithm can be used to balance exploration and exploitation.
a9
In TD Learning, the balance between exploration and exploitation can be controlled by the agent’s action selection strategy. Common strategies include ε-greedy, where the agent selects the action with the highest Q-value most of the time (exploitation) but occasionally selects a random action (exploration) with probability ε. By using this strategy, the agent can explore new actions and learn from them while still exploiting its knowledge of the best actions.

q10
What are the advantages of using Temporal-Difference Learning over Monte Carlo methods?
a10
Temporal-Difference Learning has several advantages over Monte Carlo methods:

TD learning updates the value function incrementally after each step, making it more efficient than Monte Carlo, which waits until the end of an episode to update.
TD learning does not require the agent to wait for the final outcome, enabling online learning.
TD can handle environments where episodes are not necessarily finite or end after an unknown number of steps.

6. Applications and Case Studies (5 Hours)

Elevator Dispatching
Dynamic Channel Allocation
Job-Shop Scheduling

q1
What is Elevator Dispatching in the context of Reinforcement Learning?
a1
Elevator Dispatching is a classic application where an agent (the elevator control system) must efficiently dispatch elevators to serve incoming requests in a building. The goal is to minimize the waiting time for passengers while considering factors like distance, direction, and load. Reinforcement learning can be used to optimize the dispatching process by learning the best action (which elevator to send) based on the current state (the location of the elevators and requests).

q2
How can Reinforcement Learning be applied to Elevator Dispatching?
a2
Reinforcement Learning can be applied by modeling the elevator dispatching system as a Markov Decision Process (MDP), where the agent’s state includes the positions of all elevators and the list of pending requests. The agent’s actions involve selecting which elevator to dispatch for each request. The reward could be the inverse of the waiting time for passengers, and the objective is to learn the policy that minimizes the total waiting time by exploring different strategies (e.g., nearest elevator, load balancing, or even optimizing for peak traffic times).

q3
What is Dynamic Channel Allocation and how is it related to Reinforcement Learning?
a3
Dynamic Channel Allocation refers to the process of allocating communication channels to users in a network, such as in cellular networks or Wi-Fi systems. The goal is to efficiently allocate channels to users in real-time to minimize interference and maximize throughput. Reinforcement learning can be applied to learn policies for allocating channels based on the current network conditions, such as traffic load, interference levels, and user demands.

q4
How can Reinforcement Learning improve Dynamic Channel Allocation?
a4
Reinforcement Learning can improve Dynamic Channel Allocation by treating the allocation decision as an action that the agent takes in each time step. The state of the system could include factors such as the number of users, current traffic load, and interference levels. The agent learns to allocate channels in a way that maximizes throughput and minimizes interference through rewards (e.g., successful transmissions). By learning from past experiences, the system can adapt to changing network conditions and optimize the channel allocation dynamically.

q5
What is Job-Shop Scheduling and how can Reinforcement Learning be applied to it?
a5
Job-Shop Scheduling refers to the problem of scheduling a set of jobs on a set of machines such that each job goes through a series of operations, each requiring a specific machine. The objective is often to minimize the makespan (the total time required to complete all jobs) or to optimize machine utilization. Reinforcement learning can be applied to learn policies for scheduling the jobs dynamically by treating the scheduling decisions as actions and using the rewards to guide the learning process.

q6
How does Reinforcement Learning solve the Job-Shop Scheduling problem?
a6
In Job-Shop Scheduling, Reinforcement Learning can be used by modeling the scheduling process as an MDP. The state includes information about the current status of the machines, the jobs that are waiting for resources, and the time remaining for each job. The actions involve assigning jobs to machines. The reward is typically negative (penalty) for inefficient scheduling, such as idle machine time or late job completion. The agent learns the optimal scheduling policy that minimizes makespan or maximizes machine utilization by exploring different scheduling strategies.

q7
What are the challenges in applying Reinforcement Learning to Elevator Dispatching systems?
a7
Challenges in applying Reinforcement Learning to Elevator Dispatching include:

Large State Space: The system may have a large state space, with many possible positions for elevators and requests, making it difficult to learn an optimal policy.
Real-Time Constraints: The dispatching decisions need to be made quickly to avoid passenger delays, which can be challenging for RL algorithms that require many interactions with the environment.
Exploration vs. Exploitation: The system must balance exploration (trying new dispatching strategies) with exploitation (using the current best strategy), which can be difficult in real-time settings.
Multi-Agent Coordination: In buildings with many elevators, there may be a need to coordinate between multiple agents (elevators), which increases the complexity.

q8
How does Dynamic Channel Allocation benefit from Reinforcement Learning in real-world networks?
a8
Dynamic Channel Allocation benefits from Reinforcement Learning in real-world networks by enabling the system to adapt to changing network conditions in real-time. RL algorithms can learn optimal channel allocation policies that respond to varying traffic loads, interference levels, and the number of users. This dynamic adaptation leads to better utilization of the available spectrum, reduced interference, and improved user experience by ensuring that the channels are allocated efficiently according to current conditions.

q9
What are some advantages of using Reinforcement Learning for Job-Shop Scheduling over traditional methods?
a9
Advantages of using Reinforcement Learning for Job-Shop Scheduling over traditional methods include:

Adaptability: RL can adapt to changing job characteristics, machine capabilities, and dynamic conditions without needing predefined rules.
Optimal Decision Making: RL can discover policies that minimize makespan or optimize other objectives without explicitly programming the logic, allowing it to potentially outperform heuristic methods.
Learning from Experience: RL improves with experience and can find solutions in complex or previously unseen scenarios, unlike rule-based methods that require manual adjustments.

q10
What performance metrics are typically used to evaluate Elevator Dispatching, Dynamic Channel Allocation, and Job-Shop Scheduling systems?
a10
Performance metrics typically used to evaluate these systems include:

Elevator Dispatching:
- Average passenger waiting time
- Total number of passenger requests served
- Energy consumption
- Elevator utilization and efficiency
Dynamic Channel Allocation:
- Channel throughput
- Network interference levels
- User satisfaction (e.g., reduced call drops, faster data rates)
- Channel utilization
Job-Shop Scheduling:
- Makespan (the total time required to complete all jobs)
- Machine utilization
- Job completion time and tardiness
- Total penalty (for delayed jobs)

Disclaimer:
These viva questions are based on the syllabus for the MU AI & DS Semester 8 course on Reinforcement Learning. Please note that during the viva, you may also be asked questions regarding the practicals you have performed, as well as questions that require an in-depth understanding of concepts, which may not be covered in this set of viva questions. It is advised to be well-prepared for any additional questions or topics not directly included here.

For more resources and updates, you can follow Doubtly on Instagram: @mydoubtly. Additionally, you can download the Doubtly app for Android devices from the Google Play Store: Doubtly App.

Reinforcement Learning Viva Questions & Answer – MU AI & DS Semester 8

Table of Contents

1 Introduction to Reinforcement Learning (4 Hours)

2. Bandit Problems and Online Learning (7 Hours)

3. Markov Decision Processes (7 Hours)

4. Dynamic Programming (7 Hours)

5. Monte Carlo Methods and Temporal-Difference Learning (7 Hours)

6. Applications and Case Studies (5 Hours)

Ajink Gupta

Table of Contents

1 Introduction to Reinforcement Learning (4 Hours)

2. Bandit Problems and Online Learning (7 Hours)

3. Markov Decision Processes (7 Hours)

4. Dynamic Programming (7 Hours)

5. Monte Carlo Methods and Temporal-Difference Learning (7 Hours)

6. Applications and Case Studies (5 Hours)

Ajink Gupta

Related Posts

Graph Theory Explained – SMA

Ensemble Learning Visualizer

Important Topics for Semester 3 Exam [CS/AI-DS/AI-ML/CSE]