Improving MARL Robustness for Agent Blindness in VMAS

Quick Summary

A Multi-Agent Reinforcement Learning research project investigating how swarm agents handle partial observability through random “blindness” events. Built custom blindness scenarios in VMAS (Vectorized Multi-Agent Simulator) where agents lose sensory input unpredictably, testing whether cooperative strategies can emerge despite information loss. Applied MAPPO (Multi-Agent Proximal Policy Optimization) with extensive hyperparameter tuning to achieve robust performance even when agents randomly lose observations.

Tech Stack: Python, PyTorch, TorchRL, VMAS, MAPPO, Multi-Agent Deep RL
Status: ✅ Complete - Course Capstone Project
GitHub: View Source Code

The Challenge

The Real-World Problem: Autonomous systems must cooperate even when sensors fail unexpectedly.

Examples:

Autonomous vehicles: Camera systems malfunction mid-drive—must avoid collisions using limited data
Drone swarms: GPS signal loss—drones must maintain formation without position data
Robot teams: Sensor failures—robots must complete tasks despite missing observations
Satellite networks: Communication blackouts—satellites must coordinate blindly

The Research Question: Can multi-agent reinforcement learning systems learn to cooperate robustly when agents randomly lose sensory input (go “blind”)?

System Architecture

VMAS Framework Selection

What is VMAS?

Vectorized Multi-Agent Simulator
GPU-accelerated physics simulator for MARL research
Developed by Prorok Lab (University of Cambridge)
Supports continuous control tasks with swarm-like behavior

The Balance Scenario:

The base environment features agents cooperating to roll a red ball to a green goal by collectively pushing a platform the ball rests on.

Key Mechanics:

Physics: Realistic gravity, momentum, balance dynamics
Cooperation Required: One agent can’t succeed alone—must coordinate pushing
Swarm Behavior: Parameter sharing across all agents (homogeneous policy)
Full Observability (baseline): All agents see complete state

View Balance Scenario GIF

Custom Blindness Extension

Core Innovation: Random sensory deprivation during episodes

Six Blindness Scenarios Implemented:

1. BlindOneRandomAgentEveryStep

class BlindOneRandomAgentEveryStep(Transform):
    def _step(self, tensordict, next_tensordict):
        # One random agent blinded every single step
        next_tensordict[("agents", "observation")][
            ..., random.randrange(self._n_agents), :
        ] = 0
        return next_tensordict

2. BlindAllAgentsEveryStep

All agents simultaneously blinded every step
Extreme difficulty baseline (near-impossible task)

3. BlindOneRandomAgentIfProbability

class BlindOneRandomAgentIfProbability(Transform):
    def __init__(self, n_agents, blind_prob=0.1):
        self._blind_prob = blind_prob
    
    def _step(self, tensordict, next_tensordict):
        if random.random() < self._blind_prob:
            # 10% chance one agent goes blind for 1 step
            next_tensordict[("agents", "observation")][
                ..., random.randrange(self._n_agents), :
            ] = 0
        return next_tensordict

4. BlindRandomAgentsIfProbability

Multiple agents can be blinded simultaneously
Each agent has independent probability per step

5. BlindOneRandomAgentIfProbabilityForJSteps

class BlindOneRandomAgentIfProbabilityForJSteps(Transform):
    def __init__(self, n_agents, blind_prob=0.1, max_blind_steps=10):
        self.blind_remaining = {i: 0 for i in range(n_agents)}
    
    def _step(self, tensordict, next_tensordict):
        for agent_idx in range(self._n_agents):
            if self.blind_remaining[agent_idx] > 0:
                # Agent still blind, decrement counter
                self.blind_remaining[agent_idx] -= 1
                next_tensordict[("agents", "observation")][
                    ..., agent_idx, :
                ] = 0
            elif random.random() < self._blind_prob:
                # New blindness event: 1-10 steps
                blind_duration = random.randint(1, self.max_blind_steps)
                self.blind_remaining[agent_idx] = blind_duration - 1
                next_tensordict[("agents", "observation")][
                    ..., agent_idx, :
                ] = 0
        return next_tensordict

6. BlindRandomAgentsIfProbabilityForJSteps

Multiple agents can experience multi-step blindness simultaneously
Most realistic scenario (models real sensor failures)

Blindness Mechanism:

Observation vector set to zeros (agent receives no sensory input)
Actions still processed (agent can move/act, just blindly)
Other agents unaffected (only blinded agent loses vision)

Algorithm Highlights

1. MAPPO (Multi-Agent Proximal Policy Optimization)

Why MAPPO?

State-of-the-art for cooperative MARL benchmarks
Outperforms many off-policy methods (MADDPG, QMIX)
Stable training with centralized critic, decentralized execution
Parameter sharing enables swarm-like behavior

Architecture:

Policy Network (Decentralized Actor):
├── MultiAgentMLP
│   ├── Input: Agent observation (n_obs_per_agent)
│   ├── Hidden: 2 layers × 256 units (Tanh activation)
│   ├── Output: 2 × n_actions (mean & std for continuous actions)
│   └── Share parameters: True (all agents use same policy)
├── NormalParamExtractor (split output into loc & scale)
└── TanhNormal distribution (bounded continuous actions)

Critic Network (Centralized Value Function):
├── MultiAgentMLP
│   ├── Input: All agent observations (centralized)
│   ├── Hidden: 2 layers × 256 units (Tanh activation)
│   ├── Output: 1 value per agent
│   └── Share parameters: True
└── State value estimation for PPO advantage

Key MAPPO Features:

Centralized Critic: Sees all agent observations during training
Decentralized Actor: Each agent acts from own observation only
Parameter Sharing: Single policy network shared across all agents
PPO Clipping: Prevents drastic policy updates (stability)

2. Hyperparameter Configuration

Sampling Parameters:

frames_per_batch = 6_000  # Collected experiences per iteration
n_iters = 25              # Training iterations
total_frames = 150_000    # Total training frames
max_steps = 200           # Episode length
num_vmas_envs = 30        # Parallel environments

Training Parameters:

num_epochs = 20           # Optimization passes per iteration
minibatch_size = 200      # Batch size for gradient updates
lr = 3e-4                 # Learning rate (Adam optimizer)
max_grad_norm = 1.0       # Gradient clipping

PPO-Specific:

clip_epsilon = 0.2        # PPO clipping range
gamma = 0.9               # Discount factor
lmbda = 0.9               # GAE lambda
entropy_eps = 1e-4        # Entropy bonus coefficient

3. Normalization Strategy

Problem: Agents experience wildly different scenarios

Blinded agent: Receives no reward temporarily
Non-blinded agents: Continue collecting rewards
Variance in returns destabilizes training

Solution: Advantage normalization across agents

loss_module = ClipPPOLoss(
    actor_network=policy,
    critic_network=critic,
    normalize_advantage=True  # Critical for blindness scenarios
)

Impact:

Prevents blinded agents from dominating gradient updates
Stabilizes learning despite observation variance
Essential for scenarios with random blindness events

4. GAE (Generalized Advantage Estimation)

Purpose: Estimate how much better an action was than expected

loss_module.make_value_estimator(
    ValueEstimators.GAE, 
    gamma=0.9,   # Future reward discount
    lmbda=0.9    # Bias-variance tradeoff
)

Why GAE for Blindness:

Smooths advantage estimates across uncertain states
Reduces variance when agents randomly lose observations
Helps credit assignment: “Was my action good despite blindness?”

5. Training Loop Architecture

def train_environment_variables(env, description, norm=True, 
                                clipVal=0.2, batchSize=200, 
                                numEpochs=20):
    # 1. Setup networks (policy + critic)
    policy = ProbabilisticActor(...)
    critic = TensorDictModule(...)
    
    # 2. Create data collector
    collector = SyncDataCollector(env, policy, frames_per_batch)
    
    # 3. Setup replay buffer
    replay_buffer = ReplayBuffer(storage=LazyTensorStorage(...))
    
    # 4. Create PPO loss module
    loss_module = ClipPPOLoss(...)
    
    # 5. Training loop
    for tensordict_data in collector:
        # Compute GAE advantages
        with torch.no_grad():
            GAE(tensordict_data, ...)
        
        # Store experiences
        replay_buffer.extend(tensordict_data.reshape(-1))
        
        # Multiple optimization epochs
        for _ in range(numEpochs):
            for _ in range(frames_per_batch // batchSize):
                # Sample minibatch
                subdata = replay_buffer.sample()
                
                # Compute losses
                loss_vals = loss_module(subdata)
                loss = (loss_vals["loss_objective"] + 
                        loss_vals["loss_critic"] + 
                        loss_vals["loss_entropy"])
                
                # Backprop and update
                loss.backward()
                torch.nn.utils.clip_grad_norm_(loss_module.parameters(), 1.0)
                optim.step()
                optim.zero_grad()
        
        # Update collector policy
        collector.update_policy_weights_()
        
        # Log episode rewards
        episode_reward_mean = tensordict_data[
            ("next", "agents", "episode_reward")
        ][done].mean()
    
    return episode_reward_mean_list, policy

Key Technical Achievements

✅ 6 Custom Blindness Scenarios - Probabilistic, duration-based, single/multi-agent
✅ MAPPO Implementation - Centralized critic with decentralized execution
✅ Advantage Normalization - Critical for handling observation variance
✅ Extensive Hyperparameter Study - 50+ experiments across 7 dimensions
✅ Vectorized Simulation - 30 parallel environments for efficient data collection
✅ Parameter Sharing - Single policy shared across all agents (swarm behavior)
✅ GAE Value Estimation - Smooth advantage calculation for uncertain states
✅ Automated Experimentation - Configurable training function for rapid iteration

Experimental Results

Experiment 1: Baseline Comparison Across Blindness Scenarios

Setup: 7 environments, 25 iterations, no normalization, clip=0.2

Results:

Environment	Final Reward	Difficulty
Normal (no blindness)	~450	Baseline
Blind 1 agent randomly (1 step)	~380	Easy
Blind 1 agent every step	~320	Medium
Blind 1 agent random duration	~280	Hard
Blind random agents randomly	~240	Very Hard
Blind random agents random duration	~180	Extreme
Blind all agents every step	~50	Nearly Impossible

Key Finding: Agents can learn despite blindness, but performance degrades with:

Increased blindness frequency
Longer blindness duration
More agents affected simultaneously

Surprising Result: Random 1-step blindness performs reasonably well (380/450 = 84% of baseline) → suggests robust cooperative strategies

Experiment 2: Normalization Impact

Setup: All 7 scenarios × 2 conditions (normalized vs. non-normalized)

Results:

Normalized Advantage (True):

Blind 1 agent randomly (1 step): ~420 (↑ 40 from baseline)
Blind 1 agent random duration: ~340 (↑ 60 from baseline)
Blind random agents random duration: ~220 (↑ 40 from baseline)

Non-Normalized Advantage (False):

Much higher variance in training curves
Lower final performance across all scenarios
More unstable learning (sudden drops)

Conclusion: Normalization is essential for blindness scenarios

Deals with variance from agents experiencing different observation states
Prevents blinded agents from dominating gradient updates with large errors
Stabilizes training even in extreme conditions

All future experiments use normalization enabled.

Experiment 3: PPO Clipping Value

Setup: Clip values = [0.01, 0.1, 0.2, 0.3, 0.5, 0.75]
Environment: Blind 1 random agent for random duration (hardest single-agent scenario)

Results:

Clip Value	Final Reward	Observations
0.01	~180	Too restrictive, can’t learn
0.1	~220	Slow learning
0.2	~280	Balanced (default)
0.3	~360	Best performance
0.5	~310	Unstable, overshoots
0.75	~290	Too aggressive updates

Key Insight: clip=0.3 provides optimal balance

Not too conservative (can learn important features)
Not too aggressive (maintains stability)
+28% improvement over default 0.2

Why It Works:

Allows larger policy updates when agents discover how to compensate for blindness
Still prevents catastrophic updates from outlier experiences
Sweet spot for this level of environment stochasticity

Experiment 4: Minibatch Size

Setup: Batch sizes = [10, 50, 100, 200, 300, 500, 1000]

Results:

Batch Size	Final Reward	Training Speed
10	~340	Fast epochs, noisy
50	~380	Good balance
100	~400	Optimal
200	~360	Default (good)
300	~310	Sample reuse issues
500	~280	Overfitting to batches
1000	~260	Severe overfitting

Key Finding: Smaller batches (50-100) outperform default (200)

Explanation:

Too large: Samples reused too often within epoch → overfitting
Too small: High variance, unstable gradients
Sweet spot (50-100): Fresh samples, stable gradients

Trade-off:

Batch=100: Slightly slower per iteration but better final performance
Batch=200: Faster iteration but lower asymptotic reward

Experiment 5: Number of Epochs

Setup: Epochs = [5, 10, 15, 20, 30, 50]

Results:

Epochs	Final Reward	Observations
5	~240	Underutilizes data
10	~300	Improvement
15	~340	Good
20	~360	Default (good)
30	~420	Optimal
50	~425	Marginal gain (+5)

Key Finding: 30 epochs hits diminishing returns

Explanation:

More epochs = more optimization steps per batch
Complex blindness scenarios benefit from thorough learning
30 vs. 50: Only +5 reward but +66% training time

Recommendation: Use 30 epochs for best time/performance tradeoff

Experiment 6: Blindness Probability Sensitivity

Setup: Probability = [0.01, 0.05, 0.10, 0.20, 0.30, 0.50]
Environment: Blind 1 random agent for random duration

Results (Normalized by Blindness Probability):

Reward_normalized = Reward × P(blind) × E[blind_steps]

P(blind)	Raw Reward	Normalized Reward	Relative Performance
0.01	~440	~22	100% (baseline)
0.05	~420	~105	95%
0.10	~380	~190	86%
0.20	~320	~320	73%
0.30	~260	~390	59%
0.50	~180	~450	41%

Key Insight: Agents maintain proportional performance until P(blind) > 0.2

Up to 20% blindness probability: Graceful degradation
Beyond 30%: Sharp performance cliff
Suggests learned strategies have robustness threshold

Interpretation:

Agents learn to anticipate and compensate for occasional blindness
When blindness becomes dominant (>30%), cooperation breaks down
Similar to human teams: tolerate occasional member unavailability, fail under chronic absences

Experiment 7: Maximum Blindness Duration

Setup: Max blind steps = [1, 2, 3, 5, 7, 10, 20]
Environment: P(blind) = 0.1 fixed

Results (Normalized):

Max Steps	Raw Reward	Normalized Reward	Performance
1	~420	~21	100%
2	~400	~40	95%
3	~380	~57	90%
5	~350	~87.5	83%
7	~320	~112	76%
10	~280	~140	67%
20	~180	~180	43%

Key Finding: Performance degrades linearly with duration up to ~10 steps, then crashes

Explanation:

Short blindness (1-5 steps): Agents maintain momentum, other agents compensate
Medium blindness (7-10 steps): Coordination suffers, recovery possible
Long blindness (20 steps): 10% of episode—too disruptive for recovery

Real-World Parallel:

Brief sensor glitch (1-5s): Vehicle can coast safely
Extended failure (10-20s): Requires pulling over or emergency protocols

Experiment 8: Number of Agents

Setup: n_agents = [2, 3, 4, 5, 7, 10]
Environment: Blind 1 random agent randomly (10% prob, 1-10 steps)

Results:

Agents	Final Reward	Observations
2	~280	Hard (50% capacity loss when 1 blinded)
3	~380	Baseline (33% loss)
4	~420	Better (25% loss)
5	~460	Good (20% loss)
7	~510	Optimal
10	~490	Coordination overhead

Key Insight: More agents improve robustness—to a point

Explanation:

2-3 agents: One blind = significant capacity loss
4-7 agents: Redundancy allows compensation
10 agents: Coordination complexity outweighs redundancy benefits

Sweet Spot: 5-7 agents

Enough redundancy to handle blindness
Not so many that coordination becomes bottleneck

Experiment 9: Best Combined Hyperparameters

Setup: Combine all improvements from previous experiments

Configurations Tested:

Config	Norm	Clip	Batch	Epochs	Final Reward
Original	False	0.2	200	20	~280
Best	True	0.3	100	30	~520
Best - Clip	True	0.2	100	30	~480
Best - Batch	True	0.3	200	30	~490
Best - Epochs	True	0.3	100	20	~470

Key Finding: Combined improvements yield +86% performance over baseline

Optimal Hyperparameters:

normalize_advantage = True
clip_epsilon = 0.3
minibatch_size = 100
num_epochs = 30

Video Evidence: Trained agents successfully balance platform and move ball to goal despite random multi-step blindness events

Challenges & Solutions

Challenge: VMAS Documentation Nightmare

Problem: “VMAS is one of the worst to work in”

Specific Issues:

Sparse, incomplete documentation
Conflicting library syntaxes (BenchMARL, TorchRL, VMAS)
Examples use different APIs (can’t copy-paste)
Hours spent restructuring code to match VMAS syntax

Example Frustration:

# BenchMARL syntax (doesn't work in TorchRL)
env = make_vmas_env(scenario="balance", ...)

# TorchRL syntax (required)
env = VmasEnv(scenario="balance", ...)
env = TransformedEnv(env, RewardSum(...))

# Different observation keys between libraries!

Impact: 60%+ of project time spent on framework compatibility vs. actual research

Solution:

Found working TorchRL MAPPO tutorial
Built everything from that single example
Avoided BenchMARL/MADDPG code (incompatible)

Lesson Learned: Framework maturity matters—prioritize well-documented tools

Challenge: Failed MADDPG/QMIX Implementation

Problem: Wanted to compare MAPPO to other algorithms

Attempted:

QMIX: Couldn’t translate from different library syntax
MADDPG: Ran without errors but no learning occurred

MADDPG Symptoms:

Rewards oscillate wildly
No convergence after 25 iterations
Likely implementation bug in state/action handling

Decision: Abandon other algorithms, focus on MAPPO mastery

Trade-off:

Lost algorithmic comparison
Gained depth in single algorithm with extensive hyperparameter study

Future Work: Use BenchMARL directly (if compatible with blindness transforms)

Challenge: Partner Abandonment

Problem: Group project became solo project mid-semester

Timeline:

Week 1-3: Collaborated on proposal
Week 4-8: Minimal communication
Week 9+: Complete radio silence

Impact:

Reduced scope (no multi-algorithm comparison)
Increased workload
Stress of solo research project

Coping Strategy:

Focused on depth over breadth
Hyperparameter study replaces multi-algorithm study
Quality over quantity

Silver Lining: Full ownership of codebase, complete understanding of implementation

Challenge: Training Time Constraints

Problem: 25 iterations × 6,000 frames × 30 envs = slow

Training Times:

Single run: ~15-20 minutes
Full experiment (7 scenarios): ~2 hours
Hyperparameter sweep (6 values): ~12 hours

Solution:

Ran experiments overnight
Prioritized most impactful hyperparameters
Used GPU acceleration (when available)

Limitation: Couldn’t test extreme scales (50+ iterations, 100+ agents)

Code Architecture

Main Components

Morgans4900Project.py (1,500+ lines)
├── Imports & Setup
│   ├── PyTorch, TorchRL, VMAS
│   ├── Device configuration (GPU/CPU)
│   └── Random seed (reproducibility)
│
├── Hyperparameters
│   ├── Sampling (frames, iterations, envs)
│   ├── Training (epochs, batch size, learning rate)
│   ├── PPO (clip, gamma, lambda, entropy)
│   └── Environment (max steps, agents)
│
├── Environment Creation
│   ├── VmasEnv (base Balance scenario)
│   ├── TransformedEnv (add RewardSum)
│   └── 6 Blindness Variants (env2-env7)
│
├── Custom Transforms (6 classes)
│   ├── BlindOneRandomAgentEveryStep
│   ├── BlindAllAgentsEveryStep
│   ├── BlindOneRandomAgentIfProbability
│   ├── BlindRandomAgentsIfProbability
│   ├── BlindOneRandomAgentIfProbabilityForJSteps
│   └── BlindRandomAgentsIfProbabilityForJSteps
│
├── Training Function
│   ├── train_environment_variables()
│   │   ├── Network setup (policy + critic)
│   │   ├── Collector & replay buffer
│   │   ├── PPO loss module
│   │   ├── Training loop
│   │   └── Return rewards & policy
│
└── Experiments (10 total)
    ├── Exp 1: Baseline comparison (7 scenarios)
    ├── Exp 2: Normalization (True/False)
    ├── Exp 3: Clipping (6 values)
    ├── Exp 4: Batch size (7 values)
    ├── Exp 5: Epochs (6 values)
    ├── Exp 6: Blindness probability (6 values)
    ├── Exp 7: Max blind steps (7 values)
    ├── Exp 8: Number of agents (6 values)
    ├── Exp 9: Other scenarios (5 scenarios)
    └── Exp 10: Best combined parameters

Design Patterns Used

1. Transform Pattern (VMAS/TorchRL):

class BlindTransform(Transform):
    def _step(self, tensordict, next_tensordict):
        # Modify observations mid-rollout
        next_tensordict[("agents", "observation")][...] = 0
        return next_tensordict

2. Functional Training:

def train_environment_variables(env, description, **kwargs):
    # Configurable training function
    # Returns: rewards_list, trained_policy

3. Experiment Loop:

for env, desc in environments:
    for hyperparam in hyperparams:
        rewards, policy = train_environment_variables(...)
        results.append({...})
# Batch plotting after all experiments

4. Automated Visualization:

Every experiment generates matplotlib plot
Rollout visualization with trained policies
Consistent color coding across experiments

What I Learned

This project taught me:

Multi-Agent Reinforcement Learning

MAPPO algorithm and centralized training, decentralized execution
Parameter sharing for swarm-like homogeneous policies
Advantage normalization critical for variance management
GAE for smooth advantage estimation in stochastic environments

Hyperparameter Sensitivity

Clipping value has non-monotonic relationship with performance
Batch size sweet spot (too large = overfitting, too small = noise)
Epochs: diminishing returns after threshold
Normalization: not optional for partial observability

VMAS Framework (Despite Frustration)

Vectorized simulation for parallel data collection
Custom transforms for environment modifications
TorchRL integration (after much struggle)
GPU acceleration for faster training

Research Methodology

Ablation studies (change one variable at a time)
Baseline comparisons (always test default)
Normalized metrics (account for scenario difficulty)
Visual evidence (rollout videos validate learning)

Robustness & Real-World AI

Partial observability is realistic and challenging
Systems need redundancy (more agents help)
Performance gracefully degrades until threshold
Generalization doesn’t come free (scenario-specific policies)

Soft Skills

Working solo on group project
Adapting scope when constraints appear
Prioritizing experiments with limited time
Honest acknowledgment of limitations

Future Improvements

If I were to extend this research, I would:

Fix Other Scenarios - Debug blindness implementation for Wheel, Give Way, etc.
Implement MADDPG/QMIX Properly - Use BenchMARL for fair algorithm comparison
Communication Protocols - Allow agents to signal “I’m blind, help!”
Partial Blindness - Noisy observations instead of complete zeros
Dynamic Blindness Models - Learn to predict when blindness will occur
Multi-Task Training - Train on multiple scenarios simultaneously
Longer Training - 100+ iterations to see if performance plateaus
Attention Mechanisms - Let agents attend to non-blind teammates
Curriculum Learning - Gradually increase blindness difficulty
Real Robot Deployment - Test on actual drone/robot swarms
Adversarial Blindness - Opponent strategically blinds agents
Meta-Learning - Train to adapt quickly to new blindness patterns

Research Impact & Applications

Key Findings Summary

1. Robustness is Learnable:

Agents can cooperate despite 10-20% blindness probability
Performance degrades gracefully until ~30% threshold
Suggests potential for real autonomous systems

2. Hyperparameters Matter More Than Expected:

Optimal config: +86% performance over default
Normalization absolutely essential (not optional)
Clipping sweet spot (0.3) specific to stochasticity level

3. Redundancy Helps (Up to a Point):

5-7 agents optimal for Balance scenario
Too few = no redundancy, too many = coordination overhead
Real-world implication: design for sweet spot

4. Generalization Fails:

Balance-trained policies don’t transfer to other tasks
Suggests need for multi-task training or meta-learning

Real-World Applications

1. Autonomous Vehicle Fleets:

Handle camera/sensor failures gracefully
Maintain collision avoidance despite partial blindness
Coordinate lane changes when one car loses GPS

2. Drone Swarms:

Continue formation flight during communication blackouts
Compensate for drones with malfunctioning sensors
Search-and-rescue with unreliable equipment

3. Satellite Networks:

Maintain orbit coordination during solar flares (sensor noise)
Handle antenna failures in constellation
Redundant communication pathways

4. Industrial Robot Teams:

Factory robots handle vision system glitches
Warehouse robots coordinate despite localization errors
Construction robots compensate for failed team members

5. Underwater/Space Exploration:

Robots in harsh environments with frequent sensor failures
Redundant sensing across team members
Mission completion despite partial information loss

Files & Resources

Project Files:

Morgans4900Project.py - Complete implementation (1,500+ lines)
COMP 4900 Project doc.pdf - Presentation slides and documentation
Training logs and plots (generated during experiments)

Framework Requirements:

Python 3.8+
PyTorch 1.13+
TorchRL 0.2.0
VMAS (Vectorized Multi-Agent Simulator)
TensorDict (data structure for RL)
Matplotlib (visualization)

Installation:

# Install PyTorch (with CUDA if available)
pip install torch torchvision torchaudio

# Install TorchRL and dependencies
pip install torchrl
pip install tensordict

# Install VMAS
pip install vmas

# Additional dependencies
pip install matplotlib tqdm

Running Experiments:

# Single training run
rewards, policy = train_environment_variables(
    env=env6,  # Blind 1 random agent for random steps
    description="Blindness Test",
    norm=True,
    clipVal=0.3,
    batchSize=100,
    numEpochs=30
)

# Visualize trained policy
with torch.no_grad():
    env.rollout(
        max_steps=200,
        policy=policy,
        callback=lambda env, _: env.render()
    )

Key Hyperparameters (Optimal):

# Sampling
frames_per_batch = 6_000
n_iters = 25
max_steps = 200
num_vmas_envs = 30

# Training
num_epochs = 30          # ← Increased from 20
minibatch_size = 100     # ← Decreased from 200
lr = 3e-4

# PPO
clip_epsilon = 0.3       # ← Increased from 0.2
normalize_advantage = True  # ← Critical!
gamma = 0.9
lmbda = 0.9

References & Prior Work

Key Papers:

Bettini et al. (2023) - “VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning”
- Framework foundation
- Balance scenario design
Yu et al. (2022) - “The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games”
- Justification for MAPPO baseline
- Benchmark performance comparisons
Lowe et al. (2017) - “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”
- MADDPG algorithm (attempted implementation)
- Centralized training framework
Lin et al. (2020) - “On the Robustness of Cooperative Multi-Agent Reinforcement Learning”
- Inspiration for robustness investigation
- Adversarial MARL concepts

Full citations in project documentation.

Takeaway

This project demonstrates that Multi-Agent Reinforcement Learning systems can be trained to handle realistic sensor failures through careful environment design and hyperparameter optimization. While agents don’t achieve perfect robustness, they learn cooperative strategies that gracefully degrade under partial observability—exhibiting resilience patterns similar to human teams.

The research highlights critical gaps in MARL tooling (VMAS documentation, library compatibility) that consume researcher time, underscoring the need for better infrastructure in the MARL community. Despite these challenges and the constraint of working solo on a group project, the systematic hyperparameter study provides actionable insights for building robust multi-agent systems.

Most importantly, this work validates that robustness can be learned, not just engineered—agents discover emergent compensation strategies when teammates go blind, suggesting promising avenues for deploying cooperative AI in real-world scenarios where sensor reliability cannot be guaranteed.

The 86% performance improvement through hyperparameter optimization (normalization + optimal clip/batch/epochs) proves that careful tuning is not just beneficial but essential for MARL in stochastic environments with partial observability. This finding has direct implications for practitioners building real autonomous systems.