hadoop大数据工具及其生态工具 - YARN资源管理篇

YARN架构详解与多租户资源管理

一、YARN核心组件与工作流程

1.1 YARN架构概述

Client
  ↓
ResourceManager (RM)
  ├── Scheduler (资源调度)
  └── ApplicationsManager (应用管理)
  ↓
NodeManager (NM) (每个节点)
  ├── Container (容器)
  └── ApplicationMaster (AM)

1.2 YARN核心组件详解

ResourceManager (RM)

  • 全局资源管理器:集群资源的最终决策者
  • 主要组件
  • Scheduler:纯调度器,不跟踪应用状态
  • ApplicationsManager:接受作业提交,启动ApplicationMaster

NodeManager (NM)

  • 单节点代理:管理单个节点上的资源和任务
  • 职责
  • 启动Container运行任务
  • 监控资源使用(CPU、内存)
  • 向RM汇报节点状态

ApplicationMaster (AM)

  • 应用级别管理器:每个应用一个AM
  • 职责
  • 向RM协商资源
  • 与NM协作启动/监控任务
  • 容错处理

Container

  • 资源封装单位:包含CPU、内存等资源
  • 任务运行环境:运行MapTask、ReduceTask等

二、YARN工作流程深度解析

2.1 应用提交与执行流程

// YARN应用提交伪代码演示
public class YARNWorkflow {

    /**
     * 1. 客户端提交应用
     */
    public void submitApplication() {
        // 创建应用上下文
        ApplicationSubmissionContext appContext = 
            Records.newRecord(ApplicationSubmissionContext.class);

        // 设置ApplicationMaster
        ContainerLaunchContext amContainer = 
            Records.newRecord(ContainerLaunchContext.class);
        amContainer.setCommands(Collections.singletonList("java ApplicationMaster"));

        appContext.setAMContainerSpec(amContainer);
        appContext.setResource(Resource.newInstance(1024, 1)); // 1GB内存,1个vcore

        // 提交到ResourceManager
        yarnClient.submitApplication(appContext);
    }

    /**
     * 2. ResourceManager处理
     */
    public void rmProcess() {
        // RM收到提交请求
        // - 分配ApplicationAttemptId
        // - 选择合适节点启动ApplicationMaster
        // - 与NodeManager通信启动AM容器
    }

    /**
     * 3. ApplicationMaster运行
     */
    public void amProcess() {
        // AM向RM注册
        // AM根据需求向RM申请资源
        // AM与NM协作启动任务容器
        // AM监控任务执行状态
    }
}

2.2 详细工作流程步骤

1. Client → RM: 提交应用
2. RM → NM: 分配AM容器
3. NM: 启动AM
4. AM → RM: 注册AM
5. AM → RM: 资源请求
6. RM → AM: 资源分配
7. AM → NM: 启动任务容器
8. NM: 执行任务
9. AM → RM: 状态汇报
10. AM → RM: 应用完成

三、资源调度器详解

3.1 FIFO Scheduler (先进先出调度器)

<!-- yarn-site.xml -->
<configuration>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value>
    </property>
</configuration>

特点

  • 简单先来先服务
  • 不适合多用户环境
  • 小作业可能被大作业阻塞

3.2 Capacity Scheduler (容量调度器) – 企业最常用

队列配置示例

<!-- capacity-scheduler.xml -->
<?xml version="1.0"?>
<configuration>
    <!-- 队列层次结构 -->
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>dev,prod,research</value>
    </property>

    <!-- 开发队列配置 -->
    <property>
        <name>yarn.scheduler.capacity.root.dev.capacity</name>
        <value>40</value>  <!-- 40%资源 -->
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
        <value>60</value>  <!-- 最大可占用60% -->
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.dev.user-limit-factor</name>
        <value>2</value>   <!-- 单个用户最多可使用队列2倍容量 -->
    </property>

    <!-- 生产队列配置 -->
    <property>
        <name>yarn.scheduler.capacity.root.prod.capacity</name>
        <value>40</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.prod.maximum-capacity</name>
        <value>80</value>
    </property>

    <!-- 研究队列配置 -->
    <property>
        <name>yarn.scheduler.capacity.root.research.capacity</name>
        <value>20</value>
    </property>

    <!-- ACL访问控制 -->
    <property>
        <name>yarn.scheduler.capacity.root.dev.acl_submit_applications</name>
        <value>dev_group</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.prod.acl_submit_applications</name>
        <value>prod_group,admin</value>
    </property>
</configuration>

3.3 Fair Scheduler (公平调度器)

公平调度器配置

<!-- fair-scheduler.xml -->
<?xml version="1.0"?>
<allocations>
    <!-- 默认队列配置 -->
    <queue name="default">
        <minResources>4096 mb,4 vcores</minResources>
        <maxResources>32768 mb,16 vcores</maxResources>
        <maxRunningApps>50</maxRunningApps>
        <weight>1.0</weight>
        <schedulingMode>fair</schedulingMode>
    </queue>

    <!-- 开发队列 -->
    <queue name="dev">
        <minResources>8192 mb,8 vcores</minResources>
        <maxResources>65536 mb,32 vcores</maxResources>
        <maxRunningApps>100</maxRunningApps>
        <weight>2.0</weight>
        <schedulingMode>fair</schedulingMode>

        <!-- 子队列 -->
        <queue name="dev_bi">
            <minResources>4096 mb,4 vcores</minResources>
            <maxResources>16384 mb,8 vcores</maxResources>
        </queue>
        <queue name="dev_etl">
            <minResources>4096 mb,4 vcores</minResources>
            <maxResources>16384 mb,8 vcores</maxResources>
        </queue>
    </queue>

    <!-- 生产队列 -->
    <queue name="prod">
        <minResources>16384 mb,16 vcores</minResources>
        <maxResources>131072 mb,64 vcores</maxResources>
        <weight>3.0</weight>
        <schedulingMode>fifo</schedulingMode>  <!-- 生产环境使用FIFO -->
    </queue>

    <!-- 队列放置策略 -->
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="primaryGroup" create="false"/>
        <rule name="default" queue="default"/>
    </queuePlacementPolicy>
</allocations>

启用公平调度器

<!-- yarn-site.xml -->
<configuration>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
    </property>
    <property>
        <name>yarn.scheduler.fair.allocation.file</name>
        <value>/etc/hadoop/fair-scheduler.xml</value>
    </property>
    <property>
        <name>yarn.scheduler.fair.user-as-default-queue</name>
        <value>false</value>
    </property>
</configuration>

四、多租户资源管理实战

4.1 队列配置与管理命令

创建多租户队列结构

# 查看当前队列状态
yarn queue -status default

# 列出所有队列
yarn queue -list

# 刷新队列配置(无需重启RM)
yarn rmadmin -refreshQueues

# 查看应用队列
yarn application -list | head -10

队列使用情况监控

# 查看队列资源使用
yarn top

# 详细的队列统计
yarn queue -stats dev

# 通过REST API获取队列信息
curl -s "http://hadoop-master:8088/ws/v1/cluster/scheduler" | python -m json.tool

4.2 应用提交到指定队列

MapReduce作业指定队列

// 在驱动程序中设置队列
Configuration conf = new Configuration();
conf.set("mapreduce.job.queuename", "dev");

Job job = Job.getInstance(conf, "queue-aware job");

// 或者在作业提交时指定
job.getConfiguration().set("mapreduce.job.queuename", "prod");

命令行提交到指定队列

# MapReduce作业指定队列
hadoop jar wordcount.jar WordCount \
    -Dmapreduce.job.queuename=dev \
    /input /output

# Spark作业指定队列
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --queue dev \
    --class org.apache.spark.examples.SparkPi \
    /path/to/spark-examples.jar

# 直接使用yarn命令
yarn jar wordcount.jar WordCount \
    -Dmapreduce.job.queuename=research \
    /input /output

4.3 用户和组权限管理

配置Linux用户和组

# 创建用户组
groupadd dev_group
groupadd prod_group
groupadd research_group

# 创建用户并分配到组
useradd -g dev_group dev_user1
useradd -g prod_group prod_user1
useradd -g research_group research_user1

# 设置Hadoop代理用户(在master节点)
# core-site.xml 添加:
<!-- core-site.xml 代理用户配置 -->
<property>
    <name>hadoop.proxyuser.yarn.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.yarn.groups</name>
    <value>*</value>
</property>

Capacity Scheduler的ACL配置

<!-- capacity-scheduler.xml -->
<property>
    <name>yarn.scheduler.capacity.root.dev.acl_submit_applications</name>
    <value>dev_group,admin</value>
</property>
<property>
    <name>yarn.scheduler.capacity.root.dev.acl_administer_applications</name>
    <value>admin</value>
</property>
<property>
    <name>yarn.scheduler.capacity.root.prod.acl_submit_applications</name>
    <value>prod_group,admin</value>
</property>
<property>
    <name>yarn.scheduler.capacity.root.research.acl_submit_applications</name>
    <value>research_group,admin</value>
</property>

五、应用生命周期管理

5.1 YARN应用状态管理

应用状态流转

NEW → NEW_SAVING → SUBMITTED → ACCEPTED → RUNNING → FINISHED/FAILED/KILLED

应用管理命令

# 列出所有应用
yarn application -list

# 根据状态过滤
yarn application -list -appStates RUNNING
yarn application -list -appStates FINISHED
yarn application -list -appStates FAILED

# 查看应用详情
yarn application -status <ApplicationId>

# 杀死应用
yarn application -kill <ApplicationId>

# 查看应用日志
yarn logs -applicationId <ApplicationId>
yarn logs -applicationId <ApplicationId> -containerId <ContainerId>

# 查看尝试次数
yarn applicationattempt -list <ApplicationId>
yarn container -list <ApplicationAttemptId>

5.2 资源请求和分配策略

ApplicationMaster资源协商示例

public class CustomApplicationMaster {

    public void requestResources() {
        // 创建资源请求
        List<ResourceRequest> requests = new ArrayList<>();

        // 请求Map任务资源
        ResourceRequest mapRequest = ResourceRequest.newInstance(
            Priority.newInstance(1),    // 优先级
            "*",                       // 任何节点
            Resource.newInstance(2048, 2), // 2GB内存,2个vcore
            10                         // 请求10个容器
        );
        requests.add(mapRequest);

        // 请求Reduce任务资源(更高优先级)
        ResourceRequest reduceRequest = ResourceRequest.newInstance(
            Priority.newInstance(0),    // 更高优先级
            "*",
            Resource.newInstance(4096, 4), // 4GB内存,4个vcore  
            5                          // 请求5个容器
        );
        requests.add(reduceRequest);

        // 发送资源请求到ResourceManager
        amRMClient.addContainerRequest(requests);
    }

    public void onContainersAllocated(List<Container> containers) {
        for (Container container : containers) {
            // 启动任务容器
            ContainerLaunchContext ctx = 
                Records.newRecord(ContainerLaunchContext.class);

            // 设置容器启动命令
            List<String> commands = new ArrayList<>();
            commands.add("java -Xmx2048m MapTask");
            ctx.setCommands(commands);

            // 启动容器
            nmClientAsync.startContainerAsync(container, ctx);
        }
    }
}

5.3 应用监控和指标收集

使用YARN REST API监控

#!/bin/bash
# yarn-monitor.sh

CLUSTER_URL="http://hadoop-master:8088/ws/v1/cluster"

# 获取集群指标
echo "=== 集群指标 ==="
curl -s "${CLUSTER_URL}/metrics" | jq '.clusterMetrics'

# 获取调度器信息
echo -e "\n=== 调度器信息 ==="
curl -s "${CLUSTER_URL}/scheduler" | jq '.scheduler.schedulerInfo'

# 获取运行中的应用
echo -e "\n=== 运行中的应用 ==="
curl -s "${CLUSTER_URL}/apps?states=RUNNING" | jq '.apps.app[] | {id, name, user, queue}'

# 获取节点状态
echo -e "\n=== 节点状态 ==="
curl -s "${CLUSTER_URL}/nodes" | jq '.nodes.node[] | {nodeHostName, state, availableMemoryMB, usedMemoryMB}'

自定义监控脚本

#!/usr/bin/env python3
# yarn_dashboard.py

import requests
import json
import time
from datetime import datetime

class YARNDashboard:
    def __init__(self, rm_host="hadoop-master", rm_port=8088):
        self.base_url = f"http://{rm_host}:{rm_port}/ws/v1/cluster"

    def get_cluster_metrics(self):
        """获取集群级别指标"""
        response = requests.get(f"{self.base_url}/metrics")
        return response.json()['clusterMetrics']

    def get_queue_metrics(self, queue_name):
        """获取队列级别指标"""
        response = requests.get(f"{self.base_url}/scheduler")
        scheduler_info = response.json()['scheduler']['schedulerInfo']

        # 在队列层次中查找指定队列
        def find_queue(queues, target_name):
            if isinstance(queues, list):
                for queue in queues:
                    result = find_queue(queue, target_name)
                    if result:
                        return result
            elif isinstance(queues, dict):
                if queues.get('queueName') == target_name:
                    return queues
                if 'queues' in queues:
                    return find_queue(queues['queues'], target_name)
            return None

        return find_queue(scheduler_info, queue_name)

    def generate_report(self):
        """生成监控报告"""
        metrics = self.get_cluster_metrics()

        print(f"=== YARN集群监控报告 {datetime.now()} ===")
        print(f"活跃节点数: {metrics['activeNodes']}")
        print(f"总内存: {metrics['totalMB']} MB")
        print(f"已用内存: {metrics['allocatedMB']} MB")
        print(f"可用内存: {metrics['availableMB']} MB")
        print(f"内存使用率: {(metrics['allocatedMB']/metrics['totalMB'])*100:.1f}%")
        print(f"总vCores: {metrics['totalVirtualCores']}")
        print(f"已用vCores: {metrics['allocatedVirtualCores']}")
        print(f"运行中的应用: {metrics['appsRunning']}")

        # 检查队列状态
        for queue in ['dev', 'prod', 'research']:
            queue_info = self.get_queue_metrics(queue)
            if queue_info:
                print(f"\n队列 {queue}:")
                print(f"  已用容量: {queue_info.get('usedCapacity', 0):.1f}%")
                print(f"  绝对容量: {queue_info.get('absoluteUsedCapacity', 0):.1f}%")
                print(f"  运行中的应用: {queue_info.get('numApplications', 0)}")

if __name__ == "__main__":
    dashboard = YARNDashboard()
    dashboard.generate_report()

六、YARN高级特性与调优

6.1 资源调度优化配置

容器资源分配策略

<!-- yarn-site.xml 资源相关配置 -->
<configuration>
    <!-- 最小容器内存分配 -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>

    <!-- 最大容器内存分配 -->
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>16384</value>
    </property>

    <!-- 最小容器vCore分配 -->
    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>

    <!-- 最大容器vCore分配 -->
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>8</value>
    </property>

    <!-- NodeManager资源配置 -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>16384</value>  <!-- 16GB -->
    </property>

    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>8</value>      <!-- 8个vCore -->
    </property>

    <!-- 虚拟内存检查 -->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>  <!-- 生产环境建议关闭 -->
    </property>

    <property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>true</value>
    </property>
</configuration>

6.2 队列抢占配置

Capacity Scheduler抢占配置

<!-- capacity-scheduler.xml -->
<property>
    <name>yarn.scheduler.capacity.<queue-path>.allow-preemption</name>
    <value>true</value>
</property>
<property>
    <name>yarn.scheduler.capacity.<queue-path>.disable-preemption</name>
    <value>false</value>
</property>

<!-- 全局抢占配置 -->
<property>
    <name>yarn.scheduler.capacity.preemption.max_wait_before_kill</name>
    <value>15000</value>  <!-- 等待15秒 -->
</property>
<property>
    <name>yarn.scheduler.capacity.preemption.total_preemption_per_round</name>
    <value>0.1</value>    <!-- 每轮最多抢占10%资源 -->
</property>

6.3 节点标签和资源分区

节点标签配置

# 添加节点标签
yarn rmadmin -addToClusterNodeLabels "GPU,SSD"

# 将节点关联到标签
yarn rmadmin -replaceLabelsOnNode "hadoop-slave1:8088=GPU,hadoop-slave2:8088=SSD"

# 查看节点标签
yarn cluster --list-node-labels

队列访问节点标签

<!-- capacity-scheduler.xml -->
<property>
    <name>yarn.scheduler.capacity.root.dev.accessible-node-labels</name>
    <value>GPU,SSD</value>
</property>
<property>
    <name>yarn.scheduler.capacity.root.dev.accessible-node-labels.GPU.capacity</name>
    <value>50</value>
</property>

七、实战:多租户环境搭建

7.1 完整的多租户配置脚本

#!/bin/bash
# setup-multi-tenant.sh

set -e

echo "开始配置Hadoop多租户环境..."

# 1. 创建用户和组
echo "创建用户和组..."
groupadd dev_group
groupadd prod_group  
groupadd research_group

useradd -g dev_group -m dev_user1
useradd -g dev_group -m dev_user2
useradd -g prod_group -m prod_user1
useradd -g research_group -m research_user1

echo "hadoop123" | passwd --stdin dev_user1
echo "hadoop123" | passwd --stdin prod_user1

# 2. 创建HDFS目录结构
echo "创建HDFS目录结构..."
sudo -u hdfs hdfs dfs -mkdir -p /user/dev_user1
sudo -u hdfs hdfs dfs -mkdir -p /user/dev_user2
sudo -u hdfs hdfs dfs -mkdir -p /user/prod_user1
sudo -u hdfs hdfs dfs -mkdir -p /user/research_user1

sudo -u hdfs hdfs dfs -chown dev_user1:dev_group /user/dev_user1
sudo -u hdfs hdfs dfs -chown dev_user2:dev_group /user/dev_user2
sudo -u hdfs hdfs dfs -chown prod_user1:prod_group /user/prod_user1
sudo -u hdfs hdfs dfs -chown research_user1:research_group /user/research_user1

# 3. 部署Capacity Scheduler配置
echo "配置Capacity Scheduler..."
cat > /tmp/capacity-scheduler.xml << 'EOF'
<?xml version="1.0"?>
<configuration>
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>dev,prod,research</value>
  </property>

  <!-- Dev Queue -->
  <property>
    <name>yarn.scheduler.capacity.root.dev.capacity</name>
    <value>40</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.dev.maximum-capacity</name>
    <value>60</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.dev.acl_submit_applications</name>
    <value>dev_group</value>
  </property>

  <!-- Prod Queue -->
  <property>
    <name>yarn.scheduler.capacity.root.prod.capacity</name>
    <value>40</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.prod.maximum-capacity</name>
    <value>80</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.prod.acl_submit_applications</name>
    <value>prod_group,admin</value>
  </property>

  <!-- Research Queue -->
  <property>
    <name>yarn.scheduler.capacity.root.research.capacity</name>
    <value>20</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.research.acl_submit_applications</name>
    <value>research_group</value>
  </property>
</configuration>
EOF

cp /tmp/capacity-scheduler.xml $HADOOP_HOME/etc/hadoop/

# 4. 配置YARN Site
echo "配置YARN Site..."
cat >> $HADOOP_HOME/etc/hadoop/yarn-site.xml << 'EOF'
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.configuration.file</name>
    <value>/usr/local/hadoop/etc/hadoop/capacity-scheduler.xml</value>
  </property>
EOF

# 5. 重启YARN服务
echo "重启YARN服务..."
stop-yarn.sh
start-yarn.sh

# 6. 验证配置
echo "验证配置..."
yarn rmadmin -refreshQueues
yarn queue -list

echo "多租户环境配置完成!"

7.2 多用户作业测试

#!/bin/bash
# test-multi-tenant.sh

echo "=== 多租户环境测试 ==="

# 1. 准备测试数据
echo "准备测试数据..."
sudo -u hdfs hdfs dfs -mkdir -p /shared/test_data
echo "test data for multi-tenant environment" | sudo -u hdfs hdfs dfs -put - /shared/test_data/sample.txt

# 2. 以dev_user1身份提交作业
echo "测试dev用户提交作业..."
sudo -u dev_user1 -i << 'EOF'
hdfs dfs -ls /user/dev_user1
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
    -Dmapreduce.job.queuename=dev \
    /shared/test_data /user/dev_user1/output_wordcount
echo "dev用户作业完成"
EOF

# 3. 以prod_user1身份提交作业
echo "测试prod用户提交作业..."
sudo -u prod_user1 -i << 'EOF'
hdfs dfs -ls /user/prod_user1
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount \
    -Dmapreduce.job.queuename=prod \
    /shared/test_data /user/prod_user1/output_wordcount
echo "prod用户作业完成"
EOF

# 4. 查看作业状态
echo "当前运行作业:"
yarn application -list

echo "队列状态:"
yarn queue -stats dev
yarn queue -stats prod
yarn queue -stats research

7.3 监控和告警脚本

#!/usr/bin/env python3
# yarn_alert.py

import requests
import smtplib
from email.mime.text import MIMEText
import time

class YARNAlert:
    def __init__(self, rm_host="hadoop-master", rm_port=8088):
        self.base_url = f"http://{rm_host}:{rm_port}/ws/v1/cluster"
        self.thresholds = {
            'memory_usage': 0.85,  # 85%内存使用率
            'queue_capacity': 0.90, # 90%队列容量
            'pending_apps': 10      # 10个等待应用
        }

    def check_cluster_health(self):
        """检查集群健康状态"""
        try:
            metrics = requests.get(f"{self.base_url}/metrics").json()['clusterMetrics']

            alerts = []

            # 检查内存使用率
            memory_ratio = metrics['allocatedMB'] / metrics['totalMB']
            if memory_ratio > self.thresholds['memory_usage']:
                alerts.append(f"内存使用率过高: {memory_ratio:.1%}")

            # 检查等待应用数量
            if metrics['appsPending'] > self.thresholds['pending_apps']:
                alerts.append(f"等待应用过多: {metrics['appsPending']}")

            return alerts

        except Exception as e:
            return [f"监控检查失败: {str(e)}"]

    def send_alert(self, alerts):
        """发送告警"""
        if not alerts:
            return

        subject = "YARN集群告警"
        body = "\n".join([f"• {alert}" for alert in alerts])

        # 这里可以集成邮件、Slack、微信等告警方式
        print(f"告警: {subject}")
        print(body)

    def run_monitoring(self):
        """运行监控循环"""
        while True:
            alerts = self.check_cluster_health()
            self.send_alert(alerts)
            time.sleep(300)  # 5分钟检查一次

if __name__ == "__main__":
    monitor = YARNAlert()
    monitor.run_monitoring()

学习总结

通过本篇文章,您已经掌握了:

  • ✅ YARN架构核心组件和工作原理
  • ✅ 三种调度器的配置和使用
  • ✅ 多租户资源管理和队列配置
  • ✅ 应用生命周期管理和监控
  • ✅ 高级特性:节点标签、资源抢占
  • ✅ 完整的多租户环境搭建实战

关键知识点

  1. ResourceManager:全局资源管理和调度
  2. NodeManager:单节点资源管理和任务执行
  3. Capacity Scheduler:企业级多租户调度方案
  4. 队列管理:资源分配、权限控制、监控
  5. 应用管理:提交、监控、调优

下一篇预告:《Hadoop生态工具实战:Hive、Sqoop、Flume》将带您进入Hadoop生态系统工具的世界!


实践建议

# 动手练习
1. 配置Capacity Scheduler多队列环境
2. 测试不同用户提交作业到不同队列
3. 使用YARN REST API监控集群状态
4. 实践资源调度优化配置
5. 搭建完整的多租户监控告警系统

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注