hadoop大数据工具及其生态工具 - Hadoop高级特性篇

Hadoop高级特性:安全、高可用、监控

一、Kerberos安全认证

1.1 Kerberos原理与架构

Kerberos认证流程

Client → AS: 请求TGT
AS → Client: TGT (使用客户端密码加密)
Client → TGS: 请求服务票据 (使用TGT)
TGS → Client: 服务票据
Client → Service: 访问服务 (使用服务票据)

KDC组件

  • AS (Authentication Server):认证服务器,发放TGT
  • TGS (Ticket Granting Server):票据授权服务器,发放服务票据
  • Database:存储所有主体和密钥

1.2 Kerberos安装配置

安装KDC服务器

# 在主节点安装KDC
yum install -y krb5-server krb5-workstation krb5-libs

# 配置Kerberos
cat > /etc/krb5.conf << 'EOF'

[logging]

default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log

[libdefaults]

default_realm = HADOOP.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96

[realms]

HADOOP.COM = { kdc = hadoop-master:88 admin_server = hadoop-master:749 default_domain = hadoop.com }

[domain_realm]

.hadoop.com = HADOOP.COM hadoop.com = HADOOP.COM EOF # 配置KDC cat > /var/kerberos/krb5kdc/kdc.conf << ‘EOF’

[kdcdefaults]

kdc_ports = 88 kdc_tcp_ports = 88

[realms]

HADOOP.COM = { acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal max_life = 24h max_renewable_life = 10d } EOF # 创建ACL文件 cat > /var/kerberos/krb5kdc/kadm5.acl << ‘EOF’ */admin@HADOOP.COM * EOF # 创建Kerberos数据库 kdb5_util create -s -P hadoop123 # 启动Kerberos服务 systemctl start krb5kdc systemctl start kadmin systemctl enable krb5kdc kadmin

创建Hadoop相关主体

# 登录Kerberos管理
kadmin.local -q "addprinc root/admin"
kadmin.local -q "addprinc -randkey hdfs/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey yarn/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey mapred/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey HTTP/hadoop-master@HADOOP.COM"

# 为DataNodes创建主体
kadmin.local -q "addprinc -randkey hdfs/hadoop-slave1@HADOOP.COM"
kadmin.local -q "addprinc -randkey yarn/hadoop-slave1@HADOOP.COM"
kadmin.local -q "addprinc -randkey HTTP/hadoop-slave1@HADOOP.COM"

# 生成keytab文件
kadmin.local -q "xst -k /etc/security/keytabs/hdfs.keytab hdfs/hadoop-master hdfs/hadoop-slave1 HTTP/hadoop-master HTTP/hadoop-slave1"
kadmin.local -q "xst -k /etc/security/keytabs/yarn.keytab yarn/hadoop-master yarn/hadoop-slave1 HTTP/hadoop-master HTTP/hadoop-slave1"
kadmin.local -q "xst -k /etc/security/keytabs/mapred.keytab mapred/hadoop-master HTTP/hadoop-master"

# 设置keytab文件权限
chown hdfs:hadoop /etc/security/keytabs/hdfs.keytab
chown yarn:hadoop /etc/security/keytabs/yarn.keytab
chown mapred:hadoop /etc/security/keytabs/mapred.keytab
chmod 400 /etc/security/keytabs/*.keytab

1.3 Hadoop Kerberos配置

core-site.xml配置

<configuration>
    <!-- Kerberos安全配置 -->
    <property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
    </property>
    <property>
        <name>hadoop.security.authorization</name>
        <value>true</value>
    </property>
    <property>
        <name>hadoop.security.auth_to_local</name>
        <value>
            RULE:[2:$1@$0](.*@HADOOP.COM)s/.*/hadoop/
            DEFAULT
        </value>
    </property>
</configuration>

hdfs-site.xml配置

<configuration>
    <!-- NameNode安全配置 -->
    <property>
        <name>dfs.namenode.kerberos.principal</name>
        <value>hdfs/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>dfs.namenode.keytab.file</name>
        <value>/etc/security/keytabs/hdfs.keytab</value>
    </property>

    <!-- DataNode安全配置 -->
    <property>
        <name>dfs.datanode.kerberos.principal</name>
        <value>hdfs/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>dfs.datanode.keytab.file</name>
        <value>/etc/security/keytabs/hdfs.keytab</value>
    </property>

    <!-- Web UI安全配置 -->
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.web.authentication.kerberos.principal</name>
        <value>HTTP/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>dfs.web.authentication.kerberos.keytab</name>
        <value>/etc/security/keytabs/hdfs.keytab</value>
    </property>
</configuration>

yarn-site.xml配置

<configuration>
    <!-- ResourceManager安全配置 -->
    <property>
        <name>yarn.resourcemanager.principal</name>
        <value>yarn/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>yarn.resourcemanager.keytab</name>
        <value>/etc/security/keytabs/yarn.keytab</value>
    </property>

    <!-- NodeManager安全配置 -->
    <property>
        <name>yarn.nodemanager.principal</name>
        <value>yarn/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>yarn.nodemanager.keytab</name>
        <value>/etc/security/keytabs/yarn.keytab</value>
    </property>

    <!-- Timeline安全配置 -->
    <property>
        <name>yarn.timeline-service.principal</name>
        <value>yarn/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>yarn.timeline-service.keytab</name>
        <value>/etc/security/keytabs/yarn.keytab</value>
    </property>

    <!-- Web UI安全配置 -->
    <property>
        <name>yarn.webapp.principal</name>
        <value>HTTP/_HOST@HADOOP.COM</value>
    </property>
    <property>
        <name>yarn.webapp.keytab</name>
        <value>/etc/security/keytabs/yarn.keytab</value>
    </property>
</configuration>

二、HDFS高可用配置

2.1 NameNode高可用架构

HA架构设计

Active NameNode ←→ Standby NameNode
        ↓               ↓
JournalNodes (QJM) ←→ Zookeeper
        ↓
    DataNodes

2.2 基于QJM的HA配置

配置JournalNodes

# 在所有JournalNode节点执行
mkdir -p /data/hadoop/journalnode
chown -R hdfs:hadoop /data/hadoop/journalnode

# 启动JournalNode
hdfs --daemon start journalnode

hdfs-site.xml HA配置

<configuration>
    <!-- HA配置 -->
    <property>
        <name>dfs.nameservices</name>
        <value>mycluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.mycluster</name>
        <value>nn1,nn2</value>
    </property>

    <!-- NameNode RPC地址 -->
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn1</name>
        <value>hadoop-master:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.mycluster.nn2</name>
        <value>hadoop-slave1:8020</value>
    </property>

    <!-- NameNode HTTP地址 -->
    <property>
        <name>dfs.namenode.http-address.mycluster.nn1</name>
        <value>hadoop-master:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.mycluster.nn2</name>
        <value>hadoop-slave1:9870</value>
    </property>

    <!-- JournalNode配置 -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://hadoop-master:8485;hadoop-slave1:8485;hadoop-slave2:8485/mycluster</value>
    </property>

    <!-- 故障转移配置 -->
    <property>
        <name>dfs.client.failover.proxy.provider.mycluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>

    <!-- 自动故障转移 -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>

    <!-- SSH故障转移配置 -->
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/home/hdfs/.ssh/id_rsa</value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>30000</value>
    </property>
</configuration>

core-site.xml HA配置

<configuration>
    <!-- 使用HA集群名称 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://mycluster</value>
    </property>

    <!-- ZooKeeper配置 -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>hadoop-master:2181,hadoop-slave1:2181,hadoop-slave2:2181</value>
    </property>
</configuration>

2.3 ZooKeeper配置

安装ZooKeeper

# 在所有ZK节点执行
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.6.3/apache-zookeeper-3.6.3-bin.tar.gz
tar -xzf apache-zookeeper-3.6.3-bin.tar.gz -C /usr/local/
cd /usr/local && ln -s apache-zookeeper-3.6.3-bin zookeeper

# 配置环境变量
echo 'export ZOOKEEPER_HOME=/usr/local/zookeeper' >> /etc/profile
echo 'export PATH=$ZOOKEEPER_HOME/bin:$PATH' >> /etc/profile
source /etc/profile

# 创建配置
mkdir -p /data/zookeeper/data
mkdir -p /data/zookeeper/logs

cat > $ZOOKEEPER_HOME/conf/zoo.cfg << 'EOF'
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
dataLogDir=/data/zookeeper/logs
clientPort=2181
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1

server.1=hadoop-master:2888:3888
server.2=hadoop-slave1:2888:3888
server.3=hadoop-slave2:2888:3888
EOF

# 创建myid文件(每个节点不同)
# hadoop-master:
echo "1" > /data/zookeeper/data/myid
# hadoop-slave1:
echo "2" > /data/zookeeper/data/myid  
# hadoop-slave2:
echo "3" > /data/zookeeper/data/myid

# 启动ZooKeeper
zkServer.sh start

2.4 HA集群初始化与管理

初始化HA集群

# 1. 格式化ZKFC
hdfs zkfc -formatZK

# 2. 启动JournalNodes
hdfs --daemon start journalnode

# 3. 格式化第一个NameNode
hdfs namenode -format -clusterId mycluster

# 4. 启动第一个NameNode
hdfs --daemon start namenode

# 5. 在第二个节点同步元数据
hdfs namenode -bootstrapStandby

# 6. 启动第二个NameNode
hdfs --daemon start namenode

# 7. 启动所有DataNodes
hdfs --daemon start datanode

# 8. 启动ZKFC
hdfs --daemon start zkfc

HA管理命令

# 查看NameNode状态
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

# 手动故障转移
hdfs haadmin -failover nn1 nn2

# 转移为Active状态
hdfs haadmin -transitionToActive nn1

# 转移为Standby状态  
hdfs haadmin -transitionToStandby nn2

# 检查健康状态
hdfs haadmin -checkHealth nn1

三、NameNode Federation

3.1 Federation架构设计

Federation优势

  • 命名空间扩展:多个NameNode管理不同命名空间
  • 性能提升:负载分散到多个NameNode
  • 隔离性:不同业务使用不同命名空间

Federation架构

NameNode1 (ns1)     NameNode2 (ns2)
     ↓                    ↓
Block Pool 1        Block Pool 2
     ↓                    ↓
DataNode1 (BP1+BP2)   DataNode2 (BP1+BP2)

3.2 Federation配置

配置多个命名空间

<!-- hdfs-site.xml -->
<configuration>
    <!-- 定义命名空间 -->
    <property>
        <name>dfs.nameservices</name>
        <value>ns1,ns2</value>
    </property>

    <!-- 第一个命名空间配置 -->
    <property>
        <name>dfs.namenode.rpc-address.ns1</name>
        <value>hadoop-master:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns1</name>
        <value>hadoop-master:9870</value>
    </property>

    <!-- 第二个命名空间配置 -->
    <property>
        <name>dfs.namenode.rpc-address.ns2</name>
        <value>hadoop-slave1:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.ns2</name>
        <value>hadoop-slave1:9870</value>
    </property>

    <!-- 数据节点配置 -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data1/hadoop/datanode,/data2/hadoop/datanode</value>
    </property>
</configuration>

core-site.xml配置

<configuration>
    <!-- 默认命名空间 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://ns1</value>
    </property>

    <!-- ViewFS配置(可选) -->
    <property>
        <name>fs.viewfs.mounttable.default.link./data</name>
        <value>hdfs://ns1/data</value>
    </property>
    <property>
        <name>fs.viewfs.mounttable.default.link./logs</name>
        <value>hdfs://ns2/logs</value>
    </property>
</configuration>

3.3 Federation管理

初始化Federation

# 格式化第一个NameNode
hdfs namenode -format -clusterId myfederation

# 启动第一个NameNode
hdfs --daemon start namenode

# 在第二个节点初始化新命名空间
hdfs namenode -format -clusterId myfederation

# 启动第二个NameNode
hdfs --daemon start namenode

# 启动所有DataNodes
hdfs --daemon start datanode

使用不同命名空间

# 访问默认命名空间
hdfs dfs -ls hdfs://ns1/

# 访问第二个命名空间
hdfs dfs -ls hdfs://ns2/

# 在特定命名空间创建目录
hdfs dfs -mkdir hdfs://ns1/data
hdfs dfs -mkdir hdfs://ns2/logs

# 查看块池信息
hdfs dfs -report

四、集群监控与告警

4.1 Ambari监控平台

Ambari架构

Ambari Server → Ambari Agents → Hadoop Services
     ↓
PostgreSQL (元数据存储)
     ↓
Web UI (监控界面)

Ambari安装

# 配置Ambari仓库
wget -O /etc/yum.repos.d/ambari.repo http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.5.0/ambari.repo

# 安装Ambari Server
yum install -y ambari-server

# 安装Ambari Agent(所有节点)
yum install -y ambari-agent

# 配置Ambari Server
ambari-server setup

# 启动Ambari
ambari-server start
ambari-agent start

4.2 自定义监控脚本

HDFS健康检查脚本

#!/bin/bash
# hdfs_health_check.sh

echo "=== HDFS健康检查报告 ==="
echo "检查时间: $(date)"

# 1. 检查NameNode状态
echo -e "\n1. NameNode状态:"
hdfs haadmin -getServiceState nn1 2>/dev/null || echo "NameNode1: 无法连接"
hdfs haadmin -getServiceState nn2 2>/dev/null || echo "NameNode2: 无法连接"

# 2. 检查HDFS空间使用
echo -e "\n2. HDFS空间使用:"
hdfs dfsadmin -report | grep -E "Configured Capacity|Present Capacity|DFS Used|DFS Remaining"

# 3. 检查DataNode状态
echo -e "\n3. DataNode状态:"
LIVE_NODES=$(hdfs dfsadmin -report | grep "Live datanodes" | awk '{print $3}')
DEAD_NODES=$(hdfs dfsadmin -report | grep "Dead datanodes" | awk '{print $3}')
echo "存活节点: $LIVE_NODES"
echo "死亡节点: $DEAD_NODES"

# 4. 检查块状态
echo -e "\n4. 块状态:"
UNDER_REPLICATED=$(hdfs fsck / -blocks 2>/dev/null | grep "Under replicated" | awk '{print $3}')
MISSING_BLOCKS=$(hdfs fsck / -blocks 2>/dev/null | grep "Missing blocks" | awk '{print $3}')
echo "副本不足块数: ${UNDER_REPLICATED:-0}"
echo "缺失块数: ${MISSING_BLOCKS:-0}"

# 5. 检查安全模式
echo -e "\n5. 安全模式:"
hdfs dfsadmin -safemode get

# 6. 生成告警
if [ "$DEAD_NODES" -gt 0 ]; then
    echo "告警: 发现 $DEAD_NODES 个死亡节点!"
fi

if [ "${MISSING_BLOCKS:-0}" -gt 0 ]; then
    echo "告警: 发现 $MISSING_BLOCKS 个缺失块!"
fi

YARN资源监控脚本

#!/bin/bash
# yarn_monitor.sh

echo "=== YARN资源监控报告 ==="
echo "检查时间: $(date)"

# 1. 获取集群指标
CLUSTER_METRICS=$(curl -s "http://hadoop-master:8088/ws/v1/cluster/metrics")

# 2. 解析JSON数据
TOTAL_MEMORY=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['totalMB'])")
USED_MEMORY=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['allocatedMB'])")
TOTAL_VCORES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['totalVirtualCores'])")
USED_VCORES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['allocatedVirtualCores'])")
ACTIVE_NODES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['activeNodes'])")
LOST_NODES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['lostNodes'])")

# 3. 计算使用率
MEMORY_USAGE=$(echo "scale=2; $USED_MEMORY * 100 / $TOTAL_MEMORY" | bc)
VCORE_USAGE=$(echo "scale=2; $USED_VCORES * 100 / $TOTAL_VCORES" | bc)

# 4. 输出报告
echo -e "\n资源使用情况:"
echo "内存使用: $USED_MEMORY MB / $TOTAL_MEMORY MB ($MEMORY_USAGE%)"
echo "vCore使用: $USED_VCORES / $TOTAL_VCORES ($VCORE_USAGE%)"
echo "活跃节点: $ACTIVE_NODES"
echo "丢失节点: $LOST_NODES"

# 5. 检查运行中的应用
RUNNING_APPS=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['appsRunning'])")
PENDING_APPS=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['appsPending'])")

echo -e "\n应用状态:"
echo "运行中应用: $RUNNING_APPS"
echo "等待中应用: $PENDING_APPS"

# 6. 告警检查
if (( $(echo "$MEMORY_USAGE > 85" | bc -l) )); then
    echo "告警: 内存使用率超过85%!"
fi

if [ "$LOST_NODES" -gt 0 ]; then
    echo "告警: 发现 $LOST_NODES 个丢失节点!"
fi

4.3 使用Ganglia监控

Ganglia安装配置

# 在监控服务器安装
yum install -y ganglia-gmetad ganglia-gmond ganglia-web

# 配置gmetad
cat > /etc/ganglia/gmetad.conf << 'EOF'
data_source "hadoop-cluster" hadoop-master hadoop-slave1 hadoop-slave2
setuid_username "nobody"
EOF

# 配置gmond(所有节点)
cat > /etc/ganglia/gmond.conf << 'EOF'
cluster {
  name = "hadoop-cluster"
  owner = "hadoop"
  latlong = "unspecified"
  url = "unspecified"
}

udp_send_channel {
  host = hadoop-master
  port = 8649
  ttl = 1
}

udp_recv_channel {
  port = 8649
}

tcp_accept_channel {
  port = 8649
}
EOF

# 启动服务
systemctl start gmond
systemctl start gmetad
systemctl start httpd

Hadoop Ganglia配置

<!-- hadoop-metrics2.properties -->
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10

namenode.sink.ganglia.servers=hadoop-master:8649
datanode.sink.ganglia.servers=hadoop-master:8649
resourcemanager.sink.ganglia.servers=hadoop-master:8649
nodemanager.sink.ganglia.servers=hadoop-master:8649
mrappmaster.sink.ganglia.servers=hadoop-master:8649

五、备份与恢复策略

5.1 HDFS快照管理

启用快照功能

# 为目录启用快照
hdfs dfsadmin -allowSnapshot /user/important_data

# 创建快照
hdfs dfs -createSnapshot /user/important_data backup_$(date +%Y%m%d)

# 查看快照
hdfs dfs -ls /user/important_data/.snapshot

# 删除快照
hdfs dfs -deleteSnapshot /user/important_data backup_20231201

# 从快照恢复文件
hdfs dfs -cp /user/important_data/.snapshot/backup_20231201/lost_file.txt /user/important_data/

自动化快照脚本

#!/bin/bash
# hdfs_snapshot_manager.sh

SNAPSHOT_DIRS=("/user/important_data" "/data/warehouse" "/logs")
RETENTION_DAYS=7

for DIR in "${SNAPSHOT_DIRS[@]}"; do
    # 创建今日快照
    SNAPSHOT_NAME="snapshot_$(date +%Y%m%d)"
    hdfs dfs -createSnapshot "$DIR" "$SNAPSHOT_NAME"

    # 清理过期快照
    hdfs dfs -ls "$DIR/.snapshot" | grep -o "snapshot_[0-9]*" | while read SNAP; do
        SNAP_DATE=$(echo "$SNAP" | grep -o "[0-9]*")
        CURRENT_TS=$(date +%s)
        SNAP_TS=$(date -d "$SNAP_DATE" +%s 2>/dev/null || echo 0)
        DAYS_OLD=$(( (CURRENT_TS - SNAP_TS) / 86400 ))

        if [ "$DAYS_OLD" -gt "$RETENTION_DAYS" ]; then
            hdfs dfs -deleteSnapshot "$DIR" "$SNAP"
            echo "删除过期快照: $DIR/$SNAP"
        fi
    done
done

5.2 DistCp跨集群备份

集群间数据同步

# 基础DistCp同步
hadoop distcp \
    hdfs://cluster1:8020/user/data \
    hdfs://cluster2:8020/user/data

# 增量同步
hadoop distcp \
    -update \
    -delete \
    hdfs://cluster1:8020/user/data \
    hdfs://cluster2:8020/user/data

# 使用快照的增量同步
hadoop distcp \
    -diff snapshot1 snapshot2 \
    hdfs://cluster1:8020/user/data \
    hdfs://cluster2:8020/user/data

# 带宽限制同步
hadoop distcp \
    -bandwidth 100 \
    -m 20 \
    hdfs://cluster1:8020/user/data \
    hdfs://cluster2:8020/user/data

自动化备份脚本

#!/bin/bash
# cross_cluster_backup.sh

SOURCE_CLUSTER="hdfs://cluster1:8020"
TARGET_CLUSTER="hdfs://cluster2:8020"
BACKUP_PATHS=("/user/important" "/data/warehouse")

LOG_FILE="/var/log/hadoop_backup.log"

{
    echo "=== 开始跨集群备份 $(date) ==="

    for PATH in "${BACKUP_PATHS[@]}"; do
        echo "备份路径: $PATH"

        hadoop distcp \
            -update \
            -delete \
            -m 50 \
            -bandwidth 50 \
            "${SOURCE_CLUSTER}${PATH}" \
            "${TARGET_CLUSTER}${PATH}"

        if [ $? -eq 0 ]; then
            echo "成功备份: $PATH"
        else
            echo "备份失败: $PATH"
            exit 1
        fi
    done

    echo "=== 备份完成 $(date) ==="
} >> "$LOG_FILE" 2>&1

5.3 元数据备份与恢复

NameNode元数据备份

#!/bin/bash
# namenode_metadata_backup.sh

# 备份目录
BACKUP_DIR="/backup/hadoop/namenode"
DATE=$(date +%Y%m%d_%H%M%S)

# 创建备份目录
mkdir -p "$BACKUP_DIR/$DATE"

# 进入安全模式
hdfs dfsadmin -safemode enter

# 保存命名空间镜像
hdfs dfsadmin -saveNamespace

# 备份元数据文件
cp -r /data/hadoop/namenode/current "$BACKUP_DIR/$DATE/"

# 退出安全模式
hdfs dfsadmin -safemode leave

# 清理过期备份(保留最近7天)
find "$BACKUP_DIR" -type d -mtime +7 -exec rm -rf {} \;

echo "NameNode元数据备份完成: $BACKUP_DIR/$DATE"

元数据恢复流程

#!/bin/bash
# namenode_metadata_restore.sh

RESTORE_DIR="/backup/hadoop/namenode/latest_backup"

# 停止NameNode
hdfs --daemon stop namenode

# 清空当前元数据
rm -rf /data/hadoop/namenode/current/*

# 恢复备份
cp -r "$RESTORE_DIR/current"/* /data/hadoop/namenode/current/

# 启动NameNode
hdfs --daemon start namenode

echo "NameNode元数据恢复完成"

六、安全审计与合规

6.1 HDFS审计日志

启用审计日志

<!-- hdfs-site.xml -->
<property>
    <name>dfs.namenode.audit.loggers</name>
    <value>default</value>
</property>
<property>
    <name>dfs.namenode.audit.log.async</name>
    <value>true</value>
</property>

审计日志分析脚本

#!/bin/bash
# hdfs_audit_analyzer.sh

AUDIT_LOG="/var/log/hadoop/hdfs/audit.log"
REPORT_FILE="/tmp/hdfs_audit_report_$(date +%Y%m%d).txt"

{
    echo "=== HDFS审计分析报告 $(date) ==="

    # 统计操作类型
    echo -e "\n1. 操作类型统计:"
    grep -o '"cmd":"[^"]*"' "$AUDIT_LOG" | sort | uniq -c | sort -nr

    # 统计用户活动
    echo -e "\n2. 用户活动统计:"
    grep -o '"ugi":"[^"]*"' "$AUDIT_LOG" | cut -d'"' -f4 | sort | uniq -c | sort -nr

    # 统计最活跃文件
    echo -e "\n3. 最活跃文件:"
    grep -o '"src":"[^"]*"' "$AUDIT_LOG" | cut -d'"' -f4 | sort | uniq -c | sort -nr | head -10

    # 检测可疑活动
    echo -e "\n4. 可疑活动检测:"

    # 大量删除操作
    DEL_COUNT=$(grep '"cmd":"delete"' "$AUDIT_LOG" | wc -l)
    if [ "$DEL_COUNT" -gt 100 ]; then
        echo "警告: 检测到大量删除操作 ($DEL_COUNT 次)"
    fi

    # 失败的操作
    FAILED_COUNT=$(grep '"result":"FAILED"' "$AUDIT_LOG" | wc -l)
    if [ "$FAILED_COUNT" -gt 50 ]; then
        echo "警告: 检测到大量失败操作 ($FAILED_COUNT 次)"
    fi

} > "$REPORT_FILE"

echo "审计报告生成完成: $REPORT_FILE"

6.2 YARN应用审计

应用审计配置

# 启用YARN审计
echo 'export YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -Dyarn.resourcemanager.audit.logger=INFO,RFAUDIT"' >> $HADOOP_HOME/etc/hadoop/yarn-env.sh

# 审计日志位置
# /var/log/hadoop-yarn/audit/resourcemanager-audit.log

学习总结

通过本篇文章,您已经掌握了:

  • ✅ Kerberos安全认证原理和配置
  • ✅ HDFS高可用架构和故障转移
  • ✅ NameNode Federation配置
  • ✅ 集群监控和告警系统搭建
  • ✅ 数据备份与恢复策略
  • ✅ 安全审计与合规性管理

关键知识点

  1. Kerberos:企业级安全认证标准
  2. HA架构:保证NameNode服务连续性
  3. Federation:水平扩展命名空间
  4. 监控告警:实时掌握集群状态
  5. 备份恢复:数据安全保障
  6. 安全审计:满足合规要求

下一篇预告:《生产环境Hadoop集群运维与性能优化》将带您进入企业级Hadoop运维实战!


实践建议

# 动手练习
1. 搭建Kerberos安全集群
2. 配置HDFS高可用环境
3. 部署监控告警系统
4. 实施数据备份策略
5. 分析安全审计日志
6. 测试故障转移流程

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注