一、Kerberos安全认证
1.1 Kerberos原理与架构
Kerberos认证流程
Client → AS: 请求TGT
AS → Client: TGT (使用客户端密码加密)
Client → TGS: 请求服务票据 (使用TGT)
TGS → Client: 服务票据
Client → Service: 访问服务 (使用服务票据)
KDC组件
- AS (Authentication Server):认证服务器,发放TGT
- TGS (Ticket Granting Server):票据授权服务器,发放服务票据
- Database:存储所有主体和密钥
1.2 Kerberos安装配置
安装KDC服务器
# 在主节点安装KDC
yum install -y krb5-server krb5-workstation krb5-libs
# 配置Kerberos
cat > /etc/krb5.conf << 'EOF'
[logging]
default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = HADOOP.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true default_tgs_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 default_tkt_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96 permitted_enctypes = aes256-cts-hmac-sha1-96 aes128-cts-hmac-sha1-96
[realms]
HADOOP.COM = { kdc = hadoop-master:88 admin_server = hadoop-master:749 default_domain = hadoop.com }
[domain_realm]
.hadoop.com = HADOOP.COM hadoop.com = HADOOP.COM EOF # 配置KDC cat > /var/kerberos/krb5kdc/kdc.conf << ‘EOF’
[kdcdefaults]
kdc_ports = 88 kdc_tcp_ports = 88
[realms]
HADOOP.COM = { acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal max_life = 24h max_renewable_life = 10d } EOF # 创建ACL文件 cat > /var/kerberos/krb5kdc/kadm5.acl << ‘EOF’ */admin@HADOOP.COM * EOF # 创建Kerberos数据库 kdb5_util create -s -P hadoop123 # 启动Kerberos服务 systemctl start krb5kdc systemctl start kadmin systemctl enable krb5kdc kadmin
创建Hadoop相关主体
# 登录Kerberos管理
kadmin.local -q "addprinc root/admin"
kadmin.local -q "addprinc -randkey hdfs/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey yarn/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey mapred/hadoop-master@HADOOP.COM"
kadmin.local -q "addprinc -randkey HTTP/hadoop-master@HADOOP.COM"
# 为DataNodes创建主体
kadmin.local -q "addprinc -randkey hdfs/hadoop-slave1@HADOOP.COM"
kadmin.local -q "addprinc -randkey yarn/hadoop-slave1@HADOOP.COM"
kadmin.local -q "addprinc -randkey HTTP/hadoop-slave1@HADOOP.COM"
# 生成keytab文件
kadmin.local -q "xst -k /etc/security/keytabs/hdfs.keytab hdfs/hadoop-master hdfs/hadoop-slave1 HTTP/hadoop-master HTTP/hadoop-slave1"
kadmin.local -q "xst -k /etc/security/keytabs/yarn.keytab yarn/hadoop-master yarn/hadoop-slave1 HTTP/hadoop-master HTTP/hadoop-slave1"
kadmin.local -q "xst -k /etc/security/keytabs/mapred.keytab mapred/hadoop-master HTTP/hadoop-master"
# 设置keytab文件权限
chown hdfs:hadoop /etc/security/keytabs/hdfs.keytab
chown yarn:hadoop /etc/security/keytabs/yarn.keytab
chown mapred:hadoop /etc/security/keytabs/mapred.keytab
chmod 400 /etc/security/keytabs/*.keytab
1.3 Hadoop Kerberos配置
core-site.xml配置
<configuration>
<!-- Kerberos安全配置 -->
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1@$0](.*@HADOOP.COM)s/.*/hadoop/
DEFAULT
</value>
</property>
</configuration>
hdfs-site.xml配置
<configuration>
<!-- NameNode安全配置 -->
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/_HOST@HADOOP.COM</value>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/hdfs.keytab</value>
</property>
<!-- DataNode安全配置 -->
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/_HOST@HADOOP.COM</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/hdfs.keytab</value>
</property>
<!-- Web UI安全配置 -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/_HOST@HADOOP.COM</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/hdfs.keytab</value>
</property>
</configuration>
yarn-site.xml配置
<configuration>
<!-- ResourceManager安全配置 -->
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/_HOST@HADOOP.COM</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>/etc/security/keytabs/yarn.keytab</value>
</property>
<!-- NodeManager安全配置 -->
<property>
<name>yarn.nodemanager.principal</name>
<value>yarn/_HOST@HADOOP.COM</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>/etc/security/keytabs/yarn.keytab</value>
</property>
<!-- Timeline安全配置 -->
<property>
<name>yarn.timeline-service.principal</name>
<value>yarn/_HOST@HADOOP.COM</value>
</property>
<property>
<name>yarn.timeline-service.keytab</name>
<value>/etc/security/keytabs/yarn.keytab</value>
</property>
<!-- Web UI安全配置 -->
<property>
<name>yarn.webapp.principal</name>
<value>HTTP/_HOST@HADOOP.COM</value>
</property>
<property>
<name>yarn.webapp.keytab</name>
<value>/etc/security/keytabs/yarn.keytab</value>
</property>
</configuration>
二、HDFS高可用配置
2.1 NameNode高可用架构
HA架构设计
Active NameNode ←→ Standby NameNode
↓ ↓
JournalNodes (QJM) ←→ Zookeeper
↓
DataNodes
2.2 基于QJM的HA配置
配置JournalNodes
# 在所有JournalNode节点执行
mkdir -p /data/hadoop/journalnode
chown -R hdfs:hadoop /data/hadoop/journalnode
# 启动JournalNode
hdfs --daemon start journalnode
hdfs-site.xml HA配置
<configuration>
<!-- HA配置 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- NameNode RPC地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hadoop-master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hadoop-slave1:8020</value>
</property>
<!-- NameNode HTTP地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hadoop-master:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hadoop-slave1:9870</value>
</property>
<!-- JournalNode配置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-master:8485;hadoop-slave1:8485;hadoop-slave2:8485/mycluster</value>
</property>
<!-- 故障转移配置 -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 自动故障转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- SSH故障转移配置 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hdfs/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
</configuration>
core-site.xml HA配置
<configuration>
<!-- 使用HA集群名称 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<!-- ZooKeeper配置 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-master:2181,hadoop-slave1:2181,hadoop-slave2:2181</value>
</property>
</configuration>
2.3 ZooKeeper配置
安装ZooKeeper
# 在所有ZK节点执行
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.6.3/apache-zookeeper-3.6.3-bin.tar.gz
tar -xzf apache-zookeeper-3.6.3-bin.tar.gz -C /usr/local/
cd /usr/local && ln -s apache-zookeeper-3.6.3-bin zookeeper
# 配置环境变量
echo 'export ZOOKEEPER_HOME=/usr/local/zookeeper' >> /etc/profile
echo 'export PATH=$ZOOKEEPER_HOME/bin:$PATH' >> /etc/profile
source /etc/profile
# 创建配置
mkdir -p /data/zookeeper/data
mkdir -p /data/zookeeper/logs
cat > $ZOOKEEPER_HOME/conf/zoo.cfg << 'EOF'
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper/data
dataLogDir=/data/zookeeper/logs
clientPort=2181
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.1=hadoop-master:2888:3888
server.2=hadoop-slave1:2888:3888
server.3=hadoop-slave2:2888:3888
EOF
# 创建myid文件(每个节点不同)
# hadoop-master:
echo "1" > /data/zookeeper/data/myid
# hadoop-slave1:
echo "2" > /data/zookeeper/data/myid
# hadoop-slave2:
echo "3" > /data/zookeeper/data/myid
# 启动ZooKeeper
zkServer.sh start
2.4 HA集群初始化与管理
初始化HA集群
# 1. 格式化ZKFC
hdfs zkfc -formatZK
# 2. 启动JournalNodes
hdfs --daemon start journalnode
# 3. 格式化第一个NameNode
hdfs namenode -format -clusterId mycluster
# 4. 启动第一个NameNode
hdfs --daemon start namenode
# 5. 在第二个节点同步元数据
hdfs namenode -bootstrapStandby
# 6. 启动第二个NameNode
hdfs --daemon start namenode
# 7. 启动所有DataNodes
hdfs --daemon start datanode
# 8. 启动ZKFC
hdfs --daemon start zkfc
HA管理命令
# 查看NameNode状态
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2
# 手动故障转移
hdfs haadmin -failover nn1 nn2
# 转移为Active状态
hdfs haadmin -transitionToActive nn1
# 转移为Standby状态
hdfs haadmin -transitionToStandby nn2
# 检查健康状态
hdfs haadmin -checkHealth nn1
三、NameNode Federation
3.1 Federation架构设计
Federation优势
- 命名空间扩展:多个NameNode管理不同命名空间
- 性能提升:负载分散到多个NameNode
- 隔离性:不同业务使用不同命名空间
Federation架构
NameNode1 (ns1) NameNode2 (ns2)
↓ ↓
Block Pool 1 Block Pool 2
↓ ↓
DataNode1 (BP1+BP2) DataNode2 (BP1+BP2)
3.2 Federation配置
配置多个命名空间
<!-- hdfs-site.xml -->
<configuration>
<!-- 定义命名空间 -->
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<!-- 第一个命名空间配置 -->
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>hadoop-master:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.ns1</name>
<value>hadoop-master:9870</value>
</property>
<!-- 第二个命名空间配置 -->
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>hadoop-slave1:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.ns2</name>
<value>hadoop-slave1:9870</value>
</property>
<!-- 数据节点配置 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/data1/hadoop/datanode,/data2/hadoop/datanode</value>
</property>
</configuration>
core-site.xml配置
<configuration>
<!-- 默认命名空间 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<!-- ViewFS配置(可选) -->
<property>
<name>fs.viewfs.mounttable.default.link./data</name>
<value>hdfs://ns1/data</value>
</property>
<property>
<name>fs.viewfs.mounttable.default.link./logs</name>
<value>hdfs://ns2/logs</value>
</property>
</configuration>
3.3 Federation管理
初始化Federation
# 格式化第一个NameNode
hdfs namenode -format -clusterId myfederation
# 启动第一个NameNode
hdfs --daemon start namenode
# 在第二个节点初始化新命名空间
hdfs namenode -format -clusterId myfederation
# 启动第二个NameNode
hdfs --daemon start namenode
# 启动所有DataNodes
hdfs --daemon start datanode
使用不同命名空间
# 访问默认命名空间
hdfs dfs -ls hdfs://ns1/
# 访问第二个命名空间
hdfs dfs -ls hdfs://ns2/
# 在特定命名空间创建目录
hdfs dfs -mkdir hdfs://ns1/data
hdfs dfs -mkdir hdfs://ns2/logs
# 查看块池信息
hdfs dfs -report
四、集群监控与告警
4.1 Ambari监控平台
Ambari架构
Ambari Server → Ambari Agents → Hadoop Services
↓
PostgreSQL (元数据存储)
↓
Web UI (监控界面)
Ambari安装
# 配置Ambari仓库
wget -O /etc/yum.repos.d/ambari.repo http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.5.0/ambari.repo
# 安装Ambari Server
yum install -y ambari-server
# 安装Ambari Agent(所有节点)
yum install -y ambari-agent
# 配置Ambari Server
ambari-server setup
# 启动Ambari
ambari-server start
ambari-agent start
4.2 自定义监控脚本
HDFS健康检查脚本
#!/bin/bash
# hdfs_health_check.sh
echo "=== HDFS健康检查报告 ==="
echo "检查时间: $(date)"
# 1. 检查NameNode状态
echo -e "\n1. NameNode状态:"
hdfs haadmin -getServiceState nn1 2>/dev/null || echo "NameNode1: 无法连接"
hdfs haadmin -getServiceState nn2 2>/dev/null || echo "NameNode2: 无法连接"
# 2. 检查HDFS空间使用
echo -e "\n2. HDFS空间使用:"
hdfs dfsadmin -report | grep -E "Configured Capacity|Present Capacity|DFS Used|DFS Remaining"
# 3. 检查DataNode状态
echo -e "\n3. DataNode状态:"
LIVE_NODES=$(hdfs dfsadmin -report | grep "Live datanodes" | awk '{print $3}')
DEAD_NODES=$(hdfs dfsadmin -report | grep "Dead datanodes" | awk '{print $3}')
echo "存活节点: $LIVE_NODES"
echo "死亡节点: $DEAD_NODES"
# 4. 检查块状态
echo -e "\n4. 块状态:"
UNDER_REPLICATED=$(hdfs fsck / -blocks 2>/dev/null | grep "Under replicated" | awk '{print $3}')
MISSING_BLOCKS=$(hdfs fsck / -blocks 2>/dev/null | grep "Missing blocks" | awk '{print $3}')
echo "副本不足块数: ${UNDER_REPLICATED:-0}"
echo "缺失块数: ${MISSING_BLOCKS:-0}"
# 5. 检查安全模式
echo -e "\n5. 安全模式:"
hdfs dfsadmin -safemode get
# 6. 生成告警
if [ "$DEAD_NODES" -gt 0 ]; then
echo "告警: 发现 $DEAD_NODES 个死亡节点!"
fi
if [ "${MISSING_BLOCKS:-0}" -gt 0 ]; then
echo "告警: 发现 $MISSING_BLOCKS 个缺失块!"
fi
YARN资源监控脚本
#!/bin/bash
# yarn_monitor.sh
echo "=== YARN资源监控报告 ==="
echo "检查时间: $(date)"
# 1. 获取集群指标
CLUSTER_METRICS=$(curl -s "http://hadoop-master:8088/ws/v1/cluster/metrics")
# 2. 解析JSON数据
TOTAL_MEMORY=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['totalMB'])")
USED_MEMORY=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['allocatedMB'])")
TOTAL_VCORES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['totalVirtualCores'])")
USED_VCORES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['allocatedVirtualCores'])")
ACTIVE_NODES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['activeNodes'])")
LOST_NODES=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['lostNodes'])")
# 3. 计算使用率
MEMORY_USAGE=$(echo "scale=2; $USED_MEMORY * 100 / $TOTAL_MEMORY" | bc)
VCORE_USAGE=$(echo "scale=2; $USED_VCORES * 100 / $TOTAL_VCORES" | bc)
# 4. 输出报告
echo -e "\n资源使用情况:"
echo "内存使用: $USED_MEMORY MB / $TOTAL_MEMORY MB ($MEMORY_USAGE%)"
echo "vCore使用: $USED_VCORES / $TOTAL_VCORES ($VCORE_USAGE%)"
echo "活跃节点: $ACTIVE_NODES"
echo "丢失节点: $LOST_NODES"
# 5. 检查运行中的应用
RUNNING_APPS=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['appsRunning'])")
PENDING_APPS=$(echo $CLUSTER_METRICS | python -c "import json,sys; obj=json.load(sys.stdin); print(obj['clusterMetrics']['appsPending'])")
echo -e "\n应用状态:"
echo "运行中应用: $RUNNING_APPS"
echo "等待中应用: $PENDING_APPS"
# 6. 告警检查
if (( $(echo "$MEMORY_USAGE > 85" | bc -l) )); then
echo "告警: 内存使用率超过85%!"
fi
if [ "$LOST_NODES" -gt 0 ]; then
echo "告警: 发现 $LOST_NODES 个丢失节点!"
fi
4.3 使用Ganglia监控
Ganglia安装配置
# 在监控服务器安装
yum install -y ganglia-gmetad ganglia-gmond ganglia-web
# 配置gmetad
cat > /etc/ganglia/gmetad.conf << 'EOF'
data_source "hadoop-cluster" hadoop-master hadoop-slave1 hadoop-slave2
setuid_username "nobody"
EOF
# 配置gmond(所有节点)
cat > /etc/ganglia/gmond.conf << 'EOF'
cluster {
name = "hadoop-cluster"
owner = "hadoop"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
host = hadoop-master
port = 8649
ttl = 1
}
udp_recv_channel {
port = 8649
}
tcp_accept_channel {
port = 8649
}
EOF
# 启动服务
systemctl start gmond
systemctl start gmetad
systemctl start httpd
Hadoop Ganglia配置
<!-- hadoop-metrics2.properties -->
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10
namenode.sink.ganglia.servers=hadoop-master:8649
datanode.sink.ganglia.servers=hadoop-master:8649
resourcemanager.sink.ganglia.servers=hadoop-master:8649
nodemanager.sink.ganglia.servers=hadoop-master:8649
mrappmaster.sink.ganglia.servers=hadoop-master:8649
五、备份与恢复策略
5.1 HDFS快照管理
启用快照功能
# 为目录启用快照
hdfs dfsadmin -allowSnapshot /user/important_data
# 创建快照
hdfs dfs -createSnapshot /user/important_data backup_$(date +%Y%m%d)
# 查看快照
hdfs dfs -ls /user/important_data/.snapshot
# 删除快照
hdfs dfs -deleteSnapshot /user/important_data backup_20231201
# 从快照恢复文件
hdfs dfs -cp /user/important_data/.snapshot/backup_20231201/lost_file.txt /user/important_data/
自动化快照脚本
#!/bin/bash
# hdfs_snapshot_manager.sh
SNAPSHOT_DIRS=("/user/important_data" "/data/warehouse" "/logs")
RETENTION_DAYS=7
for DIR in "${SNAPSHOT_DIRS[@]}"; do
# 创建今日快照
SNAPSHOT_NAME="snapshot_$(date +%Y%m%d)"
hdfs dfs -createSnapshot "$DIR" "$SNAPSHOT_NAME"
# 清理过期快照
hdfs dfs -ls "$DIR/.snapshot" | grep -o "snapshot_[0-9]*" | while read SNAP; do
SNAP_DATE=$(echo "$SNAP" | grep -o "[0-9]*")
CURRENT_TS=$(date +%s)
SNAP_TS=$(date -d "$SNAP_DATE" +%s 2>/dev/null || echo 0)
DAYS_OLD=$(( (CURRENT_TS - SNAP_TS) / 86400 ))
if [ "$DAYS_OLD" -gt "$RETENTION_DAYS" ]; then
hdfs dfs -deleteSnapshot "$DIR" "$SNAP"
echo "删除过期快照: $DIR/$SNAP"
fi
done
done
5.2 DistCp跨集群备份
集群间数据同步
# 基础DistCp同步
hadoop distcp \
hdfs://cluster1:8020/user/data \
hdfs://cluster2:8020/user/data
# 增量同步
hadoop distcp \
-update \
-delete \
hdfs://cluster1:8020/user/data \
hdfs://cluster2:8020/user/data
# 使用快照的增量同步
hadoop distcp \
-diff snapshot1 snapshot2 \
hdfs://cluster1:8020/user/data \
hdfs://cluster2:8020/user/data
# 带宽限制同步
hadoop distcp \
-bandwidth 100 \
-m 20 \
hdfs://cluster1:8020/user/data \
hdfs://cluster2:8020/user/data
自动化备份脚本
#!/bin/bash
# cross_cluster_backup.sh
SOURCE_CLUSTER="hdfs://cluster1:8020"
TARGET_CLUSTER="hdfs://cluster2:8020"
BACKUP_PATHS=("/user/important" "/data/warehouse")
LOG_FILE="/var/log/hadoop_backup.log"
{
echo "=== 开始跨集群备份 $(date) ==="
for PATH in "${BACKUP_PATHS[@]}"; do
echo "备份路径: $PATH"
hadoop distcp \
-update \
-delete \
-m 50 \
-bandwidth 50 \
"${SOURCE_CLUSTER}${PATH}" \
"${TARGET_CLUSTER}${PATH}"
if [ $? -eq 0 ]; then
echo "成功备份: $PATH"
else
echo "备份失败: $PATH"
exit 1
fi
done
echo "=== 备份完成 $(date) ==="
} >> "$LOG_FILE" 2>&1
5.3 元数据备份与恢复
NameNode元数据备份
#!/bin/bash
# namenode_metadata_backup.sh
# 备份目录
BACKUP_DIR="/backup/hadoop/namenode"
DATE=$(date +%Y%m%d_%H%M%S)
# 创建备份目录
mkdir -p "$BACKUP_DIR/$DATE"
# 进入安全模式
hdfs dfsadmin -safemode enter
# 保存命名空间镜像
hdfs dfsadmin -saveNamespace
# 备份元数据文件
cp -r /data/hadoop/namenode/current "$BACKUP_DIR/$DATE/"
# 退出安全模式
hdfs dfsadmin -safemode leave
# 清理过期备份(保留最近7天)
find "$BACKUP_DIR" -type d -mtime +7 -exec rm -rf {} \;
echo "NameNode元数据备份完成: $BACKUP_DIR/$DATE"
元数据恢复流程
#!/bin/bash
# namenode_metadata_restore.sh
RESTORE_DIR="/backup/hadoop/namenode/latest_backup"
# 停止NameNode
hdfs --daemon stop namenode
# 清空当前元数据
rm -rf /data/hadoop/namenode/current/*
# 恢复备份
cp -r "$RESTORE_DIR/current"/* /data/hadoop/namenode/current/
# 启动NameNode
hdfs --daemon start namenode
echo "NameNode元数据恢复完成"
六、安全审计与合规
6.1 HDFS审计日志
启用审计日志
<!-- hdfs-site.xml -->
<property>
<name>dfs.namenode.audit.loggers</name>
<value>default</value>
</property>
<property>
<name>dfs.namenode.audit.log.async</name>
<value>true</value>
</property>
审计日志分析脚本
#!/bin/bash
# hdfs_audit_analyzer.sh
AUDIT_LOG="/var/log/hadoop/hdfs/audit.log"
REPORT_FILE="/tmp/hdfs_audit_report_$(date +%Y%m%d).txt"
{
echo "=== HDFS审计分析报告 $(date) ==="
# 统计操作类型
echo -e "\n1. 操作类型统计:"
grep -o '"cmd":"[^"]*"' "$AUDIT_LOG" | sort | uniq -c | sort -nr
# 统计用户活动
echo -e "\n2. 用户活动统计:"
grep -o '"ugi":"[^"]*"' "$AUDIT_LOG" | cut -d'"' -f4 | sort | uniq -c | sort -nr
# 统计最活跃文件
echo -e "\n3. 最活跃文件:"
grep -o '"src":"[^"]*"' "$AUDIT_LOG" | cut -d'"' -f4 | sort | uniq -c | sort -nr | head -10
# 检测可疑活动
echo -e "\n4. 可疑活动检测:"
# 大量删除操作
DEL_COUNT=$(grep '"cmd":"delete"' "$AUDIT_LOG" | wc -l)
if [ "$DEL_COUNT" -gt 100 ]; then
echo "警告: 检测到大量删除操作 ($DEL_COUNT 次)"
fi
# 失败的操作
FAILED_COUNT=$(grep '"result":"FAILED"' "$AUDIT_LOG" | wc -l)
if [ "$FAILED_COUNT" -gt 50 ]; then
echo "警告: 检测到大量失败操作 ($FAILED_COUNT 次)"
fi
} > "$REPORT_FILE"
echo "审计报告生成完成: $REPORT_FILE"
6.2 YARN应用审计
应用审计配置
# 启用YARN审计
echo 'export YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -Dyarn.resourcemanager.audit.logger=INFO,RFAUDIT"' >> $HADOOP_HOME/etc/hadoop/yarn-env.sh
# 审计日志位置
# /var/log/hadoop-yarn/audit/resourcemanager-audit.log
学习总结
通过本篇文章,您已经掌握了:
- ✅ Kerberos安全认证原理和配置
- ✅ HDFS高可用架构和故障转移
- ✅ NameNode Federation配置
- ✅ 集群监控和告警系统搭建
- ✅ 数据备份与恢复策略
- ✅ 安全审计与合规性管理
关键知识点:
- Kerberos:企业级安全认证标准
- HA架构:保证NameNode服务连续性
- Federation:水平扩展命名空间
- 监控告警:实时掌握集群状态
- 备份恢复:数据安全保障
- 安全审计:满足合规要求
下一篇预告:《生产环境Hadoop集群运维与性能优化》将带您进入企业级Hadoop运维实战!
实践建议:
# 动手练习
1. 搭建Kerberos安全集群
2. 配置HDFS高可用环境
3. 部署监控告警系统
4. 实施数据备份策略
5. 分析安全审计日志
6. 测试故障转移流程