Oozie、Flume、Mahout配置与应用

-------------------------Oozie--------------------【一、部署】    1）部署Oozie服务端[root@cMaster~]#sudo yum install oozie    #cMaster上以root权限执行，部署Oozie服务端2）部署Oozie客户端[root@iClient~]f#sudo yum install oozie-client【二、配置文件】3）修改/etc/oozie/conf/oozie-env.sh文件#export CATALINA_BASE=/var/lib/oozie/tomcat-deploymentexport CATALINA_BASE=/usr/lib/oozie/oozie-server# export OOZIE_CONFIG_FILE=oozie-site.xmlexport OOZIE_CONFIG=/etc/oozie/conf# export OOZIE_LOG=${OOZIE_HOME}/logsexport OOZIE_LOG=/var/log/oozie4）追加/etc/hadoop/conf/core-site.xml内容：<property><name>hadoop.proxyuser.oozie.groups</name><value>*</value></property><property><name>hadoop.proxyuser.oozie.hosts</name><value>*</value></property>    【三、数据库与jar配置】    重启Hadoop集群，并建库$for x in `cd/etc/init.d；Is hadoop-*`；do service $x restart；done；    #`为英文输入状态下 ，esc键下面那个【警告：不建议使用】除了iCleint外，其他机器都要执行4）创建Oozie数据库模式[root@cMaster~]#sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run    #仅cMaster执行[root@cMaster ~]#mkdir /tmp/ooziesharelib[root@cMaster~]#cd /tmp/ooziesharelib[root@cMaster~]# tar xzf /usr/lib/oozie/oozie-sharelib-yarn.tar.gz    #【重要】此处oozie-sharelib-yarn.tar.gz有可能为/usr/lib/oozie/下文件夹，将其复制进/tmp/ooziesharelib即可【四、启动服务】[root@cMaster~]#sudo service oozie start应用是这里注意一点：oozie job -oozie http://cMaster:11000/oozie  -config  【/usr/share/doc/oozie-4.0.0+cdh5.0.0+54/examples/apps/map-reduce/job.properties   为本地路径】  -run      oozie job -oozie http://cMaster:11000/oozie  -config  job.properties  -runHDFS 路径：/user/【用户名】root/examples/apps/map-reduce/    下存在job.properties等文件，即/user/【用户名】root/examples/apps/map-reduce/被写死本地路径：（相对路径）进入本地job.properties等文件目录下，如：    进入  /usr/share/doc/oozie-4.0.0+cdh5.0.0+54/examples/apps/map-reduce/    目录下job.properties存在的地方——————————————————————以下冗余————————————————————Oozie部署 [21] Oozie相当于Hadoop的一个客户端，因此集群中只有一台机器部署Oozie server端即 可，    由于可以有任意多个客户端连接Oozie，故每个客户端上都须部署Oozie client，    本节 选择在cMaster上部署Oozie server，在iClient上部署Oozie client。 1）部署Oozie服务端[root@cMaster~]#sudo yum install oozie    #cMaster上以root权限执行，部署Oozie服务端2）部署Oozie客户端[root@iClient~]f#sudo yum install oozie-client3）配置Oozie 修改/etc/oozie/conf/oozie-env.sh中的CATALINA_BASE属性值，注释原值并指定新 值，    当此值指向oozie-server-0.20表明Oozie支持MRv1，指向oozie-server表示支持Yarn。    注 意cMaster、iClient都要配置，并保持一致。#export CATALINA_BASE=/usr/lib/oozie/oozie-server-0.20export CATALINABASE=/usr/lib/oozie/oozie-server在/etc/hadoop/conf/core-site.xml文档里configuration标签间加入如下内容。    注意，6台机器都要更新这个配置，并且配置此属性后，    一定要重启集群中所有Hadoop服务，此属性值才能生效。<property><name>hadoop.proxyuser.oozie.groups</name><value>*</value></property><property><name>hadoop.proxyuser.oozie.hosts</name><value>*</value></property>下面是重启Hadoop集群的命令：$for x in 'cd/etc/init.d；Is hadoop-*'；do service $x restart；done；    #除了iCleint外，其他机器都要执行4）创建Oozie数据库模式[root@cMaster~]#sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run    #仅cMaster执行5）配置Oozie Web页面[root@cMaster ~]#cd /var/lib/oozie/[root@cMaster oozie]# sudo -u oozie wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip[root@cMaster oozie]# sudo -u oozie unzip ext-2.2.zip6）将Oozie常用Jar包导入HDFS[root@cMaster~]# sudo -u hdfs hdfs dfs -mkdir /user/oozie[root@cMaster~]#sudo -u hdfs hdfs dfs -chown oozie:oozie /user/oozie[root@cMaster ~]#mkdir /tmp/ooziesharelib[root@cMaster~]#cd /tmp/ooziesharelib[root@cMaster~]# tar xzf /usr/lib/oozie/oozie-sharelib-yarn.tar.gz[root@cMaster～]# sudo -u oozie hdfs dfs -put share /user/oozie/share7）开启Oozie服务[root@cMaster~]#sudo service oozie start8）查看Oozie服务    当成功部署并在cMaster上开启Oozie服务后，如果配置了ext-2.2，    在iClient上的浏览器中打开“cmaster：11000”将显示Oozie Web界面，    也可以使用下述命令查看Oozie工作状态。[root@iClient~]#oozie admin -oozie http://cMaster:11000/oozie -status------------------------应用------2.0ozie访问接口    Oozie最常用的是命令行接口，它的Web接口只可以看到Oozie托管的任务，不可以配置作业。【例6-6】按要求完成问题：    ①进入Oozie客户端，查看常用命令。    ②运行Oozie MR示例程序。    ③运行OoziePig、Hive等示例。    ④编写workflow.xml，完成一次WordCount。    ⑤编写workflow.xml，完成两次WordCount，且第一个WC的输出为第二个WC的输入。解答：对于问题①，在iClient上执行下述命令即可，用户可以是root或joe。[root@iClient~]#sudo-u joe oozie help        #查看所有Oozie命令对于问题②，首先解压Oozie示例iar包，接着修改示例配置中的地址信息，最后上传至集群执行即可，读者按下述流程执行即可。[root@iClient~]#cd /usr/share/doc/oozie-4.0.0+cdh5.0.0+54[root@iClient oozie-4.0.0+cdh5.0.0+54]# tar-zxvf oozie-examples.tar.gz编辑examples/apps/map-reduce/job.properties，将如下两行：    nameNode=hdfs://localhost：8020    job Tracker=localhost：8021替换成集群现在配置的地址与端口：nameNode=hdfs://cMaster:8020job Tracker=cMaster:8032接着将examples上传至HDFS，使用oozie命令执行即可：[root@iClient oozie-4.0.0+cdh5.0.0+54]#sudo -u joe hdfs dfs-put examples examples[root@iClient oozie-4.0.0+cdh5.0.0+54]#cd[root@iClient ~]#sudo -u joe oozie job -oozie http://cMaster:11000/oozie  -config  /usr/share/doc/oozie-4.0.0+cdh5.0.0+54/examples/apps/map-reduce/job.properties  -run    问题③其实和②是一样的，读者可按上述过程使用oozie执行Pig或Hive等的示例脚本。    切记修改相应配置（如examples/apps/pig/job.properties）后，再上传至集群，    执行时也要定位到相应路径（如sudo-u joe oozie....../apps/pig/joe.properties-run）。对于问题④，读者可参考“examples/apps/map-reduce/workflow.xml”，    其对应jar包在“examples/apps/map-reduce/lib”下，    其下的DemoMapper.class和DemoReducer.class就是WordCount的代码，    对应的源代码在“examples/src”下，可按如下步骤完成此问题。（1）编辑文件“examples/apps/map-reduce/workflow.xml”，找到下述内容：    <property>    <name>mapred.mapper.class</name><value>org.apache.oozie.example.SampleMapper</value>    </property>    <property>    <name>mapred.reducer.class</name><value>org.apache.oozie.example.SampleReducer</value>    </property>将其替换成：<property><name>mapred.mapper.class</name><value>org.apache.oozie.example.DemoMapper</value></property><property><name>mapred.reducer.class</name><value>org.apache.oozie.example.DemoReducer</value></property><property><name>mapred.output.key.class</name><value>org.apache.hadoop.io.Text</value></property><property><name>mapred.output.value.class</name><value>org.apache.hadoop.io.IntWritable</value></property>（2）接着将原来HDFS里examples文件删除，按问题②的解答，    上传执行即可，这里只给出删除原examples的命令，上传和执行命令和问题②解答一样。[root@iClient~]#sudo -u joe hdfs dfs -rm -r -f  examples    #删除HDFS原examples文件（3）接着将examples上传至HDFS，使用oozie命令执行即可：[root@iClient oozie-4.0.0+cdh5.0.0+54]#sudo -u joe hdfs dfs-put examples examples[root@iClient oozie-4.0.0+cdh5.0.0+54]#cd[root@iClient ~]#sudo -u joe oozie job -oozie http://cMaster:11000/oozie  -config  /usr/share/doc/oozie-4.0.0+cdh5.0.0+54/examples/apps/map-reduce/job.properties  -run****问题⑤是业务逻辑中最常遇到的情形，比如你的数据处理流 是：    “M1”→“R1”→“Java1”→“Pig1”→“Hive1”→“M2”→“R2”→“Java2”    ，单独写出各类或 脚本后，写出此逻辑对应的workflow.xml即可。    限于篇幅，下面只给出workflow.xml框 架，请读者自行解决问题④。    <workflow-app xmlns="uri:oozie:workflow：0.2" name="map-reduce-wf">    <start to="mr-node"/>    <action name="mr-node">    <map-reduce>第一个wordcount配置</map-reduce>    <ok to="mr-wc2"/><error to="fail"/>    </action>    <action name="mr-wc2">    <map-reduce>第二个wordcount 配置</map-reduce>    <ok to="end"/><error to="fail"/>    </action>    <kill name="fail">    <message>Map/Reduce failed error message[${wf:errorMessage(wf:lastErrorNode())}]</message>    </kill>    <end name="end"/>    </workflow-app>——————————————————————————Flume——————————————————————1.Flume部署 [21] 集群中只有一台机器部署Flume就可以接收数据了，    此外下面的例题中还要有一台机 器做为数据源，负责向Hadoop集群发送数据，    故须在cMaster与iClient上部署Flume。 （1）部署Flume接收端[root@cMaster~]#sudo yum install flume-ng-agent    #在cMaster上部署Flume（2）部署Flume发送端[root@iClient~]#sudo yum install flume-ng-agent    #在iClient上部署Flume---------------应用2.Flume访问接口 Flume提供了命令行接口和程序接口，但Flume使用方式比较特别，    无论是命令行还 是程序接口，都必须使用Flume配置文档，    这也是Flume架构思想之一——配置型工具。 【例6-7】按要求完成问题：    ①进入Flume命令行，查看常用命令。    ②要求发送端 iClient使用telnet向cMaster发送数据，而接收端cMaster开启44444端口接收数据，        并将收到 的数据显示于命令行。    ③要求发送端iClient将本地文件“/home/joe/source.txt”发往接收端 cMaster，        而接收端cMaster将这些数据存入HDFS。    ④根据问题③，接收端cMaster开启接 收数据的Flume服务，既然此服务能接收iClient发来的数据，        它必然也可以接收iHacker机 器（黑客）发来的数据，问如何尽量减少端口攻击，并保证数据安全。 解答：对于问题①，直接在iClient上执行如下命令即可[root@iClient~]#flume-ng        #查看Flume常用命令对于问题②，首先需要在cMaster上按要求配置并开启Flume（作为接收进程被动接收 数据），    接着在iClient上使用telnet向cMaster发送数据，具体过程参见如下几步。     在cMaster上以root权限，新建文件“/etc/flume/conf/flume.conf”，    并填入如下内容：接 着在cMaster上使用此配置以前台方式开启Flume服务-----------------------------------------------------------------------------------------------#命令此处agent名为al，并命名此al的sources为rl，channels为cl，sinks为kl a1.sources=r1a1.channels=c1a1.sinks=k1#定义sources相关属性：即此sources在cMaster 上开启44444端口接收以netcat协议发来的数据a1.sources.r1.type=netcat a1.sources.r1.bind=cMaster a1.sources.r1.port=44444#定义channels及其相关属性，此处指定此次服务使用memory 暂存数据a1.channels.c1.type=memory a1.channels.c1.capacity=1000a1.channels.c1.transactionCapacity=100#定义此sink为logger类型sink：即指定sink直接将收到的数据输出到控制台a1.sinks.k1.type=logger#将sources关联到channels，channels 关联到sinks上a1.sources.r1.channels=c1a1.sinks.k1.channel=c1---------------------------------------------------------------------------------------------[root@cMaster~]#flume-ng agent -c  /etc/flume-ng/  -f  /etc/flume-ng/conf/flume.conf -n a1此时，接收端cMaster已经配置好并开启了，接下来需要开启发送端，在iClient上执行：[root@iClient~]# telnet cMaster 44444    此时向此命令行里随意输入数据并回车，telnet会将这些数据发往cMaster，再次回到 cMaster上执行命令的那个终端，    会发现刚才在iClient里输入的数据发送到了cMaster的终 端里。如果想退出iClient终端里的telnet，    按Ctrl+]组合键（即同时按住Ctrl键和]键），回 到telnet后输入“quit”命令回车即可，    至于退出cMaster上的Flume，直接按Ctrl+C组合键。 问题③的回答步骤较多。 首先，在cMaster上新建文件“/etc/flume-ng/conf/flume.conf.hdfs”，并填入如下内容-----------------------------------------------------------------------------------------------#命令此处agent名为al，并命名此al的sources为rl，channels为c1，sinks为k1a1.sources=r1a1.sinks=k1a1.channels=c1#定义sources类型及其相关属性#即此sources为avro类型，且其在cMaster上开启4141端口接收avro协议发来的数据a1.sources.r1.type=avro a1.sources.r1.bind=cMaster a1.sources.r1.port=4141#定义channels类型其实相关属性，此处指定此次服务使用memory 暂存数据a1.channels.c1.type=memory#定义此sink为HDFS类型的sink，且此sink将接收的数据以文本方式存入HDFS指定目录a1.sinks.k1.type=hdfs a1.sinks.k1.hdfs.path=/user/joe/flume/cstorArchive a1.sinks.k1.hdfs.file Type=DataStream#将sources关联到channels，channels 关联到sinks上a1.sources.r1.channels=c1a1.sinks.k1.channel=c1-----------------------------------------------------------------------------------------------接着，在iClient上新建文件“/root/businessLog”，并填入如下内容：--------------------------------------cccccccccccccccccccccssssssssssssssssssssssssttttttttttttttttttttttttttttttttttoooooooooooooooooorrrrrrrrrrrrrrrrrrrrrrrrrrrrr--------------------------------------iClient上还要新建文件“/etc/flume-ng/conf/flume.conf.exce”，并填入如下内容：-----------------------------------------------------------------------------------------------#命令此处agent名为al，并命名此al的sources为rl，channels为c1，sinks为k1a1.sources=r1a1.channels=c1a1.sinks=k1#定义sources类型及其相关属性，此sources为exce类型#其使用Linux cat 命令读取文件/root/businessLog，接着将读取到的内容写入channel a1.sources.r1.type=execa1.sources.r1.command=cat /root/businessLog#定义channels及其相关属性，此处指定此次服务使用memory 暂存数据a1.channels.c1.type=memory#定义此sink为avro类型sink，即其用avro协议将channel里的数据发往cMaster的4141端口a1.sinks.k1.type=avro a1.sinks.k1.hostname=cMaster a1.sinks.k1.port=4141#将sources关联到channels，channels 关联到sinks上a1.sources.r1.channels=c1a1.sinks.k1.channel=c1-----------------------------------------------------------------------------------------------至此，发送端iClient和接收端cMaster的Flume都已配置完成。现在需要做的是在 HDFS里新建目录，    并分别开启接收端Flume服务和发送端Flume服务，步骤如下。 在cMaster上开启Flume，    其中“flume-ng…a1”命令表示使用flume.conf.hdfs配置启动 Flume，    参数a1即是配置文件里第一行定义的那个a1[root@cMaster ~]#sudo -u joe hdfs dfs      -mkdir flume    #HDFS里新建目录/user/joe/flume[root@cMaster ～]#sudo -u joe flume-ng agent -c  /etc/flume-ng/  -f  /etc/flume-ng/conf/flume.conf.hdfs -n a1最后，在iClient上开启发送进程，与上一条命令类似，这里的a1，即flume.conf.exce 定义的a1：[root@iClient~]#flume-ng agent -c/etc/flume-ng/ -f /etc/flume-ng/conf/flume.conf.exce -n a1    此时，用户在iClient端口里打开“cMaster：50070”，依次进入目 录“/user/joe/flume/cstorArchive”，    将会查看到从iClient上传送过来的文件。 —————————————————————————— Mahout——————————————————————1.Mahout部署 [21] 作为Hadoop的一个客户端，Mahout只要在集群中或集群外某台客户机上部署即可，     实验中选择在iClient上部署Mahout[root@iClient ～]# sudo yum install mahout2.Mahout访问接口 Mahout提供了程序和命令行接口，    通过参考Mahout已有的大量机器学习算法，程序 员也可实现将某算法并行化。 【例6-8】要求以joe用户运行Mahout示例程序naivebayes，实现下载数据，建立学习 器，训练学习器，    最后使用测试数据针对此学习器进行性能测试。-------------------------------------------------------------------------#！/bin/sh#新建本地目录，新建HDFS目录mkdir -p /tmp/mahout/20news-bydate /tmp/mahout/20news-all&&hdfs dfs -mkdir mahout#下载训练和测试数据集curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz\  -o /tmp/mahout/20news-bydate.tar.gz#将数据集解压、合并，并上传至HDFS cd /tmp/mahout/20news-bydate&&tar xzf /tmp/mahout/20news-bydate.tar.gz&&cd cp -R /tmp/mahout/20news-bydate/*/*/tmp/mahout/20news-all hdfs dfs -put /tmp/mahout/20news-all mahout/20news-all#使用工具类seqdirectory 将文本数据转换成二进制数据mahout seqdirectory -i mahout/20news-all -o mahout/20news-seq -ow-------------------------------------------------------------------------解：首先须下载训练数据集和测试数据，接着运行训练MR和测试MR，但是，    Mahout里的算法要求输入格式为Value和向量格式的二进制数据，故中间还须加一些步 骤，    将数据转换成要求格式的数据，下面的脚本naivebayes.sh可以完成这些动作-------------------------------------------------------------------------#使用工具类seq2sparse将二进制数据转换成算法能处理的矩阵类型二进制数据mahout seq2sparse -i mahout/20news-seq -o mahout/20news-vectors -lnorm -nv -wt  tfidf#将总数据随机分成两部分，第一部分约占总数据80%，用来训练模型#剩下的约20%作为测试数据，用来测试模型mahout split -i mahout/20news-vectors/fidf-vectors --trainingOutput mahout/20news-train-vectors\--testOutput mahout/20news-test-vectors\--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential#训练Naive Bayes模型mahout trainnb -i mahout/20news-train-vectors  -e1 -o mahout/model -li mahout/labelindex -ow#使用训练数据集对模型进行自我测试（可能会产生过拟合）mahout testnb -i mahout/20news-train-vectors -m mahout/model -l mahout/labelindex\-ow -o mahout/20news-testing#使用测试数据对模型进行测试mahout testnb -i mahout/20news-test-vectors -m mahout/model -I mahout/labelindex\-ow -o mahout/20news-testing-------------------------------------------------------------------------限于篇幅，脚本写得简陋，执行时，切记须在iClient上，以joe用户身份执行，且只能 执行一次。    再次执行时，先将所有数据全部删除，执行方式如下[root@iClient~]# cp naivebayes.sh /home/joe[root@iClient~]# chown joe.joe naivebayes.sh[root@iClient～]# sudo -u joe chmod +x naivebayes.sh[root@iClient~]# sudo -u joe sh naivebayes.sh脚本执行时，用户可以打开Web界面“cMaster：8088”，    查看正在执行的Mahout任 务；还可以通过Web界面“cMaster：50070”，    定位到“/user/joe/mahout/”查看目录变化
文章转载于:https://www.cnblogs.com/Raodi/p/11053256.html
原著是一个有趣的人,若有侵权,请通知删除
本博客所有文章如无特别注明均为原创。
复制或转载请以超链接形式注明转自起风了，原文地址《Oozie、Flume、Mahout配置与应用》