hadoop实际使用过程中出现以下问题,现在提供解决方案.
中文问题
从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的,后来经过查看源代码,发现hadoop仅仅是不支持以gbk格式输出中文而己.
下面是TextOutputFormat.class中的代码,hadoop默认的输出都是继承自FileOutputFormat来的,FileOutputFormat的两个子类一个是基于二进制流的输出,一个就是基于文本的输出TextOutputFormat.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> { protected static class LineRecordWriter<K, V> implements RecordWriter<K, V> { private static final String utf8 = "UTF-8";//这里被写死成了utf-8 private static final byte[] newline; static { try { newline = "\n".getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException("can't find " + utf8 + " encoding"); } } |
1 2 3 4 5 6 7 8 9 10 |
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { this.keyValueSeparator = keyValueSeparator.getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException("can't find " + utf8 + " encoding"); } } |
1 2 3 4 5 6 7 8 9 10 |
private void writeObject(Object o) throws IOException { if (o instanceof Text) { Text to = (Text) o; out.write(to.getBytes(), 0, to.getLength());//这里也需要修改 } else { out.write(o.toString().getBytes(utf8)); } } |
可以看出hadoop默认的输出写死为utf8,如果decode中文正确,那么将Linux客户端的character设为utf8是可以看到中文的.
但如果数据库是用gbk来定义字段的,如果想让hadoop用gbk格式输出中文以兼容数据库怎么办?
我们可以需要定义一个新的类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
public class GBKOutputFormat<K, V> extends FileOutputFormat<K, V> { protected static class LineRecordWriter<K, V> implements RecordWriter<K, V> { private static final String gbk = "gbk";// 设置为GBK private static final byte[] newline; static { try { newline = "\n".getBytes(gbk); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException("can't find " + gbk + " encoding"); } } |
1 2 3 4 5 6 7 8 9 10 |
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { this.keyValueSeparator = keyValueSeparator.getBytes(gbk); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException("can,t find " + gbk + " encoding"); } } |
1 2 3 4 5 6 7 |
private void writeObject(Object o) throws IOException { if (o instanceof Text) { out.write(o.toString().getBytes(gbk)); } } |
reduce数量设置
reduce数量究竟多少是适合的?
目前测试认为reduce数量约等于cluster中datanode的总cores的一半比较合适.
例如cluster中有32台datanode,每台8 core,那么reduce设置为128速度最快.
因为每台机器8 core,4个作map,4个作reduce计算,正好合适.
附小测试:对同一个程序
reduce num = 32, reduce time = 6 min
reduce num = 128, reduce time = 2 min
reduce num = 320, reduce time = 5 min
启动MapReduce实例时出错
1 2 3 4 5 6 7 8 9 10 11 |
java.io.IOException: All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting… at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889) java.io.IOException: Could not get block locations. Aborting… at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889) |
经查明,问题原因是linux机器打开了过多的文件导致.linux默认的文件打开数目为1024,使用命令ulimit -n设置最大文件数,但需要root权限.
另,也可修改/ect/security/limit.conf,增加hadoop soft 65535.再重新运行程序(最好所有的datanode都修改),问题解决.
PS:据说hadoop dfs不能管理总数超过100M个文件,有待查证.
hadoop停止服务报错
1 2 3 |
no tasktracker to stop, no datanode to stop |
问题的原因是hadoop在stop的时候依据的是datanode上的mapred和dfs进程号.
默认的进程号保存在/tmp下,linux默认会每隔一段时间(一般是一个月或者7天左右)去删除这个目录下的文件.
因此删掉hadoop-hadoop-jobtracker.pid和hadoop-hadoop-namenode.pid两个文件后,namenode自然就找不到datanode上的这两个进程了.
遇到此问题需要设置$HADOOP_HOME/conf/hadoop-env.sh中的export HADOOP_PID_DIR,即可解决此问题.