hive动态分区用法及注意事项

假设你有一张hive表，最开始是没有分区的，后来你想建一张新表，并以其中一个字段做分区，并将原本未分区表的数据导入过来，你要怎么做呢。这里记录一个使用sql自动创建分区的方法测试。

第一步，创建一个原始未分区表test1

1 2	hive> create table test1 (id int, name string);

第二步，向测试表导入数据,这里用node.js简单生成一个1000行记录的sql:

# sql.js :
var n = 1000;
var str = "insert into test1 values "
var comma = ""
for (i =1; i < n; i++) {
 str += comma + "(" + i + ", 'n" +i + "')"
 comma = ","
}
console.log(str)

# 执行 node sql.js > 1.sql, 生成一个sql
# 内容大致如下  insert into test1 values (1, 'n1'), (2, 'n2')...;

# 执行 hive -f 1.sql 将数据导入hive test1 表

第三步，创建新的分区表，这里使用name做分区。

hive>
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.created.files=20000;
set hive.exec.max.dynamic.partitions.pernode=3000;
set hive.exec.max.dynamic.partitions=10000;
create table test2 ( id int  ) PARTITIONED BY (name string);
insert overwrite table test2 partition(name) select id, name   from test1 ;
create table test3 ( name string  ) PARTITIONED BY (id int);
insert overwrite table test3 partition(id) select name, id  from test1 limit 2;
#这里注意，insert ... select的时候，分区对应字段放在查询字段的最后。

#说明：
hive.exec.dynamic.partition.mode默认为false，表示不开启动态分区，这里需要设置为true
hive.exec.dynamic.partition.mode 默认为strict，不允许全部分区均为动态。这里需要设置为nostrict.
hive.exec.max.created.files用来设置一次sql最多创建的hdfs文件个数，默认是10万,如果需要创建更多，需要自己调整。
hive.exec.max.dynamic.partitions.pernode指定单个节点最多创建的动态分区个数，默认100，需自己按需配置。
hive.exec.max.dynamic.partitions指定一次sql最多创建的动态分区个数，默认1000，需自己按需配置。

第四步，查看新表的分区情况，并查询结果

hive>
# 查看分区情况， 
show partitions test2;

select count(*) from test2;

tips:
大部分时候，用以上测试流程的配置，都是可以跑过的，如果你在执行的时候遇到：unable to create new native thread类似的报错，就要看下nodemanager服务启动的节点上hadoop用户的进程上限配置了,请保证线程数大于要大于每个节点任务要创建的动态分区数,修改配置后，记得重启yarn nodemanager服务。pstree -p task_pid |wc -l可以看到，任务执行时线程的使用情况
cat /etc/security/limits.conf

hadoop soft nofile 1000000
hadoop hard nofile 1000000
hadoop soft nproc 32000
hadoop hard nproc 32000