python连接hive数据库 Python 连接hive(Linux)

编程之家2023-10-22219次浏览

各位老铁们，大家好，今天由我来为大家分享python连接hive数据库，以及Python 连接hive(Linux)的相关问题知识，希望对大家有所帮助。如果可以帮助到大家，还望关注收藏下本站，您的支持是我们最大的动力，谢谢大家了哈，下面我们开始吧！

python连接hive,怎么安装thrifthive

HiveServer2的启动

启动HiveServer2

HiveServer2的启动十分简便：

$$HIVE_HOME/bin/hiveserver2

或者

$$HIVE_HOME/bin/hive--service hiveserver2

默认情况下，HiverServer2的Thrift监听端口是10000，其WEB UI端口是10002。可通过http://localhost:10002来查看HiveServer2的Web UI界面，这里显示了Hive的一些基本信息。如果Web界面不能查看，则说明HiveServer2没有成功运行。

使用beeline测试客户端连接

HiveServer2成功运行后，我们可以使用Hive提供的客户端工具beeline连接HiveServer2。

$$HIVE_HOME/bin/beeline

beeline>!connect jdbc:hive2://localhost:10000

如果成功登录将出现如下的命令提示符，此时可以编写HQL语句。

0: jdbc:hive2://localhost:10000>

报错：User: xxx is not allowed to impersonate anonymous

在beeline使用!connect连接HiveServer2时可能会出现如下错误信息：

Caused by: org.apache.hadoop.ipc.RemoteException:

User: xxx is not allowed to impersonate anonymous

这里的xxx是我的操作系统用户名称。这个问题的解决方法是在hadoop的core-size.xml文件中添加xxx用户代理配置：

<property><name>hadoop.proxyuser.xxx.groups</name><value>*</value></property><property><name>hadoop.proxyuser.xxx.hosts</name><value>*</value></property>

重启HDFS后，再用beeline连接HiveServer2即可成功连接。

常用配置

HiveServer2的配置可以参考官方文档《Setting Up HiveServer2》

这里列举一些hive-site.xml的常用配置：

hive.server2.thrift.port：监听的TCP端口号。默认为10000。

hive.server2.thrift.bind.host：TCP接口的绑定主机。

hive.server2.authentication：身份验证方式。默认为NONE（使用 plain SASL），即不进行验证检查。可选项还有NOSASL, KERBEROS, LDAP, PAM and CUSTOM.

hive.server2.enable.doAs：是否以模拟身份执行查询处理。默认为true。

Python客户端连接HiveServer2

python中用于连接HiveServer2的客户端有3个：pyhs2，pyhive，impyla。官网的示例采用的是pyhs2，但pyhs2的官网已声明不再提供支持，建议使用impyla和pyhive。我们这里使用的是impyla。

impyla的安装

impyla必须的依赖包括：

six

bit_array

thriftpy(python2.x则是thrift)

为了支持Hive还需要以下两个包：

sasl

thrift_sasl

可在Python PI中下载impyla及其依赖包的源码。

impyla示例

以下是使用impyla连接HiveServer2的示例：

from impala.dbapi import connect

conn= connect(host='127.0.0.1', port=10000, database='default', auth_mechanism='PLAIN')

cur= conn.cursor()

cur.execute('SHOW DATABASES')print(cur.fetchall())

cur.execute('SHOW Tables')print(cur.fetchall())

Python 连接hive(Linux)

之所以选择基于Linux系统用Python连接hive，是因为在window下会出现Hadoop认证失败的问题。会出现执行python脚本的机器无目标hive的kerberos认证信息类似错误，也会出现sasl调用问题：

该错误我尝试多次，未能解决（有知道window下解决方案的欢迎留言），所以建议使用Linux系统。

VMware Workstation+Ubuntu

网上教程很多，本文推荐一个教程： https://blog.csdn.net/stpeace/article/details/78598333

主要是以下四个包：

在安装包sasl的过程会出现麻烦，主要是Ubuntu中缺乏sasl.h的问题，这里可以通过下面语句解决

这和centos有一些区别。

本文是基于本机虚拟机用Python连接的公司测试环境的hive（生产环境和测试环境是有隔离的，生产环境需要堡垒机才能连接）

因缺乏工程和计算机基础的知识，对很多的地方都了解的不够深入，欢迎大神指点，最后向以下两位大佬的帖子致谢：

[1] https://www.zhihu.com/question/269333988/answer/581126392

[2] https://mp.weixin.qq.com/s/cdFxkphMtJASQ7-nKt13mg

python连接hive的时候必须要依赖sasl类库吗

客户端连接Hive需要使用HiveServer2。HiveServer2是HiveServer的重写版本，HiveServer不支持多个客户端的并发请求。当前HiveServer2是基于Thrift RPC实现的。它被设计用于为像JDBC、ODBC这样的开发API客户端提供更好的支持。Hive 0.11版本引入的HiveServer2。

HiveServer2的启动

启动HiveServer2

HiveServer2的启动十分简便：

$$HIVE_HOME/bin/hiveserver2

或者

$$HIVE_HOME/bin/hive--service hiveserver2

默认情况下，HiverServer2的Thrift监听端口是10000，其WEB UI端口是10002。可通过来查看HiveServer2的Web UI界面，这里显示了Hive的一些基本信息。如果Web界面不能查看，则说明HiveServer2没有成功运行。

使用beeline测试客户端连接

HiveServer2成功运行后，我们可以使用Hive提供的客户端工具beeline连接HiveServer2。

$$HIVE_HOME/bin/beeline

beeline>!connect jdbc:hive2://localhost:10000

如果成功登录将出现如下的命令提示符，此时可以编写HQL语句。

0: jdbc:hive2://localhost:10000>

报错：User: xxx is not allowed to impersonate anonymous

在beeline使用!connect连接HiveServer2时可能会出现如下错误信息：

12Caused by: org.apache.hadoop.ipc.RemoteException:User: xxx is not allowed to impersonate anonymous

这里的xxx是我的操作系统用户名称。这个问题的解决方法是在hadoop的core-size.xml文件中添加xxx用户代理配置：

123456789<spanclass="hljs-tag"><<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">name>hadoop.proxyuser.xxx.groups<spanclass="hljs-tag"></<spanclass="hljs-title">name><spanclass="hljs-tag"><<spanclass="hljs-title">value>*<spanclass="hljs-tag"></<spanclass="hljs-title">value><spanclass="hljs-tag"></<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">property><spanclass="hljs-tag"><<spanclass="hljs-title">name>hadoop.proxyuser.xxx.hosts<spanclass="hljs-tag"></<spanclass="hljs-title">name><spanclass="hljs-tag"><<spanclass="hljs-title">value>*<spanclass="hljs-tag"></<spanclass="hljs-title">value><spanclass="hljs-tag"></<spanclass="hljs-title">property>

重启HDFS后，再用beeline连接HiveServer2即可成功连接。

常用配置

HiveServer2的配置可以参考官方文档《Setting Up HiveServer2》

这里列举一些hive-site.xml的常用配置：

hive.server2.thrift.port：监听的TCP端口号。默认为10000。

hive.server2.thrift.bind.host：TCP接口的绑定主机。

hive.server2.authentication：身份验证方式。默认为NONE（使用 plain SASL），即不进行验证检查。可选项还有NOSASL, KERBEROS, LDAP, PAM and CUSTOM.

hive.server2.enable.doAs：是否以模拟身份执行查询处理。默认为true。

Python客户端连接HiveServer2

impyla的安装

impyla必须的依赖包括：

six

bit_array

thriftpy(python2.x则是thrift)

为了支持Hive还需要以下两个包：

sasl

thrift_sasl

可在Python PI中下载impyla及其依赖包的源码。

impyla示例

以下是使用impyla连接HiveServer2的示例：

1234567891011 fromimpala.dbapi importconnectconn=connect(host='127.0.0.1', port=10000, database='default', auth_mechanism='PLAIN')cur=conn.cursor()cur.execute('SHOW DATABASES')print(cur.fetchall())cur.execute('SHOW Tables')print(cur.fetchall())

好了，文章到此结束，希望可以帮助到大家。