尝试使用CentOS 7.2(1511) + PostgreSQL 9.5(SCL) + MADlib 1.9.1(GA)进行机器学习

2 年 ago

清, 宇

3 minutes

首先

我计划在最新的环境下（2016/09/26），构建MADlib环境。MADlib是一个名为“Big Data Machine Learning in SQL”的项目，在PostgreSQL（或其它类似PostgreSQL的数据库如Greenplum或HAWQ）上实现了统计和机器学习等算法，并可以通过SQL进行机器学习等操作，非常优秀。

顺便提一下，我平常很少使用PostgreSQL，对机器学习几乎一无所知，所以可能包含错误信息，请注意。

这次主要是按照下面的官方文档进行尝试。详细信息请参考这里。

Quick Start Guide for Users

这次的环境

CentOS 7.2 + yum update

SCL版 PostgreSQL 9.5
MADlib Binaries v1.9.1

基本上所有的软件都可以使用rpm进行安装，所以并不难，不过SCL可能有些陌生。我建议您参考下面的IDCF Tech-Blog，虽然可能有点老（基本上没有变化）但仍然适用。

Software Collections for CentOS 6を使おう！

CentOS上安装PostgreSQL(SCL)

以后大致上，

Installation Guide

这是关于CentOS 7.2 + SCL的内容，所以有些不同，但MADlib部分几乎一样。

虛擬機器的啟動

在IDCF Cloud上，从CentOS7.2的标准模板中启动虚拟机。如果您参考下面的说明，应该可以无问题进行操作。

ご利用ガイド

安装完毕后，请执行yum -y update并重新启动。

安装SCL版的PostgreSQL

首先，启用SCL。（只需安装SCL的软件包）

# yum -y install centos-release-scl centos-release-scl-rh

我将安装SCL版本的PostgreSQL 9.5（只需安装rpm文件）。由于在MADLib中还需要PL/Python，请一并安装它。

# yum -y install rh-postgresql95-postgresql-server rh-postgresql95-postgresql-plpython

安装完成后，进行数据库的初始化并启动。请不要忘记执行scl enable rh-postgresql95 bash。

# scl enable rh-postgresql95 bash

# postgresql-setup --initdb
 * Initializing database in '/var/opt/rh/rh-postgresql95/lib/pgsql/data'
 * Initialized, logs are in /var/lib/pgsql/initdb_rh-postgresql95-postgresql.log

# systemctl start rh-postgresql95-postgresql

一旦PostgreSQL启动，就需要设置数据库中的postgres用户密码。

# su - postgres

$ scl enable rh-postgresql95 bash

$ psql
psql (9.5.2)
"help" でヘルプを表示します.

postgres=# ALTER USER postgres with password '6zCYTSn4';
ALTER ROLE
postgres=# 

$ exit
$ exit
#

一旦设置密码后，将本地主机的认证方式更改为MD5。(在安装MADlib时，只接受密码认证。)

# vi /var/opt/rh/rh-postgresql95/lib/pgsql/data/pg_hba.conf 

# IPv4 local connections:
host    all             all             127.0.0.1/32            md5

我已将127.0.0.1/32的认证设置为MD5。

在PostgreSQL中，默认的DateStyle是iso，ymd。然而，在MADLib的install-check中，如果不是iso，myd的话会失败，因此我们将其更改为iso，myd。

# vi /var/opt/rh/rh-postgresql95/lib/pgsql/data/postgresql.conf 

#datestyle = 'iso, ymd'
datestyle = 'iso, mdy'

# systemctl restart rh-postgresql95-postgresql

安装MADLib

由于MADlib也在官方提供的rpm二进制文件中，因此我们可以使用此文件进行安装。

# yum -y install https://dist.apache.org/repos/dist/release/incubator/madlib/1.9.1-incubating/apache-madlib-1.9.1-incubating-bin-Linux.rpm

现在，我们开始将MADLib设置到PostgreSQL数据库中。
（由于我不小心删除了日志，所以下面的内容只是根据记忆写下来的，可能有错误。。。）

# su - postgres

$ scl enable rh-postgresql95 bash

$ /usr/local/madlib/bin/madpack -p postgres -c postgres@127.0.0.1 install
Password for user postgres:

用中文对以下内容进行释义，只需提供一种选项：
检查是否正确安装了MADLib并将其设置到已安装的PostgreSQL中。

$ /usr/local/madlib/bin/madpack -p postgres -c postgres@127.0.0.1 install-check
Password for user postgres:

如果所有事项都没问题的话，我认为就可以了。

试一试快速启动

从现在开始，一般来说（一般情况下）。

Quick Start Guide for Users

使用逻辑回归模型进行学习和预测。在这个测试问题中，我们有心衰患者的数据，数据如下所示。

1年以内に、心臓発作を再発したかどうか。

treatment

患者がanger control(怒りのコントロール)の治療を受けたかどうか。

trait_anxiety

trait anxiety(特性不安)の係数。(高いほど多くの不安に成りやすい。)

虽然我不太确定，但大致上我认为是这个样子。利用这个模型，我们可以预测治疗能否控制怒气以及焦虑症的程度与心脏再次发作之间的关系。

首先，我们需要创建学习数据。

# su - postgres

$ scl enable rh-postgresql95 bash

$ psql
psql (9.5.2)
"help" でヘルプを表示します.

postgres=#


DROP TABLE IF EXISTS patients, patients_logregr, patients_logregr_summary;

CREATE TABLE patients( id INTEGER NOT NULL,
                        second_attack INTEGER,
                        treatment INTEGER,
                        trait_anxiety INTEGER);

INSERT INTO patients VALUES                                                     
(1,     1,      1,      70),
(3,     1,      1,      50),
(5,     1,      0,      40),
(7,     1,      0,      75),
(9,     1,      0,      70),
(11,    0,      1,      65),
(13,    0,      1,      45),
(15,    0,      1,      40),
(17,    0,      0,      55),
(19,    0,      0,      50),
(2,     1,      1,      80),
(4,     1,      0,      60),
(6,     1,      0,      65),
(8,     1,      0,      80),
(10,    1,      0,      60),
(12,    0,      1,      50),
(14,    0,      1,      35),
(16,    0,      1,      50),
(18,    0,      0,      45),
(20,    0,      0,      60);

使用这个数据，我们将使用逻辑回归进行学习。

SELECT madlib.logregr_train(
    'patients',                                 -- source table
    'patients_logregr',                         -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]',       -- features
    NULL,                                       -- grouping columns
    20,                                         -- max number of iteration
    'irls'                                      -- optimizer
    );

到了这个程度，我还不太了解。请看文件！（从这里开始就变得有点随意了，请原谅。）
逻辑回归
如果你对逻辑回归没有多少理解，可能会有些困难。

虽然我不是很清楚，但是根据这个，应该已经生成了病人日志回归表中的学习结果（模型）。让我们来看一下。

postgres=# \x on
拡張表示は on です。

postgres=# SELECT * from patients_logregr;
-[ RECORD 1 ]------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
coef                     | {-6.36346994178187,-1.02410605239327,0.119044916668606}
log_likelihood           | -9.41018298388876
std_err                  | {3.21389766375091,1.17107844860318,0.0549790458269303}
z_stats                  | {-1.9799852414576,-0.874498248699553,2.1652779686892}
p_values                 | {0.0477051870698109,0.381846973530448,0.0303664045046153}
odds_ratios              | {0.00172337630923231,0.359117354054955,1.12642051220895}
condition_no             | 326.081922791559
num_rows_processed       | 20
num_missing_rows_skipped | 0
num_iterations           | 5
variance_covariance      | {{10.3291381930635,-0.474304665195729,-0.171995901260048},{-0.474304665195729,1.37142473278283,-0.00119520703381591},{-0.171995901260048,-0.00119520703381591,0.00302269548003971}}

因为有东西在里面，肯定是已经做好了。

现在，让我们使用这个学习模型来预测second_attack（心脏病发作再次发生）。首先，让我们使用原始数据来进行预测试试看。

postgres=# \x off
拡張表示は off です。

postgres=# SELECT id, second_attack, madlib.logregr_predict(coef, ARRAY[1, treatment, trait_anxiety]), madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trait_anxiety])
FROM patients p, patients_logregr m ORDER BY id;
 id | second_attack | logregr_predict | logregr_predict_prob 
----+---------------+-----------------+----------------------
  1 |             1 | t               |    0.720223028941525
  2 |             1 | t               |    0.894354902502046
  3 |             1 | f               |    0.192269541755172
  4 |             1 | t               |    0.685513072239347
  5 |             1 | f               |     0.16774788150886
  6 |             1 | t               |     0.79809810891514
  7 |             1 | t               |    0.928568075752502
  8 |             1 | t               |     0.95930576369357
  9 |             1 | t               |    0.877576117431451
 10 |             1 | t               |    0.685513072239347
 11 |             0 | t               |    0.586700895943316
 12 |             0 | f               |    0.192269541755172
 13 |             0 | f               |    0.116032010632995
 14 |             0 | f               |   0.0383829143134989
 15 |             0 | f               |   0.0674976224147607
 16 |             0 | f               |    0.192269541755172
 17 |             0 | t               |    0.545870774302622
 18 |             0 | f               |    0.267675422387135
 19 |             0 | f               |    0.398618639285114
 20 |             0 | t               |    0.685513072239347
(20 行)

second_attackは元データ上のsecond_attackです。つまり、心臓発作が実際に起きたのか、起きなかったのか。

logregr_predictは予測結果で、true/falseが返ってきています。

logregr_predict_probは予測結果で、確率が返ってきています。50%以上が上記のtrueのようですね。

根据使用原始数据进行预测，发现在20个案例中有15个预测是准确的。

最后

如果是对逻辑回归等有一定理解的人，通过使用SQL，可以很容易地进行机器学习。嗯，对我来说，“有一定理解”这部分相当有问题。。。