[数据集]German Credit Data

德国信用卡数据(German Credit Data)提供了一个二分类数据集,下载地址 - statlog/german

解析

  • 文件german.data提供了20个属性(13个类别属性+7个数字属性)共1000个实例,最后一列表示类别(分别为12
  • 文件german.data-numeric提供了24个数字属性,共1000个实例,最后一列表示类别(分别为12

最开始数据集仅提供了文件german.data,后续提供了文件german.data-numeric,添加了属性并全部转换成数值

python读取

利用pandas库读取csv文件,利用sklearn库分离训练集和数据集,对数据进行标准化操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# -*- coding: utf-8 -*-

"""
@author: zj
@file: german.py
@time: 2019-12-13
"""

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


def load_german_data(data_path, shuffle=True, tsize=0.8):
data_list = pd.read_csv(data_path, header=None, sep='\s+')

data_array = data_list.values
height, width = data_array.shape[:2]
data_x = data_array[:, :(width - 1)]
data_y = data_array[:, (width - 1)]

x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=tsize, test_size=(1 - tsize),
shuffle=shuffle)

y_train = np.array(list(map(lambda x: 1 if x == 2 else 0, y_train)))
y_test = np.array(list(map(lambda x: 1 if x == 2 else 0, y_test)))

return x_train, x_test, y_train, y_test


if __name__ == '__main__':
data_path = '/home/zj/data/german/german.data-numeric'
x_train, x_test, y_train, y_test = load_german_data(data_path)

x_train = x_train.astype(np.double)
x_test = x_test.astype(np.double)
# 计算训练集每个属性的均值和方差
mu = np.mean(x_train, axis=0)
var = np.var(x_train, axis=0)
eps = 1e-8
# 将数据变换为均值为0,方差为1的标准正态分布
x_train = (x_train - mu) / np.sqrt(var + eps)
x_test = (x_test - mu) / np.sqrt(var + eps)

print(x_test)

相关阅读