구분 | 설명 | 값 |
---|---|---|
Survival | 생존 여부 | Survival. 0 = No, 1 = Yes |
Pclass | 티켓의 클래스 | Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd |
Sex | 성별(Sex) | 남(male)/여(female) |
Age | 나이(Age in years.) | |
SibSp | 함께 탑승한 형제와 배우자의 수 /siblings, spouses aboard the Titanic. | |
Parch | 함께 탑승한 부모, 아이의 수 | # of parents / children aboard the Titanic. |
Ticket | 티켓 번호(Ticket number) | (ex) CA 31352, A/5. 2151 |
Fare | 탑승료(Passenger fare) | |
Cabin | 객실 번호(Cabin number) | |
Embarked | 탑승 항구(Port of Embarkation) | C = Cherbourg, Q = Queenstown, S = Southampton |
import pandas as pd
train = pd.read_csv("data/titanic/train.csv")
test = pd.read_csv("data/titanic/test.csv")
sub = pd.read_csv("data/titanic/gender_submission.csv")
print(train.shape)
print(test.shape)
print(sub.shape)
(891, 12) (418, 11) (418, 2)
print(train.columns)
print(test.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object') Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
test.isnull().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
train.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
test.describe()
PassengerId | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
count | 418.000000 | 418.000000 | 332.000000 | 418.000000 | 418.000000 | 417.000000 |
mean | 1100.500000 | 2.265550 | 30.272590 | 0.447368 | 0.392344 | 35.627188 |
std | 120.810458 | 0.841838 | 14.181209 | 0.896760 | 0.981429 | 55.907576 |
min | 892.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
25% | 996.250000 | 1.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 1100.500000 | 3.000000 | 27.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1204.750000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.500000 |
max | 1309.000000 | 3.000000 | 76.000000 | 8.000000 | 9.000000 | 512.329200 |
train.describe(include=['O'])
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Moore, Mr. Leonard Charles | male | 1601 | G6 | S |
freq | 1 | 577 | 7 | 4 | 644 |
test.describe(include=['O'])
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 418 | 418 | 418 | 91 | 418 |
unique | 418 | 2 | 363 | 76 | 3 |
top | Ford, Mr. Arthur | male | PC 17608 | B57 B59 B63 B66 | S |
freq | 1 | 266 | 5 | 3 | 270 |
from sklearn.neighbors import KNeighborsClassifier
# 데이터 준비 - 빠른 모델 생성을 위해 데이터 전처리가 불필요한 피처(변수)만 선택
# 'Survived'를 제외 (예측해야할 피처) ,
# 'Embarked', 'Sex'',Name', 'Ticket' => 문자포함
# 'Age' : 결측치가 있음
sel = ['PassengerId', 'Pclass', 'SibSp', 'Parch' ]
# 학습에 사용될 데이터 준비 X_train, y_train
X_train = train[sel]
y_train = train['Survived']
# 학습 후, 테스트에 사용될 피처 선택
X_test = test[sel]
X_train.shape, y_train.shape, X_test.shape
((891, 4), (891,), (418, 4))
# 모델 선택
model = KNeighborsClassifier()
# 학습
model.fit(X_train, y_train)
# 예측
pred = model.predict(X_test)
pred[:15] # 예측한 값을 15개 정도 확인
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# 제출
sub.columns
Index(['PassengerId', 'Survived'], dtype='object')
sub['Survived'] = pred
sub.to_csv("first_sub.csv", index=False)