타이타닉 생존자 예측 대회¶

학습 목표¶

  • Ticket, Gender 피처를 사용한다.
  • GridSearchCV를 통해 좋은 변수를 사용한다.
  • 데이터 변환을 수행해 본다.
  • 새로운 변수 생성을 알아본다.

목차

01. 데이터 불러오기
02. 데이터 전처리
03. 모델링
04. 예측

데이터¶

Data Fields¶

구분 설명 값
Survival 생존 여부 Survival. 0 = No, 1 = Yes
Pclass 티켓의 클래스 Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd
Sex 성별(Sex) 남(male)/여(female)
Age 나이(Age in years.)
SibSp 함께 탑승한 형제와 배우자의 수 /siblings, spouses aboard the Titanic.
Parch 함께 탑승한 부모, 아이의 수 # of parents / children aboard the Titanic.
Ticket 티켓 번호(Ticket number) (ex) CA 31352, A/5. 2151
Fare 탑승료(Passenger fare)
Cabin 객실 번호(Cabin number)
Embarked 탑승 항구(Port of Embarkation) C = Cherbourg, Q = Queenstown, S = Southampton
  • siblings : 형제, 자매, 형제, 의붓 형제
  • spouses : 남편, 아내 (정부와 약혼자는 무시)
  • Parch : Parent(mother, father), child(daughter, son, stepdaughter, stepson)

01. 데이터 불러오기

목차로 이동하기

참고 노트북¶

  • titanic 전체 노트북
    • https://www.kaggle.com/code/pliptor/how-am-i-doing-with-my-score/report
    • https://www.kaggle.com/code/pliptor/titanic-ticket-only-study/notebook
In [109]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [110]:
sel_f = ['Ticket', 'Pclass', 'Sex']

train = pd.read_csv("data/titanic/train.csv", 
                    usecols=['PassengerId', 'Survived']+sel_f)

test = pd.read_csv("data/titanic/test.csv", 
                   usecols=['PassengerId'] + sel_f)

sub = pd.read_csv("data/titanic/gender_submission.csv")
In [111]:
# 컬럼 추가 및 합치기
test['Survived'] = np.nan
all_df = pd.concat([train, test])
all_df.head()
Out[111]:
PassengerId Survived Pclass Sex Ticket
0 1 0.0 3 male A/5 21171
1 2 1.0 1 female PC 17599
2 3 1.0 3 female STON/O2. 3101282
3 4 1.0 1 female 113803
4 5 0.0 3 male 373450

02. 티켓 변수 확인

목차로 이동하기

In [112]:
all_df.Ticket.unique()
Out[112]:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',
       '330877', '17463', '349909', '347742', '237736', 'PP 9549',
       '113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
       '244373', '345763', '2649', '239865', '248698', '330923', '113788',
       '347077', '2631', '19950', '330959', '349216', 'PC 17601',
       'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
       'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
       'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
       '2662', '349237', '3101295', 'A/4. 39886', 'PC 17572', '2926',
       '113509', '19947', 'C.A. 31026', '2697', 'C.A. 34651', 'CA 2144',
       '2669', '113572', '36973', '347088', 'PC 17605', '2661',
       'C.A. 29395', 'S.P. 3464', '3101281', '315151', 'C.A. 33111',
       'S.O.C. 14879', '2680', '1601', '348123', '349208', '374746',
       '248738', '364516', '345767', '345779', '330932', '113059',
       'SO/C 14885', '3101278', 'W./C. 6608', 'SOTON/OQ 392086', '343275',
       '343276', '347466', 'W.E.P. 5734', 'C.A. 2315', '364500', '374910',
       'PC 17754', 'PC 17759', '231919', '244367', '349245', '349215',
       '35281', '7540', '3101276', '349207', '343120', '312991', '349249',
       '371110', '110465', '2665', '324669', '4136', '2627',
       'STON/O 2. 3101294', '370369', 'PC 17558', 'A4. 54510', '27267',
       '370372', 'C 17369', '2668', '347061', '349241',
       'SOTON/O.Q. 3101307', 'A/5. 3337', '228414', 'C.A. 29178',
       'SC/PARIS 2133', '11752', '7534', 'PC 17593', '2678', '347081',
       'STON/O2. 3101279', '365222', '231945', 'C.A. 33112', '350043',
       '230080', '244310', 'S.O.P. 1166', '113776', 'A.5. 11206',
       'A/5. 851', 'Fa 265302', 'PC 17597', '35851', 'SOTON/OQ 392090',
       '315037', 'CA. 2343', '371362', 'C.A. 33595', '347068', '315093',
       '363291', '113505', 'PC 17318', '111240', 'STON/O 2. 3101280',
       '17764', '350404', '4133', 'PC 17595', '250653', 'LINE',
       'SC/PARIS 2131', '230136', '315153', '113767', '370365', '111428',
       '364849', '349247', '234604', '28424', '350046', 'PC 17610',
       '368703', '4579', '370370', '248747', '345770', '3101264', '2628',
       'A/5 3540', '347054', '2699', '367231', '112277',
       'SOTON/O.Q. 3101311', 'F.C.C. 13528', 'A/5 21174', '250646',
       '367229', '35273', 'STON/O2. 3101283', '243847', '11813',
       'W/C 14208', 'SOTON/OQ 392089', '220367', '21440', '349234',
       '19943', 'PP 4348', 'SW/PP 751', 'A/5 21173', '236171', '347067',
       '237442', 'C.A. 29566', 'W./C. 6609', '26707', 'C.A. 31921',
       '28665', 'SCO/W 1585', '367230', 'W./C. 14263',
       'STON/O 2. 3101275', '2694', '19928', '347071', '250649', '11751',
       '244252', '362316', '113514', 'A/5. 3336', '370129', '2650',
       'PC 17585', '110152', 'PC 17755', '230433', '384461', '110413',
       '112059', '382649', 'C.A. 17248', '347083', 'PC 17582', 'PC 17760',
       '113798', '250644', 'PC 17596', '370375', '13502', '347073',
       '239853', 'C.A. 2673', '336439', '347464', '345778', 'A/5. 10482',
       '113056', '349239', '345774', '349206', '237798', '370373',
       '19877', '11967', 'SC/Paris 2163', '349236', '349233', 'PC 17612',
       '2693', '113781', '19988', '9234', '367226', '226593', 'A/5 2466',
       '17421', 'PC 17758', 'P/PP 3381', 'PC 17485', '11767', 'PC 17608',
       '250651', '349243', 'F.C.C. 13529', '347470', '29011', '36928',
       '16966', 'A/5 21172', '349219', '234818', '345364', '28551',
       '111361', '113043', 'PC 17611', '349225', '7598', '113784',
       '248740', '244361', '229236', '248733', '31418', '386525',
       'C.A. 37671', '315088', '7267', '113510', '2695', '2647', '345783',
       '237671', '330931', '330980', 'SC/PARIS 2167', '2691',
       'SOTON/O.Q. 3101310', 'C 7076', '110813', '2626', '14313',
       'PC 17477', '11765', '3101267', '323951', 'C 7077', '113503',
       '2648', '347069', 'PC 17757', '2653', 'STON/O 2. 3101293',
       '349227', '27849', '367655', 'SC 1748', '113760', '350034',
       '3101277', '350052', '350407', '28403', '244278', '240929',
       'STON/O 2. 3101289', '341826', '4137', '315096', '28664', '347064',
       '29106', '312992', '349222', '394140', 'STON/O 2. 3101269',
       '343095', '28220', '250652', '28228', '345773', '349254',
       'A/5. 13032', '315082', '347080', 'A/4. 34244', '2003', '250655',
       '364851', 'SOTON/O.Q. 392078', '110564', '376564', 'SC/AH 3085',
       'STON/O 2. 3101274', '13507', 'C.A. 18723', '345769', '347076',
       '230434', '65306', '33638', '113794', '2666', '113786', '65303',
       '113051', '17453', 'A/5 2817', '349240', '13509', '17464',
       'F.C.C. 13531', '371060', '19952', '364506', '111320', '234360',
       'A/S 2816', 'SOTON/O.Q. 3101306', '113792', '36209', '323592',
       '315089', 'SC/AH Basle 541', '7553', '31027', '3460', '350060',
       '3101298', '239854', 'A/5 3594', '4134', '11771', 'A.5. 18509',
       '65304', 'SOTON/OQ 3101317', '113787', 'PC 17609', 'A/4 45380',
       '36947', 'C.A. 6212', '350035', '315086', '364846', '330909',
       '4135', '26360', '111427', 'C 4001', '382651', 'SOTON/OQ 3101316',
       'PC 17473', 'PC 17603', '349209', '36967', 'C.A. 34260', '226875',
       '349242', '12749', '349252', '2624', '2700', '367232',
       'W./C. 14258', 'PC 17483', '3101296', '29104', '2641', '2690',
       '315084', '113050', 'PC 17761', '364498', '13568', 'WE/P 5735',
       '2908', '693', 'SC/PARIS 2146', '244358', '330979', '2620',
       '347085', '113807', '11755', '345572', '372622', '349251',
       '218629', 'SOTON/OQ 392082', 'SOTON/O.Q. 392087', 'A/4 48871',
       '349205', '2686', '350417', 'S.W./PP 752', '11769', 'PC 17474',
       '14312', 'A/4. 20589', '358585', '243880', '2689',
       'STON/O 2. 3101286', '237789', '13049', '3411', '237565', '13567',
       '14973', 'A./5. 3235', 'STON/O 2. 3101273', 'A/5 3902', '364848',
       'SC/AH 29037', '248727', '2664', '349214', '113796', '364511',
       '111426', '349910', '349246', '113804', 'SOTON/O.Q. 3101305',
       '370377', '364512', '220845', '31028', '2659', '11753', '350029',
       '54636', '36963', '219533', '349224', '334912', '27042', '347743',
       '13214', '112052', '237668', 'STON/O 2. 3101292', '350050',
       '349231', '13213', 'S.O./P.P. 751', 'CA. 2314', '349221', '8475',
       '330919', '365226', '349223', '29751', '2623', '5727', '349210',
       'STON/O 2. 3101285', '234686', '312993', 'A/5 3536', '19996',
       '29750', 'F.C. 12750', 'C.A. 24580', '244270', '239856', '349912',
       '342826', '4138', '330935', '6563', '349228', '350036', '24160',
       '17474', '349256', '2672', '113800', '248731', '363592', '35852',
       '348121', 'PC 17475', '36864', '350025', '223596', 'PC 17476',
       'PC 17482', '113028', '7545', '250647', '348124', '34218', '36568',
       '347062', '350048', '12233', '250643', '113806', '315094', '36866',
       '236853', 'STON/O2. 3101271', '239855', '28425', '233639',
       '349201', '349218', '16988', '376566', 'STON/O 2. 3101288',
       '250648', '113773', '335097', '29103', '392096', '345780',
       '349204', '350042', '29108', '363294', 'SOTON/O2 3101272', '2663',
       '347074', '112379', '364850', '8471', '345781', '350047',
       'S.O./P.P. 3', '2674', '29105', '347078', '383121', '36865',
       '2687', '113501', 'W./C. 6607', 'SOTON/O.Q. 3101312', '374887',
       '3101265', '12460', 'PC 17600', '349203', '28213', '17465',
       '349244', '2685', '2625', '347089', '347063', '112050', '347087',
       '248723', '3474', '28206', '364499', '112058', 'STON/O2. 3101290',
       'S.C./PARIS 2079', 'C 7075', '315098', '19972', '368323', '367228',
       '2671', '347468', '2223', 'PC 17756', '315097', '392092', '11774',
       'SOTON/O2 3101287', '2683', '315090', 'C.A. 5547', '349213',
       '347060', 'PC 17592', '392091', '113055', '2629', '350026',
       '28134', '17466', '233866', '236852', 'SC/PARIS 2149', 'PC 17590',
       '345777', '349248', '695', '345765', '2667', '349212', '349217',
       '349257', '7552', 'C.A./SOTON 34068', 'SOTON/OQ 392076', '211536',
       '112053', '111369', '370376', '330911', '363272', '240276',
       '315154', '7538', '330972', '2657', '349220', '694', '21228',
       '24065', '233734', '2692', 'STON/O2. 3101270', '2696', 'C 17368',
       'PC 17598', '2698', '113054', 'C.A. 31029', '13236', '2682',
       '342712', '315087', '345768', '113778', 'SOTON/O.Q. 3101263',
       '237249', 'STON/O 2. 3101291', 'PC 17594', '370374', '13695',
       'SC/PARIS 2168', 'SC/A.3 2861', '349230', '348122', '349232',
       '237216', '347090', '334914', 'F.C.C. 13534', '330963', '2543',
       '382653', '349211', '3101297', 'PC 17562', '359306', '11770',
       '248744', '368702', '19924', '349238', '240261', '2660', '330844',
       'A/4 31416', '364856', '347072', '345498', '376563', '13905',
       '350033', 'STON/O 2. 3101268', '347471', 'A./5. 3338', '11778',
       '365235', '347070', '330920', '383162', '3410', '248734', '237734',
       '330968', 'PC 17531', '329944', '2681', '13050', '367227',
       '392095', '368783', '350045', '211535', '342441',
       'STON/OQ. 369943', '113780', '2621', '349226', '350409', '2656',
       '248659', 'SOTON/OQ 392083', '17475', 'SC/A4 23568', '113791',
       '349255', '3701', '350405', 'S.O./P.P. 752', '347469', '110489',
       'SOTON/O.Q. 3101315', '335432', '220844', '343271', '237393',
       'PC 17591', '17770', '7548', 'S.O./P.P. 251', '2670', '2673',
       '233478', '7935', '239059', 'S.O./P.P. 2', 'A/4 48873', '28221',
       '111163', '235509', '347465', '347066', 'C.A. 31030', '65305',
       'C.A. 34050', 'F.C. 12998', '9232', '28034', 'PC 17613', '349250',
       'SOTON/O.Q. 3101308', '347091', '113038', '330924', '32302',
       'SC/PARIS 2148', '342684', 'W./C. 14266', '350053', 'PC 17606',
       '350054', '370368', '242963', '113795', '3101266', '330971',
       '350416', '2679', '250650', '112377', '3470', 'SOTON/O2 3101284',
       '13508', '7266', '345775', 'C.A. 42795', 'AQ/4 3130', '363611',
       '28404', '345501', '350410', 'C.A. 34644', '349235', '112051',
       'C.A. 49867', 'A. 2. 39186', '315095', '368573', '2676',
       'SC 14888', 'CA 31352', 'W./C. 14260', '315085', '364859',
       'A/5 21175', 'SOTON/O.Q. 3101314', '2655', 'A/5 1478', 'PC 17607',
       '382650', '2652', '345771', '349202', '113801', '347467', '347079',
       '237735', '315092', '383123', '112901', '315091', '2658',
       'LP 1588', '368364', 'AQ/3. 30631', '28004', '350408', '347075',
       '2654', '244368', '113790', 'SOTON/O.Q. 3101309', '236854',
       'PC 17580', '2684', '349229', '110469', '244360', '2675', '2622',
       'C.A. 15185', '350403', '348125', '237670', '2688', '248726',
       'F.C.C. 13540', '113044', '1222', '368402', '315083', '112378',
       'SC/PARIS 2147', '28133', '248746', '315152', '29107', '680',
       '366713', '330910', 'SC/PARIS 2159', '349911', '244346', '364858',
       'C.A. 30769', '371109', '347065', '21332', '17765',
       'SC/PARIS 2166', '28666', '334915', '365237', '347086',
       'A.5. 3236', 'SOTON/O.Q. 3101262', '359309'], dtype=object)
  • 티켓은 순전히 숫자인 것, 그리고 영숫자 접두사, 그리고 승무원인 LINE이 발행.
In [113]:
all_df.loc[all_df['Ticket']=='LINE']
Out[113]:
PassengerId Survived Pclass Sex Ticket
179 180 0.0 3 male LINE
271 272 1.0 3 male LINE
302 303 0.0 3 male LINE
597 598 0.0 3 male LINE
  • 다른 데이터와 비슷하게 만들기 위해 LINE 0 으로 변경
In [114]:
all_df['Ticket'] = all_df['Ticket'].replace('LINE', 'LINE 0')
all_df[all_df['Ticket']=='LINE 0']
Out[114]:
PassengerId Survived Pclass Sex Ticket
179 180 0.0 3 male LINE 0
271 272 1.0 3 male LINE 0
302 303 0.0 3 male LINE 0
597 598 0.0 3 male LINE 0

티켓의 중복 확인¶

In [115]:
dup_tickets = all_df.groupby('Ticket').size()
dup_tickets
Out[115]:
Ticket
110152         3
110413         3
110465         2
110469         1
110489         1
              ..
W./C. 6608     5
W./C. 6609     1
W.E.P. 5734    2
W/C 14208      1
WE/P 5735      2
Length: 929, dtype: int64

중복 티켓 장수¶

In [116]:
all_df['중복티켓수'] = all_df['Ticket'].map(dup_tickets)
plt.xlabel('duplications')
plt.ylabel('frequency')
plt.title('Duplicate Tickets')
all_df['중복티켓수'].hist(bins=20)
Out[116]:
<AxesSubplot:title={'center':'Duplicate Tickets'}, xlabel='duplications', ylabel='frequency'>
  • 유일한 티켓이 압도적으로 많다.

티켓의 값 처리¶

  • '.'', '/' 을 공백으로 변경
In [117]:
all_df['Ticket'] = all_df['Ticket'].apply(lambda x: x.replace('.','').replace('/','').lower())
all_df.head()
Out[117]:
PassengerId Survived Pclass Sex Ticket 중복티켓수
0 1 0.0 3 male a5 21171 1
1 2 1.0 1 female pc 17599 2
2 3 1.0 3 female stono2 3101282 1
3 4 1.0 1 female 113803 2
4 5 0.0 3 male 373450 1

Ticket의 값을 공백으로 분리 후, 앞의 이름을 갖는 변수 만들기¶

In [118]:
"aaaaa 000000".split(' ')[0][0]    # 첫번째 단어의 맨 앞 첫글자
Out[118]:
'a'
In [119]:
def get_prefix(ticket):
    lead = ticket.split(' ')[0][0]
    
    # 알파벳인지 확인
    if lead.isalpha():
        return ticket.split(' ')[0]
    else:
        return 'NoPrefix'
    
all_df['Prefix'] = all_df['Ticket'].apply(lambda x: get_prefix(x))
all_df.head()
Out[119]:
PassengerId Survived Pclass Sex Ticket 중복티켓수 Prefix
0 1 0.0 3 male a5 21171 1 a5
1 2 1.0 1 female pc 17599 2 pc
2 3 1.0 3 female stono2 3101282 1 stono2
3 4 1.0 1 female 113803 2 NoPrefix
4 5 0.0 3 male 373450 1 NoPrefix
In [120]:
"a5 21171".split(' ')[-1]
Out[120]:
'21171'
In [121]:
str("a5 21171")[0]
Out[121]:
'a'
In [122]:
val = int( "a5 21171".split(' ')[-1] )
str(val)
Out[122]:
'21171'
  • TNumeric : 숫자로 변경
  • TNlen : TNumeric의 길이
  • LeadingDigit : TNumeric의 맨 앞글자
  • TGroup : Ticket의 둿부분의 문자로 변경
In [123]:
all_df['TNumeric'] = all_df['Ticket'].apply(lambda x: int(x.split(' ')[-1])//1)
all_df['TNlen'] = all_df['TNumeric'].apply(lambda x : len(str(x))) 
all_df['LeadingDigit'] = all_df['TNumeric'].apply(lambda x : int(str(x)[0]))
all_df['TGroup'] = all_df['Ticket'].apply(lambda x: str(int(x.split(' ')[-1])//10))
all_df.head()
Out[123]:
PassengerId Survived Pclass Sex Ticket 중복티켓수 Prefix TNumeric TNlen LeadingDigit TGroup
0 1 0.0 3 male a5 21171 1 a5 21171 5 2 2117
1 2 1.0 1 female pc 17599 2 pc 17599 5 1 1759
2 3 1.0 3 female stono2 3101282 1 stono2 3101282 7 3 310128
3 4 1.0 1 female 113803 2 NoPrefix 113803 6 1 11380
4 5 0.0 3 male 373450 1 NoPrefix 373450 6 3 37345
In [124]:
pd.crosstab(all_df['Pclass'],all_df['LeadingDigit'])
Out[124]:
LeadingDigit 0 1 2 3 4 5 6 7 8 9
Pclass
1 0 288 8 18 0 5 4 0 0 0
2 0 32 205 37 0 1 0 2 0 0
3 4 22 136 476 22 4 17 18 5 5
In [125]:
all_df = all_df.drop(columns=['Ticket', 'TNumeric', 'Pclass'])
all_df
Out[125]:
PassengerId Survived Sex 중복티켓수 Prefix TNlen LeadingDigit TGroup
0 1 0.0 male 1 a5 5 2 2117
1 2 1.0 female 2 pc 5 1 1759
2 3 1.0 female 1 stono2 7 3 310128
3 4 1.0 female 2 NoPrefix 6 1 11380
4 5 0.0 male 1 NoPrefix 6 3 37345
... ... ... ... ... ... ... ... ...
413 1305 NaN male 1 a5 4 3 323
414 1306 NaN female 3 pc 5 1 1775
415 1307 NaN male 1 sotonoq 7 3 310126
416 1308 NaN male 1 NoPrefix 6 3 35930
417 1309 NaN male 3 NoPrefix 4 2 266

1309 rows × 8 columns

In [126]:
all_df['Prefix']
Out[126]:
0            a5
1            pc
2        stono2
3      NoPrefix
4      NoPrefix
         ...   
413          a5
414          pc
415     sotonoq
416    NoPrefix
417    NoPrefix
Name: Prefix, Length: 1309, dtype: object
In [127]:
all_df = pd.concat([pd.get_dummies(all_df[['Prefix','TGroup']]), 
            all_df[['PassengerId','Survived','중복티켓수','TNlen','LeadingDigit', 'Sex']]],
            axis=1)

all_df
Out[127]:
Prefix_NoPrefix Prefix_a Prefix_a4 Prefix_a5 Prefix_aq3 Prefix_aq4 Prefix_as Prefix_c Prefix_ca Prefix_casoton ... TGroup_847 TGroup_85 TGroup_923 TGroup_954 PassengerId Survived 중복티켓수 TNlen LeadingDigit Sex
0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 1 0.0 1 5 2 male
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 2 1.0 2 5 1 female
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 3 1.0 1 7 3 female
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 4 1.0 2 6 1 female
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 5 0.0 1 6 3 male
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 1305 NaN 1 4 3 male
414 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1306 NaN 3 5 1 female
415 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1307 NaN 1 7 3 male
416 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1308 NaN 1 6 3 male
417 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1309 NaN 3 4 2 male

1309 rows × 451 columns

In [135]:
dict_s = {'male':0, 'female':1}
all_df['Sex'] = all_df['Sex'].map(dict_s)
In [136]:
predictors = sorted(list(set(all_df.columns) - set(['PassengerId','Survived'])))
predictors
Out[136]:
['LeadingDigit',
 'Prefix_NoPrefix',
 'Prefix_a',
 'Prefix_a4',
 'Prefix_a5',
 'Prefix_aq3',
 'Prefix_aq4',
 'Prefix_as',
 'Prefix_c',
 'Prefix_ca',
 'Prefix_casoton',
 'Prefix_fa',
 'Prefix_fc',
 'Prefix_fcc',
 'Prefix_line',
 'Prefix_lp',
 'Prefix_pc',
 'Prefix_pp',
 'Prefix_ppp',
 'Prefix_sc',
 'Prefix_sca3',
 'Prefix_sca4',
 'Prefix_scah',
 'Prefix_scow',
 'Prefix_scparis',
 'Prefix_soc',
 'Prefix_sop',
 'Prefix_sopp',
 'Prefix_sotono2',
 'Prefix_sotonoq',
 'Prefix_sp',
 'Prefix_stono',
 'Prefix_stono2',
 'Prefix_stonoq',
 'Prefix_swpp',
 'Prefix_wc',
 'Prefix_wep',
 'Sex',
 'TGroup_0',
 'TGroup_1048',
 'TGroup_11015',
 'TGroup_11041',
 'TGroup_11046',
 'TGroup_11048',
 'TGroup_11056',
 'TGroup_11081',
 'TGroup_11116',
 'TGroup_11124',
 'TGroup_11132',
 'TGroup_11136',
 'TGroup_11142',
 'TGroup_1120',
 'TGroup_11205',
 'TGroup_11227',
 'TGroup_11237',
 'TGroup_11290',
 'TGroup_11302',
 'TGroup_11303',
 'TGroup_11304',
 'TGroup_11305',
 'TGroup_11350',
 'TGroup_11351',
 'TGroup_11357',
 'TGroup_11376',
 'TGroup_11377',
 'TGroup_11378',
 'TGroup_11379',
 'TGroup_11380',
 'TGroup_116',
 'TGroup_1166',
 'TGroup_1175',
 'TGroup_1176',
 'TGroup_1177',
 'TGroup_1181',
 'TGroup_1196',
 'TGroup_122',
 'TGroup_1223',
 'TGroup_1246',
 'TGroup_1274',
 'TGroup_1275',
 'TGroup_1299',
 'TGroup_1303',
 'TGroup_1304',
 'TGroup_1305',
 'TGroup_1321',
 'TGroup_1323',
 'TGroup_1350',
 'TGroup_1352',
 'TGroup_1353',
 'TGroup_1354',
 'TGroup_1356',
 'TGroup_1369',
 'TGroup_1390',
 'TGroup_1420',
 'TGroup_1425',
 'TGroup_1426',
 'TGroup_1431',
 'TGroup_147',
 'TGroup_1487',
 'TGroup_1488',
 'TGroup_1497',
 'TGroup_1518',
 'TGroup_158',
 'TGroup_160',
 'TGroup_1696',
 'TGroup_1698',
 'TGroup_1724',
 'TGroup_1731',
 'TGroup_1736',
 'TGroup_174',
 'TGroup_1742',
 'TGroup_1745',
 'TGroup_1746',
 'TGroup_1747',
 'TGroup_1748',
 'TGroup_1753',
 'TGroup_1755',
 'TGroup_1756',
 'TGroup_1757',
 'TGroup_1758',
 'TGroup_1759',
 'TGroup_1760',
 'TGroup_1761',
 'TGroup_1775',
 'TGroup_1776',
 'TGroup_1777',
 'TGroup_1850',
 'TGroup_1872',
 'TGroup_1987',
 'TGroup_1992',
 'TGroup_1994',
 'TGroup_1995',
 'TGroup_1997',
 'TGroup_1998',
 'TGroup_1999',
 'TGroup_200',
 'TGroup_2058',
 'TGroup_207',
 'TGroup_21153',
 'TGroup_2117',
 'TGroup_212',
 'TGroup_2122',
 'TGroup_213',
 'TGroup_2133',
 'TGroup_214',
 'TGroup_2144',
 'TGroup_215',
 'TGroup_216',
 'TGroup_21862',
 'TGroup_21953',
 'TGroup_22036',
 'TGroup_22084',
 'TGroup_222',
 'TGroup_22359',
 'TGroup_22659',
 'TGroup_22687',
 'TGroup_22841',
 'TGroup_22923',
 'TGroup_23008',
 'TGroup_23013',
 'TGroup_23043',
 'TGroup_231',
 'TGroup_23191',
 'TGroup_23194',
 'TGroup_23347',
 'TGroup_23363',
 'TGroup_23373',
 'TGroup_23386',
 'TGroup_234',
 'TGroup_23436',
 'TGroup_23460',
 'TGroup_23468',
 'TGroup_23481',
 'TGroup_23550',
 'TGroup_2356',
 'TGroup_23617',
 'TGroup_23685',
 'TGroup_23721',
 'TGroup_23724',
 'TGroup_23739',
 'TGroup_23744',
 'TGroup_23756',
 'TGroup_23766',
 'TGroup_23767',
 'TGroup_23773',
 'TGroup_23778',
 'TGroup_23779',
 'TGroup_23905',
 'TGroup_23985',
 'TGroup_23986',
 'TGroup_24026',
 'TGroup_24027',
 'TGroup_2406',
 'TGroup_24092',
 'TGroup_2416',
 'TGroup_24296',
 'TGroup_24384',
 'TGroup_24388',
 'TGroup_24425',
 'TGroup_24427',
 'TGroup_24431',
 'TGroup_24434',
 'TGroup_24435',
 'TGroup_24436',
 'TGroup_24437',
 'TGroup_2457',
 'TGroup_2458',
 'TGroup_246',
 'TGroup_24865',
 'TGroup_24869',
 'TGroup_24870',
 'TGroup_24872',
 'TGroup_24873',
 'TGroup_24874',
 'TGroup_25',
 'TGroup_25064',
 'TGroup_25065',
 'TGroup_254',
 'TGroup_262',
 'TGroup_263',
 'TGroup_2636',
 'TGroup_264',
 'TGroup_265',
 'TGroup_26530',
 'TGroup_266',
 'TGroup_267',
 'TGroup_2670',
 'TGroup_268',
 'TGroup_269',
 'TGroup_270',
 'TGroup_2704',
 'TGroup_2726',
 'TGroup_2784',
 'TGroup_2800',
 'TGroup_2803',
 'TGroup_281',
 'TGroup_2813',
 'TGroup_2820',
 'TGroup_2821',
 'TGroup_2822',
 'TGroup_2840',
 'TGroup_2842',
 'TGroup_2855',
 'TGroup_286',
 'TGroup_2866',
 'TGroup_290',
 'TGroup_2901',
 'TGroup_2903',
 'TGroup_2910',
 'TGroup_2917',
 'TGroup_292',
 'TGroup_2939',
 'TGroup_2956',
 'TGroup_2975',
 'TGroup_3063',
 'TGroup_3076',
 'TGroup_308',
 'TGroup_310126',
 'TGroup_310127',
 'TGroup_310128',
 'TGroup_310129',
 'TGroup_310130',
 'TGroup_310131',
 'TGroup_3102',
 'TGroup_3103',
 'TGroup_31299',
 'TGroup_313',
 'TGroup_3135',
 'TGroup_3141',
 'TGroup_31503',
 'TGroup_31508',
 'TGroup_31509',
 'TGroup_31515',
 'TGroup_3192',
 'TGroup_323',
 'TGroup_3230',
 'TGroup_32359',
 'TGroup_32395',
 'TGroup_32466',
 'TGroup_32994',
 'TGroup_33084',
 'TGroup_33087',
 'TGroup_33090',
 'TGroup_33091',
 'TGroup_33092',
 'TGroup_33093',
 'TGroup_33095',
 'TGroup_33096',
 'TGroup_33097',
 'TGroup_33098',
 'TGroup_3311',
 'TGroup_333',
 'TGroup_33491',
 'TGroup_33509',
 'TGroup_33543',
 'TGroup_33567',
 'TGroup_3359',
 'TGroup_3363',
 'TGroup_33643',
 'TGroup_338',
 'TGroup_3405',
 'TGroup_3406',
 'TGroup_341',
 'TGroup_34182',
 'TGroup_3421',
 'TGroup_3424',
 'TGroup_34244',
 'TGroup_3426',
 'TGroup_34268',
 'TGroup_34271',
 'TGroup_34282',
 'TGroup_34309',
 'TGroup_34312',
 'TGroup_34327',
 'TGroup_34536',
 'TGroup_34549',
 'TGroup_34550',
 'TGroup_34557',
 'TGroup_34576',
 'TGroup_34577',
 'TGroup_34578',
 'TGroup_346',
 'TGroup_3464',
 'TGroup_3465',
 'TGroup_347',
 'TGroup_34705',
 'TGroup_34706',
 'TGroup_34707',
 'TGroup_34708',
 'TGroup_34709',
 'TGroup_34746',
 'TGroup_34747',
 'TGroup_34774',
 'TGroup_34812',
 'TGroup_34920',
 'TGroup_34921',
 'TGroup_34922',
 'TGroup_34923',
 'TGroup_34924',
 'TGroup_34925',
 'TGroup_34990',
 'TGroup_34991',
 'TGroup_35002',
 'TGroup_35003',
 'TGroup_35004',
 'TGroup_35005',
 'TGroup_35006',
 'TGroup_35040',
 'TGroup_35041',
 'TGroup_3527',
 'TGroup_3528',
 'TGroup_353',
 'TGroup_354',
 'TGroup_3585',
 'TGroup_35858',
 'TGroup_359',
 'TGroup_35930',
 'TGroup_3620',
 'TGroup_36231',
 'TGroup_36327',
 'TGroup_36329',
 'TGroup_36359',
 'TGroup_36361',
 'TGroup_36449',
 'TGroup_36450',
 'TGroup_36451',
 'TGroup_36484',
 'TGroup_36485',
 'TGroup_36522',
 'TGroup_36523',
 'TGroup_3656',
 'TGroup_36671',
 'TGroup_36722',
 'TGroup_36723',
 'TGroup_36765',
 'TGroup_36832',
 'TGroup_36836',
 'TGroup_36840',
 'TGroup_36857',
 'TGroup_3686',
 'TGroup_36870',
 'TGroup_36878',
 'TGroup_3692',
 'TGroup_3694',
 'TGroup_3696',
 'TGroup_3697',
 'TGroup_36994',
 'TGroup_370',
 'TGroup_37012',
 'TGroup_37036',
 'TGroup_37037',
 'TGroup_37106',
 'TGroup_37110',
 'TGroup_37111',
 'TGroup_37136',
 'TGroup_37262',
 'TGroup_37345',
 'TGroup_37474',
 'TGroup_37488',
 'TGroup_37491',
 'TGroup_37656',
 'TGroup_3767',
 'TGroup_38264',
 'TGroup_38265',
 'TGroup_38312',
 'TGroup_38316',
 'TGroup_38446',
 'TGroup_38652',
 'TGroup_390',
 'TGroup_3918',
 'TGroup_39207',
 'TGroup_39208',
 'TGroup_39209',
 'TGroup_39414',
 'TGroup_3988',
 'TGroup_400',
 'TGroup_413',
 'TGroup_4279',
 'TGroup_434',
 'TGroup_4538',
 'TGroup_457',
 'TGroup_4887',
 'TGroup_4986',
 'TGroup_54',
 'TGroup_5451',
 'TGroup_5463',
 'TGroup_554',
 'TGroup_572',
 'TGroup_573',
 'TGroup_621',
 'TGroup_6530',
 'TGroup_656',
 'TGroup_660',
 'TGroup_68',
 'TGroup_69',
 'TGroup_707',
 'TGroup_726',
 'TGroup_75',
 'TGroup_753',
 'TGroup_754',
 'TGroup_755',
 'TGroup_759',
 'TGroup_793',
 'TGroup_847',
 'TGroup_85',
 'TGroup_923',
 'TGroup_954',
 'TNlen',
 '중복티켓수']
In [137]:
all_df2 = all_df[predictors + ['Survived']]
all_df2.head()
Out[137]:
LeadingDigit Prefix_NoPrefix Prefix_a Prefix_a4 Prefix_a5 Prefix_aq3 Prefix_aq4 Prefix_as Prefix_c Prefix_ca ... TGroup_755 TGroup_759 TGroup_793 TGroup_847 TGroup_85 TGroup_923 TGroup_954 TNlen 중복티켓수 Survived
0 2 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 5 1 0.0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 5 2 1.0
2 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 7 1 1.0
3 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 2 1.0
4 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 0.0

5 rows × 450 columns

In [138]:
df_train = all_df2.loc[all_df2['Survived'].isin([np.nan]) == False]
df_test  = all_df2.loc[all_df2['Survived'].isin([np.nan]) == True]

print(df_train.shape)
df_train.head()
(891, 450)
Out[138]:
LeadingDigit Prefix_NoPrefix Prefix_a Prefix_a4 Prefix_a5 Prefix_aq3 Prefix_aq4 Prefix_as Prefix_c Prefix_ca ... TGroup_755 TGroup_759 TGroup_793 TGroup_847 TGroup_85 TGroup_923 TGroup_954 TNlen 중복티켓수 Survived
0 2 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 5 1 0.0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 5 2 1.0
2 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 7 1 1.0
3 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 2 1.0
4 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 0.0

5 rows × 450 columns

In [139]:
print(df_test.shape)
df_test.head()
(418, 450)
Out[139]:
LeadingDigit Prefix_NoPrefix Prefix_a Prefix_a4 Prefix_a5 Prefix_aq3 Prefix_aq4 Prefix_as Prefix_c Prefix_ca ... TGroup_755 TGroup_759 TGroup_793 TGroup_847 TGroup_85 TGroup_923 TGroup_954 TNlen 중복티켓수 Survived
0 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 NaN
1 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 NaN
2 2 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 NaN
3 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 6 1 NaN
4 3 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 7 2 NaN

5 rows × 450 columns

03. 모델링

목차로 이동하기

In [140]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
In [141]:
model = KNeighborsClassifier(n_neighbors=11, metric = 'manhattan')

param_grid = ({'n_neighbors':[6,7,8,9,11],
               'metric':['manhattan','minkowski'],
               'p':[1,2]})

grs = GridSearchCV(model, param_grid, 
                   cv = 28, 
                   n_jobs=1, 
                   return_train_score = True,
                   pre_dispatch=1)

grs.fit(np.array(df_train[predictors]), np.array(df_train['Survived']))
Out[141]:
GridSearchCV(cv=28,
             estimator=KNeighborsClassifier(metric='manhattan', n_neighbors=11),
             n_jobs=1,
             param_grid={'metric': ['manhattan', 'minkowski'],
                         'n_neighbors': [6, 7, 8, 9, 11], 'p': [1, 2]},
             pre_dispatch=1, return_train_score=True)
In [142]:
print("Best parameters " + str(grs.best_params_))
gpd = pd.DataFrame(grs.cv_results_)
print("정확도 :{0:1.4f}".format(gpd['mean_test_score'][grs.best_index_]))
Best parameters {'metric': 'manhattan', 'n_neighbors': 9, 'p': 1}
정확도 :0.7969

04. 예측

목차로 이동하기

In [144]:
pred_knn = grs.predict(np.array(df_test[predictors]))

sub = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':pred_knn})
sub.to_csv('ticket__sex_knn.csv', index = False, float_format='%1d')
sub.head()
Out[144]:
PassengerId Survived
0 892 0.0
1 893 1.0
2 894 0.0
3 895 0.0
4 896 1.0

0.72009¶

실습¶

    1. 다른 feature도 추가한 이후에 제출해 보기
      • 'PassengerId', 'Pclass', 'SibSp', 'Parch'
    1. Age, Fare, Embarked를 추가한 이후에 제출해 보기
In [ ]: