구분 | 설명 | 값 |
---|---|---|
Survival | 생존 여부 | Survival. 0 = No, 1 = Yes |
Pclass | 티켓의 클래스 | Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd |
Sex | 성별(Sex) | 남(male)/여(female) |
Age | 나이(Age in years.) | |
SibSp | 함께 탑승한 형제와 배우자의 수 /siblings, spouses aboard the Titanic. | |
Parch | 함께 탑승한 부모, 아이의 수 | # of parents / children aboard the Titanic. |
Ticket | 티켓 번호(Ticket number) | (ex) CA 31352, A/5. 2151 |
Fare | 탑승료(Passenger fare) | |
Cabin | 객실 번호(Cabin number) | |
Embarked | 탑승 항구(Port of Embarkation) | C = Cherbourg, Q = Queenstown, S = Southampton |
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sel_f = ['Ticket', 'Pclass', 'Sex']
train = pd.read_csv("data/titanic/train.csv",
usecols=['PassengerId', 'Survived']+sel_f)
test = pd.read_csv("data/titanic/test.csv",
usecols=['PassengerId'] + sel_f)
sub = pd.read_csv("data/titanic/gender_submission.csv")
# 컬럼 추가 및 합치기
test['Survived'] = np.nan
all_df = pd.concat([train, test])
all_df.head()
PassengerId | Survived | Pclass | Sex | Ticket | |
---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | A/5 21171 |
1 | 2 | 1.0 | 1 | female | PC 17599 |
2 | 3 | 1.0 | 3 | female | STON/O2. 3101282 |
3 | 4 | 1.0 | 1 | female | 113803 |
4 | 5 | 0.0 | 3 | male | 373450 |
all_df.Ticket.unique()
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450', '330877', '17463', '349909', '347742', '237736', 'PP 9549', '113783', 'A/5. 2151', '347082', '350406', '248706', '382652', '244373', '345763', '2649', '239865', '248698', '330923', '113788', '347077', '2631', '19950', '330959', '349216', 'PC 17601', 'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677', 'A./5. 2152', '345764', '2651', '7546', '11668', '349253', 'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311', '2662', '349237', '3101295', 'A/4. 39886', 'PC 17572', '2926', '113509', '19947', 'C.A. 31026', '2697', 'C.A. 34651', 'CA 2144', '2669', '113572', '36973', '347088', 'PC 17605', '2661', 'C.A. 29395', 'S.P. 3464', '3101281', '315151', 'C.A. 33111', 'S.O.C. 14879', '2680', '1601', '348123', '349208', '374746', '248738', '364516', '345767', '345779', '330932', '113059', 'SO/C 14885', '3101278', 'W./C. 6608', 'SOTON/OQ 392086', '343275', '343276', '347466', 'W.E.P. 5734', 'C.A. 2315', '364500', '374910', 'PC 17754', 'PC 17759', '231919', '244367', '349245', '349215', '35281', '7540', '3101276', '349207', '343120', '312991', '349249', '371110', '110465', '2665', '324669', '4136', '2627', 'STON/O 2. 3101294', '370369', 'PC 17558', 'A4. 54510', '27267', '370372', 'C 17369', '2668', '347061', '349241', 'SOTON/O.Q. 3101307', 'A/5. 3337', '228414', 'C.A. 29178', 'SC/PARIS 2133', '11752', '7534', 'PC 17593', '2678', '347081', 'STON/O2. 3101279', '365222', '231945', 'C.A. 33112', '350043', '230080', '244310', 'S.O.P. 1166', '113776', 'A.5. 11206', 'A/5. 851', 'Fa 265302', 'PC 17597', '35851', 'SOTON/OQ 392090', '315037', 'CA. 2343', '371362', 'C.A. 33595', '347068', '315093', '363291', '113505', 'PC 17318', '111240', 'STON/O 2. 3101280', '17764', '350404', '4133', 'PC 17595', '250653', 'LINE', 'SC/PARIS 2131', '230136', '315153', '113767', '370365', '111428', '364849', '349247', '234604', '28424', '350046', 'PC 17610', '368703', '4579', '370370', '248747', '345770', '3101264', '2628', 'A/5 3540', '347054', '2699', '367231', '112277', 'SOTON/O.Q. 3101311', 'F.C.C. 13528', 'A/5 21174', '250646', '367229', '35273', 'STON/O2. 3101283', '243847', '11813', 'W/C 14208', 'SOTON/OQ 392089', '220367', '21440', '349234', '19943', 'PP 4348', 'SW/PP 751', 'A/5 21173', '236171', '347067', '237442', 'C.A. 29566', 'W./C. 6609', '26707', 'C.A. 31921', '28665', 'SCO/W 1585', '367230', 'W./C. 14263', 'STON/O 2. 3101275', '2694', '19928', '347071', '250649', '11751', '244252', '362316', '113514', 'A/5. 3336', '370129', '2650', 'PC 17585', '110152', 'PC 17755', '230433', '384461', '110413', '112059', '382649', 'C.A. 17248', '347083', 'PC 17582', 'PC 17760', '113798', '250644', 'PC 17596', '370375', '13502', '347073', '239853', 'C.A. 2673', '336439', '347464', '345778', 'A/5. 10482', '113056', '349239', '345774', '349206', '237798', '370373', '19877', '11967', 'SC/Paris 2163', '349236', '349233', 'PC 17612', '2693', '113781', '19988', '9234', '367226', '226593', 'A/5 2466', '17421', 'PC 17758', 'P/PP 3381', 'PC 17485', '11767', 'PC 17608', '250651', '349243', 'F.C.C. 13529', '347470', '29011', '36928', '16966', 'A/5 21172', '349219', '234818', '345364', '28551', '111361', '113043', 'PC 17611', '349225', '7598', '113784', '248740', '244361', '229236', '248733', '31418', '386525', 'C.A. 37671', '315088', '7267', '113510', '2695', '2647', '345783', '237671', '330931', '330980', 'SC/PARIS 2167', '2691', 'SOTON/O.Q. 3101310', 'C 7076', '110813', '2626', '14313', 'PC 17477', '11765', '3101267', '323951', 'C 7077', '113503', '2648', '347069', 'PC 17757', '2653', 'STON/O 2. 3101293', '349227', '27849', '367655', 'SC 1748', '113760', '350034', '3101277', '350052', '350407', '28403', '244278', '240929', 'STON/O 2. 3101289', '341826', '4137', '315096', '28664', '347064', '29106', '312992', '349222', '394140', 'STON/O 2. 3101269', '343095', '28220', '250652', '28228', '345773', '349254', 'A/5. 13032', '315082', '347080', 'A/4. 34244', '2003', '250655', '364851', 'SOTON/O.Q. 392078', '110564', '376564', 'SC/AH 3085', 'STON/O 2. 3101274', '13507', 'C.A. 18723', '345769', '347076', '230434', '65306', '33638', '113794', '2666', '113786', '65303', '113051', '17453', 'A/5 2817', '349240', '13509', '17464', 'F.C.C. 13531', '371060', '19952', '364506', '111320', '234360', 'A/S 2816', 'SOTON/O.Q. 3101306', '113792', '36209', '323592', '315089', 'SC/AH Basle 541', '7553', '31027', '3460', '350060', '3101298', '239854', 'A/5 3594', '4134', '11771', 'A.5. 18509', '65304', 'SOTON/OQ 3101317', '113787', 'PC 17609', 'A/4 45380', '36947', 'C.A. 6212', '350035', '315086', '364846', '330909', '4135', '26360', '111427', 'C 4001', '382651', 'SOTON/OQ 3101316', 'PC 17473', 'PC 17603', '349209', '36967', 'C.A. 34260', '226875', '349242', '12749', '349252', '2624', '2700', '367232', 'W./C. 14258', 'PC 17483', '3101296', '29104', '2641', '2690', '315084', '113050', 'PC 17761', '364498', '13568', 'WE/P 5735', '2908', '693', 'SC/PARIS 2146', '244358', '330979', '2620', '347085', '113807', '11755', '345572', '372622', '349251', '218629', 'SOTON/OQ 392082', 'SOTON/O.Q. 392087', 'A/4 48871', '349205', '2686', '350417', 'S.W./PP 752', '11769', 'PC 17474', '14312', 'A/4. 20589', '358585', '243880', '2689', 'STON/O 2. 3101286', '237789', '13049', '3411', '237565', '13567', '14973', 'A./5. 3235', 'STON/O 2. 3101273', 'A/5 3902', '364848', 'SC/AH 29037', '248727', '2664', '349214', '113796', '364511', '111426', '349910', '349246', '113804', 'SOTON/O.Q. 3101305', '370377', '364512', '220845', '31028', '2659', '11753', '350029', '54636', '36963', '219533', '349224', '334912', '27042', '347743', '13214', '112052', '237668', 'STON/O 2. 3101292', '350050', '349231', '13213', 'S.O./P.P. 751', 'CA. 2314', '349221', '8475', '330919', '365226', '349223', '29751', '2623', '5727', '349210', 'STON/O 2. 3101285', '234686', '312993', 'A/5 3536', '19996', '29750', 'F.C. 12750', 'C.A. 24580', '244270', '239856', '349912', '342826', '4138', '330935', '6563', '349228', '350036', '24160', '17474', '349256', '2672', '113800', '248731', '363592', '35852', '348121', 'PC 17475', '36864', '350025', '223596', 'PC 17476', 'PC 17482', '113028', '7545', '250647', '348124', '34218', '36568', '347062', '350048', '12233', '250643', '113806', '315094', '36866', '236853', 'STON/O2. 3101271', '239855', '28425', '233639', '349201', '349218', '16988', '376566', 'STON/O 2. 3101288', '250648', '113773', '335097', '29103', '392096', '345780', '349204', '350042', '29108', '363294', 'SOTON/O2 3101272', '2663', '347074', '112379', '364850', '8471', '345781', '350047', 'S.O./P.P. 3', '2674', '29105', '347078', '383121', '36865', '2687', '113501', 'W./C. 6607', 'SOTON/O.Q. 3101312', '374887', '3101265', '12460', 'PC 17600', '349203', '28213', '17465', '349244', '2685', '2625', '347089', '347063', '112050', '347087', '248723', '3474', '28206', '364499', '112058', 'STON/O2. 3101290', 'S.C./PARIS 2079', 'C 7075', '315098', '19972', '368323', '367228', '2671', '347468', '2223', 'PC 17756', '315097', '392092', '11774', 'SOTON/O2 3101287', '2683', '315090', 'C.A. 5547', '349213', '347060', 'PC 17592', '392091', '113055', '2629', '350026', '28134', '17466', '233866', '236852', 'SC/PARIS 2149', 'PC 17590', '345777', '349248', '695', '345765', '2667', '349212', '349217', '349257', '7552', 'C.A./SOTON 34068', 'SOTON/OQ 392076', '211536', '112053', '111369', '370376', '330911', '363272', '240276', '315154', '7538', '330972', '2657', '349220', '694', '21228', '24065', '233734', '2692', 'STON/O2. 3101270', '2696', 'C 17368', 'PC 17598', '2698', '113054', 'C.A. 31029', '13236', '2682', '342712', '315087', '345768', '113778', 'SOTON/O.Q. 3101263', '237249', 'STON/O 2. 3101291', 'PC 17594', '370374', '13695', 'SC/PARIS 2168', 'SC/A.3 2861', '349230', '348122', '349232', '237216', '347090', '334914', 'F.C.C. 13534', '330963', '2543', '382653', '349211', '3101297', 'PC 17562', '359306', '11770', '248744', '368702', '19924', '349238', '240261', '2660', '330844', 'A/4 31416', '364856', '347072', '345498', '376563', '13905', '350033', 'STON/O 2. 3101268', '347471', 'A./5. 3338', '11778', '365235', '347070', '330920', '383162', '3410', '248734', '237734', '330968', 'PC 17531', '329944', '2681', '13050', '367227', '392095', '368783', '350045', '211535', '342441', 'STON/OQ. 369943', '113780', '2621', '349226', '350409', '2656', '248659', 'SOTON/OQ 392083', '17475', 'SC/A4 23568', '113791', '349255', '3701', '350405', 'S.O./P.P. 752', '347469', '110489', 'SOTON/O.Q. 3101315', '335432', '220844', '343271', '237393', 'PC 17591', '17770', '7548', 'S.O./P.P. 251', '2670', '2673', '233478', '7935', '239059', 'S.O./P.P. 2', 'A/4 48873', '28221', '111163', '235509', '347465', '347066', 'C.A. 31030', '65305', 'C.A. 34050', 'F.C. 12998', '9232', '28034', 'PC 17613', '349250', 'SOTON/O.Q. 3101308', '347091', '113038', '330924', '32302', 'SC/PARIS 2148', '342684', 'W./C. 14266', '350053', 'PC 17606', '350054', '370368', '242963', '113795', '3101266', '330971', '350416', '2679', '250650', '112377', '3470', 'SOTON/O2 3101284', '13508', '7266', '345775', 'C.A. 42795', 'AQ/4 3130', '363611', '28404', '345501', '350410', 'C.A. 34644', '349235', '112051', 'C.A. 49867', 'A. 2. 39186', '315095', '368573', '2676', 'SC 14888', 'CA 31352', 'W./C. 14260', '315085', '364859', 'A/5 21175', 'SOTON/O.Q. 3101314', '2655', 'A/5 1478', 'PC 17607', '382650', '2652', '345771', '349202', '113801', '347467', '347079', '237735', '315092', '383123', '112901', '315091', '2658', 'LP 1588', '368364', 'AQ/3. 30631', '28004', '350408', '347075', '2654', '244368', '113790', 'SOTON/O.Q. 3101309', '236854', 'PC 17580', '2684', '349229', '110469', '244360', '2675', '2622', 'C.A. 15185', '350403', '348125', '237670', '2688', '248726', 'F.C.C. 13540', '113044', '1222', '368402', '315083', '112378', 'SC/PARIS 2147', '28133', '248746', '315152', '29107', '680', '366713', '330910', 'SC/PARIS 2159', '349911', '244346', '364858', 'C.A. 30769', '371109', '347065', '21332', '17765', 'SC/PARIS 2166', '28666', '334915', '365237', '347086', 'A.5. 3236', 'SOTON/O.Q. 3101262', '359309'], dtype=object)
all_df.loc[all_df['Ticket']=='LINE']
PassengerId | Survived | Pclass | Sex | Ticket | |
---|---|---|---|---|---|
179 | 180 | 0.0 | 3 | male | LINE |
271 | 272 | 1.0 | 3 | male | LINE |
302 | 303 | 0.0 | 3 | male | LINE |
597 | 598 | 0.0 | 3 | male | LINE |
all_df['Ticket'] = all_df['Ticket'].replace('LINE', 'LINE 0')
all_df[all_df['Ticket']=='LINE 0']
PassengerId | Survived | Pclass | Sex | Ticket | |
---|---|---|---|---|---|
179 | 180 | 0.0 | 3 | male | LINE 0 |
271 | 272 | 1.0 | 3 | male | LINE 0 |
302 | 303 | 0.0 | 3 | male | LINE 0 |
597 | 598 | 0.0 | 3 | male | LINE 0 |
dup_tickets = all_df.groupby('Ticket').size()
dup_tickets
Ticket 110152 3 110413 3 110465 2 110469 1 110489 1 .. W./C. 6608 5 W./C. 6609 1 W.E.P. 5734 2 W/C 14208 1 WE/P 5735 2 Length: 929, dtype: int64
all_df['중복티켓수'] = all_df['Ticket'].map(dup_tickets)
plt.xlabel('duplications')
plt.ylabel('frequency')
plt.title('Duplicate Tickets')
all_df['중복티켓수'].hist(bins=20)
<AxesSubplot:title={'center':'Duplicate Tickets'}, xlabel='duplications', ylabel='frequency'>
all_df['Ticket'] = all_df['Ticket'].apply(lambda x: x.replace('.','').replace('/','').lower())
all_df.head()
PassengerId | Survived | Pclass | Sex | Ticket | 중복티켓수 | |
---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | a5 21171 | 1 |
1 | 2 | 1.0 | 1 | female | pc 17599 | 2 |
2 | 3 | 1.0 | 3 | female | stono2 3101282 | 1 |
3 | 4 | 1.0 | 1 | female | 113803 | 2 |
4 | 5 | 0.0 | 3 | male | 373450 | 1 |
"aaaaa 000000".split(' ')[0][0] # 첫번째 단어의 맨 앞 첫글자
'a'
def get_prefix(ticket):
lead = ticket.split(' ')[0][0]
# 알파벳인지 확인
if lead.isalpha():
return ticket.split(' ')[0]
else:
return 'NoPrefix'
all_df['Prefix'] = all_df['Ticket'].apply(lambda x: get_prefix(x))
all_df.head()
PassengerId | Survived | Pclass | Sex | Ticket | 중복티켓수 | Prefix | |
---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | a5 21171 | 1 | a5 |
1 | 2 | 1.0 | 1 | female | pc 17599 | 2 | pc |
2 | 3 | 1.0 | 3 | female | stono2 3101282 | 1 | stono2 |
3 | 4 | 1.0 | 1 | female | 113803 | 2 | NoPrefix |
4 | 5 | 0.0 | 3 | male | 373450 | 1 | NoPrefix |
"a5 21171".split(' ')[-1]
'21171'
str("a5 21171")[0]
'a'
val = int( "a5 21171".split(' ')[-1] )
str(val)
'21171'
all_df['TNumeric'] = all_df['Ticket'].apply(lambda x: int(x.split(' ')[-1])//1)
all_df['TNlen'] = all_df['TNumeric'].apply(lambda x : len(str(x)))
all_df['LeadingDigit'] = all_df['TNumeric'].apply(lambda x : int(str(x)[0]))
all_df['TGroup'] = all_df['Ticket'].apply(lambda x: str(int(x.split(' ')[-1])//10))
all_df.head()
PassengerId | Survived | Pclass | Sex | Ticket | 중복티켓수 | Prefix | TNumeric | TNlen | LeadingDigit | TGroup | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | male | a5 21171 | 1 | a5 | 21171 | 5 | 2 | 2117 |
1 | 2 | 1.0 | 1 | female | pc 17599 | 2 | pc | 17599 | 5 | 1 | 1759 |
2 | 3 | 1.0 | 3 | female | stono2 3101282 | 1 | stono2 | 3101282 | 7 | 3 | 310128 |
3 | 4 | 1.0 | 1 | female | 113803 | 2 | NoPrefix | 113803 | 6 | 1 | 11380 |
4 | 5 | 0.0 | 3 | male | 373450 | 1 | NoPrefix | 373450 | 6 | 3 | 37345 |
pd.crosstab(all_df['Pclass'],all_df['LeadingDigit'])
LeadingDigit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
Pclass | ||||||||||
1 | 0 | 288 | 8 | 18 | 0 | 5 | 4 | 0 | 0 | 0 |
2 | 0 | 32 | 205 | 37 | 0 | 1 | 0 | 2 | 0 | 0 |
3 | 4 | 22 | 136 | 476 | 22 | 4 | 17 | 18 | 5 | 5 |
all_df = all_df.drop(columns=['Ticket', 'TNumeric', 'Pclass'])
all_df
PassengerId | Survived | Sex | 중복티켓수 | Prefix | TNlen | LeadingDigit | TGroup | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | male | 1 | a5 | 5 | 2 | 2117 |
1 | 2 | 1.0 | female | 2 | pc | 5 | 1 | 1759 |
2 | 3 | 1.0 | female | 1 | stono2 | 7 | 3 | 310128 |
3 | 4 | 1.0 | female | 2 | NoPrefix | 6 | 1 | 11380 |
4 | 5 | 0.0 | male | 1 | NoPrefix | 6 | 3 | 37345 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | NaN | male | 1 | a5 | 4 | 3 | 323 |
414 | 1306 | NaN | female | 3 | pc | 5 | 1 | 1775 |
415 | 1307 | NaN | male | 1 | sotonoq | 7 | 3 | 310126 |
416 | 1308 | NaN | male | 1 | NoPrefix | 6 | 3 | 35930 |
417 | 1309 | NaN | male | 3 | NoPrefix | 4 | 2 | 266 |
1309 rows × 8 columns
all_df['Prefix']
0 a5 1 pc 2 stono2 3 NoPrefix 4 NoPrefix ... 413 a5 414 pc 415 sotonoq 416 NoPrefix 417 NoPrefix Name: Prefix, Length: 1309, dtype: object
all_df = pd.concat([pd.get_dummies(all_df[['Prefix','TGroup']]),
all_df[['PassengerId','Survived','중복티켓수','TNlen','LeadingDigit', 'Sex']]],
axis=1)
all_df
Prefix_NoPrefix | Prefix_a | Prefix_a4 | Prefix_a5 | Prefix_aq3 | Prefix_aq4 | Prefix_as | Prefix_c | Prefix_ca | Prefix_casoton | ... | TGroup_847 | TGroup_85 | TGroup_923 | TGroup_954 | PassengerId | Survived | 중복티켓수 | TNlen | LeadingDigit | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0.0 | 1 | 5 | 2 | male |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 2 | 1.0 | 2 | 5 | 1 | female |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 3 | 1.0 | 1 | 7 | 3 | female |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 4 | 1.0 | 2 | 6 | 1 | female |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 5 | 0.0 | 1 | 6 | 3 | male |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1305 | NaN | 1 | 4 | 3 | male |
414 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1306 | NaN | 3 | 5 | 1 | female |
415 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1307 | NaN | 1 | 7 | 3 | male |
416 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1308 | NaN | 1 | 6 | 3 | male |
417 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1309 | NaN | 3 | 4 | 2 | male |
1309 rows × 451 columns
dict_s = {'male':0, 'female':1}
all_df['Sex'] = all_df['Sex'].map(dict_s)
predictors = sorted(list(set(all_df.columns) - set(['PassengerId','Survived'])))
predictors
['LeadingDigit', 'Prefix_NoPrefix', 'Prefix_a', 'Prefix_a4', 'Prefix_a5', 'Prefix_aq3', 'Prefix_aq4', 'Prefix_as', 'Prefix_c', 'Prefix_ca', 'Prefix_casoton', 'Prefix_fa', 'Prefix_fc', 'Prefix_fcc', 'Prefix_line', 'Prefix_lp', 'Prefix_pc', 'Prefix_pp', 'Prefix_ppp', 'Prefix_sc', 'Prefix_sca3', 'Prefix_sca4', 'Prefix_scah', 'Prefix_scow', 'Prefix_scparis', 'Prefix_soc', 'Prefix_sop', 'Prefix_sopp', 'Prefix_sotono2', 'Prefix_sotonoq', 'Prefix_sp', 'Prefix_stono', 'Prefix_stono2', 'Prefix_stonoq', 'Prefix_swpp', 'Prefix_wc', 'Prefix_wep', 'Sex', 'TGroup_0', 'TGroup_1048', 'TGroup_11015', 'TGroup_11041', 'TGroup_11046', 'TGroup_11048', 'TGroup_11056', 'TGroup_11081', 'TGroup_11116', 'TGroup_11124', 'TGroup_11132', 'TGroup_11136', 'TGroup_11142', 'TGroup_1120', 'TGroup_11205', 'TGroup_11227', 'TGroup_11237', 'TGroup_11290', 'TGroup_11302', 'TGroup_11303', 'TGroup_11304', 'TGroup_11305', 'TGroup_11350', 'TGroup_11351', 'TGroup_11357', 'TGroup_11376', 'TGroup_11377', 'TGroup_11378', 'TGroup_11379', 'TGroup_11380', 'TGroup_116', 'TGroup_1166', 'TGroup_1175', 'TGroup_1176', 'TGroup_1177', 'TGroup_1181', 'TGroup_1196', 'TGroup_122', 'TGroup_1223', 'TGroup_1246', 'TGroup_1274', 'TGroup_1275', 'TGroup_1299', 'TGroup_1303', 'TGroup_1304', 'TGroup_1305', 'TGroup_1321', 'TGroup_1323', 'TGroup_1350', 'TGroup_1352', 'TGroup_1353', 'TGroup_1354', 'TGroup_1356', 'TGroup_1369', 'TGroup_1390', 'TGroup_1420', 'TGroup_1425', 'TGroup_1426', 'TGroup_1431', 'TGroup_147', 'TGroup_1487', 'TGroup_1488', 'TGroup_1497', 'TGroup_1518', 'TGroup_158', 'TGroup_160', 'TGroup_1696', 'TGroup_1698', 'TGroup_1724', 'TGroup_1731', 'TGroup_1736', 'TGroup_174', 'TGroup_1742', 'TGroup_1745', 'TGroup_1746', 'TGroup_1747', 'TGroup_1748', 'TGroup_1753', 'TGroup_1755', 'TGroup_1756', 'TGroup_1757', 'TGroup_1758', 'TGroup_1759', 'TGroup_1760', 'TGroup_1761', 'TGroup_1775', 'TGroup_1776', 'TGroup_1777', 'TGroup_1850', 'TGroup_1872', 'TGroup_1987', 'TGroup_1992', 'TGroup_1994', 'TGroup_1995', 'TGroup_1997', 'TGroup_1998', 'TGroup_1999', 'TGroup_200', 'TGroup_2058', 'TGroup_207', 'TGroup_21153', 'TGroup_2117', 'TGroup_212', 'TGroup_2122', 'TGroup_213', 'TGroup_2133', 'TGroup_214', 'TGroup_2144', 'TGroup_215', 'TGroup_216', 'TGroup_21862', 'TGroup_21953', 'TGroup_22036', 'TGroup_22084', 'TGroup_222', 'TGroup_22359', 'TGroup_22659', 'TGroup_22687', 'TGroup_22841', 'TGroup_22923', 'TGroup_23008', 'TGroup_23013', 'TGroup_23043', 'TGroup_231', 'TGroup_23191', 'TGroup_23194', 'TGroup_23347', 'TGroup_23363', 'TGroup_23373', 'TGroup_23386', 'TGroup_234', 'TGroup_23436', 'TGroup_23460', 'TGroup_23468', 'TGroup_23481', 'TGroup_23550', 'TGroup_2356', 'TGroup_23617', 'TGroup_23685', 'TGroup_23721', 'TGroup_23724', 'TGroup_23739', 'TGroup_23744', 'TGroup_23756', 'TGroup_23766', 'TGroup_23767', 'TGroup_23773', 'TGroup_23778', 'TGroup_23779', 'TGroup_23905', 'TGroup_23985', 'TGroup_23986', 'TGroup_24026', 'TGroup_24027', 'TGroup_2406', 'TGroup_24092', 'TGroup_2416', 'TGroup_24296', 'TGroup_24384', 'TGroup_24388', 'TGroup_24425', 'TGroup_24427', 'TGroup_24431', 'TGroup_24434', 'TGroup_24435', 'TGroup_24436', 'TGroup_24437', 'TGroup_2457', 'TGroup_2458', 'TGroup_246', 'TGroup_24865', 'TGroup_24869', 'TGroup_24870', 'TGroup_24872', 'TGroup_24873', 'TGroup_24874', 'TGroup_25', 'TGroup_25064', 'TGroup_25065', 'TGroup_254', 'TGroup_262', 'TGroup_263', 'TGroup_2636', 'TGroup_264', 'TGroup_265', 'TGroup_26530', 'TGroup_266', 'TGroup_267', 'TGroup_2670', 'TGroup_268', 'TGroup_269', 'TGroup_270', 'TGroup_2704', 'TGroup_2726', 'TGroup_2784', 'TGroup_2800', 'TGroup_2803', 'TGroup_281', 'TGroup_2813', 'TGroup_2820', 'TGroup_2821', 'TGroup_2822', 'TGroup_2840', 'TGroup_2842', 'TGroup_2855', 'TGroup_286', 'TGroup_2866', 'TGroup_290', 'TGroup_2901', 'TGroup_2903', 'TGroup_2910', 'TGroup_2917', 'TGroup_292', 'TGroup_2939', 'TGroup_2956', 'TGroup_2975', 'TGroup_3063', 'TGroup_3076', 'TGroup_308', 'TGroup_310126', 'TGroup_310127', 'TGroup_310128', 'TGroup_310129', 'TGroup_310130', 'TGroup_310131', 'TGroup_3102', 'TGroup_3103', 'TGroup_31299', 'TGroup_313', 'TGroup_3135', 'TGroup_3141', 'TGroup_31503', 'TGroup_31508', 'TGroup_31509', 'TGroup_31515', 'TGroup_3192', 'TGroup_323', 'TGroup_3230', 'TGroup_32359', 'TGroup_32395', 'TGroup_32466', 'TGroup_32994', 'TGroup_33084', 'TGroup_33087', 'TGroup_33090', 'TGroup_33091', 'TGroup_33092', 'TGroup_33093', 'TGroup_33095', 'TGroup_33096', 'TGroup_33097', 'TGroup_33098', 'TGroup_3311', 'TGroup_333', 'TGroup_33491', 'TGroup_33509', 'TGroup_33543', 'TGroup_33567', 'TGroup_3359', 'TGroup_3363', 'TGroup_33643', 'TGroup_338', 'TGroup_3405', 'TGroup_3406', 'TGroup_341', 'TGroup_34182', 'TGroup_3421', 'TGroup_3424', 'TGroup_34244', 'TGroup_3426', 'TGroup_34268', 'TGroup_34271', 'TGroup_34282', 'TGroup_34309', 'TGroup_34312', 'TGroup_34327', 'TGroup_34536', 'TGroup_34549', 'TGroup_34550', 'TGroup_34557', 'TGroup_34576', 'TGroup_34577', 'TGroup_34578', 'TGroup_346', 'TGroup_3464', 'TGroup_3465', 'TGroup_347', 'TGroup_34705', 'TGroup_34706', 'TGroup_34707', 'TGroup_34708', 'TGroup_34709', 'TGroup_34746', 'TGroup_34747', 'TGroup_34774', 'TGroup_34812', 'TGroup_34920', 'TGroup_34921', 'TGroup_34922', 'TGroup_34923', 'TGroup_34924', 'TGroup_34925', 'TGroup_34990', 'TGroup_34991', 'TGroup_35002', 'TGroup_35003', 'TGroup_35004', 'TGroup_35005', 'TGroup_35006', 'TGroup_35040', 'TGroup_35041', 'TGroup_3527', 'TGroup_3528', 'TGroup_353', 'TGroup_354', 'TGroup_3585', 'TGroup_35858', 'TGroup_359', 'TGroup_35930', 'TGroup_3620', 'TGroup_36231', 'TGroup_36327', 'TGroup_36329', 'TGroup_36359', 'TGroup_36361', 'TGroup_36449', 'TGroup_36450', 'TGroup_36451', 'TGroup_36484', 'TGroup_36485', 'TGroup_36522', 'TGroup_36523', 'TGroup_3656', 'TGroup_36671', 'TGroup_36722', 'TGroup_36723', 'TGroup_36765', 'TGroup_36832', 'TGroup_36836', 'TGroup_36840', 'TGroup_36857', 'TGroup_3686', 'TGroup_36870', 'TGroup_36878', 'TGroup_3692', 'TGroup_3694', 'TGroup_3696', 'TGroup_3697', 'TGroup_36994', 'TGroup_370', 'TGroup_37012', 'TGroup_37036', 'TGroup_37037', 'TGroup_37106', 'TGroup_37110', 'TGroup_37111', 'TGroup_37136', 'TGroup_37262', 'TGroup_37345', 'TGroup_37474', 'TGroup_37488', 'TGroup_37491', 'TGroup_37656', 'TGroup_3767', 'TGroup_38264', 'TGroup_38265', 'TGroup_38312', 'TGroup_38316', 'TGroup_38446', 'TGroup_38652', 'TGroup_390', 'TGroup_3918', 'TGroup_39207', 'TGroup_39208', 'TGroup_39209', 'TGroup_39414', 'TGroup_3988', 'TGroup_400', 'TGroup_413', 'TGroup_4279', 'TGroup_434', 'TGroup_4538', 'TGroup_457', 'TGroup_4887', 'TGroup_4986', 'TGroup_54', 'TGroup_5451', 'TGroup_5463', 'TGroup_554', 'TGroup_572', 'TGroup_573', 'TGroup_621', 'TGroup_6530', 'TGroup_656', 'TGroup_660', 'TGroup_68', 'TGroup_69', 'TGroup_707', 'TGroup_726', 'TGroup_75', 'TGroup_753', 'TGroup_754', 'TGroup_755', 'TGroup_759', 'TGroup_793', 'TGroup_847', 'TGroup_85', 'TGroup_923', 'TGroup_954', 'TNlen', '중복티켓수']
all_df2 = all_df[predictors + ['Survived']]
all_df2.head()
LeadingDigit | Prefix_NoPrefix | Prefix_a | Prefix_a4 | Prefix_a5 | Prefix_aq3 | Prefix_aq4 | Prefix_as | Prefix_c | Prefix_ca | ... | TGroup_755 | TGroup_759 | TGroup_793 | TGroup_847 | TGroup_85 | TGroup_923 | TGroup_954 | TNlen | 중복티켓수 | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 0.0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2 | 1.0 |
2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 1.0 |
3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2 | 1.0 |
4 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0.0 |
5 rows × 450 columns
df_train = all_df2.loc[all_df2['Survived'].isin([np.nan]) == False]
df_test = all_df2.loc[all_df2['Survived'].isin([np.nan]) == True]
print(df_train.shape)
df_train.head()
(891, 450)
LeadingDigit | Prefix_NoPrefix | Prefix_a | Prefix_a4 | Prefix_a5 | Prefix_aq3 | Prefix_aq4 | Prefix_as | Prefix_c | Prefix_ca | ... | TGroup_755 | TGroup_759 | TGroup_793 | TGroup_847 | TGroup_85 | TGroup_923 | TGroup_954 | TNlen | 중복티켓수 | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 0.0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2 | 1.0 |
2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 1.0 |
3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2 | 1.0 |
4 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0.0 |
5 rows × 450 columns
print(df_test.shape)
df_test.head()
(418, 450)
LeadingDigit | Prefix_NoPrefix | Prefix_a | Prefix_a4 | Prefix_a5 | Prefix_aq3 | Prefix_aq4 | Prefix_as | Prefix_c | Prefix_ca | ... | TGroup_755 | TGroup_759 | TGroup_793 | TGroup_847 | TGroup_85 | TGroup_923 | TGroup_954 | TNlen | 중복티켓수 | Survived | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | NaN |
1 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | NaN |
2 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | NaN |
3 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | NaN |
4 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | NaN |
5 rows × 450 columns
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=11, metric = 'manhattan')
param_grid = ({'n_neighbors':[6,7,8,9,11],
'metric':['manhattan','minkowski'],
'p':[1,2]})
grs = GridSearchCV(model, param_grid,
cv = 28,
n_jobs=1,
return_train_score = True,
pre_dispatch=1)
grs.fit(np.array(df_train[predictors]), np.array(df_train['Survived']))
GridSearchCV(cv=28, estimator=KNeighborsClassifier(metric='manhattan', n_neighbors=11), n_jobs=1, param_grid={'metric': ['manhattan', 'minkowski'], 'n_neighbors': [6, 7, 8, 9, 11], 'p': [1, 2]}, pre_dispatch=1, return_train_score=True)
print("Best parameters " + str(grs.best_params_))
gpd = pd.DataFrame(grs.cv_results_)
print("정확도 :{0:1.4f}".format(gpd['mean_test_score'][grs.best_index_]))
Best parameters {'metric': 'manhattan', 'n_neighbors': 9, 'p': 1} 정확도 :0.7969
pred_knn = grs.predict(np.array(df_test[predictors]))
sub = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':pred_knn})
sub.to_csv('ticket__sex_knn.csv', index = False, float_format='%1d')
sub.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0.0 |
1 | 893 | 1.0 |
2 | 894 | 0.0 |
3 | 895 | 0.0 |
4 | 896 | 1.0 |