<TIL> 2024-04-09

내일배움캠프(데이터 분석 부트캠프 1기)/TIL & WIL

<TIL> 2024-04-09

배또가또 2024. 4. 9. 22:08

오늘 진행한 일
- 최종 프로젝트 파이썬 함수 작성(리텐션)
- SQL 코드테스트

https://datarian.io/blog/rolling-retention

리텐션 (2) Rolling Retention

롤링 리텐션은 '사용자가 이탈하지 않고 남아있는가?'에 초점을 맞추기 때문에 Unbounded Retention 이라고도 부릅니다.

datarian.io

오늘은 파이썬을 이용해 롤링 리텐션과 n_day 리텐션을 구하는 함수를 작성하였다.

롤링 리텐션은 위의 데이터리안 링크를 참고하였다.

롤링 리텐션은 기준일을 포함하여 그 이후에 한 번이라도 재방문한 유저의 비율을 나타내는데,

기준일 이후에 방문 기록이 있다면, 기준일 당시에는 이탈하지 않은 사용자로 계산한단 것이다.

함수는 아래와 같다.

def visualize_retention(retention_percent:pd.DataFrame) : 
    plt.plot(retention_percent.loc[0])
    plt.xticks(rotation=45)
    plt.show()

def rolling_retention(retention_table:pd.DataFrame, retention_period:int) : 
    for idx in retention_table.index : 
        if retention_table.loc[idx][f'retention_{retention_period*6}'] == 1 : 
            retention_table.loc[idx] = 1
        elif retention_table.loc[idx][f'retention_{retention_period*5}'] == 1 : 
            retention_table.loc[idx, [f'retention_{retention_period}', f'retention_{retention_period*2}', f'retention_{retention_period*3}', f'retention_{retention_period*4}']] = 1
        elif retention_table.loc[idx][f'retention_{retention_period*4}'] == 1 : 
            retention_table.loc[idx, [f'retention_{retention_period}', f'retention_{retention_period*2}', f'retention_{retention_period*3}']] = 1
        elif retention_table.loc[idx][f'retention_{retention_period*3}'] == 1 : 
            retention_table.loc[idx, [f'retention_{retention_period}', f'retention_{retention_period*2}']] = 1
        elif retention_table.loc[idx][f'retention_{retention_period*2}'] == 1 : 
            retention_table.loc[idx, [f'retention_{retention_period}']] = 1
            
    return retention_table

def retention(transaction:pd.DataFrame, year_month:str, retention_period:int) : 
    
    # 받아온 테이블에서 필요한 데이터만을 저장한 후 메모리 정리
    transaction_time_by_customer = transaction[['customer_id', 'created_at']]
    del transaction
    
    # 계산 및 비교가능한 날짜 형식으로 변환
    transaction_time_by_customer = transaction_time_by_customer.assign(created_at = transaction_time_by_customer['created_at'].dt.strftime('%Y-%m-%d'))
    transaction_time_by_customer = transaction_time_by_customer.assign(created_at = pd.to_datetime(transaction_time_by_customer['created_at']))
    
    # 받아온 날짜 형변환
    picked_time = datetime.strptime(year_month, '%Y-%m')
    for_retain_start_time = picked_time - timedelta(days=retention_period) # picked_time 이전 retention기간일(기준 retention을 잡기 위함)
    for_retain_end_time = picked_time + timedelta(days=retention_period*6) # picked_time 이후 retention기간*6 일
    
    # 리텐션에 사용할 데이터
    picked_data = transaction_time_by_customer[(transaction_time_by_customer['created_at']>=for_retain_start_time)
                                               &(transaction_time_by_customer['created_at']<for_retain_end_time)]
    # 메모리 정리
    del transaction_time_by_customer
    
    # retain 기간별로 dataframe 적재
    retain_period = []
    retain_period.append(picked_data.loc[(picked_data['created_at']>=for_retain_start_time)&
                                         (picked_data['created_at']<picked_time)])
    for i in range(6) : 
        retain_period.append(picked_data.loc[(picked_data['created_at']>=picked_time+timedelta(days=retention_period*i))&
                                             (picked_data['created_at']<picked_time+timedelta(days=retention_period*(i+1)))])
    
    # retain 기간 별로 유저가 방문했는지 여부
    # groupby('customer_id').count로 계산
    retention_group = []
    for i, df in enumerate(retain_period) : 
        retention_group.append(df.groupby('customer_id').count().rename(columns={'created_at':f'retention_{i*retention_period}'}))
    
    # 리텐션 테이블 생성
    retention = pd.concat(retention_group, axis=1)
    retention = retention[~retention['retention_0'].isnull()] # retention_0값이 비어있는 경우 삭제
    retention = retention.fillna(0) # 그 외 null값은 0으로 채움
    retention = retention.astype(int)
    retention[retention>1] = 1
        
    # retention에 rolling 적용
    rolling_retention_table = rolling_retention(retention, retention_period)
    rolling_retention_percent = (rolling_retention_table.mean().to_frame().T*100).round(2)
    
    print('rolling_retention_table')
    display(rolling_retention_table)
    print('-----------------------------------------------\nrolling_retention_percentage')
    display(rolling_retention_percent)
    print('-----------------------------------------------\nrolling_retention_curve')
    visualize_retention(rolling_retention_percent)
    
    return retention, rolling_retention_table, rolling_retention_percent

unrolled_retention, rolling_retention_table, rolling_retention_percent = retention(paid_transaction_trend, '2018-01', 60)

retention_period를 기준으로해서 해당 기간 안에 한 번이라도 접속 이력이 있으면 retention이 있는 것으로 판단하고

그 이후에라도 접속했다면 이전 기간 또한 이탈하지 않은 것으로 간주되도록 rolling하는 코드를 작성하였다.

그리고 리텐션의 퍼센트를 계산하여 리텐션 커브 곡선까지 시각화 되도록 하였다.

그리고 위 코드를 조금 수정하여서 n_day retention 코드 또한 작성하였다.

def retention_nday(transaction:pd.DataFrame, year_month_day:str, n_day:list) : 
    
    # 받아온 테이블에서 필요한 데이터만을 저장한 후 메모리 정리
    transaction_time_by_customer = transaction[['customer_id', 'created_at']]
    del transaction
    
    # 계산 및 비교가능한 날짜 형식으로 변환
    transaction_time_by_customer = transaction_time_by_customer.assign(created_at = transaction_time_by_customer['created_at'].dt.strftime('%Y-%m-%d'))
    transaction_time_by_customer = transaction_time_by_customer.assign(created_at = pd.to_datetime(transaction_time_by_customer['created_at']))
    
    # 받아온 날짜 형변환
    picked_time = datetime.strptime(year_month_day, '%Y-%m-%d')
    
    n_day.insert(0, 0) # n_day리스트 맨 앞에 0 추가
    
    # n_day에 해당하는 날짜 데이터 값을 적재
    retain_period = []
    for n in n_day : 
        retain_period.append(transaction_time_by_customer[(transaction_time_by_customer['created_at']==picked_time+timedelta(days=n))])
    
    # 메모리 정리
    del transaction_time_by_customer
     
    # retain 기간 별로 유저가 방문했는지 여부
    # groupby('customer_id').count로 계산
    retention_group = []
    for i, df in enumerate(retain_period) : 
        retention_group.append(df.groupby('customer_id').count().rename(columns={'created_at':f'retention_{n_day[i]}'}))
    
    # 리텐션 테이블 생성
    retention = pd.concat(retention_group, axis=1)
    retention = retention[~retention['retention_0'].isnull()] # retention_0값이 비어있는 경우 삭제
    retention = retention.fillna(0) # 그 외 null값은 0으로 채움
    retention = retention.astype(int)
    retention[retention>1] = 1
        
    retention_percent = (retention.mean().to_frame().T*100).round(2)
    
    print('retention_table')
    display(retention)
    print('-----------------------------------------------\nrolling_retention_percentage')
    display(retention_percent)
    print('-----------------------------------------------\nrolling_retention_curve')
    visualize_retention(retention_percent)
    
    return retention, retention_percent

n_day에 대한 리스트를 예를 들어 [1, 7, 30]으로 넣은 경우 아래와 같이 해당 날짜 이후 1일, 7일, 30일에

다시 구매를 했는가 여부를 리텐션으로 확인할 수 있다.

아무래도 패션 데이터다보니 1일뒤에는 대부분 돌아오지 않았고 30일 뒤에는 소수의 사람이 재구매를 한 것으로

확인되었다.

내일은 중간 발표를 대비하여 EDA를 정리하고 발표 자료를 작성할 예정이다.