「rebuild.fmの統計学」というタイトルで発表しました。

資料はSpeakerDecからご覧ください！

speakerdeck.com

LIFULLさんで開催された「Ltech#3 【podcast × IT】LT Night !」で話しました。

lifull.connpass.com

使ったコードは全Githubに上げましたが、簡単にどんなことをしたのか解説したいと思います。

python-sandbox/rebuild.fmの統計学.ipynb at master · ikedaosushi/python-sandbox · GitHub

そして、この内容を今週土曜日の勉強会で話そうと思ってます。まだ参加申し込みできますので興味ある方は是非！

tskubapy.connpass.com

サイトから情報の取得

f:id:mergyi:20181213052504p:plain

サイトからの情報取得にはrequests-htmlを使っています。

endpoint = 'https://rebuild.fm/{}/'
session = HTMLSession()
episodes = []
for number in tqdm_notebook(range(1, 223)):
    for plus in ['', 'a']: # Normal / Aftershow
        url = endpoint.format(str(number)+plus)
        r = session.get(url)
        if r.status_code != 200: # Aftershowがないとき404が返ってくる
            continue

        # 情報をCSS Selectorで取得
        date_ = r.html.find('#contents > div > span', first=True).text
        date_ = date_.replace('\n', '-')

        record_time = r.html.find('#contents > div > div.post > p > i', first=True).text
        record_time = record_time.replace('収録時間: ', '').replace(' |', '')

        title = r.html.find('#contents > div > h2 > a', first=True).text

        description = r.html.find('#contents > div > div.post > div.episode-description > p', first=True).text

        persons = []
        persion_elements = r.html.find('#contents > div > div.post > div.episode-description > div.episode-people > ul > li')
        for person_element in persion_elements:
            persons.append(person_element.text)

        shownotes = []
        shownote_elements = r.html.find('#show_notes_ > ul > li > a')
        for shownote_element in shownote_elements:
            shownotes.append(shownote_element.text)

        # 情報を辞書にしてappend
        episode = {
            'date': date_,
            'record_time': record_time, 
            'title': title, 
            'persons': persons,
            'shownotes': shownotes
        }
        episodes.append(episode)
        
        # サイトに負荷をかけないように1秒間ごとに
        time.sleep(1)

# 最後にpd.DataFrame形式に変換
df = pd.DataFrame(episodes)

特に難しいことはしていませんが、本編は https://rebuild.fm/1/ 、 Aftershowは https://rebuild.fm/1a/ という形になるのと、皆さんご存知の通りAftershowは回によってあるときとないときがあるので、その点を注意したコードになっています。

前処理

ここからはpandasを使ったデータ加工パートになります。

# datetime型に変換
df['date'] = pd.to_datetime(df['date'])

# shownotesの数を保存
n_shownotes = []
for i, row in df.iterrows():
    n_s = len(row['shownotes'])
    n_shownotes.append(n_s)
df['n_shownotes'] = n_shownotes

# 収録時間(ex: 52:53)を正規表現を使って分(integer)に変換
hours = df['record_time'].str.extract('(?<=^)(\d)(?=:\d\d:\d\d$)').fillna(0).astype(int)
minutes = df['record_time'].str.extract('(\d\d)(?=:)') .fillna(0).astype(int)
seconds = df['record_time'].str.extract('(?<=\d\d)(\d\d)(?=$)') .fillna(0).astype(int)
df['minutes'] = hours*60 + minutes + seconds/60

# Aftershowかどうか
df['is_aftershow'] = False
df['is_aftershow'] = df['is_aftershow'].mask(df['title'].str.contains('Aftershow', na=False), True)

日付は Feb 13-2013 のようになっているのですが、 pd.to_datetime というメソッドは何も指定することなく使うだけで正しいdatetime型にしてくれます。pandasでの時系列データの取扱いに関してはsinhrksさんのBlogが詳しいので参照してみてください。

sinhrks.hatenablog.com

# episodeのナンバー
df['show_no'] = df['title'].str.extract('^(\d\d?\d?)')
df['show_no'] = df['show_no'].mask(df['show_no'].isnull(), df['title'].str.extract('^Aftershow (\d\d?\d?)')[0])

また今回の分析の特性上、「本編とAftershowを合算したい場合がある(例えば1回あたりの収録時間を見たいなど)」ためその準備をします。

# 分割してMerge
dfm = pd.merge(df.query('~is_aftershow'), df.query('is_aftershow'), how='left', on='show_no', suffixes=('_main', '_after'))
dfm['date'] = dfm['date_main']
dfm['minutes'] = dfm['minutes_main'].fillna(0) + dfm['minutes_after'].fillna(0)
dfm['persons'] = dfm['persons_main']
dfm['n_shownotes'] = dfm['n_shownotes_main'].fillna(0) + dfm['n_shownotes_after'].fillna(0)

一旦基本的な前処理はここまでにして、込み入った処理はこの後都度解説しようと思います。

可視化

準備が出来たので可視化していきます。

ヒストグラム

自分は基本的にグラフごとにmatplotlibのapiを都度書くことが多いのですが、ヒストグラムは細かい表示をしたいのでメソッドにしています。

def plot_hist(s, title, bins=30):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    mean =s.mean().round(2)
    median = s.median().round(2)
    std = s.std().round(2)

    sns.distplot(s, ax=ax, bins=bins, kde_kws={"color": "k", "lw": 3})
    ax.set_title(title, fontsize=20)
    ax.tick_params(axis = 'x', which = 'major', labelsize = 20)
    vals = ax.get_yticks()
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
    ax.text( 0.99, 0.99, f"平均値: {mean:.2f} \n 中央値: {median:.2f} \n 標準偏差: {std:.2f}", horizontalalignment='right', verticalalignment='top', transform=ax.transAxes, fontsize=20)

こんな風に呼び出すと Plotできます。

plot_hist(df.query('~is_aftershow')['minutes'], '[本編] 収録時間の分布とKernel密度推定(分)')

f:id:mergyi:20181213053754p:plain — 収録時間の分布

時系列

次は時系列のプロットをしていきます。

fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(1, 1, 1)
dfm.plot('date', 'minutes', linewidth=5, linestyle='--', ax=ax)
dfm.set_index('date')['minutes'].rolling(window=4).mean().plot(linewidth=5, ax=ax)
ax.set_title('[全体] 一回当たりの収録時間は長くなっているのか(分)', fontsize=20)
ax.tick_params(axis='x', labelsize='xx-large')
ax.set_xlabel('')
ax.set_ylabel('')

f:id:mergyi:20181213054212p:plain — 一回当たりの収録時間は長くなっているのか

set_index を使ってindexにdatetimeを設定するだけで簡単に時系列プロットすることができます。破線が実データで実線が4回ごとの移動平均になっているのがポイントです。移動平均はpandasのrollingを使うと簡単に出すことができます。

続いて一ヶ月ごとのプロットもしてみます。

fig = plt.figure(figsize=(16, 4))
ax = fig.add_subplot(1, 1, 1)
ax = df.groupby(pd.Grouper(key='date', freq='1M'))['minutes'].sum().plot(linewidth=5, linestyle='--', ax=ax)
ax = df.groupby(pd.Grouper(key='date', freq='1M'))['minutes'].sum().rolling(window=4).mean().plot(linewidth=5, ax=ax)
ax.set_title('1ヶ月あたりの合計分数', fontsize=20)
ax.tick_params(axis='x', labelsize='xx-large')
ax.set_xlabel('')
ax.set_ylabel('')

f:id:mergyi:20181213054243p:plain — 1ヶ月あたりの合計分数

先ほどと違うのは pd.Grouper を使って期間を設定しているところです。こちらの使い方に関してもsinhrksさんのBlogが詳しいので参照してみてください。

sinhrks.hatenablog.com

棒グラフ

次は棒グラフです。

fig = plt.figure(figsize=(16,8))

tmp_df = s_persons.value_counts()[:10].to_frame('value').reset_index().rename(columns={'index': 'name'})
ax = sns.barplot(x='value', y='name', data=tmp_df)
max_ = tmp_df['value'].max()

for i, (_, row) in enumerate(tmp_df.iterrows()):
    text = ax.text(row['value'] + max_*.03, i+0.1, row['value'], color='black', ha="center", fontsize=20)

[spine.set_visible(False) for spine in ax.spines.values()]
ax.tick_params(bottom=False, left=False, labelbottom=False)
ax.tick_params(axis='y', labelsize='x-large')
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('出演回数ランキング', fontsize=20)
ax.patch.set_facecolor('white') 

ax.patch.set_alpha(0)
plt.grid(False)

f:id:mergyi:20181213054916p:plain — 出演回数ランキング

棒グラフのプロットの仕方が下記のQiitaで書いてくださっている内容をかなり参考にしています。seabornを使ったほうが少しだけ簡単にかけるのと項目数が多いときも勝手にカッコよく表示してくれるので自分はこちらを使っています。

qiita.com

box-plot

box-plotに関しては、seabornのメソッドをそのまま使っているだけなので簡単に。(笑)

# 全員分表示すると多いのでTOP10に絞る
top10_speaker = s_persons.value_counts()[:10].index.tolist()
ax = sns.boxplot(data=dfms.query('person in @top10_speaker').sort_values('person'), x='person', y='minutes')
ax.set_title('出演者ごとの収録時間の分布', fontsize=20)
ax.tick_params(axis='both', labelsize='x-large')
ax.set_xlabel('')
ax.set_ylabel('')

f:id:mergyi:20181213055047p:plain — 出演者ごとの収録時間の分布

今回はそんなに見やすくなかったので使わなかったのですが、 hue optionを使うと便利なときもあります。下のプロットは hue optionを使って本編とAftershowの分布を同時にプロットしたものです。

ax = sns.boxplot(data=dfs.query('person in @top10_speaker').sort_values('person'), x='person', y='minutes', hue='is_aftershow')
ax.set_title('出演者ごとの収録時間の分布', fontsize=20)
ax.tick_params(axis='both', labelsize='x-large')
ax.set_xlabel('')
ax.set_ylabel('')

f:id:mergyi:20181213055348p:plain — hue optionsを加えたもの

ワードクラウド

最後にワードクラウドです。驚くことにこれもPythonのライブラリで簡単に作ることができます。まず下のように前処理をします。shownoteのテキストをスペースで区切って一般的な単語などを除いています。本当はスペース区切りだと「Alpha Go」などが「Alpha」「Go」になってしまって良くないので、最初に「Alpha Go」などの固有名詞が入ったの辞書を作っておくのがスタンダードですが、今回は大変なので、単純にやっています。(笑)

shownotes = []
for i, row in df.iterrows():
    tmp_shownote = row['shownotes']
    shownotes.extend(tmp_shownote)
s_shownotes = pd.Series(shownotes)

# 一般的な単語や記号を除外
ignore_word = [':', '/', '-', '–', 'to', 'your', 'for', 'the', 'and', '|', 'in', 'of', 'a', 'is', 'on', 'with', 'how', 'new', 'by', '2', 'at', 'rebuild:']

shownotes_split = []
for i, row in df.iterrows():
    tmp_shownote = row['shownotes']
    for t in tmp_shownote:
        tmp_split = t.split()
        shownotes_split.extend(tmp_split)
shownotes_split = [s.lower().replace(',', '') for s in shownotes_split]
shownotes_split = [s for s in shownotes_split if s not in ignore_word]

準備が出来たのでワードクラウドを作りたいと思います。下記のライブラリを作ってくれている方がいるので、これを使うだけで簡単に出すことができます。詳しい使い方はDocumentを見てみてください。

github.com

plt.figure(figsize=(15,12))
wordcloud = WordCloud(background_color="white", width=900, height=500).generate(" ".join(shownotes_split))
plt.imshow(wordcloud)
plt.grid(False)
ax.tick_params(bottom=False, left=False, labelbottom=False)
ax.set_xlabel('')
ax.set_ylabel('')
[spine.set_visible(False) for spine in ax.spines.values()]