[pandas] HTML(웹) 불러오기

HAN_PY 2021. 1. 14. 16:41

판다스의 read_html() 함수는 HTML 웹 페이지에 있는 <table> 태그에서 표 형식의 데이터를 모두 찾아서 데이터프레임으로 변환한다. 그리고 각각의 표를 원소로 가지는 리스트가 반환된다.

아래의 html이 있다고 하자.

코드를 보면 아래와 같다.

<table>
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>c0</th>
      <th>c1</th>
      <th>c2</th>
      <th>c3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
      <td>4</td>
      <td>7</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>2</td>
      <td>5</td>
      <td>8</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>3</td>
      <td>6</td>
      <td>9</td>
    </tr>
  </tbody>
</table>


<table>
  <thead>
    <tr style="text-align: right;">
      <th>name</th>
      <th>year</th>
      <th>developer</th>
      <th>opensource</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>NumPy</th>
      <td>2006</td>
      <td>Travis Oliphant</td>
      <td>True</td>
    </tr>
    <tr>
      <th>matplotlib</th>
      <td>2003</td>
      <td>John D. Hunter</td>
      <td>True</td>
    </tr>
    <tr>
      <th>pandas</th>
      <td>2008</td>
      <td>Wes Mckinneye</td>
      <td>True</td>
    </tr>
  </tbody>
</table>

html에 대한 자세한 설명은 생략하겠다. 관련 내용은 여기를 클릭하자. 그렇다면 이제 불러서 확인을 해보자.

# html_file.html로 저장된 문서를 임의로 불러오겠다. 
# 웹을 불러오려면 url 변수에 파일 위치가 아닌 url을 적어주자
import pandas as pd
url = './html_file.html'

tables = pd.read_html(url)

print(tables)

#output
[   Unnamed: 0  c0  c1  c2  c3
 0           0   0   1   4   7
 1           1   1   2   5   8
 2           2   2   3   6   9,
          name  year        developer  opensource
 0       NumPy  2006  Travis Oliphant        True
 1  matplotlib  2003   John D. Hunter        True
 2      pandas  2008    Wes Mckinneye        True]
 
 
 
 
df = tables[1]
print(df.set_index('name'))

#output
            year        developer  opensource
name                                         
NumPy       2006  Travis Oliphant        True
matplotlib  2003   John D. Hunter        True
pandas      2008    Wes Mckinneye        True

#output

리스트로 들어오는 것을 확인 할 수 있다. 만약 set_index에 대해 잘 모르겠다면 여기를 눌러 판다스 기초를 정리하고 오자.