A web scraping python script with pandas and requests packages
I created a python script where it accesses to a website from which I extracted some baseball data.
Pandas has a function called “read_html”, so you don’t have to use other packages like BeatifulSoup!
The structure of the script is the following:
- Get html information with the requests package.
- Read html in pandas.
- Output the result to a csv file.
import requests
import pandas as pd
URL = 'https://www.baseball-almanac.com/hitting/hihr5.shtml'
def get_table(html, table):
df = pd.read_html(html, attrs={'class': 'boxed'}, header=1)[0]
return df
def main():
html = requests.get(URL).text
df = get_table(html, {'class': 'boxed'})
df.to_csv('HR Year-by-Year Leaders.csv', index=None)
if __name__ == '__main__':
main()
Hope this is helpful in some way for those who are learning / using Python.
Hi, thanks for highlighting that feature. I got a question about your code: why do you define “table” as an argument in the function that creates the DataFrame-object?
My intention was that since we’re scraping tables on the web page, I called it table. You can feel free to change it however you want though!