Sign up
Sign in
Sign up
Sign in
Michael DeReus
Follow
Towards Data Science
—
Listen
Share
Gathering data is a vital part of any machine learning project, but many tutorials tend to use data that already exists in a convenient format. This is great for example cases, but not great for learning the whole process. In the real world, not all data can be found on Google… at least not yet. This is the part of the tutorial where we collect all of the statistics necessary for training our neural net for NHL prediction, but more importantly, I will show you how you can apply these concepts and collect any data you want from the world wide web.
For this example, we will be taking an entire season worth of NHL daily game stats. The output will consist of two lines of data, each row being one team’s perspective of the match. This data will be saved to a text file in CSV (Comma Separated Values) format, which is essentially just an excel worksheet but instead of having lines separating the cells, it has commas and newlines. For a frame of reference, we will start with an NHL.com daily stats page, and end up with an output file containing 2.1k lines of this:
Selenium is essentially a library that can open an instance of a web browser and interact with it automatically, and BeautifuSoup is what pulls the HTML from the page in order for us to use. Numpy and Pandas are both used for data handling and formatting which is not required if you just want to make a web scraper, but since we will be using this data later with Tensorflow, it’s easiest to use Numpy arrays right off the bat, considering my plans for this data in particular (not to mention that it is more efficient than regular Python logic).
Getting right into the actual scraping, we need a function first off that can access the web page and turn it into something that is more user-friendly. The plan to achieve this is simply to create a short method that will open up the page in selenium, take the HTML (the text-based framework that every web page has), turn it into soup that can be used, and then return it to the user.
This short method does all of the actual ‘scraping’. If you give it a URL, the HTML of the page will be returned in a functional format from which you can use to find what you need from the web page. Let’s dig into how it works.
Th first step is to define a web driver, which is essentially just the browser that selenium will use. Something I like about Selenium is that it automatically opens the web page in real time so you can see every action of your program. I decided to use Firefox. You can use any browser and selenium will do all the work accessing that browser, it’s very easy, just define which one you would like to use. Now all you have to do is give the driver a URL, and it will open the page automatically, but don’t forget to close the driver once you are done pulling from the page.
The source variable contains the raw HTML pulled from the page. I use the term “raw” because it is exactly that, raw and messy. We must first turn this storm of text into soup that we can easily consume. So, we simply run it through the “soup” function which takes two inputs; a source page and a parser.
The parser function is not a parser as you probably know it. You can look at the B.S. documentation to find the different kinds of parsers, but they all do the same thing, which is clean up the HTML so that we can parse through it with our own code. Some might work while others won’t, but ‘lxml’ just happened to do the trick in this case. There is one trick to this though. Before you can specify your parser, you must open up a command line and download your parser. Since I chose lxml, I would have to first do:
After that you’re all set and you can run the next two lines of code…
Before we can dig through the soup, we need to collect it first. All we have done so far is defined a method that removes data from one single web page, but we need an entire season’s worth of daily game data, so we will be calling this method many times. So first we need to figure out a way to rapidly gather all of this information and save it together for processing.
One vital concept for understanding this web scraper is the manipulation of URLs. The NHL website happens to use the same URL structure for every single instance of statistics in the table. Below is the link to one example of a page of daily statistics on NHL.com. As you can see, there is a date embedded in this URL, in two different places. If you change those to a different date you will find that the link still works, and you will be redirected to the page with the statistics for that specific day.
We can use this to our advantage by simply altering the date that is embedded in the URL automatically. In order for us to put this concept into effect, we need a function that will generate a link to the NHL.com page for any specific date, and that is exactly what the five lines of code below does.
The function does this with the use of f-strings. You may not be familiar with this tool but it comes with default Python 3.6. Essentially whatever date you enter into the function’s parameters, the f-string will fill in the year, month and day positions in the NHL.com URL. It then utilizes the url_to_soup() function from earlier in order to return the soup for that page. So instead of having to call both functions separately, this function does it for you.
For example, if you want to find statistics for the games played on 10/08/2018, you would call the function with these parameters: nhl_daily_data(2018,10,08) and it would then pull the sauce from this specific NHL.com web page, run it through the parser, and return it as soup.
You might be asking yourself, why would you need this function in your project? Well the truth is, maybe you don’t. The first step to finding web data is coming up with a plan of attack. Maybe all of your desired information is stored on the same page, in which case you could skip this step and just take all of the data you need in one sweep. If you aren’t able to use this URL trick, selenium offers functions for “manual” clicking that you can use to navigate through the website and you can find those in the documentation for Selenium.
Next, we I will go over how to read through HTML yourself so that you can tell your program what to look for.
In order to do this next part, you will need to know how to do some HTML reading. The first step is to navigate to a page that you need data from, I went to a random daily NHL stats page. Take a look at where your data is on the page. You can see that the game statistics are all embedded into some kind of table, so it is likely that this table is its own class. Your data might not be inside of a unique JavaScript element like mine is, but the process is still the same.
Next, right click on the text information that you want to use, and press “inspect element”. This will bring up the inspection pane in your web browser where you can find all of the classes and sub-classes of the page. Look through these and find the “route” to the class your desired object, chart, or information is located in. You need this in order to tell your scraper where to look.
The find() function is what does the looking for you in the code. We don’t need to make the program finger through every fold in the HTML to get to the table, you just need to specify which class to look for and BeautifulSoup will find it.
This one-liner is what locates the class in python that holds the table data and turns it into the object “table_body” which we can then parse through to record the data we need.
The table is effectively in the format of a 2-dimensional array; each game being two arrays. Each array is the statistics for one team’s perspective of that game.
The large chunk of code below may look like the hard part, but really we are on the home stretch.
Using what we know from reading the HTML, we can simply use the find() function and put a for loop in front of it and it will automatically pull everything within the class, one row at a time. But we can’t stop there because included in the table element is more than just the data we want but also some data we don’t want such as headers and divisions. If we look at the specific HTML for the data portion of the table, you can see that it has a different tag: td (table data). So what this means is that we need to do a nested for loop in order to narrow the search to only this table data.
This finds a list of the row data for us, one row at a time. We can then select what information we want, sort it, and save it.
To wrap this up in a nice bow we can tie, I defined a batch collection method. This method calls the nhl_daily_data() method one day at a time for a range of dates. This creates a chart of everything we have collected, and then writes it to your computer as a CSV file.
From here this data will need a lot more processing, but everything we need is now lumped into one file in a format that is easy to read.
—
—
Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.
Builder interested in analysis ~ michael.dereus@outlook.com
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams