LinkedIn Scraping with Selenium and Beautiful Soup !!!

Nikos kalikis
7 min readJan 20, 2021
Photo by Jason Sung on Unsplash

The subject that I will discuss - as you may have figured out from the title - is about web scraping by utilizing two of the most powerful scraping tools; Selenium and Beautiful Soup. Definitely, by searching online, one can find many applications that could do the work for them, considering though, the spending cost and the time to learn how to use them. Unfortunately, with my approach you have also to put some effort to learn how to use the modules but it is something that you can use for every source that you want to parse data from; and believe me, it will be a piece of cake after some practice.

Why do we need data?

Where can I find the information that I need?

How to transform the information to a readable file?

How many times have you had a business idea, the curiosity to know the current price of your favourite car in the market, or the salary that you deserve based on your skills? Although we could give an answer to these kind of questions, we get discouraged from the idea of collecting the data that will lead us to the answer. In this article I will try to represent the scraping process as simple as I can, in order to motivate you to answer your questions.

Create a problem

I really love learning by trying to solve problems, construct and optimize the solution. So, here is the problem: I am willing to start developing my own operation and I am thinking about the competition in market; hmm, do you have the same problem?

Photo by Bermix Studio on Unsplash

Where could I find the data?

We need a platform which contains the companies plus some company’s details; first thought: the LinkedIn platform, a platform where we can parse many valuable features.

But, before diving into the scraping process, I would like to present the two modules being used and their functionality.

The scraping tools

Beautiful Soup and Selenium are tools used for scraping unstructured data and converting them to machine readable structure data. Beautiful Soup is a module that can scrape static web pages in which you are able to parse whatever you can see. On the other hand, Selenium is capable of scraping data that is stored in interactive pages, pages which are developed by JavaScript.

Download the essential web driver

The LinkedIn platform is not a static page and we will use both modules for scraping the needed data. Firstly, install the modules and secondly, it is also required to install a web driver in order to automate the scraping process. In my case, I installed the last version of Chrome web driver. Notice that it is essential to find the driver that works with your current browser version. If you don’t know your Browser version you can download the last driver version and update your Browser. The last thing that you have to take care before jumping into the scrape part is to save the driver to a path that you can easily find. For instance, I saved the driver to the same folder with all my coding files.

Let’s start scraping the page.

We are now all set to open an editor and start coding. I am working on Jupyter notebook, a platform where you can print every coding step. I think Jupyter is a suitable platform for web scraping and in general, an amazing programming tool for data analysis.

Let’s start by importing the required libraries.

After importing the modules, we are ready to open a browser using the web driver, to navigate the page. I will use the Chrome platform but you can use whichever platform you have by downloading the compatible web driver. Notice that it is essential to keep the following window always open, as this window will be the base of our next searching steps.

Opening the Chrome Driver

And get the LinkedIn initial page by using the: “https://www.linkedin.com/login”

Running the code we are getting the following window where we should complete our personal information. In order to keep my personal information secure, I created a txt. file where I saved my email and my password. If you are not planning to share your code, you can define two variables that will assign your personal information.

Running the notebook columns one by one, the Sign In form completed automatically.

Reaching the LinkedIn home page

The Selenium module automatically filled my personal details and by clicking submit I logged into my initial LinkedIn page.

Now, we are in the main LinkedIn page and we want to parse data based on a certain search. We could implement this step with different approaches, but I think the easiest way is by determining the link of the page that we are looking for. Otherwise, we have to use the Selenium library, click the search box, write a certain query and modify the filters in order to earn the needed information. By using the direct link we can avoid many lines of code.

Determine the searching link.

This code can be adjustable and everyone will be able to search for companies in specific sectors. If you are interested in something else, change the link and follow the next steps. Notice that I used the Time Python module and the sleep command. Don’t set aside this command and give some time to the browser to load the page in order to parse the page source code properly.

Load the scraping page

We are in the page that contains the information that we need to parse. In this step is necessary to know some HTML structure to inspect the required elements. I used the Beautiful soup library to parse the elements in this page. We are in a static page now and we can detect the elements by doing a left mouse click.

After inspecting the page and determining the elements that we need, it is time to develop a scraping function. This part scared me when I started with the web scraping because I was looking for small specific parts. The optimum way to reach the essential data in a page is by finding a big element. For instance, that could be a class which contains a list of objects and then starts to break down the list into smaller elements such as: title, description, number of followers, etc. Let’s jump to the point and see the code:

Creating the scraping function

We are almost done with the scraping process, I say almost because we also have to iterate through all the available searching pages and apply the scraping function. In my case, the pages are 13 and in order to scrape every page I figured out that I could accomplish this by changing the page’s URL by adding a number at the end of it.

Note: The number must match the length of the available searching pages.

Finding the total searching pages and Append all the searching URLs to a list:

I would suggest defining the number of the pages that you want to inspect. Maybe sometimes your query will return more than a hundred pages; and you may don’t want to scrape all of them. Thus, you could define a threshold despite your needed data.

Iterating through all the scraping pages

Convert the list to a tidy csv file

Be creative

Definitely, I’m not saying that this is the optimum solution, because maybe you can find something more efficient to parse data from all the pages. Furthermore, consider that web pages have different structures and occasionally, it is essential to be creative. Finally, we have the parts that we need. So, iterate through all the pages, extract the page source code and apply the scraping function. Follow the same process for all the available pages. Lastly, don’t forget to save the features into a list, and here is the file with the full code.

Open the cvs file and be ready to use the information to extract valuable insights.

Conclusion

We did it! We have a file that contains information extracted from multiple web pages using Beautiful Soup and Selenium! I believe now that we have the basic skills to parse data from different pages.

I hope you enjoyed the process and if you followed along, let me know how it went!

Thanks for your time!!!

--

--