GitHub Topics Scraper | Python Web-Scraping


Web scraping is a technique used to extract data from websites. It allows us to gather information from web pages and use it for various purposes, such as data analysis, research, or building applications.

In this article, we will explore a Python project called “GitHub Topics Scraper,” which leverages web scraping to extract information from the GitHub topics page and retrieve repository names and details for each topic.

GitHub is a widely popular platform for hosting and collaborating on code repositories. It offers a feature called “topics” that allows users to categorize repositories based on specific subjects or themes. The GitHub Topics Scraper project automates the process of scraping these topics and retrieving relevant repository information.

The GitHub Topics Scraper is implemented using Python and utilizes the following libraries:

  • requests: Used for making HTTP requests to retrieve the HTML content of web pages.
  • BeautifulSoup: A powerful library for parsing HTML and extracting data from it.
  • pandas: A versatile library for data manipulation and analysis, used for organizing the scraped data into a structured format.

Let’s dive into the code and understand how each component of the project works.

import requests
from bs4 import BeautifulSoup
import pandas as pd

The above code snippet imports three libraries: requests, BeautifulSoup, and pandas.

def topic_page_authentication(url):

topics_url = url
response = requests.get(topics_url)
page_content = response.text
doc = BeautifulSoup(page_content, 'html.parser')
return doc

Defines a function called topic_page_authentication that takes a URL as an argument.

Here’s a breakdown of what the code does:

1. topics_url = url: This line assigns the provided URL to the variable topics_url. This URL represents the web page that we want to authenticate and retrieve its content.

2. response = requests.get(topics_url): This line uses the requests.get() function to send an HTTP GET request to the topics_url and stores the response in the response variable. This request is used to fetch the HTML content of the web page.

3. page_content = response.text: This line extracts the HTML content from the response object and assigns it to the page_content variable. The response.text attribute retrieves the text content of the response.

4. doc = BeautifulSoup(page_content, 'html.parser'): This line creates a BeautifulSoup object called doc by parsing the page_content using the 'html.parser' parser. This allows us to navigate and extract information from the HTML structure of the web page.

5. return doc: This line returns the BeautifulSoup object doc from the function. This means that when the topic_page_authentication function is called, it will return the parsed HTML content as a BeautifulSoup object.

The purpose of this function is to authenticate and retrieve the HTML content of a web page specified by the provided URL. It uses the requests library to send an HTTP GET request retrieves the response content, and then parses it using BeautifulSoup to create a navigable object representing the HTML structure.

Please note that the provided code snippet handles the initial steps of web page authentication and parsing, but it doesn’t perform any specific scraping or data extraction tasks.

def topicSraper(doc):

# Extract title
title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':title_class})

# Extract description
description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class':description_class})

# Extract link
link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a',{'class':link_class})

#Extract all the topic names
topic_titles = []
for tag in topic_title_tags:
topic_titles.append(tag.text)

#Extract the descrition text of the particular topic
topic_description = []
for tag in topic_desc_tags:
topic_description.append(tag.text.strip())

#Extract the urls of the particular topics
topic_urls = []
base_url = "https://github.com"
for tags in topic_link_tags:
topic_urls.append(base_url + tags['href'])

topics_dict = {
'Title':topic_titles,
'Description':topic_description,
'URL':topic_urls
}

topics_df = pd.DataFrame(topics_dict)

return topics_df

Defines a function called topicScraper that takes a BeautifulSoup object (doc) as an argument.

Get Certified in ChatGPT + Conversational UX + Dialogflow

Here’s a breakdown of what the code does:

1. title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary': This line defines the CSS class name (title_class) for the HTML element that contains the topic titles on the web page.

2. topic_title_tags = doc.find_all('p', {'class':title_class}): This line uses the find_all() method of the BeautifulSoup object to find all HTML elements (<p>) with the specified CSS class (title_class). It retrieves a list of BeautifulSoup Tag objects representing the topic title tags.

3. description_class = 'f5 color-fg-muted mb-0 mt-1': This line defines the CSS class name (description_class) for the HTML element that contains the topic descriptions on the web page.

4. topic_desc_tags = doc.find_all('p', {'class':description_class}): This line uses the find_all() method to find all HTML elements (<p>) with the specified CSS class (description_class). It retrieves a list of BeautifulSoup Tag objects representing the topic description tags.

5. link_class = 'no-underline flex-1 d-flex flex-column': This line defines the CSS class name (link_class) for the HTML element that contains the topic links on the web page.

6. topic_link_tags = doc.find_all('a',{'class':link_class}): This line uses the find_all() method to find all HTML elements (<a>) with the specified CSS class (link_class). It retrieves a list of BeautifulSoup Tag objects representing the topic link tags.

7. topic_titles = []: This line initializes an empty list to store the extracted topic titles.

8. for tag in topic_title_tags: ...: This loop iterates over the topic_title_tags list and appends the text content of each tag to the topic_titles list.

9. topic_description = []: This line initializes an empty list to store the extracted topic descriptions.

10. for tag in topic_desc_tags: ...: This loop iterates over the topic_desc_tags list and appends the stripped text content of each tag to the topic_description list.

11. topic_urls = []: This line initializes an empty list to store the extracted topic URLs.

12. base_url = "https://github.com": This line defines the base URL of the website.

13. for tags in topic_link_tags: ...: This loop iterates over the topic_link_tags list and appends the concatenated URL (base URL + href attribute) of each tag to the topic_urls list.

14. topics_dict = {...}: This block creates a dictionary (topics_dict) that contains the extracted data: topic titles, descriptions, and URLs.

15. topics_df = pd.DataFrame(topics_dict): This line converts the topics_dict dictionary into a pandas DataFrame, where each key becomes a column in the DataFrame.

16. return topics_df: This line returns the pandas DataFrame containing the extracted data.

The purpose of this function is to scrape and extract information from the provided BeautifulSoup object (doc). It retrieves the topic titles, descriptions, and URLs from specific HTML elements on the web page and stores them in a pandas data frame for further analysis or processing.

def topic_url_extractor(dataframe):

url_lst = []
for i in range(len(dataframe)):
topic_url = dataframe['URL'][i]
url_lst.append(topic_url)
return url_lst

Defines a function called topic_url_extractor that takes a panda DataFrame (dataframe) as an argument.

Here’s a breakdown of what the code does:

1. url_lst = []: This line initializes an empty list (url_lst) to store the extracted URLs.

2. for i in range(len(dataframe)): ...: This loop iterates over the indices of the DataFrame rows.

3. topic_url = dataframe['URL'][i]: This line retrieves the value of the ‘URL’ column for the current row index (i) in the data frame.

4. url_lst.append(topic_url): This line appends the retrieved URL to the url_lst list.

5. return url_lst: This line returns the url_lst list containing the extracted URLs.

The purpose of this function is to extract the URLs from the ‘URL’ column of the provided DataFrame.

It iterates over each row of the DataFrame, retrieves the URL value for each row, and adds it to a list. Finally, the function returns the list of extracted URLs.

This function can be useful when you want to extract the URLs from a DataFrame for further processing or analysis, such as visiting each URL or performing additional web scraping on the individual web pages.

def parse_star_count(stars_str):

stars_str = stars_str.strip()[6:]
if stars_str[-1] == 'k':
stars_str = float(stars_str[:-1]) * 1000
return int(stars_str)

Defines a function called parse_star_count that takes a string (stars_str) as an argument.

Here’s a breakdown of what the code does:

1. stars_str = stars_str.strip()[6:]: This line removes leading and trailing whitespace from the stars_str string using the strip() method. It then slices the string starting from the 6th character and assigns the result back to stars_str. The purpose of this line is to remove any unwanted characters or spaces from the string.

2. if stars_str[-1] == 'k': ...: This line checks if the last character of stars_str is ‘k’, indicating that the star count is in thousands.

3. stars_str = float(stars_str[:-1]) * 1000: This line converts the numeric part of the string (excluding the ‘k’) to a float and then multiplies it by 1000 to convert it to the actual star count.

4. return int(stars_str): This line converts the stars_str to an integer and returns it.

The purpose of this function is to parse and convert the star count from a string representation to an integer value. It handles cases where the star count is in thousands (‘k’) by multiplying the numeric part of the string by 1000. The function returns the parsed star count as an integer.

This function can be useful when you have star counts represented as strings, such as ‘1.2k’ for 1,200 stars, and you need to convert them to numerical values for further analysis or processing.

def get_repo_info(h3_tags, star_tag):
base_url = 'https://github.com'
a_tags = h3_tags.find_all('a')
username = a_tags[0].text.strip()
repo_name = a_tags[1].text.strip()
repo_url = base_url + a_tags[1]['href']
stars = parse_star_count(star_tag.text.strip())
return username, repo_name, stars, repo_url

Defines a function called get_repo_info that takes two arguments: h3_tags and star_tag.

Here’s a breakdown of what the code does:

1. base_url = 'https://github.com': This line defines the base URL of the GitHub website.

2. a_tags = h3_tags.find_all('a'): This line uses the find_all() method of the h3_tags object to find all HTML elements (<a>) within it. It retrieves a list of BeautifulSoup Tag objects representing the anchor tags.

3. username = a_tags[0].text.strip(): This line extracts the text content of the first anchor tag (a_tags[0]) and assigns it to the username variable. It also removes any leading or trailing whitespace using the strip() method.

4. repo_name = a_tags[1].text.strip(): This line extracts the text content of the second anchor tag (a_tags[1]) and assigns it to the repo_name variable. It also removes any leading or trailing whitespace using the strip() method.

5. repo_url = base_url + a_tags[1]['href']: This line retrieves the value of the ‘href’ attribute from the second anchor tag (a_tags[1]) and concatenates it with the base_url to form the complete URL of the repository. The resulting URL is assigned to the repo_url variable.

6. stars = parse_star_count(star_tag.text.strip()): This line extracts the text content of the star_tag object removes any leading or trailing whitespace and passes it as an argument to the parse_star_count function. The function returns the parsed star count as an integer, which is assigned to the stars variable.

7. return username, repo_name, stars, repo_url: This line returns a tuple containing the extracted information: username, repo_name, stars, and repo_url.

The purpose of this function is to extract information about a GitHub repository from the provided h3_tags and star_tag objects. It retrieves the username, repository name, star count, and repository URL by navigating and extracting specific elements from the HTML structure. The function then returns this information as a tuple.

This function can be useful when you want to extract repository information from a web page that contains a list of repositories, such as when scraping GitHub topics.

def topic_information_scraper(topic_url):
# page authentication
topic_doc = topic_page_authentication(topic_url)

# extract name
h3_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class':h3_class})

#get star tag
star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_tags = topic_doc.find_all('a',{'class':star_class})

#get information about the topic
topic_repos_dict = {
'username': [],
'repo_name': [],
'stars': [],
'repo_url': []
}

for i in range(len(repo_tags)):
repo_info = get_repo_info(repo_tags[i], star_tags[i])
topic_repos_dict['username'].append(repo_info[0])
topic_repos_dict['repo_name'].append(repo_info[1])
topic_repos_dict['stars'].append(repo_info[2])
topic_repos_dict['repo_url'].append(repo_info[3])

return pd.DataFrame(topic_repos_dict)

Defines a function called topic_information_scraper that takes a topic_url as an argument.

Here’s a breakdown of what the code does:

1. topic_doc = topic_page_authentication(topic_url): This line calls the topic_page_authentication function to authenticate and retrieve the HTML content of the topic_url. The parsed HTML content is assigned to the topic_doc variable.

2. h3_class = 'f3 color-fg-muted text-normal lh-condensed': This line defines the CSS class name (h3_class) for the HTML element that contains the repository names within the topic page.

3. repo_tags = topic_doc.find_all('h3', {'class':h3_class}): This line uses the find_all() method of the topic_doc object to find all HTML elements (<h3>) with the specified CSS class (h3_class). It retrieves a list of BeautifulSoup Tag objects representing the repository name tags.

4. star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default': This line defines the CSS class name (star_class) for the HTML element that contains the star counts within the topic page.

5. star_tags = topic_doc.find_all('a',{'class':star_class}): This line uses the find_all() method to find all HTML elements (<a>) with the specified CSS class (star_class). It retrieves a list of BeautifulSoup Tag objects representing the star count tags.

6. topic_repos_dict = {...}: This block creates a dictionary (topic_repos_dict) that will store the extracted repository information: username, repository name, star count, and repository URL.

7. for i in range(len(repo_tags)): ...: This loop iterates over the indices of the repo_tags list, assuming that it has the same length as the star_tags list.

8. repo_info = get_repo_info(repo_tags[i], star_tags[i]): This line calls the get_repo_info function to extract information about a specific repository. It passes the current repository name tag (repo_tags[i]) and star count tag (star_tags[i]) as arguments. The returned information is assigned to the repo_info variable.

9. topic_repos_dict['username'].append(repo_info[0]): This line appends the extracted username from repo_info to the ‘username’ list in topic_repos_dict.

10. topic_repos_dict['repo_name'].append(repo_info[1]): This line appends the extracted repository name repo_info to the ‘repo_name’ list in topic_repos_dict.

11. topic_repos_dict['stars'].append(repo_info[2]): This line appends the extracted star count repo_info to the ‘stars’ list in topic_repos_dict.

12. topic_repos_dict['repo_url'].append(repo_info[3]): This line appends the extracted repository URL from repo_info to the ‘repo_url’ list in topic_repos_dict.

13. return pd.DataFrame(topic_repos_dict): This line converts the topic_repos_dict dictionary into a pandas DataFrame, where each key becomes a column in the DataFrame. The resulting data frame contains the extracted repository information.

The purpose of this function is to scrape and extract information about the repositories within a specific topic on GitHub. It authenticates and retrieves the HTML content of the topic page, then extracts the repository names and star counts using specific CSS class names.

It calls the get_repo_info function for each repository to retrieve the username, repository name, star count, and repository URL.

The extracted information is stored in a dictionary and then converted into a pandas DataFrame, which is returned by the function.

if __name__ == "__main__":
url = 'https://github.com/topics'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)

# Make Other CSV files acording to the topics
url = topic_url_extractor(topic_dataframe)
name = topic_dataframe['Title']
for i in range(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None)

The code snippet demonstrates the main execution flow of the script.

Here’s a breakdown of what the code does:

1. if __name__ == "__main__":: This conditional statement checks if the script is being run directly (not imported as a module).

2. url = 'https://github.com/topics': This line defines the URL of the GitHub topics page.

3. topic_dataframe = topicSraper(topic_page_authentication(url)): This line retrieves the topic page’s HTML content using topic_page_authentication, and then passes the parsed HTML (doc) to the topicSraper function. It assigns the resulting data frame (topic_dataframe) to a variable.

4. topic_dataframe.to_csv('GitHubtopics.csv', index=None): This line exports the topic_dataframe DataFrame to a CSV file named ‘GitHubtopics.csv’. The index=None argument ensures that the row indices are not included in the CSV file.

5. url = topic_url_extractor(topic_dataframe): This line calls the topic_url_extractor function, passing the topic_dataframe as an argument. It retrieves a list of URLs (url) extracted from the data frame.

6. name = topic_dataframe['Title']: This line retrieves the ‘Title’ column from the topic_dataframe and assigns it to the name variable.

7. for i in range(len(topic_dataframe)): ...: This loop iterates over the indices of the topic_dataframe DataFrame.

8. new_df = topic_information_scraper(url[i]): This line calls the topic_information_scraper function, passing the URL (url[i]) as an argument. It retrieves repository information for the specific topic URL and assigns it to the new_df DataFrame.

9. new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None): This line exports the new_df DataFrame to a CSV file. The file name is dynamically generated using an f-string, incorporating the topic name (name[i]). The index=None an argument ensures that the row indices are not included in the CSV file.

The purpose of this script is to scrape and extract information from the GitHub topics page and create CSV files containing the extracted data. It first scrapes the main topics page, saves the extracted information in ‘GitHubtopics.csv’, and then proceeds to scrape individual topic pages using the extracted URLs.

For each topic, it creates a new CSV file named after the topic and saves the repository information in it.

This script can be executed directly to perform the scraping and generate the desired CSV files.

url = 'https://github.com/topics'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)

Once this code runs, it will generate a CSV file name as ‘GitHubtopics.csv’, which looks like this. and that csv covers all the topic names, their description, and their URLs.

GitHubtopics.csv
url = topic_url_extractor(topic_dataframe) 
name = topic_dataframe['Title']
for i in range(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Files/{name[i]}.csv', index=None)

Then this code will execute to create the specific csv files based on topics we saved in the earlier ‘GitHubtopics.csv’ file. Then those CSV files are saved in a directory called ‘GitHubTopic_CSV-Files’ with their own specific topic names. Those csv files look like this.

GitHubTopcis_CSV-Files

These Topic csv files stored some information about the topic, such as their Username, Repository name, Stars of the Repository, and the Repository URL.

3D.csv

Note: The tags of the website may change, so before running this python script, check the tags once according to the website.

Access of full Script >> https://github.com/PrajjwalSule21/GitHub-Topic-Scraper/blob/main/RepoScraper.py

Chatathon by Chatbot Conference



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *