Tutorials, PHP & MySQL, PHP & MySQL, PHP & MySQL, PHP & MySQL

Scraping GitHub

For an upcoming project I need to be able to dynamically get information about a GitHub repository such as the number of stars, watchers, forks and the repo description and url.

Looking at the API I didn’t see a simple way of doing it so I decided to scrape my repo instead.

Using HTML Dom Parser (http://simplehtmldom.sourceforge.net) the process is simple. First include simple_html_dom.php then setup the url to my repo:

$html = file_get_html('https://github.com/simple-mvc-framework/framework');

Next I need to get the watchers, stars and forks, each are contained within an a link with a class of social-count, that’s perfect I can use the class to get all links with that class:

$html->find('a.social-count', 0)->innertext;

The number represents the index I could loop through the results using a foreach but I wanted to be specific and add them to an array like this:

$info = array(
    'watching' => trim($html->find('a.social-count', 0)->innertext),
    'starred' => trim($html->find('a.social-count', 1)->innertext),
    'forked' => trim($html->find('a.social-count', 2)->innertext)

I’ve wrapped the results around trim to remove any spacing.

That’s the stats taken care of, next is the repo description, that is stored in a div with a class of ‘repository-description’:

$html->find('div.repository-description', 0)->innertext;

Finally the repo url, it’s inside a div with a class of ‘repository-website’:

strip_tags($html->find('div.repository-website', 0)->innertext)

This time I want to remove the a link using strip_tags that will remove all markup.

Putting this all together:

$info = array(
    'watching' => trim($html->find('a.social-count', 0)->innertext),
    'starred' => trim($html->find('a.social-count', 1)->innertext),
    'forked' => trim($html->find('a.social-count', 2)->innertext),
    'desc' => trim($html->find('div.repository-description', 0)->innertext),
    'sitelink' => trim(strip_tags($html->find('div.repository-website', 0)->innertext))

Now anytime I want to display one of these I can call the relevent part such as $info[’starred’].


Now I have the stats it would be nice to display recent commits say the most recent 5.

This time I call a different url. The commits are stored in series of li’s with a class of commit. This time looping through them.

Storing the commit and title is variabled and then using str_replace to make sure the url on the a links are pointing to github.

I only want to so a check is ran once the $i is equal to 5 break the loop.

$i = 0;
$html = file_get_html('https://github.com/simple-mvc-framework/framework/commits/master');
foreach($html->find('li.commit') as $e){
    $comit = $e->find('div.commit-meta', 0)->innertext.'<br>';
    $title = $e->find('p.commit-title', 0)->innertext.'<br>';
    echo '<p>';
    echo str_replace('href="/', 'href="https://github.com/', $title);
    echo str_replace('href="/', 'href="https://github.com/', $comit);
    echo '</p>';

    if ($i == 5) {


It would have been nice to use an official API but it would have meant multiple calls for the information. Scrapping the information is much quicker and easier in this case.

David Carr

David Carr

For the past 12 years, I’ve been developing applications for the web using mostly PHP. I do this for a living and love what I do as every day there is something new and exciting to learn.

In my spare time, the web development community is a big part of my life. Whether managing online programming groups and blogs or attending a conference, I find keeping involved helps me stay up to date. This is also my chance to give back to the community that helped me get started, a place I am proud to be apart of.

Besides programming I love spending time with friends and family and can often be found together going out catching the latest movie, staying in playing games on the sofa or planning a trip to someplace I’ve never been before.