1. Home


< Blog />

Using Selenium with PHP to crawl web pages


Selenium is an awesome tool to automate the testing of your application, although, there are several better performing headless solutions available today for testing (Phantom.js, Zombie.js). Selenium can still be extremely useful to load a web page, perform some actions like a search and extract data from it.

Here are 2 answers I posted on StackOverflow.com demonstrating the basic concept, let’s have a look and get up and running with Selenium.

How to use Selenium with PHP? Disable images in Selenium Python

Installing the tools

Before we do anything we need to install all the tools, this is straight forward, nothing out of the ordinary here.

1. Download and install facebook/php-webdriver into your project.

composer require facebook/webdriver

2. Download Selenium and start it. (you can place this anywhere)

# Download it
curl -O http://selenium-release.storage.googleapis.com/2.53/selenium-server-standalone-2.53.0.jar

# Start it
java -jar selenium-server-standalone-2.53.0.jar

# INFO - Launching a standalone Selenium Server
# INFO - Java: Oracle Corporation 25.45-b02
# ... ... ...
# INFO - RemoteWebDriver instances should connect to:
# INFO - Selenium Server is up and running

3. Download Quick Java and place it into your project directory.

# Download QuickJava to your project dir
# We'll need to reference it later on
curl -O https://addons.cdn.mozilla.net/user-media/addons/1237/quickjava-2.0.8-fx.xpi

4. Download Firefox, if you don’t have it already. We’ll use firefox as our web driver for this tutorial.

Using selenium with PHP facebook web driver

Now we’ve got everything installed and up and running, let’s start playing with it!

Somewhere in your project (or in a new PHP script), place the following code. Remember to change the path to QuickJava to where you downloaded it to.

use Facebook\WebDriver\Firefox\FirefoxProfile;
use Facebook\WebDriver\Firefox\FirefoxDriver;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;

// Change this to the path of you xpi
$rootDir = $this->container->getParameter('kernel.root_dir');
$extensionPath = $rootDir.'/../bin/selenium/quickjava-2.0.6-fx.xpi';

// Build our firefox profile
$profile = new FirefoxProfile();
$profile->setPreference('thatoneguydotnet.QuickJava.curVersion', '');

// Disable all these
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Images', 2);
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.AnimatedImage', 2);
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.CSS', 2);
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Flash', 2);
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Java', 2);
$profile->setPreference("thatoneguydotnet.QuickJava.startupStatus.Silverlight", 2);

// Except Cookies & JavaScript
//$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Cookies', 2);
//$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.JavaScript', 2);

// Create DC
$dc = DesiredCapabilities::firefox();
$dc->setCapability(FirefoxDriver::PROFILE, $profile);

// Create our new driver
$driver = RemoteWebDriver::create($host, $dc);

// The HTML Source code
$html = $driver->getPageSource();

// Firefox should be open and you can see no images or css was loaded

Disable loading images with selenium

You can also disable the loading of images, this dramatically improve your speed and save bandwidth.

We can do this by simply setting a preference on our FirefoxProfile.

// Build our firefox profile

$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Images', 2);

// Create DC ...

Disable loading CSS or JavaScript with selenium

You may also want to disable the loading of CSS and JavaScript, we can do this by setting the following preferences on our FirefoxProfile.

// Build our firefox profile

$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.Cookies', 2);
$profile->setPreference('thatoneguydotnet.QuickJava.startupStatus.JavaScript', 2);

// Create DC ...