Clearing Facebook’s Cached Fetches

Earlier, I ran into an issue with a website suffering a configuration issue that resulted in a redirect loop.  During this time, the site had been added to a Facebook post.  As expected, rather than displaying site content in the preview that it puts together, Facebook noted in the preview that there was a 301 redirect.

The issue was that after the configuration issue was resolved, Facebook continued to display the site URL as a 301 redirect, which makes sense – you don’t serve content to 1.6 billion active users without some serious caching.

To get Facebook to fetch the most recent content you have two options:

  1. Wait
  2. Force it to re-grab the preview data via the Facebook object debugger tool

To use the debugger, visit https://developers.facebook.com/tools/debug in your browser – you’ll be asked to log in if you aren’t already.

Then pop the URL into the Input URL box and submit.  Facebook will display the previously-cached information with that 301 redirect that you want it to forget.

Facebook Object Debugger

The secret sauce is that “Fetch New Scrape Information” button.   Click it and Facebook will dutifully re-scrape your site, refreshing its cache and happily serving this updated data to its users, ftw.

Antivirus with ClamAV on Ubuntu 14.04 Server

Running regular antivirus scans on your server is never a bad idea.  One of the easiest ways to accomplish this is via the Clam AV utility.

Clam AV can be run in two ways.  Either as a background daemon via clamdscan, which depends on clamd, or on-demand via clamscan.  Because clamd runs in the background and doesn’t have to be reloaded each time, using it will usually reduce both scanning time, CPU load, and memory usage.

Clamd listens for incoming connections on Unix and/or TCP socket and scans files or directories. It reads the configuration file /etc/clamav/clamd.conf and can have most of its behavior defined that way. Alternatively, Clamscan needs to load the virus database each time it is run, and has its config information passed via arguments to the command.

To install Clam AV, run the following:

sudo apt-get install clamav clamav-daemon -y

Now it’s time to run the initial configuration process – it will ask you a whole bunch of questions which will define the base behavior of ClamAV on your system.  You can pretty much go with defaults, but read the explanation of each to make your final determination.  These will include questions about what socket to listen on, whether to scan emails, how much to do concurrently, filesize limits, etc:

sudo dpkg-reconfigure clamav-base

Once installed, you will want to grab the most recent virus definitions:

sudo freshclam

…and start the daemon.

sudo /etc/init.d/clamav-daemon start

It will start automatically and run in the background thereafter, and will scan any uploaded files.  If you want to have it actively scan certain directories nightly, you can cron a script.

Below is a cronnable example of a script you can run nightly to scan specific folders (here, /home, /etc, and /var/www) for issues

#!/bin/sh
rm -R /var/log/nightly-clamav-scan.log
touch /var/log/nightly-clamav-scan.log
clamdscan /home/ /etc/ /var/www/ --infected --multiscan --fdpass --log=/var/log/nightly-clamav-scan.log

Here is what the man page has to say about the options we added:

–infected – Only print infected files. (clamd)

–multiscan – In the multiscan mode clamd will attempt to scan the directory contents in parallel using available threads. This option is especially useful on multiprocessor and multi-core  systems. If you pass more than one file or directory in the command line, they are put in a queue and sent  to  clamd  individually. This means, that single files are always scanned by a single thread.  Similarly, clamdscan will wait for clamd to finish a directory scan  (performed in multiscan mode) before sending request to scan  another  directory. This option can be combined with –fdpass (see below). (clamdscan)

–fdpass – Pass the file descriptor permissions to clamd. This is useful if clamd is running as a different user as it is faster than streaming the file to clamd. Only available if connected to clamd via local(unix) socket. (clamdscan)

–log – Save scan report to FILE. (clamd)

Now, make this script executable (edit your path as needed):

chmod +x /path/to/script.sh

…and  cron it to run each night to scan the appropriate folders, logging any results to the log file specified.  To enable the scheduling, fire up crontab:

crontab -e

…and add this line and save it to schedule it to run at midnight each night:

0 0 * * * /bin/sh /path/to/script.sh

Related Links:

Links – Oct 10-24

A few interesting links from the last week or so – the Google Marker link really helped out on a project we’re working on which requires displaying lots of information about a limited number of locations.

Some easy website optimization wins

Here is a list of my favorite quick-and-easy optimizations which will result in faster load times for your websites:

Caching

The fastest communication is that which doesn’t occur, and sending appropriate content expiration headers with your data can accomplish just that – browsers and intermediate caches will hold on to your data after transmission and not bother your server with future requests. On the server side, you can speed things up by making use of any application-supported key/value stores such as memcached or redis, or even enable simple file-based caching to return data without bothering your database. If using PHP, enabling an opcode cache such as APC or Zend Optimizer will make sure that PHP doesn’t have to parse your scripts every time they’re needed, resulting in significantly faster scripts. A reverse proxy installed on the server can drastically reduce the number of requests that even make it to the web service stack by serving data straight out of memory. If you do have to talk to the database, make sure it is caching query results so that common queries are returned swiftly.

Use a CDN for static assets

Not only will this reduce the number of requests that are made of your web server, but static assets will be served more quickly as they will have to travel less to get to your user. There are lots of solutions for this, but we’ve been really liking Amazon’s Cloudfront of late.

Run a recent Linux kernel

Make sure the server and receiver’s TCP stacks communicate as efficiently as possible by keeping your server kernel up to date. Among other things, this will ensure that as much data can get out as early as possible in the communication process, that the max receive window size is large (resulting in greater throughput), and that it gets to that size as quickly as possible.

Combine/Minify CSS and JS

An oldie but a goodie, and one that is surprisingly often overlooked. This practice adheres nicely to the two tenets of sending as little as possible in as few round trips as possible between your server and your user’s browser.

Gzip Compression

Instant and significant bandwidth savings for non-binary assets when sending to browsers that support gzip compression, which is pretty much all of them. Enable mod deflate in Apache or simply make sure the gzip directive is set if using Nginx.

DNS

Resolving hostnames to IP addresses is the first part of the connection process, and can itself go awry. Use a quality DNS service with good performance and set a reasonably high (but not too high, in case you need to make changes) TTL on your zone records. This will ensure that results will be usually cached nearby, with resolution happening as quickly as possible for your users. Reducing the use of unnecessary Cname records will decrease the number of lookups that need to happen and get your user an IP to connect to more quickly.

Load Javascript asynchronously

Loading as much javascript as possible in an asynchronous manner will prevent script blocking and will result in a faster time-to-paint.

Lazy-load images

This will especially help your mobile users. By not loading non-visible images until they’re needed, your time to paint can be reduced. And those images – you’re optimizing them, right?

WIN!  Mod_pagespeed

A fantastic tool that will accomplish a lot of these optimizations very easily is mod_pagespeed, which is available for both Apache and Nginx web servers. If you use this, start with a small number of optimizations and enable new ones one at a time, testing thoroughly after enabling each – they can sometimes have unintended consequences

How to add Google Authenticator two-factor auth to your Drupal instanance

Sleep better by adding an extra layer of security to your Drupal site by applying the Google Authenticator module to your site:

  1. Download it here and install it into your modules directory as normal, and enable it (it’s called GA Login in the GUI module list).  If you use drush, just run “drush dl ga_login && drush en -y ga_login
  2. Check out the configuration screen at /admin/config/people/ga_login.  You can adjust the realm name or account suffix to change how it will appear in your Google Authenticator app.  It’s probably fine as is.  You can also adjust the skew for the time or HMAC-based authentication methods.  Basically, this adjusts how much leeway GA will give you in terms of submitting your code.  The default is 10, and should be fine.  Check the box to force uid 1 to log in with two-factor auth as well – this will force your primary admin account to use GA as well.  Then save your configuration.
  3. Now visit your user page at /user and click the GA Login tab.  Click the Get Started tab to get your authenticator URL.  Select the method you want GA to use – I just go with the time-based default – and click Create Code.  On the following page you’ll see a URL, account name and key, which you can manually plug into your GA mobile app to configure it for your site, or you can simply scan the URL embodied in the QR code you’ll see (easier!).
  4. After you have configured your mobile app, be sure to check the “I have successfully scanned the current code” checkbox and click Use This Code to finalize your settings in Drupal.

At this point if you log out, you should be challenged with an additional field on your login screen, labeled Code.  You’ll need to fire up your Google Authenticator app and enter the current code along with your login and password to get in.

Fear not!  If something terrible happens, you can always disable the ga_login module to get in with simply your username and password.  To do this with drush, run drush dis ga_login.

Google maintains a list of mobile apps you can use to run the authenticator here:

https://support.google.com/accounts/answer/1066447?hl=en

How to prevent forwarding to the web server back-end port when using Varnish

Depending on your web server and Varnish configurations, you may find that URLs that do not end in a trailing slash get redirected to Varnish’s back-end port:

http://example.com/about --> http://example.com:8080/about/

If you have (and hopefully you do) your firewall blocking direct access to the back-end port, you can end up with a timeout error.

The solution to this is to put your web service back on port 80, rather than 8080 or something else, and configure Varnish to talk to it on the loopback address 127.0.0.1 on port 80 (not localhost or the server’s IP address):

backend default {
    .host = "127.0.0.1";
    .port = "80";
}

If using Apache, set your ports.conf to chat on port 80 again, and any change any port references in your enabled site config files.  Then restart Varnish and Apache.

This will allow front and back-end communications to operate on the same port but always use different interfaces, resulting in no conflicts, and any redirects will not be blocked at the firewall.

Get your Drupal site crawled by Google more efficiently with the XML Sitemap module

Google does a frighteningly good job of finding all the stuff you add to your site, but even it will miss some of the darker corners of your online properties.  The best way to ensure that the pages you WANT to have crawled are indeed crawled is via a Google Sitemap.

There are lots of solutions to this – you can roll your own, use external crawlers, and install standalone scripts that will do this for you.  If you’re using the Drupal platform, there are a number of projects out there you can use.  One of the easiest to set up is (logically) called XML Sitemap.  Installation is a quick thing:

  1. Add it to your site by either downloading it and dropping it into your /sites/all/modules folder, or use Drush and simply issue: drush dl xmlsitemap
  2. Enable at least the following two submodules – XML Sitemap and XML Sitemap Node.  This is done either via the Modules section in your browser, or via Drush at the commandline: drush en -y xmlsitemap xmlsitemap_node
  3. By default it will only include your site root.  Now, go to Structure->Content Types and open any content types you want to include.  Click the edit link, and navigate to the XML Sitemap section of that content type.  Change Disabled to Enabled to include it, and give it a priority based on how often you will be updating it.
  4. Once completed, create your sitemap by visiting http://example.com/admin/config/search/xmlsitemap.
  5. Your sitemap will then be available at http://example.com/sitemap.xml

You can then submit it via Webmaster Tools, and can keep it up to date by resubmitting when updates are made by enabling the XML Sitemap Engines module (drush en xmlsitemap_engines).  In the configuration for this module, you can choose to resubmit to Google and/or Bing when updates are made to any of your Sitemap content (you’ll want to make sure your site is already registered with them).

Get a daily server overview via email with logwatch

Keeping up with activity on your Linux server can be a full-time job if you allow it to be (as well it should be, it can easily be argued).  There are many services that can ease the burden by providing you with a detailed overview of what’s going on at the moment (things like Monit, Munin, Cloudpassage, and a host of others).

Something that can augment this nicely is the 30,000-foot view that logwatch provides.

Logwatch is a series of Perl scripts packaged together that will detect, parse, and summarize a wide variety of log file types, including Apache, sshd, postfix, iptables, and a host of others.  Simply tell it who to email reports to, and schedule it to run at least once a day to receive a regular email digest of server activity you can quickly scan for issues.

To install, you can either grab the files via the project website at http://sourceforge.net/projects/logwatch/files/, or via your server’s package manager.  On flavors of Ubuntu, this would look like:

sudo apt-get install logwatch

Installation will also install a few required libraries if they’re not already on your server (libdate-manip-perl and libyaml-syck-perl).

At this point, you can run logwatch from the commandline and you’ll see some text output with information about what’s happened in the last day.  It’ll list out things like the what packages have been installed/removed, the 404s and 500s Apache has served, how much drive space is left, any segfaults detected, and what services like postfix and sshd have been up to.

To customize the configuration, you’ll want to edit the config file found in the slightly out-of-the-way location of

/usr/share/logwatch/default.conf/logwatch.conf

In here you’ll want to change the email address from root to your own email address.  Logwatch can send your email as text or HTML – select whichever you prefer.

Output = mail #change from stdout
Format = text
MailTo = you@example.com

There’s a detail level you can set as well – by default it’s set to the lowest level (0), which itself sends a fair amount of info.  If you find that you want to receive more, increase it as needed, up to 10.  It also supports text values of Low, Medium and High, which translate to 0, 5, and 10:

Detail = Low

By default, logwatch will recursively look for logs in /var/log, due to the configuration directive:

LogDir = /var/log

If you have additional logs in other locations, you can specify them by simply adding other LogDir lines:

LogDir = /var/log
LogDir = /var/www/example.com/logs/

Finally, schedule it by adding it to crontab – in this example, we schedule it to run at midnight by running crontab -e and adding the line:

0 0 * * * /usr/sbin/logwatch

…now sit back and wait for handy overviews to appear nightly in your inbox!

Enabling gzip compression for your Nginx-served Cloudfront content

Recently, I decided to put my own site behind Amazon’s Cloudfront CDN service. One way that they offer to accomplish this is via an origin pull. This means that the Cloudfront service connects to your own server, pulls your content from it, and distributes it to it’s various edge locations.

In a nutshell, you create a distribution at AWS, which is essentially a series of rules that define what origin server to pull from, whether to pay attention to things like cookies and querystrings, how long to cache things for, what data centers to cache the data at, etc.  Amazon responds by creating a unique subdomain for your distribution on the cloudfront.net domain, and connects it transparently to a combination of EC2 instances and S3 storange containers (you don’t need to worry about these – they take care of it).  At that point Cloudfront opens persistent connections to your server to pull, cache, and distribute content to end users as needed. You can even point to your Cloudfront domain via a CNAME record in your own domain.

This is quick and easy to set up, and works well.

After getting things in place, I was perusing the response headers, and I noticed that gzip compression was not being applied to my content after it had popped out of Cloudfront, even static content like javascript files and stylesheets. This puzzled me, because I definitely had gzip enabled in my nginx configuration.

After hunting through my logs, I noticed something about the requests from Cloudfront to my origin server. Here’s an example line:

216.137.32.250 - - [06/Jan/2013:06:43:24 -0500] "GET /?feed=atom HTTP/1.0" 200 10084 "-" "Amazon CloudFront"

It turns out that HTTP/1.0 which Cloudfront uses for its requests was causing the issue – by default the gzip_http_version in nginx is 1.1. As long as it’s set that way, your content will get sent to Cloudfront uncompressed, which can cost you more in bandwidth costs and result in longer wait times as Amazon will also be sending more data to your users per request.

Remedy this by making sure that for content you’re distributing with Cloudfront from an origin server, your nginx configuration on that server includes:

gzip_http_version 1.0;

In the end, my gzip settings ended up being:

gzip  on;
gzip_comp_level 6;
gzip_http_version 1.0;
gzip_types text/plain text/html text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;
gzip_proxied any; 
gzip_disable "msie6";

…and my content is once again compressed when transmitted through AWS.

Chaining and piping at the command line

At the command line, you will find that you frequently need to run multiple commands to accomplish your end goal. It’s important, therefore, to understand the various options you have to link commands together:

Consecutively

The simplest way is to run multiple commands is to connect them with semi-colons:

command1 ; command2 ; command3

Doing this will run command 1, then command 2, then command 3, consecutively. It will wait for the previous command to complete before proceeding with the next one.

AND-ed

Alternately, you can connect commands with two ampersands between each, which will add a logic component to the operation by AND-ing them together. This runs the commands consecutively, but has the additional effect of will paying attention to how each command exits, and only proceeds to the next on success of the previous. So, for example:

command1 && command2

This will run command 1, and if that is successful, command 2 will run. If command 1 fails, command 2 will not be run.

OR-ed

On the flip side, if you OR them together with two pipes, you flip the logic so that it only proceeds on failure:

command1 || command2

In this case, command 2 will run only if command 1 does not successfully complete. This is useful if you need to log something on failure, for instance.

Grouping

Additionally, you can utilize parentheses to group commands together, just like mathematical operations:

command1 || (command2 && command3)

This will run command 1, and if it fails, command 2 will run, and on its success, command 3.

In Parallel

If the running of commands don’t each depend on the previous one having been run successfully, you can connect them with ampersand to run them in parallel, which will let you complete them all in a shorter amount of time:

command1 & command2 & command3

When you run these commands in parallel, they are all run simultaneously in the background. The system will assign each job in the run an ID and will output a list of which job got what ID:

[1] 10945
[2] 10946
[3] 10947

Knowing these IDs is handy in case you want to stop one by killing it. You can kill a job by its ID number:

kill 19945

or via its index number in the job list:

kill %1

Additionally can check on the progress of any job in the list issuing the foreground command following by its index number:

fg %1

Running this will bring the first job into the foreground, and you’ll see any output it’s generating in your terminal window. To put it back into the background, hit ctrl-z to suspend it and give you the command line again, and then enter :

bg %1

…to resume its running in the background again. That’s the difference between ctrl-c and ctrl-z – ctrl-c will kill a process with a SIGINT, and ctrl-z will merely suspend it with a SIGSTOP, which lets you continue it at a later time.

As each job completes, it outputs a line saying that it’s done:

[1] Done command1

If, during their runtime, if you need to see which job is which, and what their current runstates are, just enter the command

jobs

…and you’ll get a list of what’s running:

[1] Running command1 &
[2]- Running command2 &
[3]+ Running command3 &

Piping

Finally, by gluing commands together with a single pipe, you can pass the output of command to subsequent commands. This is extremely powerful for getting stuff done.

For instance, to find all of the files in a filelist that contain the term foo, you could redirect the output of ls to a text file:

ls > bar.txt

…and then grep on it:

grep foo bar.txt

But you can save a step and some time by just chaining them together with pipe to get the same result without generating a file:

ls | grep foo

Using piping, in two or three commands you can get an enormous amount of stuff done in a single line of code.